My Weblog

Blog about programming and math

Scrapping Hotel url from Tripadvisor

Recently I started working on a web scrapping project. The problem was to find the hotels’ url of tripadvisor for a given city to scrap their reviews. Our review scrapper is working fine once we supply the url for given hotel but the problem was finding url. It involves a human to find the url and I wanted to automate the process. After investigating the tripadvisor home page, I automated the whole process of filling data by selenimum.

;some thing went wrong with my firefox update
(defn go-to-tripadv
  "returns tripadvisors urls for city"
  [city]
  (do
    (set-driver! {:browser :firefox})
    (get-url "https://www.tripadvisor.in")
    (wait-until #(= (title) "TripAdvisor: Read Reviews, Compare Prices & Book"))
    (input-text "#GEO_SCOPED_SEARCH_INPUT" city)
    (click "#GEO_SCOPE_CONTAINER .scopedSearchDisplay li")
    (apply quick-fill-submit
           [{"#mainSearch" "Hotel"}
            {"#SEARCH_BUTTON" click}])
    (wait-until
     #(not= (title) "TripAdvisor: Read Reviews, Compare Prices & Book"))
    (let [home-url (current-url)
          page (page-source)
          rest-url (fetch-all-nav-urls page)]
      (quit)
      (cons home-url rest-url))))

It is working fine but I did not like the browser involvement. I was wondering about request send by browser to tripadvisor to fetch the data and I found firefox pluging Firebug.  After analyzing the request I figured out the request (replace city with required city name) and now the whole process is trivial. You can see the whole code github.

(ns tripadvisorurl.core
  (:require [clj-webdriver.taxi :refer :all]
            [net.cgrand.enlive-html :as ehtml]
            [clj-json.core :as json]))


(defn rest-url-from-page [url page-no]
  (let [[f-el s-el t-el] (clojure.string/split url #"-" 3)
        all-page-no
        (map (fn [x] (str "oa" (* 30 x))) (rest (range page-no)))
        final-url
        (map (fn [x] (str f-el "-" s-el "-" x "-" t-el)) all-page-no)] final-url))


(defn fetch-all-nav-urls
  "fetch all the tripadvisor navigation urls for a city"
  [url]
  (as-> (java.net.URL. url) d
    (ehtml/html-resource d)
    (ehtml/select d [:div.pageNumbers :a])
    (last d)
    (first (:content d))
    (cons url
          (if (nil? d) nil
              (rest-url-from-page url (Integer/valueOf d))))))


(defn go-to-tripadv-api
  "Tripadvisor API for fetching JSON and extract url"
  [city]
  (as-> (str "https://www.tripadvisor.in/TypeAheadJson?query="
             city
             "&action=API&types=geo,theme_park&link_type=hotel&details=false") d
    (java.net.URL. d)
    (ehtml/html-resource d)
    (map (comp first :content first :content) d)
    (first d)
    (json/parse-string d)
    (get d "results")
    (first d)
    (get d "url")
    (str "https://www.tripadvisor.in" d)))

(defn parse-hotel-url
  "returns all the hotels from tripadvisor page"
  [url]
  (as-> (java.net.URL. url) d
    (ehtml/html-resource d)
    (ehtml/select d [:a.property_title])
    (map (juxt (comp first :content) (comp :href :attrs)) d)))



Advertisements

January 7, 2016 - Posted by | Programming | , ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: