My Weblog

Blog about programming and math

Scrapping Hotel url from Tripadvisor

Recently I started working on a web scrapping project. The problem was to find the hotels’ url of tripadvisor for a given city to scrap their reviews. Our review scrapper is working fine once we supply the url for given hotel but the problem was finding url. It involves a human to find the url and I wanted to automate the process. After investigating the tripadvisor home page, I automated the whole process of filling data by selenimum.

;some thing went wrong with my firefox update
(defn go-to-tripadv
  "returns tripadvisors urls for city"
    (set-driver! {:browser :firefox})
    (get-url "")
    (wait-until #(= (title) "TripAdvisor: Read Reviews, Compare Prices & Book"))
    (input-text "#GEO_SCOPED_SEARCH_INPUT" city)
    (click "#GEO_SCOPE_CONTAINER .scopedSearchDisplay li")
    (apply quick-fill-submit
           [{"#mainSearch" "Hotel"}
            {"#SEARCH_BUTTON" click}])
     #(not= (title) "TripAdvisor: Read Reviews, Compare Prices & Book"))
    (let [home-url (current-url)
          page (page-source)
          rest-url (fetch-all-nav-urls page)]
      (cons home-url rest-url))))

It is working fine but I did not like the browser involvement. I was wondering about request send by browser to tripadvisor to fetch the data and I found firefox pluging Firebug.  After analyzing the request I figured out the request (replace city with required city name) and now the whole process is trivial. You can see the whole code github.

(ns tripadvisorurl.core
  (:require [ :refer :all]
            [net.cgrand.enlive-html :as ehtml]
            [clj-json.core :as json]))

(defn rest-url-from-page [url page-no]
  (let [[f-el s-el t-el] (clojure.string/split url #"-" 3)
        (map (fn [x] (str "oa" (* 30 x))) (rest (range page-no)))
        (map (fn [x] (str f-el "-" s-el "-" x "-" t-el)) all-page-no)] final-url))

(defn fetch-all-nav-urls
  "fetch all the tripadvisor navigation urls for a city"
  (as-> ( url) d
    (ehtml/html-resource d)
    (ehtml/select d [:div.pageNumbers :a])
    (last d)
    (first (:content d))
    (cons url
          (if (nil? d) nil
              (rest-url-from-page url (Integer/valueOf d))))))

(defn go-to-tripadv-api
  "Tripadvisor API for fetching JSON and extract url"
  (as-> (str ""
             "&action=API&types=geo,theme_park&link_type=hotel&details=false") d
    ( d)
    (ehtml/html-resource d)
    (map (comp first :content first :content) d)
    (first d)
    (json/parse-string d)
    (get d "results")
    (first d)
    (get d "url")
    (str "" d)))

(defn parse-hotel-url
  "returns all the hotels from tripadvisor page"
  (as-> ( url) d
    (ehtml/html-resource d)
    (ehtml/select d [:a.property_title])
    (map (juxt (comp first :content) (comp :href :attrs)) d)))


January 7, 2016 Posted by | Programming | , , | Leave a comment