My Weblog

Blog about programming and math

Converting Wikipedia html files in pdf

I want to convert html files from Wikipedia to pdf for off line reading purpose . After bit of searching , Wikipedia itself provides a link on left side [ Print/export ] of every article to convert it into pdf . After couple of clicks , we can download the pdf but I want to write Haskell script. This script generates the rendering url. Rendering url return empty tags while copy and pasting the rendering url to web browser generates the pdf file. After asking on Haskell-cafe revealed that the link is generated by javascript and i have to script an actual browser to generated pdf from this code. Technically this is still unfinished project 😦 but first time I played with some sort of web programming.

import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe
 
parseHelp :: Tag String -> Maybe String 
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y 
                             then Just $  "http://en.wikipedia.org" ++   snd (   y !!  0 )
                              else Nothing
 
 
parse :: [ Tag String ] -> Maybe String
parse [] = Nothing 
parse ( x : xs ) 
   | isTagOpen x = case parseHelp x of 
                         Just s -> Just s 
                         Nothing -> parse xs
   | otherwise = parse xs
 
 
main = do 
        x <- getLine 
        tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
        let lst =  head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
            url =  fromJust . parse $ lst  --rendering url
        putStrLn url
        tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
        print tags_2
 

My second choice was obviously python and it finished the job perfectly . Python script for this purpose and in fact it can convert any html file to pdf. Its like opening a html file in web browser and printing it to pdf file.

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

#http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
#http://pastebin.com/xunfQ959
#http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
#http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html

def convertFile( ):
                web.print_( printer )
                print "done"
                QApplication.exit()


if __name__=="__main__":
        url = raw_input("enter url:")
        filename = raw_input("enter file name:")
        app = QApplication( sys.argv )
        web = QWebView()
        web.load(QUrl( url ))
        #web.show()
        printer = QPrinter( QPrinter.HighResolution )
        printer.setPageSize( QPrinter.A4 )
        printer.setOutputFormat( QPrinter.PdfFormat )
        printer.setOutputFileName(  filename + ".pdf" )
        QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
        sys.exit(app.exec_())
~                              
Advertisements

September 9, 2011 - Posted by | Programming | , ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: