Converting Wikipedia html files in pdf
I want to convert html files from Wikipedia to pdf for off line reading purpose . After bit of searching , Wikipedia itself provides a link on left side [ Print/export ] of every article to convert it into pdf . After couple of clicks , we can download the pdf but I want to write Haskell script. This script generates the rendering url. Rendering url return empty tags while copy and pasting the rendering url to web browser generates the pdf file. After asking on Haskell-cafe revealed that the link is generated by javascript and i have to script an actual browser to generated pdf from this code. Technically this is still unfinished project
but first time I played with some sort of web programming.
import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe
parseHelp :: Tag String -> Maybe String
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y
then Just $ "http://en.wikipedia.org" ++ snd ( y !! 0 )
else Nothing
parse :: [ Tag String ] -> Maybe String
parse [] = Nothing
parse ( x : xs )
| isTagOpen x = case parseHelp x of
Just s -> Just s
Nothing -> parse xs
| otherwise = parse xs
main = do
x <- getLine
tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
let lst = head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
url = fromJust . parse $ lst --rendering url
putStrLn url
tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
print tags_2
My second choice was obviously python and it finished the job perfectly . Python script for this purpose and in fact it can convert any html file to pdf. Its like opening a html file in web browser and printing it to pdf file.
import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
#http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
#http://pastebin.com/xunfQ959
#http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
#http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html
def convertFile( ):
web.print_( printer )
print "done"
QApplication.exit()
if __name__=="__main__":
url = raw_input("enter url:")
filename = raw_input("enter file name:")
app = QApplication( sys.argv )
web = QWebView()
web.load(QUrl( url ))
#web.show()
printer = QPrinter( QPrinter.HighResolution )
printer.setPageSize( QPrinter.A4 )
printer.setOutputFormat( QPrinter.PdfFormat )
printer.setOutputFileName( filename + ".pdf" )
QObject.connect( web , SIGNAL("loadFinished(bool)"), convertFile )
sys.exit(app.exec_())
~
No comments yet.
Hello All , My name is Mukesh Tiwari and i graduated from