i wrote a class which adds a feature to the apache pdfbox library

the added feature is: converting a pdf file into a text file while keeping the layout of the pdf

this is particularly useful when one needs to extract data from tables or forms which are embedded inside a PDF document

see it on github

see the Hacker News discussion





i used this program to parse all the pdf timetables from my local bus company and i made myself a mobile app to access all schedules of any bus line in an offline way

i also made a program which parses your pdf bank statements and presents your spending in a visual way





name date tech java apache pdfbox
PDF LAYOUT TEXT STRIPPER 2015
credits Jonathan Link