This project aims to develop an efficient rule based extractor of references, located in scientific papers in English language. The application takes a pdf file or a directory of pdfs and then returns an html file, containing the list of all entries with their respective title. Moreover the title of the v article is searched through Google Web Service to get the URL that identifying the article on the web. If the URL provides on the page a Bibtex entry, this will appear in the html under the relative entrie, taken from some websites such as citeseer, ieeexlpore etc. The application does not make search over pdf file based on images. The project is released under the GNU General Public License.
Involved Technologies: Python, Python Frameworks, PyDev, Google API, RegEx.
Released on Google code at http://code.google.com/p/pdftoref/
Year: 2008.
See the live demo below: