Extract reference info from PDFs

Sunday , 27, May 2007

I have collected an enormous amount of scientific publications in PDF format the last couple of years and I was wondering if it could be possible to automatically extract the reference information (author names, title, journal, data) from a PDF (if it isn’t a photocopy PDF of course, then you would first need OCR software…). The only journal I had some success with extracting references is the Journal of Geophysical Research because these have the full reference in the title page: using perl and a perl addon called CAM::PDF I was able to extract the references. But it would be much nicer to have software which tries to find the reference for almost all scientific PDFs around. I found this publication describing a possible method. Perhaps the best thing is to add an XML header to scientific PDFs, which contains the reference information. This could  then be read by bibliography software like EndNote…

2 thoughts on “ : Extract reference info from PDFs”
  • Robert says:

    Web of Science is what you need mate: http://isiknowledge.com/

    It’ll probably only work at the uni though since you need IP identification.

    From there you can export your references to your endnote online library. There you can manage you references and further export them as BibTex, endnote or any other format.

  • leukvoorj says:

    Yeay I know that one, but I was wondering if it is possible to extract reference info from my pdf collection…

  • Please give us your valuable comment