Observing OCR Technologies for PDF Parsing
I’ve gotten the opportunity to investigate some Java-based OCR technologies recently for the purpose of analyzing PDFs, and wanted to write about some aspects of them that aren’t very well-documented. I hope to incorporate this into these tools' documentation at some point, but for now, here it is... in loooong prose.
TightOCR
Couldn’t get this one working at all. Was hoping to run it on Python, but it tends to claim certain functions for parsing JPGs, TIFFs, and PNGs do not exist when obviously Tesseract on the command line knows how to handle these types of files adroitly. It also has a dependency on CTesseract which seems not to be updated for the revised Tesseract APIs (function headers with more arguments) as updated in Tesseract version 3.03, so you have to install Tesseract 3.02 to work with CTesseract.
Tess4J
This was a real hassle to install on my Mac. I first started by trying to compile everything from scratch and use GCC, but faced a number of weird compilation problems. Here was the (backwards) dependency chart:
- libtool
- Leptonica
- Tesseract
- Ghostscript
- Tess4J
Once I installed home-brew (brew) and set it up to install libtool, I was able to successfully compile the other libraries. Then, Tess4J still required some dependencies in Java which weren’t easily resolved. What did the trick is when I switched to using a Maven project and simply used that to install Tess4J by adding this to my pom.xml file:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>2.0.1</version>
</dependency>
After simply allowing Maven to configure Tess4J, I was faced with configuring the location of Tess4J’s dependencies (various .dylib files on the Mac). Since GhostScript & Tesseract ended up installing themselves in two different locations, preventing me from simply using a command-line variable (thanks to Eclipse not properly splitting on ; or : in the path used in -Djava.library.path), I set up an environment variable on the VM called LD_LIBRARY_PATH, and set it to /opt/local/lib:/usr/local/Cellar/ghostscript/9.16/lib — the value I was hoping to put on the “command line” when running Java.
Once I reached this stage, it was time to utilize it to read from PDFs. The results were very Tesseract-y (i.e. L’s tend to become |_), but luckily, it seemed to do a fairly good job overall. However, it couldn’t read any data contained inside tables, which renders it relatively useless if you’re trying to parse data from, say, tax returns or product datasheets. At first, I was thinking of finding a way to expose image-cropping tools from Leptonica to Java. There is a nice solution for this in the Tess4J API, though, that’ll allow you to crop a PDF down to the specific area you care about:
File imageFile = new File("/path/to/my.pdf");
Tesseract instance = Tesseract.getInstance();
instance.doOCR(imageFile, rectangle);
Of course, one thing that’s not mentioned in the documentation at all about this bounding rectangle (yet is very important) is what units you actually need to specify in order to make this rectangle. Want to know the Tess4J bounding box rectangle units? They're in DPI. As such, if you want a 2”x2" rectangle starting from (1”, 1”) down from the top left, and if your PDF is 300dpi, you would define your rectangle as follows:
instance.doOCR(imageFile, new Rectangle(300, 300, 600, 600));
Note that the rectangle is defined as (X distance from left, Y distance from top, width (to the right), height (downward)), all in "dpi-dots" (i.e. 300 "dpi-dots" per inch with a document of 300dpi).
Overall, once the installation headaches were solved, it works pretty nicely, and does exactly as expected when reading from fields. However, reading from fields is Tesseract-y, slow in comparison, and fetches exactly what you ask for that happens to be within the rectangle — meaning that it may crop letters and symbols falling out of bounds.
Another interesting note is how some facets of this library appear to be aging: the argument taken by the Tesseract object’s doOCR() function is a File (java.io), which has been superseded by Files (java.nio.file) in Java 7. This also seems to hold true for their slightly different Tesseract1 object.
iTextPDF
This is an extremely simple library to install if you have a Maven project. All you need to do is add the following dependency:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.0.6</version>
</dependency>
Then add these imports:
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.*;
It is fairly simple to read an entire document. The Java code is a touch more complex to set up for reading from a particular user-defined rectangle, though:
PdfReader reader = new PdfReader("/path/to/my.pdf");
RenderFilter filter = new RegionTextRenderFilter(rectangle);
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
String output = PdfTextExtractor.getTextFromPage(reader, pageNum, strategy);
Nevertheless, it works flawlessly once you get it. However, finding the correct specification for the bounding rectangle was a bit tricky on this because, of course, the units iText prefers have nothing to do with the ones Tess4J uses. Also, like with Tess4J, the units to use in the rectangle are not specified in the documentation. It's as if we're expected to read the minds of the original developers. Through experimentation (which was made difficult because it returns all text from any object contained within the rectangle, rather than strictly the text within the rectangle), it was found that iText doesn’t want DPI-dots, but points (of which there are always 72 points per inch). Also, the Y-origin is set at the bottom of each page, which is actually the standard for PDF files (rather than from the top, which is how Tess4J counts).
Also, as mentioned earlier, iText pulls all text contained within any object whose bounds overlap the rectangle you specify, rather than simply the text within the rectangle. I imagine this is because they’re actually reading the data from the PDF and pulling text directly from the objects rather than doing OCR. As such, I haven’t seen any errors in the results from iText (e.g. no “L” -> “|_”), and it runs much faster than Tess4J.
To specify the bounding box for the same area as above (1” from the top left corner, and 2” each side), now we must assume you have a page that’s 11” tall (US Letter size, Portrait orientation). In that case, you would use:
...
RenderFilter filter = new RegionTextRenderFilter(new Rectangle(72, 576, 144, 144));
...
As these arguments go, 72 sets your X distance as 1 inch away from the left edge, 576 sets your Y distance as 8 inches up from the bottom edge, 144 is the width going to the right of X, and 144 is the height of the rectangle going up from Y.
Hopefully you find this useful in your quest to extract data from PDFs. May your data-scraping activities go much smoother!
Comments
Post a Comment