Posts

Showing posts with the label OCR

Observing OCR Technologies for PDF Parsing

I’ve gotten the opportunity to investigate some Java-based OCR technologies recently for the purpose of analyzing PDFs, and wanted to write about some aspects of them that aren’t very well-documented.  I hope to incorporate this into these tools' documentation at some point, but for now, here it is... in loooong prose. TightOCR Couldn’t get this one working at all.  Was hoping to run it on Python, but it tends to claim certain functions for parsing JPGs, TIFFs, and PNGs do not exist when obviously Tesseract on the command line knows how to handle these types of files adroitly.  It also has a dependency on CTesseract which seems not to be updated for the revised Tesseract APIs (function headers with more arguments) as updated in Tesseract version 3.03, so you have to install Tesseract 3.02 to work with CTesseract. Tess4J This was a real hassle to install on my Mac.  I first started by trying to compile everything from scratch and use GCC, but faced a ...