Future suggestion about using available text layers

#21
by jondecker76 - opened

The current state of OCR is still not close enough to 100% accuracy for most production use cases, IMO. No matter the SOtA model we have tested, every single one has inaccuracies in the extracted text (we have one field where a long alphanumeric value has an S in the middle of a range of numbers - this almost always gets improperly extracted as a 5 for example.

Given that most PDFs these days have a text layer, and the OCR step is more about capturing the physical layout, why not use the text layer when present to avoid these types of transcription errors? Even a model that is 99.9% correct will have many errors when extracting 100,000 pages. Sometimes there is no choice but to do a full OCR due to the lack of a text layer, but it still makes sense to use the text layer as context for the extraction to help avoid these types of errors.

Thanks for taking this into consideration

Sign up or log in to comment