Future suggestion about using available text layers
The current state of OCR is still not close enough to 100% accuracy for most production use cases, IMO. No matter the SOtA model we have tested, every single one has inaccuracies in the extracted text (we have one field where a long alphanumeric value has an S in the middle of a range of numbers - this almost always gets improperly extracted as a 5 for example.
Given that most PDFs these days have a text layer, and the OCR step is more about capturing the physical layout, why not use the text layer when present to avoid these types of transcription errors? Even a model that is 99.9% correct will have many errors when extracting 100,000 pages. Sometimes there is no choice but to do a full OCR due to the lack of a text layer, but it still makes sense to use the text layer as context for the extraction to help avoid these types of errors.
Thanks for taking this into consideration