Extracting Information from Documents Using Images: Challenges and Solutions with OCR, Custom Processors and Multimodal Models

Marcos Recolons
Mar 3
6 min read

Document digitization has brought with it a constant challenge: the reliable extraction of information when images do not follow a standardized pattern. Variable lighting, tilts, different document types and unequal resolutions mean that, in many cases, a basic OCR is insufficient. In this article, I present the experience of a real project where we addressed this challenge and explored various solutions: from the use of traditional OCRs and custom processors to the migration to multimodal LLM (Large Language Models). We will see advantages, disadvantages, costs, and why the combination of preprocessing techniques and new generative models can offer us better results at a lower cost.

1. Introduction

The demand for automating the extraction of key data (customer name, date, invoice amount, etc.) from images has grown significantly. While OCR allows us to read the text in an image and know the location of each word, the real problem often lies in correctly linking the label and the value . A slight change in angle, a different tilt, or the presence of watermarked backgrounds can lead to inaccurate results.

In this article, I will explain:

How we got started with basic OCRs and discovered their limitations.
Our experience training custom processors (like Google's) to adjust to specific documents.
The move towards multimodal models, such as GPT or Gemini, which do not require such exhaustive labelling and provide great flexibility.
Preprocessing challenges (rotation, watermarking) and the solutions we found.

2. Basic OCR: Getting Started and Limitations

2.1 Text Recognition and Vertex Location

Basic OCRs, such as Tesseract or cloud services, offer:

Recognition of text present in the image.
Positioning (coordinates) of each word.

However, when trying to extract specific information (for example, “Total height” or “Weight”), merely sequential reading of the text can confuse labels with contiguous values. In ideal conditions—well-scanned documents, without tilts—OCRs obtain acceptable results. But reality showed us documents with different capture angles, poor lighting or irregular resolutions.

2.2 Key Problems

Spatial confusion : The model took as a “value” the word or number that appeared closest in the image, without considering that it could be located above, below, or diagonally.
Repeated labels : Sometimes the document repeated “Height” in different places, and the correct value was wrongly associated with the wrong labeling.

The main lesson learned was that simple text recognition was not sufficient to maintain label-value correspondence reliably.

3. Custom Processors: Training and Deployment

3.1 Platform Selection and Market Analysis

To refine the extraction, there are custom processors on the market (e.g. Google Document AI or Amazon Textract), which allow training a model with examples of labeled documents. This option, however, usually involves a high cost, both at the training stage and in the prediction per document.

After trying several alternatives, we found that the Google suite offered very good performance:

Pre-trained models that recognize common layouts (invoices, receipts, passports, etc.).
Custom templates that require a data labeling phase to fit specific documents.

3.2 Training and Costs

Training the model consisted of:

Tag at least 100 documents to indicate where each field of interest is located.
Upload this data to the platform and let the model learn to extract the information.

Advantages :

We achieved 90% accuracy after labeling and training.
The model learned the typical spatial layout of the document and performed better than basic OCR.

Disadvantages :

Cost of use : between 20 and 30 cents per document for extraction.
Monthly fee : a fixed fee is paid for having the processor active (around €60 per month), unless the resource is deactivated automatically during inactivity hours.
Latency and Location : New features are often available sooner in the US than in the EU, and deploying the model in the US may introduce additional latency and legal considerations (GDPR).

4. Multimodal LLM Models: A New Generation of Solutions

4.1 Why Test Multimodal Models?

With the arrival of generative models capable of processing both text and images (for example, GPT or Gemini ), a less rigid alternative to custom processors emerges. These models can understand instructions or “prompts” where the task is described and the image is provided.

They do not require exhaustive labeling for each type of document.
Flexibility and speed in configuration: all you need to do is design a good prompt to request the desired information.

4.2 Extraction Techniques

Block Extraction :

Group nearby fields in the image (e.g. “total height, width, weight”) and ask the model to extract all of these in a single instruction.
Avoid “jumping” between very distant places within the same prompt, as this can confuse the model.

Individual extraction :

In cases of very distant fields in the image or unclear fields, sometimes extracting them separately improves accuracy.
It involves launching multiple prompts if more than one value is required.

4.3 Cost and Scalability

Compared to custom processors:

There is no fixed fee for having the service active.
In many cases, the API for multimodal models is cheaper depending on their size and the number of tokens consumed.
Almost immediate scalability : we can go from processing a few documents to thousands, as long as computing capacity is available.

5. Technical Challenges and Effective Solutions

5.1 Image Watermarks

In documents with a background watermark (even a small one), both OCR and multimodal models often confuse the text. To the human eye, the watermark is ignorable, but the algorithm sees the superimposed text.

Solution :

Filtering the image by reducing resolution or modifying contrast until the brand text becomes illegible for the model.
Use simple OCR libraries (e.g. Tesseract) to “recognize and discard” what is likely to be part of the watermark before final extraction.

5.2 Image Rotation

When the photo arrives rotated or tilted, extraction is affected. Although some OCRs offer automatic rotation detection, it is not always reliable.

Solution :

Using Tesseract to estimate rotation and a confidence index.
Rotate the image through 4 angles (0°, 90°, 180°, 270°), run each one through Tesseract and choose the one with the highest confidence rating.
This process increases accuracy by identifying the correct orientation before extracting the fields.

6. Conclusions and Future Perspectives

Custom Processors vs. Multimodal LLM
- Cost : LLMs are often more attractive if the volume of documents is high, as they do not have fixed monthly costs.
- Accuracy : Training a custom processor can achieve great results, but the flexibility of a multimodal model allows it to adapt to a wide variety of documents.
- Ease of implementation : LLMs dramatically reduce the need to tag hundreds of documents.
Importance of Preprocessing
- Rotation correction and removing or reducing the impact of watermarks are key steps to ensure that extraction tools work properly.
Looking to the Future
- Next-generation models with advanced reasoning capabilities may be better able to handle complex documents and different scenarios without as much preprocessing.
- Platforms are expected to emerge that offer a simple labeling interface for LLM, allowing fine-tuning without having to retrain the entire model.
Recommendations for New Projects
- Evaluate ROI based on document volume and complexity.
- Design a flexible pipeline where new preprocessing steps can be quickly incorporated.
- Maintain a test environment to compare accuracy between a custom processor and a multimodal model, before deploying to production.

7. Resources and Bibliography

Tesseract OCR : https://github.com/tesseract-ocr/tesseract
Google Cloud Document AI : https://cloud.google.com/document-ai
Amazon Textract : https://aws.amazon.com/textract/
OpenAI GPT : https://openai.com
Anthropic : https://www.anthropic.com/

8. Annexes (Optional)

Appendix A: Example of Prompt for Multimodal LLM

Prompt : “Please see the attached image. I need you to extract the following fields: 'Total Height', 'Width', 'Weight'. If a field is not clearly visible, it indicates 'Not Readable'. Please provide the response in JSON format: {"Total Height": "x", "Width": "y", "Weight": "z"}.”

Annex B: Comparative Table of Challenges

Challenge	Cause	Proposed Solution	Effectiveness
Watermarks	Background text confuses the algorithm	Filter image (lower resolution, contrast)	High
Image rotation	Unscanned photograph, variable angles	Rotation detection algorithm (Tesseract, etc.)	Medium-High
Latency in EU vs. US	New features roll out earlier in the US; network lag	Choose EU region if GDPR compliance is required	Variable

Final Conclusion

Extracting information from document images is a process that combines OCR techniques, preprocessing algorithms and, increasingly, the power of multimodal models. The choice of the ideal solution depends on the budget, the diversity of documents and the scale of the project. Although “ad hoc” trained processors offer great accuracy, multimodal LLMs have become an excellent alternative due to their cost reduction, flexibility and rapid adaptability without the need to label large amounts of data.

Staying up to date with the latest AI developments and having a robust preprocessing pipeline are key elements to ensure success in large-scale digitization and information extraction projects.