OCR 2.0: VLM Structured Output

This talk explores using small vision-language models to create instruction-augmented OCR systems that produce structured outputs for complex, multi-format documents.

Overview

OCR is almost a solved problem but not really a generalizable problem. Although I am not an industry expert, but from talking to some expert I realized that classical (Object detection / text-recognition / rule based) OCRs normally parses the whole document and returns a very “less organized” results. Lot of manual post processing are involved on these kind of OCRs. The level of post processing / hardcoding logic for a family of documents rises with the increase in complexity of the document (example: if document contains combination of tables / images etc).

With rise of LLMs and Vision Language Models (VLMs), the above problems can be solved and can be generalized over a range of documents. Last week, I started this new side project of mine, of building a more generalized OCR pipeline, where user will upload the document and also provide the expected output schema. The pipeline will do the OCR and would structure the result adhering to the uploaded schema. I tried over some range of documents (complex documents with tables, invoices and multi lingual documents).

Now I do not want to make this another Open AI wrapper software, for several reasons:

For enterprise focussed documents, invoices might contain lot of PIIs and user would not be comfortable giving it to a third party client.
I did some bunch of experiments, and from there, I learned, I can achieve a very good pipelines with models < 7B parameters. I have been using ensembling approaches and it really works. For instance, there is a recent model called GOT-OCR 2.0, which gives awesome result but it is based on Qwen 0.5B. More open source models like InternLM, Qwen models, Llama 3.2 vision are also amazing and adheres with the schema.
It also gives me a full flexibility for furthur fine-tuning models on very complex documents where it fails to give results.

Speaking of fine-tuning, I surely have faced challenges. For example, I have been fine-tuning GOT-OCR for a less known language, and I learned that it got overfitted on my training dataset and could not generalize over anything outside the training dataset distribution. Also for fine-tuning, generating data samples is also another challenge.

In this talk I will be sharing my above learnings and my roadmap in more details and lead to an open discussion.

ps: I will show the demo while doing the presentation

Tech stack