India’s vision-language model Patram tackles unique challenges

New Delhi: The BharatGen project team unveiled Patram earlier in the month, a seven billion parameter vision-language model designed for document analysis in India. The technology is comparable to Google’s NotebookLM or Nvidia’s ChatRTX. Patram enables users to query documents, even scanned or photographed ones using natural language, with text-based outputs. Patram currently supports English instructions, with the open-source model available on AiKosh and Hugging Face. The team aims to expand to multilingual capabilities.

Developing AI for Indic languages pose unique challenges, as the abugida writing systems are poorly supported by Unicode, and conventional software as well as hardware that are primarily designed for English with discrete alphabets. Limited digitsed data and inadequate tools complicate progress. Led by Ravi Kiran Sarvadevabhatla from IIIT-Hyderabad, the Patram team tackled these hurdles by training their model with a mix of real data and synthetic question-answer pairs, while ensuring diverse document representation through clustering-based data curation. Patram also offers an advanced API version that supports end-user applications with added guardrails to ensure that the answers are along the lines of the desired application. We spoke to Sarvadevabhatla to better understand the challenges and the creative approaches used for overcoming them.

Roadmap for future development

As of now, Patram can process single page documents in English, but the team plans to expand the capabilities. Sarvadevabhatla says, “We want to add the capabilities in a couple of directions. One is the ability to handle more languages. Next on our roadmap would be Hindi, so basically a bilingual model, and thereafter go to multilingual, like capable of handling all the 22 recognised languages, so that is one aspect. The second aspect is, we know that many documents are not just single page, they are a collection of pages, so multi-page. So that is another dimension that we want to work on. The third is, what is known as grounding. For certain kinds of answers, we want some kind of attribution. Like, where in the document, data for the particular answer comes from. This is technically referred to as grounding, and we would like to add the grounding capability ot the model. So these are the three primary directions which we want to pursue.”

The model is also being developed to handle the unique requirement of references to other documents. Sarvadevabhatla explains, “Let us take one use case, government regulations. There are lots of government regulations that are released, and typically, these are released in a PDF format. Now, you may have seen some regulations which refer to some other regulations, so with reference to a previous regulations. This is case with not only government regulations, but all office documents, you refer to some other document. Now, if some question that needs answering requires following this chain to answer well, then it is important that we have a mechanism to link all these things together. So we want to enable these kinds of applications to be built. That is, you start with a bunch of PDFs, but the model can recognise that there is a particular link to another PDF. So, circular number so and so, okay, so I have to go and refer to that. So now, when you ask a question, it can actually jump across documents. That is where eventually we want to go. What we have developed is the first stage in such a system. By unlocking the information in these kinds of documents, we can get closer to the connected document hubs that we want to build.”

From India and for India

Self-reliance is one of the driving motivations for developing the technology. Sarvadevabhatla says, “So far, if you look at the discourse in tech circles about these kinds of models, it has focused a lot on speech and text. Even from other entities, they have been talking a lot about speech and text models. With Patram, we are really breaking new ground. Patram is not about just training vision-language models, but it is also about making a statement. That we are capable of training good foundational language models, for India and from India. There is more to it than just training this model. It is also a technology capability demonstration.”

The approach of training the model from the ground up resulted in fostering a rich learning experience for the team. Sarvadevabhatla adds, “That has been our experience as well, we discovered a lot of things that were not written anywhere. Things we had to discover from scratch. We told ourselves that we do not want to simply piggyback on existing models, and finetune them. Let us get hands-on experience on what it is to train these kinds of models from scratch. It has been a great learning experience for us as a team. We have set a good pace, from here we can get to the capabilities that I have mentioned previously, additional languages, being capable of giving a reference to the answer, multi-age and so on.”