A Bengaluru-based startup, Sarvam AI, claimed superiority over global giants like Google Gemini and ChatGPT in vision and speech models for Indian languages. Sarvam Vision, as stated by co-founder Pratyush Kumar, achieved an accuracy of 84.3% on olmOCR-Bench and 93.28% on OmniDocBench v1.5 for English subsets, surpassing leading models like Gemini 3 Pro. The company’s Bulbul V3 text-to-speech model supports 35 voices across all 22 scheduled Indian languages with the ability to handle various scan qualities and content types.
Kumar emphasized that Sarvam Vision stands out as the top model for Indian languages while supporting all 22 scheduled languages. The Vision series comprises a 3-billion-parameter state-space model capable of tasks like image captioning, scene text recognition, chart interpretation, and table parsing. Sarvam AI’s core focus lies in democratizing artificial intelligence access across India, aiming to cater to the nation’s unique requirements with foundational components.
Sarvam AI showcased its prowess in extracting technical terms from complex tables and charts, as evidenced by social media posts. The platform’s capabilities extend to general natural scene understanding, accurately describing images like scenic landscapes. Union IT Minister Ashwini Vaishnaw commended the startup’s work, citing it as a testament to India’s AI mission success.
