IISc Bengaluru research paper on AI training reaches CVPR 2026 finals

A research paper authored by a team of scientists from the Indian Institute of Science (IISc) in Bengaluru reached the finals of the Conference on Computer Vision and Pattern Recognition (CVPR) 2026 in Colorado, USA, earlier this month. The paper was selected among the top 15 out of approximately 16,000 submissions globally.

The paper, titled "Rethinking Dataset Distillation: Hard Truths about Soft Labels," was written by Priyam Dey, R Venkatesh Babu, Additya Sahdev, Sunny Bhati, and Konda Reddy Mopuri. All five authors are researchers from the Computational and Data Science (CDS) Department at the IISc campus in Bengaluru.

The annual CVPR conference is a major global event focusing on software questions involving the recognition of images by computers. The selection of the IISc paper in the top 15 was announced by the institute earlier this month.

The team's research centers on "dataset distillation," a process that could drastically lower the costs associated with training artificial intelligence models. Currently, training AI models requires vast networks of training data, which demands significant electricity, Graphics Processing Units (GPUs), and other infrastructure.

Professor R Venkatesh Babu, the head of the CDS Department, explained that token-based AIs like Claude are highly expensive to run due to these infrastructure and power demands. The research questions the continuous use of massive datasets, suggesting that a handful of carefully selected samples, or even random samples, can yield the same training accuracy.

According to Professor Babu, the volume of data currently fed into AI systems is so large that developers rarely pay attention to it, leading to significant carbon emissions from the machinery. The researchers believe that reducing the amount of data required for training could significantly lower the carbon footprint of AI technologies.

Currently, the IISc team has applied this research to the classification of images, such as sorting one million images into 1,000 different categories. However, the researchers noted that the same principles could be applied to other domains, including audio samples.