Professor Baharan Mirzasoleiman has received a National Science Foundation CAREER award, the agency’s highest honor for faculty members at the start of their research and teaching careers. This award includes a five-year grant to support her research on Coresets for Robust and Efficient Machine Learning.

Large datasets have enabled modern machine learning models to achieve unprecedented success in various applications, ranging from medical diagnostics to urban planning and autonomous driving, to name a few. However, learning from massive data is contingent on exceptionally large and expensive computational resources. Such infrastructures consume substantial energy, produce a massive amount of carbon footprint, and often become obsolete and turn into e-waste within a few years. While there has been a persistent effort to improve the performance and reliability of machine learning models, their sustainability is frequently neglected. This project aims to address the sustainability, reliability, and efficiency of machine learning by selecting the most relevant data for training. The resulting algorithms will be broadly applicable for learning from massive datasets across a wide range of applications, such as medical diagnosis and environment sensing. 

The main objective of this project is to develop a new generation of theoretically rigorous methods that enable efficient and robust learning from massive datasets. To achieve this goal, this project will develop scalable combinatorial optimization algorithms to extract weighted subsets (coresets) of data that guarantee similar training dynamics to that of training on the full data. This enables sustainable, efficient, and accurate learning from massive data. As datasets grow larger, maintaining their quality becomes very expensive. Hence, mislabeled and malicious examples become ubiquitous in large datasets. To ensure reliability in addition to sustainability and efficiency, the developed techniques will be leveraged to extract coresets of data points that enable robust learning against noisy labels and adversarial attacks. This project will also seek to learn better objectives to automatically extract the most valuable data for efficient and robust learning from massive data. Finally, this research will enable efficient and robust learning frameworks that can be applied to many real-world applications through interdisciplinary collaborations.