
Beyond Accuracy: Rethinking LLM Evaluation for Real-World, Interactive, and Culturally Inclusive Scenarios
Abstract:
Traditional evaluation methods for large language models (LLMs)—often centered on accuracy in static multiple-choice or short-answer questions—fail to capture the complexities of real-world use. As we envision LLMs serving users in dynamic, multicultural, and interactive scenarios, we must rethink what meaningful evaluation looks like. This talk presents our recent research to advance LLM evaluation through culturally aware, socially grounded, and interaction-driven benchmarks. We assess factual consistency across languages and regions [1], explore everyday knowledge in underrepresented cultures [2], and examine cultural inclusivity [3][4][5][6]. We highlight that while LLMs may not appear to be socially biased in simple question-answering [7], they reveal their biases in generation tasks [8], which is more aligned with the actual LLM usage. We further introduce dynamic and interactive evaluation paradigms: LLM-as-an-Interviewer [9], which mimic real-time user interaction, and Flex-TravelPlanner [10], which evaluates planning adaptability under evolving and prioritized constraints. Together, these papers reveal that accuracy alone is insufficient; LLM evaluation must consider culture, context, interactivity, and adaptation. This talk calls for a broader evaluation agenda and presents these ten papers as starting points for more robust, inclusive, and realistic assessments.
References:
[1] Shafayat, et al. Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore. COLM 2024
[2] Myung, et al. BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Language. NeurIPS D&B 2024
[3] Winata, et al. WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines. NAACL 2025
[4] Bayramli, et al. Diffusion Models Through a Global Lens: Are They Culturally Inclusive? C3NLP Workshop@NAACL 2025
[5] Lee, et al. Exploring Cross-Cultural Differences in English Hate Speech Annotations. NAACL 2024
[6] Kim, et al. When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal LLMs in Cultural Mixture Contexts. ArXiv 2025
[7] Jin, et al. KoBBQ: Korean Bias Benchmark for Question Answering. TACL 2024
[8] Jin, et al. Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations. ArXiv 2025
[9] Kim, et al. LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation. ArXiv 2025
[10] Oh, et al. Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents. ICLR 2025
Bio:
Alice Oh is a Professor in the School of Computing at KAIST. Her major research area is at the intersection of natural language processing (NLP) and computational social science, with a recent focus on multilingual and multicultural aspects of LLMs. She collaborates with scholars in humanities and social sciences such as political science, education, and history. She has served as Program Chair for ICLR 2021 and NeurIPS 2022, General Chair for ACM FAccT 2022 and NeurIPS 2023, and DEI Chair for COLM 2024. She is the current President of SIGDAT which oversees EMNLP.
Date/Time:
Date(s) - May 13, 2025
4:00 pm - 5:45 pm
Location:
3400 Boelter Hall
420 Westwood Plaza Los Angeles California 90095