Maria Teleki

Socially Robust AI

Howdy! I’m a PhD student in Computer Science at Texas A&M University, advised by James Caverlee. I’m currently at Google DeepMind as a Student Researcher.

I build methods, benchmarks, and audits that improve AI robustness in socially complex situations. My research spans Speech AI and Computational Social Science, across three threads:

Disfluency-Aware AI: Is AI robust to how people naturally speak? → ICASSP '26, Preprint, Preprint, INTERSPEECH '25, INTERSPEECH '24, LREC-COLING '24

Personalized AI: Is AI robust to who people actually are? → ICWSM '25, INTERSPEECH '26, ACL '26

Governable AI: Is AI robust within the societal structures it operates in? → FAccT '26

I hold the Avilés–Johnson Fellowship in Computer Science and Engineering and have been featured in invited talks and public media, including the MASKulinity Podcast.

I'm on the job market!

I'm seeking tenure-track faculty and research scientist positions starting Fall 2027. I'd love to connect — email me or grab my CV.

Publications

Filters:

Thread 1

Disfluency-Aware AI

: Is AI robust to how people naturally speak?

Real human speech is full of disfluency — pauses, repairs, and restarts — that current systems often treat as noise. I develop benchmarks and evaluation frameworks that model these patterns as structured signal, revealing robustness failures in conversational AI systems operating on real speech.

	PREPRINTSpeechSpectrum: A Linguistic Fidelity Spectrum for Accountable Speech-to-Text Anna Seo Gyeong Choi, Maria Teleki*, Miguel del Rio, Corey Miller^†, James Caverlee^†, Allison Koenecke^† Collaboration w/ Cornell University, RevAI* Presented at IC2S2, SpeechAI4All@CHI '26 Paper IC2S2 Abstract Poster Speech-to-text (STT) systems are increasingly embedded in everyday technologies, yet they largely continue to treat transcription as a technical problem of accuracy, assuming a single "correct" representation of speech. This overlooks that speech can be transcribed in multiple legitimate ways, and that different contexts demand different balances of fidelity, conciseness, and emphasis. We contribute SpeechSpectrum, a framework reconceptualizing STT as cross-modal translation along a continuum of representational fidelity that makes these representational decisions explicit and user-controllable. Through theoretical analysis and empirical investigation, we show that existing STT systems already impose spectrum-based choices without user input, indicating the normative significance of who controls transcription outcomes. Our user study (N=52) demonstrates that granting users explicit control over transcript representation improves task support, while a comparative study shows that large language models fail to capture the diversity and context-sensitivity of human preferences. We derive implications and recommendations for building STT systems that prioritize user agency in representational decisions, and release open-source code – including the speechspectrum Python package – and a prototype to support future research. Our work positions control over transcription fidelity as a core site of user agency in speech technologies, and shows that system-imposed defaults constitute an accountability gap. `@inproceedings{choi26_speechspectrum, title = {SpeechSpectrum: A Linguistic Fidelity Spectrum for Accountable Speech-to-Text}, author = {Anna Seo Gyeong Choi and Maria Teleki and Miguel del Rio and James Caverlee and Corey Miller and Allison Koenecke}, year = {2026}, booktitle = {Preprint} }`
	PREPRINTConversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, Éva Székely, James Caverlee Collaboration w/ KTH Royal Institute of Technology Paper LinkedIn Explainer Code Poster LLMs serve as the backbone in SpeechLLMs, yet their behavior on spontaneous conversational input remains poorly understood. Conversational speech contains pervasive disfluencies -- interjections, edits, and parentheticals -- that are rare in the written corpora used for pre-training. Because gold disfluency removal is a deletion-only task, it serves as a controlled probe to determine whether a model performs faithful structural repair or biased reinterpretation. Using the DRES evaluation framework, we evaluate proprietary and open-source LLMs across architectures and scales. We show that model performance clusters into stable precision-recall regimes reflecting distinct ``editing policies.'' Notably, reasoning models systematically over-delete fluent content, revealing a bias toward semantic abstraction over structural fidelity. While fine-tuning achieves SOTA results, it harms generalization. Our findings demonstrate that robustness to speech is shaped by specific training objectives. `@inproceedings{teleki26_dres, title = {Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones}, author = {Maria Teleki and Sai Janjur and Haoran Liu and Oliver Grabner and Ketan Verma and Thomas Docog and Xiangjue Dong and Lingfeng Shi and Cong Wang and Stephanie Birkelbach and Jason Kim and Yin Zhang and James Caverlee}, year = {2025}, }`
	PREPRINTBeyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation Anna Seo Gyeong Choi, Maria Teleki, James Caverlee, Miguel del Rio^†, Corey Miller^†, Hoon Choi^† Collaboration w/ Cornell University, RevAI, Kangwon National University Presented at IC2S2 Paper IC2S2 Abstract Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered – they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism – enforcing a single transcription convention as ground truth – commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer. `@inproceedings{range-wer, title = {Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation}, author = {Anna Seo Gyeong Choi and Maria Teleki and James Caverlee and Miguel del Rio and Corey Miller and Hoon Choi}, year = {2026}, booktitle = {Preprint} }`
	Z-Scores: A Metric for Linguistically Assessing Disfluency Removal Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, James Caverlee ICASSP 2026 Paper Code Poster LinkedIn Explainer Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions -- such as tailored prompts or data augmentation -- yielding measurable performance improvements. A case study with LLMs shows that Z-scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies. `@inproceedings{teleki25_zscores, title = {Z-Scores: A Metric for Linguistically Assessing Disfluency Removal}, author = {Maria Teleki and Sai Janjur and Haoran Liu and Oliver Grabner and Ketan Verma and Thomas Docog and Xiangjue Dong and Lingfeng Shi and Cong Wang and Stephanie Birkelbach and Jason Kim and Yin Zhang and James Caverlee}, year = {2025}, booktitle = {ICASSP}, }`
	I want a horror -- comedy -- movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance Maria Teleki, Lingfeng Shi, Chengkai Liu, and James Caverlee INTERSPEECH 2025 Paper GitHub Poster Disfluencies are a characteristic of speech. We focus on the impact of a specific class of disfluency -- whole-word speech substitution errors (WSSE) -- on LLM-based conversational recommender system performance. We develop Syn-WSSE, a psycholinguistically-grounded framework for synthetically creating genre-based WSSE at varying ratios to study their impact on conversational recommender system performance. We find that LLMs are impacted differently: llama and mixtral have improved performance in the presence of these errors, while gemini, gpt-4o, and gpt-4o-mini have deteriorated performance. We hypothesize that this difference in model resiliency is due to differences in the pre- and post-training methods and data, and that the increased performance is due to the introduced genre diversity. Our findings indicate the importance of a careful choice of LLM for these systems, and more broadly, that disfluencies must be carefully designed for as they can have unforeseen impacts. `@inproceedings{teleki25_horror, title = {{I want a horror -- comedy -- movie: Slips-of-the-Tongue Impact Conversational Recommender System Performance}}, author = {Maria Teleki and Lingfeng Shi and Chengkai Liu and James Caverlee}, year = {2025}, booktitle = {INTERSPEECH} }`
	Comparing ASR Systems in the Context of Speech Disfluencies Maria Teleki, Xiangjue Dong, Soohwan Kim, and James Caverlee INTERSPEECH 2024 Paper Code Project Website Poster ISCA Archive Link LinkedIn Explainer In this work, we evaluate the disfluency capabilities of two automatic speech recognition systems -- Google ASR and WhisperX -- through a study of 10 human-annotated podcast episodes and a larger set of 82,601 podcast episodes. We employ a state-of-the-art disfluency annotation model to perform a fine-grained analysis of the disfluencies in both the scripted and non-scripted podcasts. We find, on the set of 10 podcasts, that while WhisperX overall tends to perform better, Google ASR outperforms in WIL and BLEU scores for non-scripted podcasts. We also find that Google ASR's transcripts tend to contain closer to the ground truth number of edited-type disfluent nodes, while WhisperX's transcripts are closer for interjection-type disfluent nodes. This same pattern is present in the larger set. Our findings have implications for the choice of an ASR model when building a larger system, as the choice should be made depending on the distribution of disfluent nodes present in the data. `@inproceedings{teleki24_interspeech, title = {Comparing ASR Systems in the Context of Speech Disfluencies}, author = {Maria Teleki and Xiangjue Dong and Soohwan Kim and James Caverlee}, year = {2024}, booktitle = {Interspeech 2024}, pages = {4548--4552}, doi = {10.21437/Interspeech.2024-1270}, }`

	Quantifying the Impact of Disfluency on Spoken Content Summarization Maria Teleki, Xiangjue Dong, and James Caverlee LREC-COLING 2024 Paper Code Poster Video Slides ACL Anthology Link Spoken content is abundant -- including podcasts, meeting transcripts, and TikTok-like short videos. And yet, many important tasks like summarization are often designed for written content rather than the looser, noiser, and more disfluent style of spoken content. Hence, we aim in this paper to quantify the impact of disfluency on spoken content summarization. Do disfluencies negatively impact the quality of summaries generated by existing approaches? And if so, to what degree? Coupled with these goals, we also investigate two methods towards improving summarization in the presence of such disfluencies. We find that summarization quality does degrade with an increase in these disfluencies and that a combination of multiple disfluency types leads to even greater degradation. Further, our experimental results show that naively removing disfluencies and augmenting with special tags can worsen the summarization when used for testing, but that removing disfluencies for fine-tuning yields the best results. We make the code available at https://github.com/mariateleki/Quantifying-Impact-Disfluency. @inproceedings{teleki-etal-2024-quantifying-impact, title = "Quantifying the Impact of Disfluency on Spoken Content Summarization", author = "Teleki, Maria and Dong, Xiangjue and Caverlee, James", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1175", pages = "13419--13428", abstract = "Spoken content is abundant {--} including podcasts, meeting transcripts, and TikTok-like short videos. And yet, many important tasks like summarization are often designed for written content rather than the looser, noiser, and more disfluent style of spoken content. Hence, we aim in this paper to quantify the impact of disfluency on spoken content summarization. Do disfluencies negatively impact the quality of summaries generated by existing approaches? And if so, to what degree? Coupled with these goals, we also investigate two methods towards improving summarization in the presence of such disfluencies. We find that summarization quality does degrade with an increase in these disfluencies and that a combination of multiple disfluency types leads to even greater degradation. Further, our experimental results show that naively removing disfluencies and augmenting with special tags can worsen the summarization when used for testing, but that removing disfluencies for fine-tuning yields the best results. We make the code available at https://github.com/mariateleki/Quantifying-Impact-Disfluency.", }
	PREPRINTA Survey on LLM Inference-Time Self-Improvement *Xiangjue Dong, Maria Teleki,* and James Caverlee** arXiv 2024 Paper GitHub Poster Techniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self-Improvement from three different perspectives: Independent Self-improvement, focusing on enhancements via decoding or sampling methods; Context-Aware Self-Improvement, leveraging additional context or datastore; and Model-Aided Self-Improvement, achieving improvement through model collaboration. We provide a comprehensive review of recent relevant studies, contribute an in-depth taxonomy, and discuss challenges and limitations, offering insights for future research. `@inproceedings{dong24_survey, title = {A Survey on LLM Inference-Time Self-Improvement}, author = {Xiangjue Dong and Maria Teleki and James Caverlee}, year = {2024}, booktitle = {arXiv} }`
	MS 1ST AUTHORDACL: Disfluency Augmented Curriculum Learning for Fluent Text Generation Rohan Chaudhury, Maria Teleki, Xiangjue Dong, and James Caverlee LREC-COLING 2024 Paper Code Poster Video Slides ACL Anthology Link Voice-driven software systems are in abundance. However, language models that power these systems are traditionally trained on fluent, written text corpora. Hence there can be a misalignment between the inherent disfluency of transcribed spoken content and the fluency of the written training data. Furthermore, gold-standard disfluency annotations of various complexities for incremental training can be expensive to collect. So, we propose in this paper a Disfluency Augmented Curriculum Learning (DACL) approach to tackle the complex structure of disfluent sentences and generate fluent texts from them, by using Curriculum Learning (CL) coupled with our synthetically augmented disfluent texts of various levels. DACL harnesses the tiered structure of our generated synthetic disfluent data using CL, by training the model on basic samples (i.e. more fluent) first before training it on more complex samples (i.e. more disfluent). In contrast to the random data exposure paradigm, DACL focuses on a simple-to-complex learning process. We comprehensively evaluate DACL on Switchboard Penn Treebank-3 and compare it to the state-of-the-art disfluency removal models. Our model surpasses existing techniques in word-based precision (by up to 1%) and has shown favorable recall and F1 scores. @inproceedings{chaudhury-etal-2024-dacl-disfluency, title = "{DACL}: Disfluency Augmented Curriculum Learning for Fluent Text Generation", author = "Chaudhury, Rohan and Teleki, Maria and Dong, Xiangjue and Caverlee, James", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.385", pages = "4311--4321", abstract = "Voice-driven software systems are in abundance. However, language models that power these systems are traditionally trained on fluent, written text corpora. Hence there can be a misalignment between the inherent disfluency of transcribed spoken content and the fluency of the written training data. Furthermore, gold-standard disfluency annotations of various complexities for incremental training can be expensive to collect. So, we propose in this paper a Disfluency Augmented Curriculum Learning (DACL) approach to tackle the complex structure of disfluent sentences and generate fluent texts from them, by using Curriculum Learning (CL) coupled with our synthetically augmented disfluent texts of various levels. DACL harnesses the tiered structure of our generated synthetic disfluent data using CL, by training the model on basic samples (i.e. more fluent) first before training it on more complex samples (i.e. more disfluent). In contrast to the random data exposure paradigm, DACL focuses on a simple-to-complex learning process. We comprehensively evaluate DACL on Switchboard Penn Treebank-3 and compare it to the state-of-the-art disfluency removal models. Our model surpasses existing techniques in word-based precision (by up to 1{\%}) and has shown favorable recall and F1 scores.", }
	DEMOHowdy Y’all: An Alexa TaskBot Majid Alfifi, Xiangjue Dong, Timo Feldman, Allen Lin, Karthic Madanagopal, Aditya Pethe, Maria Teleki, Zhuoer Wang, Ziwei Zhu, James Caverlee Alexa Prize TaskBot Challenge Proceedings 2022 Paper Amazon Science Link In this paper, we present Howdy Y’all, a multi-modal task-oriented dialogue agent developed for the 2021-2022 Alexa Prize TaskBot competition. Our design principles guiding Howdy Y’all aim for high user satisfaction through friendly and trustworthy encounters, minimization of negative conversation edge cases, and wide coverage over many tasks. Hence, Howdy Y’all is built upon a rapid prototyping platform to enable fast experimentation and powered by four key innovations to enable this vision: (i) First, it combines a rules, phonetic matching, and a transformer-based approach for robust intent understanding. (ii) Second, to accurately elicit user preferences and guide users to the right task, Howdy Y’all is powered by a contrastive learning search framework over sentence embeddings and a conversational recommender for eliciting preferences. (iii) Third, to support a variety of user question types, it introduces a new data augmentation method for question generation and a self-supervised answer selection approach for improving question answering. (iv) Finally, to help motivate our users and keep them engaged, we design an emotional conversation tracker that provides empathetic responses to keep users engaged and a monitor of conversation quality. `@inproceedings{University2022, author={Alfifi, Majid and Dong, Xiangjue and Feldman, Timo and Lin, Allen and Madanagopal, Karthic and Pethe, Aditya and Teleki, Maria and Wang, Zhuoer and Zhu, Ziwei and Caverlee, James}, title = {Howdy Y’all: An Alexa TaskBot}, year = {2022}, url = {https://www.amazon.science/alexa-prize/proceedings/howdy-yall-an-alexa-taskbot}, booktitle = {Alexa Prize TaskBot Challenge Proceedings}, }`

Thread 2

Personalized AI

: Is AI robust to who people actually are?

AI systems respond differently to different people — shaped by signals of who we are, like gender, accent, and voice, and signals of how we express ourselves, like creativity and discourse. Understanding these signals is what it takes to build personalized AI that serves every user.

	Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models Maria Teleki, Xiangjue Dong, Haoran Liu, and James Caverlee ICWSM 2025 Presented at IC2S2 (Oral), SICon@ACL Paper (Preprint) Paper (AAAI) GitHub Project Website Slides Poster IC2S2 Abstract Masculine defaults are widely recognized as a significant type of gender bias, but they are often unseen as they are under-researched. Masculine defaults involve three key parts: (i) the cultural context, (ii) the masculine characteristics or behaviors, and (iii) the reward for, or simply acceptance of, those masculine characteristics or behaviors. In this work, we study discourse-based masculine defaults, and propose a twofold framework for (i) the large-scale discovery and analysis of gendered discourse words in spoken content via our Gendered Discourse Correlation Framework (GDCF); and (ii) the measurement of the gender bias associated with these gendered discourse words in LLMs via our Discourse Word-Embedding Association Test (D-WEAT). We focus our study on podcasts, a popular and growing form of social media, analyzing 15,117 podcast episodes. We analyze correlations between gender and discourse words -- discovered via LDA and BERTopic -- to automatically form gendered discourse word lists. We then study the prevalence of these gendered discourse words in domain-specific contexts, and find that gendered discourse-based masculine defaults exist in the domains of business, technology/politics, and video games. Next, we study the representation of these gendered discourse words from a state-of-the-art LLM embedding model from OpenAI, and find that the masculine discourse words have a more stable and robust representation than the feminine discourse words, which may result in better system performance on downstream tasks for men. Hence, men are rewarded for their discourse patterns with better system performance by one of the state-of-the-art language models -- and this embedding disparity is a representational harm and a masculine default. `@inproceedings{teleki25_icwsm, title = {Masculine Defaults via Gendered Discourse in Podcasts and Large Language Models}, author = {Maria Teleki and Xiangjue Dong and Haoran Liu and James Caverlee}, year = {2025}, booktitle = {ICWSM 2025} }`
	POSTER"Walk a Mile in My Voice": Voice Conversion Shapes Trust, Attribution, and Empathy in Human–AI Speech Interactions Shree Harsha Bokkahalli Satish, Maria Teleki, Christoph Minixhofer, Ondrej Klejch, Peter Bell, Éva Székely Collaboration w/ KTH Royal Institute of Technology, University of Edinburgh IUI 2026 Presented at CHI Paper Project Website LinkedIn Explainer 1 LinkedIn Explainer 2 Speech Large Language Models (SpeechLLMs) represent a new generation of conversational AI that processes spoken language directly from audio. This enables sensitivity to prosodic cues while also inheriting voice-based demographic information that has been shown to lead to biased system behaviour. Studying how people react and reflect on AI responses to different gender and accent presentation can contribute to understanding the potential societal impact. In this study, we examine how vocal identity factors of accent and perceived gender shape user evaluations of AI responses while the underlying linguistic content remains constant. Through two complementary studies (Interactive Study, N=24; Observational Study, N=19), we investigate whether experiencing interactions through voice converted identities versus observing pre-recorded conversations affects perceived harm, acceptability, trust, and responsibility attribution. We find that participants who experienced voice conversion rated benign AI responses as significantly more acceptable and reported significantly higher trust compared to those observing identical interactions, while perceived harm remained low across conditions. Qualitative feedback reveals that participants attributed different AI behaviours to voice characteristics, noting perceived differences in tone, helpfulness, and respect based on accent and gender presentation. Our findings suggest that vocal identity functions as a design variable, with systematic effects on user perception even when lexical content is held constant. `title = {\textbf{Walk a Mile in My Voice: Voice Conversion Shapes Trust, Attribution, and Empathy in Human–AI Speech Interactions}}, author = {Shree Harsha Bokkahalli Satish and \underline{Maria Teleki} and Christoph Minixhofer and Ondrej Klejch and Peter Bell and Éva Székely}, year = {2026}, booktitle = {IUI (Short)} }`
	New! Extended VersionA Survey on LLMs for Story Generation Maria Teleki, Xiangjue Dong, Peter Carragher, Vedangi Bengali, Tian Liu, Haoran Liu, Sai Tejas Janjur, Thomas Docog, Stephanie Birkelbach, Oliver Grabner, Cong Wang, Ting Liu, Yin Zhang, Frank Shipman, James Caverlee Collaboration w/ Carnegie Mellon University EMNLP Findings 2025 Extended Paper Paper GitHub Methods for story generation with Large Language Models (LLMs) have come into the spotlight recently. We create a novel taxonomy of LLMs for story generation consisting of two major paradigms: (i) independent story generation by an LLM, and (ii) author-assistance for story generation -- a collaborative approach with LLMs supporting human authors. We compare existing works based on their methodology, datasets, generated story types, evaluation methods, and LLM usage. With a comprehensive survey, we identify potential directions for future work. `@inproceedings{teleki25_survey, title = {{A Survey on LLMs for Story Generation}}, author = {Maria Teleki and Vedangi Bengali and Xiangjue Dong and Sai Tejas Janjur and Haoran Liu and Tian Liu and Cong Wang and Ting Liu and Yin Zhang and Frank Shipman and James Caverlee}, year = {2025}, booktitle = {EMNLP Findings} }`
	The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely Collaboration w/ KTH Royal Institute of Technology, University of Edinburgh INTERSPEECH 2026 Paper LinkedIn Explainer Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities. `@inproceedings{satish26_voicebehind, title = {The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs}, author = {Shree Harsha Bokkahalli Satish and Christoph Minixhofer and Maria Teleki and James Caverlee and Ondřej Klejch and Peter Bell and Gustav Eje Henter and Éva Székely}, year = {2026}, booktitle = {Interspeech} }`
	CHOIR: Harmonizing Structured Persona Diversity for Robust Collaborative LLM Reasoning Xiangjue Dong, Cong Wang, Maria Teleki, Millennium Bismay, Ruihong Huang, and James Caverlee ACL 2026 Paper LinkedIn Explainer Persona-assigned Large Language Models can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun swaps, can alter reasoning trajectories, leading to divergent sets of correct answers on reasoning benchmarks. We explore the potential of these variations as a constructive resource to improve LLM reasoning performance. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes a set of demographically perturbed, persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative decoding process among counterfactual personas perturbed across dimensions of gender, race, religion, disability, and age, dynamically balancing agreement and divergence in their reasoning paths to improve performance. Experiments demonstrate that CHOIR consistently enhances LLM reasoning across model architectures, scales, and tasks. Improvements reach up to 20.1% for individual groups and 15.1% on average, and we show that CHOIR remains effective even when base personas are suboptimal. @inproceedings{dong-etal-2026-choir, title = "{CHOIR}: Harmonizing Structured Persona Diversity for Robust Collaborative {LLM} Reasoning", author = "Dong, Xiangjue and Wang, Cong and Teleki, Maria and Bismay, Millennium and Huang, Ruihong and Caverlee, James", booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)", month = jul, year = "2026", address = "San Diego, California, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2026.acl-long.2175/", pages = "46997--47014", ISBN = "979-8-89176-390-6" }
	POSTERUG 1ST AUTHORPromptHelper: A Prompt Recommender System for Encouraging Creativity in AI Chatbot Interactions Jason Kim, Maria Teleki, James Caverlee CHI 2026 Paper Poster IC2S2 Abstract GitHub LinkedIn Explainer Prompting is central to interaction with AI systems, yet many users struggle to explore alternative directions, articulate creative intent, or understand how variations in prompts shape model outputs. We introduce prompt recommender systems (PRS) as an interaction approach that supports exploration, suggesting contextually relevant follow-up prompts. We present PromptHelper, a PRS prototype integrated into an AI chatbot that surfaces semantically diverse prompt suggestions while users work on real writing tasks. We evaluate PromptHelper in a 2x2 fully within-subjects study (N=32) across creative and academic writing tasks. Results show that PromptHelper significantly increases users' perceived exploration and expressiveness without increasing cognitive workload. Qualitative findings illustrate how prompt recommendations help users branch into new directions, overcome uncertainty about what to ask next, and better articulate their intent. We discuss implications for designing AI interfaces that scaffold exploratory interaction while preserving user agency, and release open-source resources to support research on prompt recommendation. `@inproceedings{kim26_prompthelper, title = {{PromptHelper: A Prompt Recommender System for Encouraging Creativity in AI Chatbot Interactions}}, author = {Jason Kim and Maria Teleki and James Caverlee}, year = {2026}, booktitle = {arXiv} }`
	SHORTUG 1ST AUTHORSocialPulse: An Open-Source Subreddit Sensemaking Toolkit Stephanie Birkelbach, Maria Teleki, Peter Carragher, Xiangjue Dong, Nehul Bhatnagar, James Caverlee Collaboration w/ Carnegie Mellon University, Revionics SocialLLM@ICWSM 26 Presented at IC2S2 (Oral) Paper GitHub IC2S2 Abstract Video LinkedIn Explainer Understanding how online communities discuss and make sense of complex social issues is a central challenge in social media research, yet existing tools for large-scale discourse analysis are often closed-source, difficult to adapt, or limited to single analytical views. We present SocialPulse, an open-source subreddit sensemaking toolkit that unifies multiple complementary analyses -- topic modeling, sentiment analysis, user activity characterization, and bot detection -- within a single interactive system. SocialPulse enables users to fluidly move between aggregate trends and fine-grained content, compare highly active and long-tail contributors, and examine temporal shifts in discourse across subreddits. The demo showcases end-to-end exploratory workflows that allow researchers and practitioners to rapidly surface themes, participation patterns, and emerging dynamics in large Reddit datasets. By offering an extensible and openly available platform, SocialPulse provides a practical and reusable foundation for transparent, reproducible sensemaking of online community discourse. `@inproceedings{birkelbach26_socialpulse, title = {{SocialPulse: An Open-Source Subreddit Sensemaking Toolkit}}, author = {Stephanie Birkelbach and Maria Teleki and Peter Carragher and Xiangjue Dong and Nehul Bhatnagar and James Caverlee year = {2026}, booktitle = {arXiv} }`
	DEMOAI in the Loop: A Multimodal Assistant for Real-Time Lecture Comprehension Amanda Lacy, Maria Teleki, Esau Hutcherson, Jobin Varughese, Jun Kwon, Frank Shipman, Tracy Hammond L@S 2026 Paper LinkedIn Explainer Synchronous learning environments often present a bandwidth problem where high-density visual and auditory information is lost due to fast-paced delivery or inaccessible presentation methods. We present AI in the Loop, a multimodal assistant that supports real-time lecture comprehension by integrating live speech recognition, slide content extraction, and an LLM-powered question answering interface. The system allows students to query lecture content in natural language during class, receiving contextually grounded responses drawn from both spoken audio and displayed slides. Our demo showcases the system in a live classroom setting, highlighting its potential to reduce cognitive load and support diverse learners. `@inproceedings{lacy-etal-2026-aiintheloop, title = "{AI} in the Loop: A Multimodal Assistant for Real-Time Lecture Comprehension", author = "Lacy, Amanda and Teleki, Maria and Hutcherson, Esau and Varughese, Jobin and Kwon, Jun and Shipman, Frank and Hammond, Tracy", booktitle = "Proceedings of the Thirteenth ACM Conference on Learning at Scale", year = "2026", }`
	Co2PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning Xiangjue Dong, Ziwei Zhu, Zhuoer Wang, Maria Teleki, and James Caverlee Collaboration w/ George Mason University EMNLP Findings 2023 Paper Code ACL Anthology Link Pre-trained Language Models are widely used in many important real-world applications. However, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. To address this challenge, we propose Co2PT, an efficient and effective debias-while-prompt tuning method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. Our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of Co2PT on bias mitigation during the prompt tuning process and its adaptability to existing upstream debiased language models. These findings indicate the strength of Co2PT and provide promising avenues for further enhancement in bias mitigation on downstream tasks. @inproceedings{dong-etal-2023-co2pt, title = "{C}o$^2${PT}: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning", author = "Dong, Xiangjue and Zhu, Ziwei and Wang, Zhuoer and Teleki, Maria and Caverlee, James", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.390", doi = "10.18653/v1/2023.findings-emnlp.390", pages = "5859--5871", abstract = "Pre-trained Language Models are widely used in many important real-world applications. However, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. To address this challenge, we propose Co$^2$PT, an efficient and effective debias-while-prompt tuning method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. Our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of Co$^2$PT on bias mitigation during the prompt tuning process and its adaptability to existing upstream debiased language models. These findings indicate the strength of Co$^2$PT and provide promising avenues for further enhancement in bias mitigation on downstream tasks.", }

Thread 3

Governable AI

: Is AI robust within the societal structures it operates in?

I study the infrastructures of power — governments, universities, and tech companies — that determine how AI systems are authorized, deployed, and contested. I develop audits and systems that reveal where accountability breaks down.

	The Due Process Deficit: Auditing AI Governance in U.S. Higher Education Maria Teleki, Anna Seo Gyeong Choi^, Anne Duray^, Haoran Liu, Junyan Zhang, Xiangjue Dong, Dilma Da Silva, Allison Koenecke^†, James Caverlee^† Collaboration w/ Cornell University, AAAS Science and Technology Policy Fellow FAccT 2026 Presented at IC2S2 (Oral) Paper ACAI-US79.ORG Slides IC2S2 Abstract U.S. universities have met artificial intelligence with policy -- revised honor codes, AI use guidelines, and detection tools to enforce them -- but the procedural infrastructure that would let affected parties understand, contest, and seek redress for AI-related decisions has not kept pace. We present ACAI-US79, an institutional audit of AI governance across 79 U.S. universities spanning R1, R2, and teaching-focused institutions, and ACAI, the Academic AI Capacity Index: an interpretable measure of the public legibility of governance structures that allocate authority and accountability. Unlike prior audits focused on pedagogical guidance at elite or R1-only samples, ACAI shifts the analytic frame to procedural accountability, evaluating four domains — policy clarity, faculty support, feedback mechanisms, and AI detection tool governance — through time-bounded review of institutionally authoritative materials. We find that governance capacity is systematically concentrated in rule articulation while mechanisms for participation, feedback, and procedural constraint are underdeveloped, and that AI research intensity is largely uncorrelated with AI governance capacity -- i.e., we find that technical leadership does not translate into procedural accountability. We further show that LLM-based audits produce unstable institutional rankings, demonstrating the continued necessity of human judgment in high-stakes audit contexts. We release the dataset, audit instrument, and public website at acai-us79.org to support transparency, critique, and institutional self-reflection, laying empirical groundwork for accountable AI governance in higher education. `@inproceedings{teleki26_acai, title = {The Due Process Deficit: Auditing AI Governance in U.S. Higher Education}, author = {Maria Teleki and Anna Seo Gyeong Choi and Anne Duray and Haoran Liu and Junyan Zhang and Xiangjue Dong and Dilma Da Silva and James Caverlee and Allison Koenecke}, year = {2026}, booktitle = {ACM FAccT 2026} }`
	PREPRINTWhen Behavioral Traces Mislead: Evaluating Evidence Adequacy in Learning Analytics Junhyeon Kwon, Jobin Varughese, Maria Teleki, Mohsen Dorodchi Collaboration w/ UNC Charlotte Presented at IC2S2 (Oral) IC2S2 Abstract The transition from high school to college requires students to shift from externally guided behaviors to independent self-regulated learning. In large-enrollment computing courses, learning management system data provides scalable traces of how students distribute coursework across a semester. Such temporal behaviors are widely used in learning analytics systems to infer risk and trigger interventions, yet the evidentiary adequacy of these signals remains unclear. We examine whether submission timing patterns meaningfully distinguish learners and whether they support individual-level decisions. We analyze 32,275 assignment submissions from 3,982 students across four introductory computing courses over six semesters. To compare heterogeneous assignments, we introduce the Submission Fraction Index (SFI), a normalized representation of when work is submitted within an available time window. Using semester-long SFI trajectories, we identify recurring pacing profiles---consistent-early, gradual-early, gradual-late, and consistent-late---and evaluate their relationship to academic outcomes. Earlier and more organized pacing corresponds to moderately higher grades at the group level; however, outcome distributions substantially overlap across profiles. Predictive models trained on submission timing explain little individual performance variance (R² ≈ 0.05) indicating limited diagnostic power. These findings show that temporal behavior differentiates strategies but weakly identifies individual learning needs. When multiple pacing patterns lead to similar outcomes, behavioral traces alone cannot justify fine-grained intervention decisions. Our results suggest that learning analytics systems should align intervention granularity with measurement reliability, using temporal pacing information to scaffold planning and awareness. This work reframes common analytics practices by emphasizing evidentiary limits in large-scale educational decision-making and highlights the need for context-aware approaches when designing responsible support systems.
	DEMOUG 1ST AUTHORPodChecker: An Interpretable Fact-Checking Companion for Podcasts Anna Irmetova, Haoran Liu, Maria Teleki, Peter Carragher, Julie Zhang, James Caverlee Collaboration w/ Carnegie Mellon University MisD@ICWSM 2026 Paper GitHub Slides We present PodChecker, a user-facing system for automated, claim-level fact-checking of podcast content. PodChecker processes podcast audio or RSS feeds by transcribing episodes, extracting atomic factual claims, and assigning each claim one of four fine-grained labels -- \textit{true, false, misleading/partially true, or unverifiable} -- using retrieval-augmented verification. The system presents fact-checking results at the level of individual claims, accompanied by simple visual indicators and links to supporting/conflicting sources. This design, implemented via an interactive web-based interface, enables users to inspect fact-checking outputs and underlying evidence directly, supporting interpretable and critical engagement with long-form audio content. By presenting claim-level evidence and labels, PodChecker assists both general listeners and professional fact-checkers in assessing podcast factuality. `@inproceedings{irmetova26_podchecker, title = {{PodChecker: An Interpretable Fact-Checking Companion for Podcasts}}, author = {Anna Irmetova, Haoran Liu, Maria Teleki, Peter Carragher, Xiangjue Dong, Julie Zhang, James Caverlee year = {2026}, }`
	SHORTUG 1ST AUTHORPREPRINTDetecting and Mitigating Demographic Bias in LLM-Based Resume Evaluation Oluwadayo Bamgbelu, Maria Teleki**, Xiangjue Dong, James Caverlee** Presented at IC2S2 `@misc{bamgbelu_resume_bias_llm, title = {Detecting and Mitigating Demographic Bias in LLM-Based Resume Evaluation}, author = {Oluwadayo Bamgbelu and Maria Teleki and Xiangjue Dong and James Caverlee}, note = {Short paper / work in progress} }`

Education

2022 – Present | PhD Computer Science at Texas A&M University

Advised by James Caverlee.

2017 – 2022 | B.S. Computer Science at Texas A&M University

Summa Cum Laude. Emphasis in Electrical Engineering. Minor in Math.

Experience

Google DeepMind | Student Researcher | Cambridge, MA | June - Sep '26

Working on Speech AI on the Frontier AI Team.

RetailMeNot | Software Engineering Intern | Austin, TX | May - August '21

BERT embeddings and PCA to surface which semantic dimensions of coupon content predicted user click-through rates.

The Hi, How Are You Project | Volunteer | Austin, TX | May - Dec '20

I built the "Friendly Frog" Alexa Skill with the organization at the beginning of the COVID-19 pandemic to promote mental health.

RetailMeNot | Software Engineering Intern | Austin, TX | May - August '20

I built the "RetailMeNot DealFinder" Alexa Skill to help users activate cash back offers.

Silicon Labs | Applications Engineering Intern | Austin, TX | May - August '19

I built a Python library adopted by customers for real-time hardware bus traffic analysis across multiple protocols.

Service

Program Committee (Reviewer) for INTERSPEECH | '26
Program Committee (Reviewer) for ICWSM | Jan '24, May '24, Sep '24, Jan '25, Sep '25, Jan '26
Program Committee (Reviewer) for IC2S2 | '26
External Program Committee (External Reviewer) for RecSys | '24
Program Committee (Reviewer) for ACL ARR | Aug '24, Oct '24, Dec '24, May '25, Oct '25, Jan '26

Workshop Organizer for Speech AI for All: The What, How, and Who of Measurement
Kimi V. Wenzel, Alisha Pradhan, Maria Teleki, Tobias Weinberg, Robin Netzorg, Alyssa Hillary Zisk, Anna Seo Gyeong Choi, Jingjin Li, Raja Kushalnagar, Colin Lea, Abraham Glasser, Christian Vogler, Nan Bernstein Ratner, Ly Xīnzhèn M. Zhǎngsūn Brown, Allison Koenecke, Karen Nakamura, Shaomei Wu
CHI 2026

Paper Website

Optimized for ``typical'' and fluent speech, today's speech AI systems perform poorly for people with speech diversities, sometimes to an unusable or even harmful degree. These harms play out in daily life through household voice assistants and workplace meeting services, in higher stakes scenarios like medical transcription, and in emerging applications of AI in augmentative and alternative communication. Standard metrics aiming to quantify these inequities, however, fail to comprehensively understand the impact of speech AI on diverse user groups, and furthermore do not easily generalize to newer speech language and speech generation models. To address these social inequities and measurement limitations, this workshop brings academics, practitioners, and non-profit workers together in proactive dialogue to improve measurement of speech AI performance and user impact. Through a poster session and breakout group discussions, our workshop will extend current understanding on how to best leverage existing metrics, like Word Error Rate, within the HCI design ecosystem, and also explore new innovations in speech AI measurement. Key outcomes of this workshop include: a research agenda for CHI community to guide and contribute to speech AI development, groundwork for new papers on speech AI measurement, and a diversity-centered benchmark suite for external evaluators.


                  @inproceedings{wenzel26_speech,

                    title     = {{Speech AI for All: The What, How, and Who of Measurement}},

                    author    = {Kimi V. Wenzel, Alisha Pradhan, Maria Teleki, Tobias Weinberg, Robin Netzorg, Alyssa Hillary Zisk, Anna Seo Gyeong Choi, Jingjin Li, Raja Kushalnagar, Colin Lea, Abraham Glasser, Christian Vogler, Nan Bernstein Ratner, Ly Xīnzhèn M. Zhǎngsūn Brown, Allison Koenecke, Karen Nakamura, Shaomei Wu},

                    year      = {2026},

                    booktitle = {CHI}

                  }

Media

🎙️ Podcast Feature: I was interviewed on the MASKulinity Podcast, where we discussed our ICWSM ‘25 work on gendered discourse: how simple masculine defaults (e.g., the use of “like”) shape AI systems, and why these patterns matters for gender equality. [🎧 Listen Here] [💬 About MASKulinity] [📄 ICWSM ‘25 Paper]

Mentoring

Whether you have prior research experience or are just starting out, I have a few spots each semester to mentor and collaborate with students who have a passion for learning, a growth mindset, and who want to contribute to impactful projects. [Here is information on how I work with mentees], if you're interested: fill out this form

★ student was an author on a published paper; ♠ student was an author on a Preprint paper; ▲ student completed their thesis; ◆ student received course credit (i.e. CSCE 485, CSCE 691); ♣ student had no publications prior to mentorship.

MS Students

Rohan Chaudhury [★▲] M.S. Computer Science – First Employment at Amazon

1st author DACL: Disfluency Augmented Curriculum Learning for Fluent Text Generation LREC-COLING '24

Sai Tejas Janjur [★♠♣] M.S. Computer Science – 🥇1st Place Award at Student Research Week, First Employment at NVIDIA

co-author A Survey on LLMs for Story Generation EMNLP Findings '25

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Cong Wang [★♠♣] M.S. Statistics – Presented at Texas NLP Symposium

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Junhyeon Kwon [♠♣] M.S. Computer Science

1st author When Behavioral Traces Mislead: Evaluating Evidence Adequacy in Learning Analytics Preprint

Undergraduate Students

Soohwan Kim [★◆♣] B.S. Computer Science – First Employment at UPS

co-author Comparing ASR Systems in the Context of Speech Disfluencies INTERSPEECH '24

Oliver Grabner [★♠♣] B.S. Computer Science – Presented at Texas NLP Symposium & Student Research Week, Intern at Samsung, Intern at Google

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Thomas Docog [★♠♣] B.S. Computer Science – Presented at Texas NLP Symposium, Intern at Ramirez

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Ketan Verma [★♠♣] B.S. Computer Science – Intern at Samsung, M.S. at Cornell Tech

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Stephanie Birkelbach [★♠◆♣] B.S. Computer Science + B.S. Statistics – REU at Rochester Institute of Technology w/ Ashique KhudaBukhsh

1st author SocialPulse: An Open-Source Subreddit Sensemaking Toolkit SocialLLM@ICWSM '26

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Jason Kim [★♠♣] B.S. Computer Science – Presented at Texas NLP Symposium, 🥇1st Place Award at Student Research Week

1st author PromptHelper: A Prompt Recommender System for Encouraging Creativity in AI Chatbot Interactions CHI '26

co-author Z-Scores: A Metric for Linguistically Assessing Disfluency Removal ICASSP '26

co-author Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones Preprint

Anna Irmetova [★♣◆] B.S. Computer Science – Presented at Student Research Week, Intern at Target

1st author PodChecker: An Interpretable Fact-Checking Companion for Podcasts MisD@ICWSM '26

Oluwadayo (Dayo) Bamgbelu [♣] B.S. Computer Science – Presented at Texas NLP Symposium, Intern at Meta

1st author Detecting and Mitigating Demographic Bias in LLM-Based Resume Evaluation Preprint

Teaching

Certifications

2025 | Professional Development Mastery Certificate in Instruction & Assessment | GRAD Aggies, CIRTL@TAMU
Completed multi-semester program in evidence-based teaching, including syllabus design, learning assessment, and inclusive pedagogy.

Teaching Assistantships

Spring 2026 | Teaching Assistant | Texas A&M University, CSCE 676 Data Mining & Analysis
Helped Prof. Caverlee redesign the course for the AI era.

Guest Lectures

Fall 2025 | Speech Language Models | Texas A&M University, CSCE 676 Data Mining & Analysis
Spring 2025 | IR Evaluation and Learning to Rank | Texas A&M University, CSCE 670 Information Storage & Retrieval

Earlier Teaching

Dec '18 – Dec '19 | Peer Teacher | Texas A&M University
Sep '16 – July '17 | Afterschool Instructor | The Y (YMCA)
Taught weekly classes at local elementary schools and authored robotics & science instruction manuals for the Williamson County YMCA Afterschool program.

Awards

2022-2026 | Dr. Dionel Avilés '53 and Dr. James Johnson '67 Fellowship in Computer Science and Engineering

Multi-year fellowship supporting doctoral research in Computer Science.

2024, 2025, Spring 2026, Summer 2026 | Department of Computer Science & Engineering Travel Grant

2024 | CRA-WP Grad Cohort for Women

Selected participant; supported with travel support.

2017-2021 | President's Endowed Scholarship

Multi-year fellowship supporting undergraduate studies.

2018 | Bertha & Samuel Martin Scholarship

Invited Talks

Fall 2026 | Socially Robust AI | George Mason University

Fall 2026 | The Due Process Deficit | The Contentious Politics of AI (Remote)

Spring 2026 | Accountable AI | IPT Global

Fall 2025 | Conversational AI | Revionics

Spring 2025 | How does ChatGPT work? & My research! | Texas A&M University, Club of Aggie Female Engineers (C.A.F.E.)

Fall 2024 | The Other AI: An Intuitive Understanding of Artificial Intelligence | Texas Tech University, School of Veterinary Medicine VBMA

Resources
[Review Response Worksheet] [Working With Collaborators] [PhD Professionalization by Allison Koenecke] [PhD Advice] [Prompting Best Practices] [Shomir Wilson's Advice] [SPARC Advice] [Hey Maria, How Can I Learn About AI?]
I'm happy to chat about grad school applications & my experience — send me an email!

Outside the lab 🐶
I spend my time outside work with our very good boy, Apollo 🐶 [Smiling] [Sunflowers] [Generated-1], traveling 🌎 [On top of volcano in Hawaii] [Casper, Wyoming] [Bastrop, Texas], doing CrossFit 🏋️‍♀️, and watching New Girl 👓 and Parks & Rec 🪴. Fun fact: I have a midnight blue belt in Tang Soo Do Mi Guk Kwon! 🥋 We use the "midnight blue belt" (instead of "black belt") to symbolize that you can never be perfect, and you are never done learning.

Publications

Disfluency-Aware AI

Personalized AI

Governable AI

Education

Experience

Service

Media

Mentoring

Teaching

Awards

Invited Talks

More