Eray Turkel

Senior ML Research Engineer · LLM Post-Training & Evaluation · Causal ML · Stanford Ph.D.

I’m an ML research engineer working on LLM post-training and evaluation. My career has been a progression through increasingly ambiguous versions of measurement and attribution problems. The questions that drive my work: designing training signals that generate genuine capability rather than reward hacking, building evaluation systems whose measurements can be trusted, and assigning credit correctly across long trajectories and complex reward structures. I approach these as ML problems informed by the tools of statistics and causal inference.

At Roblox, I work on end-to-end LLM post-training for agentic AI: synthetic data generation, supervised fine-tuning, reinforcement learning with verifiable and rubric-based rewards, and the evaluation infrastructure that measures whether these interventions produce genuine improvement. I maintain Open Game Eval, Roblox’s open-source LLM evaluation framework for game-development tasks.

Previously, at Google Search AI Overviews, I built LLM-as-judge evaluation systems, uncertainty quantification methods for LLM judges, and hybrid human-LLM evaluation algorithms. This work fed directly into fine-tuning, reward design, and RLHF for Search’s generative AI products. Before that, on Google’s Central Causal Inference team, I built ML models across Maps, YouTube, Play, and Ads: Bayesian ML models for understanding routing interventions on Maps across the 10 biggest US cities, the YouTube Hype Small Creator bonus mechanism (covered by The Verge), and sales-intervention models for Ads affecting millions of dollars in operations.

Earlier, at Uber, I implemented variance-reduction methods for marketplace experimentation and worked on identifying spillovers in switchback experiments. My Stanford Ph.D. combined machine learning, statistical modeling, and causal inference, with published work in PNAS and ACM WWW.

The questions I find most interesting these days: designing reward functions that correctly assign credit across long trajectories, and improving the quality and quantifying the uncertainty of LLM-as-judge evaluation methods. One current independent project is conditional conformal inference for LLM judges: calibrated, conditionally valid uncertainty quantification for LLM-as-a-judge scores. Per-prompt coverage rather than marginal coverage, and conformal procedures robust to judge drift and prompt-distribution shift.