Eray Turkel, Ph.D.

I’m an ML researcher working on post-training, evaluation, and reinforcement learning for modern AI systems. My work focuses on the problems that determine whether post-training actually produces the capabilities we intend: designing training signals that generate genuine capability rather than reward hacking, building evaluation systems whose measurements can be trusted, and assigning credit correctly across long trajectories and complex reward structures. I approach these as ML problems informed by the tools of statistics and causal inference.

My career has been a progression through increasingly ambiguous versions of measurement and attribution problems. At Roblox, I work on end-to-end LLM post-training for agentic AI: synthetic data generation, supervised fine-tuning, reinforcement learning with verifiable and rubric-based rewards, and the evaluation infrastructure that measures whether these interventions produce genuine improvement. Previously, at Google Search AI Overviews, I built LLM-as-judge evaluation systems, uncertainty quantification methods for LLM judges, and robust hybrid human-LLM evaluation algorithms. Before that, on Google’s Causal Inference team, I developed Bayesian ML models and statistical packages used across Maps, YouTube, Play, and Ads. Earlier, at Uber, I implemented variance reduction methods for marketplace experimentation and worked on spillovers in switchback experiments. My Stanford PhD combined machine learning, statistical modeling, and causal inference, with published work in PNAS and ACM.

Post-training, evaluation, and credit assignment are difficult measurement problems at their core. My background in statistics and causal inference gives me an edge, because identification and rigorous measurement are exactly the tools modern ML post-training needs. The questions I find most interesting sit at that intersection: designing reward functions that correctly assign credit across long trajectories, and bringing causal identification to process-level rewards and LLM-as-judge evaluation.

Research Directions

My current research lives at the intersection of LLM post-training and the measurement and identification machinery of statistics and causal inference. Three active directions:

  • Doubly-robust estimation of model performance. Treating evaluation as a causal-inference problem. Building DR estimators for LLM performance that combine outcome models with propensity-style adjustments, with valid uncertainty under finite-sample, off-policy, and judge-noise regimes.
  • Partial reward models and credit assignment. Designing reward functions that correctly attribute credit across long agentic trajectories — process-level signals, partial and rubric-based rewards, and methods for disentangling which steps in a trajectory actually drove the outcome.
  • Conditional conformal inference for LLM judges. Calibrated, conditionally valid uncertainty for LLM-as-a-judge scores: per-prompt coverage rather than marginal coverage, and conformal procedures robust to judge drift and prompt-distribution shift.