As organizations adopt AI at scale, evaluation becomes the backbone of trust, safety, and product readiness. Yet building effective evals is deceptively hard: there is no single metric or benchmark that captures the complexity of user-facing AI systems. This talk unpacks the evolving landscape of AI evaluations, drawing lessons from personalization infrastructure and enterprise adoption.
Evals aren’t just “does the model answer right?”—they span from model correctness to infra robustness, product safety, human experience, and long-term systemic effects. We’ll explore the multi-layered nature of evals: from core metrics like precision and recall, to system-level robustness (latency, drift, stability), to product-facing guardrails that ensure fairness, safety, and alignment with user expectations. AI adoption in enterprises fails if evals stop at precision/recall and don’t extend to guardrails, human experience, and systemic impact. We’ll discuss why manual labeling and gold-standard datasets remain indispensable for grounding LLM judgments, and how iterative evaluation loops allow criteria to evolve alongside product goals, avoiding the trap of static benchmarks.
The session will also address organizational dimensions: who owns evaluation, how to align infra, product, and ML teams on rigor, and what “good” looks like when rolling AI systems into production responsibly. Attendees will gain a framework for understanding evaluation as a dynamic, multi-layered practice that spans technical correctness, human experience, and long-term product impact.
Key Takeaways:
- Evals are multi-layered, spanning model correctness, infra reliability, product guardrails, and user experience.
- Manual labeling and curated datasets are essential for grounding AI judgments.
- Iterative evaluation loops ensure relevance as products and models evolve.
- Ownership and governance of evals are as much organizational as technical.
- A structured, layered approach accelerates safe, impactful AI adoption.
Speaker

Mallika Rao
Engineering Leader @Netflix
Mallika Rao is an Engineering Leader at Netflix with deep expertise in building and operating large-scale distributed systems, including search, recommendations, and personalization infrastructure. She brings a systems-thinking mindset to infrastructure strategy and is passionate about integrating AI into product and engineering in ways that enhance resilience, transparency, and operational excellence. Her work focuses on enabling teams to innovate rapidly while maintaining the stability and rigor required in enterprise-scale environments. Beyond her technical leadership, Mallika mentors senior engineers and leaders, and draws inspiration from the elegance of mathematics and the improvisational creativity of music.