Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qcon.ai with any comments or concerns.
The presentation titled Building Evals for AI Adoption: From Principles to Practice by Mallika Rao focuses on the complexities and importance of building effective evaluation frameworks for AI adoption in enterprises. The transcript covers several key points:
Introduction
Mallika Rao introduces herself and discusses her experience leading search infrastructure teams at Twitter, Walmart, and Netflix where she built and operated large-scale systems focused on personalization and search infrastructure.
Importance of Evaluation Frameworks
- Evaluation frameworks are crucial for maintaining user trust and product success.
- Evaluation debt can silently build up, leading to significant problems if not addressed.
Challenges and Solutions
- Evaluation Debt: Accumulates unnoticed and can cause product failures.
- Chatbot Personalization: Use of personalized models and optimizing for various surfaces (search, recommendations, etc.) pose unique challenges.
- Recommendation for multi-layered evaluations, integrating product, infrastructure, and user experience considerations.
Case Studies
Rao discusses building personalized search systems at Twitter and cash rewards at Walmart, highlighting the scale challenges and learning experiences.
Organizational Dimensions
- Discusses the importance of aligning infrastructure, product, and ML teams.
- Proposes an organizational framework to handle evaluation governance effectively.
Key Takeaways
- Evals are not just about model correctness but also include user experience and systemic impact.
- Manual labeling and gold-standard datasets remain essential.
- Iterative evaluation loops are critical for keeping evaluations relevant as the product evolves.
Mallika Rao emphasizes the need for ongoing commitment to evaluation as a dynamic practice that adapts as the product and user expectations evolve .
This is the end of the AI-generated content.
As organizations adopt AI at scale, evaluation becomes the backbone of trust, safety, and product readiness. Yet building effective evals is deceptively hard: there is no single metric or benchmark that captures the complexity of user-facing AI systems. This talk unpacks the evolving landscape of AI evaluations, drawing lessons from personalization infrastructure and enterprise adoption.
Evals aren’t just “does the model answer right?”—they span from model correctness to infra robustness, product safety, human experience, and long-term systemic effects. We’ll explore the multi-layered nature of evals: from core metrics like precision and recall, to system-level robustness (latency, drift, stability), to product-facing guardrails that ensure fairness, safety, and alignment with user expectations. AI adoption in enterprises fails if evals stop at precision/recall and don’t extend to guardrails, human experience, and systemic impact. We’ll discuss why manual labeling and gold-standard datasets remain indispensable for grounding LLM judgments, and how iterative evaluation loops allow criteria to evolve alongside product goals, avoiding the trap of static benchmarks.
The session will also address organizational dimensions: who owns evaluation, how to align infra, product, and ML teams on rigor, and what “good” looks like when rolling AI systems into production responsibly. Attendees will gain a framework for understanding evaluation as a dynamic, multi-layered practice that spans technical correctness, human experience, and long-term product impact.
Key Takeaways:
- Evals are multi-layered, spanning model correctness, infra reliability, product guardrails, and user experience.
- Manual labeling and curated datasets are essential for grounding AI judgments.
- Iterative evaluation loops ensure relevance as products and models evolve.
- Ownership and governance of evals are as much organizational as technical.
- A structured, layered approach accelerates safe, impactful AI adoption.
Speaker
Mallika Rao
Engineering Leader @Netflix
Mallika Rao who has been an Engineering Leader at Twitter, Walmart and Netflix with deep expertise in building and operating large-scale distributed systems, including search, recommendations, and personalization infrastructure. She brings a systems-thinking mindset to infrastructure strategy and is passionate about integrating AI into product and engineering in ways that enhance resilience, transparency, and operational excellence. Her work focuses on enabling teams to innovate rapidly while maintaining the stability and rigor required in enterprise-scale environments. Beyond her technical leadership, Mallika mentors senior engineers and leaders, and draws inspiration from the elegance of mathematics and the improvisational creativity of music.