Capstone Project

Evaluation CoPilot

People are collaborating in an office setting, with two individuals focused on a computer screen. A whiteboard with notes is visible in the background. The text reads: Evaluation Copilot: Comprehensive evaluation that empowers developers to develop with trust, deploy in confidence.

Problem

The rapid integration of Large Language Models (LLMs) in app development introduces challenges in understanding and trusting AI-generated responses.

Approach

Our team combined user centered design with technological innovation, starting with extensive user research to understand developers’ needs and pain points. We iteratively developed a series of prototypes following Agile framework, incorporating feedback from user testing sessions and learning from the latest research outcomes in the field.

Solution

The “Evaluation Copilot” is a web app that demystifies LLM evaluation metrics for developers, offering an intuitive platform to test, understand, and refine AI-generated text. It provides clear, actionable feedback on how to improve prompts for better LLM responses, ensuring developers can enhance AI reliability and effectiveness in their applications.

View Poster