Description
This project will create a visual, automated, and continuous evaluation pipeline for an LLM-powered feature (e.g., a summarizer, a classification agent, or a RAG system) using the Promptfoo framework.
The core deliverable is a self-hosted, interactive dashboard that allows developers and product managers to immediately see and compare the performance, cost, and security of different prompt templates and LLM providers. Instead of relying on manual output review, we will implement model-graded assertions and LLM red-teaming to provide quantitative, objective performance scores.
This moves our LLM development from manual "trial-and-error" to a robust, "test-driven" engineering workflow.
Goals
*Automated Evaluation Baseline * The primary goal is to implement a functional promptfooconfig.yaml that defines a baseline prompt, a comprehensive set of 10 test cases (inputs), and successfully runs evaluations against 2 different LLM providers (e.g., GPT-4o-mini and Gemini 2.5 Pro). This establishes the foundational benchmarking capability.
Quantitative Metrics Implementation (P0):
We must move beyond simple qualitative review by using Promptfoo's assertions and model-graded rubrics (e.g., scoring for actuality, relevance, or coherence). This will provide quantitative performance scores for each configuration tested.
Shareable Dashboard/Report (P0)
The project must generate a final, shareable HTML report using promptfoo eval -o report.html. This deliverable is the interactive dashboard that allows the Hackweek team to easily view and interpret the side-by-side comparison matrix, proving the value of automated testing.
Red Teaming Baseline (P1 Stretch Goal):
As an advanced capability, we aim to implement a basic red-teaming strategyto test the prompt for critical security vulnerabilities like prompt injection or PII leakage, integrating security testing into the development cycle.
CI/CD Integration Proof-of-Concept (P1 Stretch Goal):
Finally, we will write a simple GitHub Action or script that integrates promptfoo eval into a pull request workflow. The ideal outcome is for a developer to see the objective pass/fail status of their prompt changes before they merge code, demonstrating how this tool enables Continuous Integration for LLM apps.
Resources
Promptfoo CLI : The primary tool for defining tests, running the evaluations, and generating the report.
Node.js / NPM : To install and run the Promptfoo Command Line Interface (CLI).
YAML Configuration: The declarative format for defining the entire evaluation matrix (prompts, providers, tests, and assertions).
API Keys : To authenticate and make calls to the large language models being tested.
LLM Providers (The Models to Test) : will analyze and pick up models which are available
No Hackers yet
Looking for hackers with the skills:
Nothing? Add some keywords!
This project is part of:
Hack Week 25
Comments
Be the first to comment!
Similar Projects
This project is one of its kind!