giskard-oss: a modular testing and red-teaming framework for agentic systems and LLM applications
giskard-oss: a modular testing and red-teaming framework for agentic systems and LLM applications
What it solves
Giskard is designed to test and evaluate agentic systems, specifically addressing the challenge of non-deterministic outputs where a single input can produce multiple valid responses. It provides tools to catch regressions, validate RAG quality, enforce safety rules, and evaluate multi-turn conversations.
How it works
The project is organized as a modular set of Python packages that can wrap any LLM, black-box agent, or multi-step pipeline. It consists of three primary components:
- Giskard Checks: A library for creating evaluations (evals) using a scenario API, featuring built-in checks for string matching, regex, semantic similarity, and "LLM-as-judge" assessments (such as Groundedness and Conformity).
- Giskard Scan: A red-teaming layer that automatically generates adversarial test suites based on a plain-language description of the agent to detect vulnerabilities like prompt injection, harmful content, and misinformation.
- Giskard RAG: (Planned) A tool for RAG evaluation and synthetic data generation.
Who it’s for
Developers and AI engineers building LLM-based agents and RAG pipelines who need to ensure their systems are safe, grounded, and reliable through automated testing and red-teaming.
Highlights
- Async-first architecture: Designed for dynamic, multi-turn testing of AI agents.
- Automated Red-Teaming: Automatically generates adversarial inputs across OWASP LLM Top-10 threat categories.
- LLM-as-Judge: Supports advanced evaluation metrics like groundedness and conformity.
- Modular Design: Lightweight packages with minimal dependencies to fit into any pipeline.
Sources
- undefinedGiskard-AI/giskard-oss