giskard-oss: a modular testing and red-teaming framework for agentic systems and LLM applications

giskard-oss: a modular testing and red-teaming framework for agentic systems and LLM applications

What it solves

Giskard is designed to test and evaluate agentic systems, specifically addressing the challenge of non-deterministic outputs where a single input can produce multiple valid responses. It provides tools to catch regressions, validate RAG quality, enforce safety rules, and evaluate multi-turn conversations.

How it works

The project is organized as a modular set of Python packages that can wrap any LLM, black-box agent, or multi-step pipeline. It consists of three primary components:

  • Giskard Checks: A library for creating evaluations (evals) using a scenario API, featuring built-in checks for string matching, regex, semantic similarity, and "LLM-as-judge" assessments (such as Groundedness and Conformity).
  • Giskard Scan: A red-teaming layer that automatically generates adversarial test suites based on a plain-language description of the agent to detect vulnerabilities like prompt injection, harmful content, and misinformation.
  • Giskard RAG: (Planned) A tool for RAG evaluation and synthetic data generation.

Who it’s for

Developers and AI engineers building LLM-based agents and RAG pipelines who need to ensure their systems are safe, grounded, and reliable through automated testing and red-teaming.

Highlights

  • Async-first architecture: Designed for dynamic, multi-turn testing of AI agents.
  • Automated Red-Teaming: Automatically generates adversarial inputs across OWASP LLM Top-10 threat categories.
  • LLM-as-Judge: Supports advanced evaluation metrics like groundedness and conformity.
  • Modular Design: Lightweight packages with minimal dependencies to fit into any pipeline.

Sources