synthetic-data-generator: a privacy-preserving tabular data generator supporting GANs, LLMs, and billion-scale datasets
synthetic-data-generator: a privacy-preserving tabular data generator supporting GANs, LLMs, and billion-scale datasets
What it solves
It addresses the challenge of creating high-quality structured tabular data that retains the statistical characteristics of original datasets without containing sensitive information. This allows users to share data, train models, and test systems while remaining compliant with privacy regulations like GDPR and ADPPA.
How it works
SDG provides a framework that integrates multiple synthesis approaches:
- Statistical and GAN-based models: It implements algorithms like CTGAN, TVAE, and GaussianCopula to learn patterns from existing data and generate synthetic versions.
- LLM-based generation: It uses Large Language Models to generate synthetic data based solely on metadata (without needing training data) or to perform "off-table feature inference," where the LLM infers new columns based on existing data and its internal knowledge.
- Data Processing Pipeline: A dedicated Data Processor module handles format conversions (e.g., for Datetime columns), manages null values, and performs pre- and post-processing.
Who it’s for
- Data scientists and ML engineers who need privacy-preserving datasets for model training and debugging.
- Software developers requiring realistic tabular data for system testing.
- Organizations needing to share data across teams or partners without violating privacy laws.
Highlights
- Big Data Optimization: Optimized for memory efficiency, specifically supporting billion-level data processing with CTGAN.
- Zero-Data Synthesis: Ability to generate tabular data using LLMs based only on metadata.
- Privacy Features: Supports differential privacy and anonymization methods.
- Extensible Architecture: Uses a plug-in system for adding new models, data connectors, and processing steps.
Sources
- undefinedhitsz-ids/synthetic-data-generator