RAISE Synthetic Data Generator
description
RAISE Synthetic Data Generator is a Python module to generate shareable versions of tabular datasets ready to upload to the RAISE platform. The module can also be used beyond the scope of the RAISE project as a lightweight synthetic data generator..
Features
In the current version (0.1.5), the module provides a single main function: generate_synthetic_data. It has the following capabilities:
-
Input flexibility Accepts either:
- A CSV file path, or
- A
pandas.DataFrame.
-
Automatic or manual model selection
"auto-select"(default) automatically chooses the best synthetic data generation model based on the input data.- You can also specify a model explicitly (for the moment one of
CTGAN,TVAEorCopulas).
-
Synthetic data generation
- Generates a synthetic dataset with the same properties as the input data.
- Number of synthetic samples to generate can be specified via
n_samples(defaults to the size of the input data).
-
Results storage
- Saves the generated synthetic dataset as
synthetic_data.csv. - Stores model information (
info.txt) inside the chosen output folder. - Creates a run-specific folder under the desired output path.
- Saves the generated synthetic dataset as
-
Evaluation report (optional)
- If
evaluation_report=True(default), runs a quality assessment comparing original vs synthetic data. - Produces an evaluation report (
evaluation_report.pdf) with figures and summary statistics.
- If
-
Logging and error handling
- Provides informative log messages for each step (dataset loading, model selection, data generation, report creation).
- Exceptions are logged with full traceback and re-raised for debugging.
Installation
You can install raise-synthetic-data-generator directly from PyPI using pip:
pip install raise-synthetic-data-generatorTo verify that the installation was successful, you can check if the module is installed and accessible by running a simple Python command. Open a Python shell and try to import the module:
import raise_synthetic_data_generatorIf no errors are raised, the package is installed and ready to use.
Troubleshooting
If you encounter any issues during installation:
- Ensure your Python environment is properly set up.
- If you’re using Python 3, you might need to run:
Terminal window pip3 install raise_synthetic_data_generator - If you’re working within a virtual environment, make sure it is activated before running the install command.
Usage
from raise_synthetic_data_generator import generate_synthetic_dataimport pandas as pd
# Example input dataframedf = pd.DataFrame( {"age": [23, 35, 44, 29, 31], "country": ["ES", "FR", "DE", "IT", "ES"]})
# Generate synthetic data (in memory + saved to disk)generate_synthetic_data( dataset=df, # if desired the CSV filename can also be given selected_model="auto-select", # or explicitly: "CTGAN", "TVAE" or "Copulas n_samples=10, # number of synthetic samples to generate evaluation_report=True, # if true (evaluation PDF report is generated) output_dir="results", # base output directory (if none, results path will be created) run_name="demo-run", # optional run name (this will be the subfolder where generated objects will be stored, if none a subfolder will be created))This will save in specified output folder:
- The generated synthetic (
synthetic_data.csv) - A text file with the applied model information (
info.txt) - (If selected) A folder with resulted evaluation figures (
evaluation_figures). - (If selected) A PDF report with synthetic data quality evaluation results (
evaluation_report.pdf).
API Reference
generate_synthetic_data(dataset, selected_model, n_samples, evaluation_report, output_dir, run_name)
This function, iis the primary entry point to generate a synthetic dataset.
Parameters:
dataset(str or pandas DataFrame): The path where your input data is stored in CSV format or a pandas DataFrame that contains the original dataset.selected_model(str, optional): One of “auto-select”, “CTGAN”, “TVAE” or “Copulas”. The default is"auto-select".n_sample(int, optional): Number of synthetic samples to generate. The default is the number of samples in dataset.evaluation_report(boolean, optional): Boolean indicating if a synthetic data quality evaluation report will be created. The default isTrue.output_dir(str or Path, optional): The output_dir where generated objects will be stored. The default is"results".run_name(str, optional): The folder name where generated objects will be stored. The default is generated automatically.