Skip to content

RAISE Synthetic Data Generator

description

RAISE Synthetic Data Generator is a Python module to generate shareable versions of tabular datasets ready to upload to the RAISE platform. The module can also be used beyond the scope of the RAISE project as a lightweight synthetic data generator..

Features

In the current version (0.1.5), the module provides a single main function: generate_synthetic_data. It has the following capabilities:

  • Input flexibility Accepts either:

    • A CSV file path, or
    • A pandas.DataFrame.
  • Automatic or manual model selection

    • "auto-select" (default) automatically chooses the best synthetic data generation model based on the input data.
    • You can also specify a model explicitly (for the moment one of CTGAN, TVAE or Copulas).
  • Synthetic data generation

    • Generates a synthetic dataset with the same properties as the input data.
    • Number of synthetic samples to generate can be specified via n_samples (defaults to the size of the input data).
  • Results storage

    • Saves the generated synthetic dataset as synthetic_data.csv.
    • Stores model information (info.txt) inside the chosen output folder.
    • Creates a run-specific folder under the desired output path.
  • Evaluation report (optional)

    • If evaluation_report=True (default), runs a quality assessment comparing original vs synthetic data.
    • Produces an evaluation report (evaluation_report.pdf) with figures and summary statistics.
  • Logging and error handling

    • Provides informative log messages for each step (dataset loading, model selection, data generation, report creation).
    • Exceptions are logged with full traceback and re-raised for debugging.

Installation

You can install raise-synthetic-data-generator directly from PyPI using pip:

Terminal window
pip install raise-synthetic-data-generator

To verify that the installation was successful, you can check if the module is installed and accessible by running a simple Python command. Open a Python shell and try to import the module:

import raise_synthetic_data_generator

If no errors are raised, the package is installed and ready to use.

Troubleshooting

If you encounter any issues during installation:

  • Ensure your Python environment is properly set up.
  • If you’re using Python 3, you might need to run:
    Terminal window
    pip3 install raise_synthetic_data_generator
  • If you’re working within a virtual environment, make sure it is activated before running the install command.

Usage

from raise_synthetic_data_generator import generate_synthetic_data
import pandas as pd
# Example input dataframe
df = pd.DataFrame(
{"age": [23, 35, 44, 29, 31], "country": ["ES", "FR", "DE", "IT", "ES"]}
)
# Generate synthetic data (in memory + saved to disk)
generate_synthetic_data(
dataset=df, # if desired the CSV filename can also be given
selected_model="auto-select", # or explicitly: "CTGAN", "TVAE" or "Copulas
n_samples=10, # number of synthetic samples to generate
evaluation_report=True, # if true (evaluation PDF report is generated)
output_dir="results", # base output directory (if none, results path will be created)
run_name="demo-run", # optional run name (this will be the subfolder where generated objects will be stored, if none a subfolder will be created)
)

This will save in specified output folder:

  • The generated synthetic (synthetic_data.csv)
  • A text file with the applied model information (info.txt)
  • (If selected) A folder with resulted evaluation figures (evaluation_figures).
  • (If selected) A PDF report with synthetic data quality evaluation results (evaluation_report.pdf).

API Reference

generate_synthetic_data(dataset, selected_model, n_samples, evaluation_report, output_dir, run_name)

This function, iis the primary entry point to generate a synthetic dataset.

Parameters:

  • dataset (str or pandas DataFrame): The path where your input data is stored in CSV format or a pandas DataFrame that contains the original dataset.
  • selected_model (str, optional): One of “auto-select”, “CTGAN”, “TVAE” or “Copulas”. The default is "auto-select".
  • n_sample (int, optional): Number of synthetic samples to generate. The default is the number of samples in dataset.
  • evaluation_report (boolean, optional): Boolean indicating if a synthetic data quality evaluation report will be created. The default is True.
  • output_dir (str or Path, optional): The output_dir where generated objects will be stored. The default is "results".
  • run_name (str, optional): The folder name where generated objects will be stored. The default is generated automatically.