RAISE Synthetic Data Generator

description

RAISE Synthetic Data Generator is a Python module to generate shareable versions of tabular datasets ready to upload to the RAISE platform. The module can also be used beyond the scope of the RAISE project as a lightweight synthetic data generator..

Features

In the current version (0.1.5), the module provides a single main function: generate_synthetic_data. It has the following capabilities:

Input flexibility Accepts either:
- A CSV file path, or
- A pandas.DataFrame.
Automatic or manual model selection
- "auto-select" (default) automatically chooses the best synthetic data generation model based on the input data.
- You can also specify a model explicitly (for the moment one of CTGAN, TVAE or Copulas).
Synthetic data generation
- Generates a synthetic dataset with the same properties as the input data.
- Number of synthetic samples to generate can be specified via n_samples (defaults to the size of the input data).
Results storage
- Saves the generated synthetic dataset as synthetic_data.csv.
- Stores model information (info.txt) inside the chosen output folder.
- Creates a run-specific folder under the desired output path.
Evaluation report (optional)
- If evaluation_report=True (default), runs a quality assessment comparing original vs synthetic data.
- Produces an evaluation report (evaluation_report.pdf) with figures and summary statistics.
Logging and error handling
- Provides informative log messages for each step (dataset loading, model selection, data generation, report creation).
- Exceptions are logged with full traceback and re-raised for debugging.

Installation

You can install raise-synthetic-data-generator directly from PyPI using pip:

pip install raise-synthetic-data-generator

To verify that the installation was successful, you can check if the module is installed and accessible by running a simple Python command. Open a Python shell and try to import the module:

import raise_synthetic_data_generator

If no errors are raised, the package is installed and ready to use.

Troubleshooting

If you encounter any issues during installation:

Ensure your Python environment is properly set up.
If you’re using Python 3, you might need to run:
Terminal window
```
pip3 install raise_synthetic_data_generator
```
If you’re working within a virtual environment, make sure it is activated before running the install command.

Usage

from raise_synthetic_data_generator import generate_synthetic_data
import pandas as pd

# Example input dataframe
df = pd.DataFrame(
    {"age": [23, 35, 44, 29, 31], "country": ["ES", "FR", "DE", "IT", "ES"]}
)

# Generate synthetic data (in memory + saved to disk)
generate_synthetic_data(
    dataset=df,  # if desired the CSV filename can also be given
    selected_model="auto-select",  # or explicitly: "CTGAN", "TVAE" or "Copulas
    n_samples=10,  # number of synthetic samples to generate
    evaluation_report=True,  # if true (evaluation PDF report is generated)
    output_dir="results",  # base output directory (if none, results path will be created)
    run_name="demo-run",  # optional run name (this will be the subfolder where generated objects will be stored, if none a subfolder will be created)
)

This will save in specified output folder:

The generated synthetic (synthetic_data.csv)
A text file with the applied model information (info.txt)
(If selected) A folder with resulted evaluation figures (evaluation_figures).
(If selected) A PDF report with synthetic data quality evaluation results (evaluation_report.pdf).

API Reference

`generate_synthetic_data(dataset, selected_model, n_samples, evaluation_report, output_dir, run_name)`

This function, iis the primary entry point to generate a synthetic dataset.

Parameters:

dataset (str or pandas DataFrame): The path where your input data is stored in CSV format or a pandas DataFrame that contains the original dataset.
selected_model (str, optional): One of “auto-select”, “CTGAN”, “TVAE” or “Copulas”. The default is "auto-select".
n_sample (int, optional): Number of synthetic samples to generate. The default is the number of samples in dataset.
evaluation_report (boolean, optional): Boolean indicating if a synthetic data quality evaluation report will be created. The default is True.
output_dir (str or Path, optional): The output_dir where generated objects will be stored. The default is "results".
run_name (str, optional): The folder name where generated objects will be stored. The default is generated automatically.