Interesting Cases with Null Values¶

This cookbook explores specific scenarios you may encounter when synthesizing data that has missing values. Each section is self-contained — feel free to jump to the section that is most relevant to your case.

Prerequisite: Familiarity with SDV's basic null handling. If you're new to generating synthetic data from datasets with nulls, start with the How to Generate Synthetic Data When Your Data Has Null Values cookbook.

Setup¶

We'll load a demo dataset and fit a default synthesizer that we can reference throughout the notebook.

import warnings

warnings.filterwarnings("ignore")

import pandas as pd

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.pii import AnonymizedFaker

data, metadata = download_demo(
    modality="single_table", dataset_name="null_values_demo_dataset"
)


def phone_transformer():
    return {
        "customer_phone": AnonymizedFaker(
            provider_name="phone_number", function_name="phone_number"
        )
    }


synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)
synthesizer.update_transformers(phone_transformer())
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_rows=2000)

print(f"Dataset: {data.shape[0]} rows, {data.shape[1]} columns")
print(f"Columns with nulls: {data.isnull().any().sum()}")

Dataset: 2000 rows, 17 columns
Columns with nulls: 13

How different synthesizers handle null values¶

SDV offers several single-table synthesizers: GaussianCopulaSynthesizer, CTGANSynthesizer, TVAESynthesizer, and CopulaGANSynthesizer. Since null handling happens in the preprocessing layer — before the data reaches the synthesizer — all of them share a common preprocessing pipeline for null handling.

Let's verify by comparing GaussianCopula and CTGAN on our dataset:

from sdv.single_table import CTGANSynthesizer

ctgan = CTGANSynthesizer(metadata, epochs=5)
ctgan.auto_assign_transformers(data)
ctgan.update_transformers(phone_transformer())
ctgan.fit(data)
ctgan_synthetic = ctgan.sample(num_rows=2000)

null_compare = pd.DataFrame(
    {
        "Real": data.isnull().mean().round(3),
        "GaussianCopula": synthetic_data.isnull().mean().round(3),
        "CTGAN": ctgan_synthetic.isnull().mean().round(3),
    }
)
null_compare[null_compare["Real"] > 0]

	Real	GaussianCopula	CTGAN
customer_email	0.101	0.107	0.096
category	0.146	0.131	0.166
is_escalated	0.200	0.193	0.272
response_time_hours	0.049	0.044	0.044
resolution_time_hours	0.305	0.306	0.306
satisfaction_score	0.684	0.664	0.664
internal_notes_count	0.928	0.925	0.925
resolved_at	0.305	0.314	0.314
resolution_status	0.305	0.304	0.228
agent_name	0.078	0.078	0.078
customer_phone	0.252	0.252	0.252
num_reassignments	0.195	0.192	0.192
customer_notes	1.000	1.000	1.000

Column	Real	`random` (default)	`from_column`
`response_time_hours`	4.9%	4.5%	100.0%
`resolution_time_hours`	30.5%	30.6%	32.3%
`satisfaction_score`	68.4%	66.4%	69.2%
`internal_notes_count`	92.8%	92.5%	98.9%
`num_reassignments`	19.5%	19.2%	20.5%

	ticket_id	num_reassignments
0	TKT-28524	1.0
1	TKT-70367	2.0
2	TKT-00785	4.0
3	TKT-30724	4.0
4	TKT-21867	0.0
5	TKT-19435	4.0
6	TKT-41578	5.0
7	TKT-85912	3.0
8	TKT-32422	3.0
9	TKT-31125	NaN

	ticket_id	num_reassignments
0	TKT-28524	1
1	TKT-70367	2
2	TKT-00785	4
3	TKT-30724	4
4	TKT-21867	0
5	TKT-19435	4
6	TKT-41578	5
7	TKT-85912	3
8	TKT-32422	<NA>
9	TKT-31125	<NA>

Interesting Cases with Null Values¶

Setup¶

How different synthesizers handle null values¶

When null rates don't match: `from_column` distortion¶

How mean replacement affects synthetic data quality¶

What `None` mode does (and doesn't do) across column types¶

Columns that are entirely null¶

Preserving nullable integer types (`Int64`)¶

Conclusion¶

Interesting Cases with Null Values¶

Setup¶

How different synthesizers handle null values¶

When null rates don't match: from_column distortion¶

How mean replacement affects synthetic data quality¶

What None mode does (and doesn't do) across column types¶

Columns that are entirely null¶

Preserving nullable integer types (Int64)¶

Conclusion¶

When null rates don't match: `from_column` distortion¶

What `None` mode does (and doesn't do) across column types¶

Preserving nullable integer types (`Int64`)¶