Synthesizing with Null Values¶

When you pass data with missing values to SDV, nulls are handled automatically. You don't need to clean or impute them yourself — SDV's preprocessing layer takes care of filling in nulls before the synthesizer trains, and then reintroduces them into the synthetic output.

This cookbook walks you through the default behavior, shows you how to customize it when the defaults aren't enough, and highlights common pitfalls to avoid along the way.

When to use this cookbook: You have a dataset with missing values and want to generate synthetic data that preserves the null patterns from your real data.

1. Loading and exploring the dataset¶

We'll work with the null_values_demo_dataset, a single-table dataset designed for this cookbook. It contains columns with varying amounts of missing values — from 0% to 100% — making it a good stress test for null handling.

import pandas as pd
import warnings
from sdv.datasets.demo import download_demo

warnings.filterwarnings("ignore")

data, metadata = download_demo(
    modality="single_table", dataset_name="null_values_demo_dataset"
)

data.head()

	ticket_id	created_at	customer_email	priority	category	is_escalated	response_time_hours	resolution_time_hours	satisfaction_score	internal_notes_count	resolved_at	resolution_status	agent_name	response_time_legacy	customer_phone	num_reassignments	customer_notes
0	TKT-00000	2025-09-26 12:36:44	james.davis@gmail.com	Medium	Technical	False	1.61	NaN	NaN	NaN	NaN	NaN	NaN	1.61	(713) 584-8784	1.0	NaN
1	TKT-00001	2025-07-06 06:41:18	linda.jones_760@gmail.com	Critical	Billing	False	0.95	3.05	NaN	NaN	2025-07-06 09:44:18	Resolved	Derek Patel	0.95	(206) 459-3081	3.0	NaN
2	TKT-00002	2025-09-27 23:57:30	NaN	Critical	Technical	False	4.50	14.60	NaN	NaN	2025-09-28 14:33:30	Workaround	Raymond Reddington	4.50	(415) 598-6575	0.0	NaN
3	TKT-00003	2025-11-07 20:42:47	patricia.miller@outlook.com	Medium	Feature Request	False	NaN	17.29	3.0	NaN	2025-11-08 14:00:11	Resolved	Marko Zakic	-1.00	NaN	2.0	NaN
4	TKT-00004	2025-09-02 00:03:08	michael.wilson@gmail.com	Medium	Bug Report	NaN	1.61	34.52	NaN	NaN	2025-09-03 10:34:20	Workaround	Raymond Reddington	1.61	(310) 613-5736	1.0	NaN

What are the null rates across columns?

data.isnull().mean().round(3)

ticket_id                0.000
created_at               0.000
customer_email           0.101
priority                 0.000
category                 0.146
is_escalated             0.200
response_time_hours      0.049
resolution_time_hours    0.305
satisfaction_score       0.684
internal_notes_count     0.928
resolved_at              0.305
resolution_status        0.305
agent_name               0.078
response_time_legacy     0.000
customer_phone           0.252
num_reassignments        0.195
customer_notes           1.000
dtype: float64

The null rates range from 0% (columns like ticket_id and priority) all the way to 100% (customer_notes, which is entirely null). Some columns share a similar null rate — for example, resolution_time_hours, resolved_at, and resolution_status are all around 30%, hinting that they might be null together. We'll investigate this pattern later in Section 7.

Null Value Map

2. Understanding the metadata¶

The metadata describes the structure of our dataset — the type of each column (its sdtype), which drives how SDV handles missing values during preprocessing. Let's look at the metadata that came with the dataset:

metadata.visualize()

No description has been provided for this image

Each column has an sdtype — numerical, categorical, datetime, email, phone, or id. The sdtype determines how nulls are handled: numerical columns get their missing values filled with the column mean before training, categorical columns treat null as just another category, and PII columns like email and phone regenerate values using Faker while reintroducing nulls at the original rate.

3. Generating synthetic data with default settings¶

Generating synthetic data from a dataset with nulls requires no special configuration. Just create a synthesizer, fit it to the data, and sample. The GaussianCopulaSynthesizer is a good starting point — it uses classic statistical methods, produces high-quality results, and supports extensive customization.

We use AnonymizedFaker for the phone column to generate plausible phone numbers that aren't real. This ensures the synthetic data looks realistic without exposing anyone's actual contact information. If you're using SDV Enterprise, this also avoids a known compatibility issue with phone number columns.

from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.pii import AnonymizedFaker


def phone_transformer():
    """Return a fresh AnonymizedFaker for phone columns."""
    return {
        "customer_phone": AnonymizedFaker(
            provider_name="phone_number", function_name="phone_number"
        )
    }


synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)
synthesizer.update_transformers(phone_transformer())
synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=2000)

How do the null rates compare between real and synthetic data?

pd.DataFrame(
    {
        "Real": data.isnull().mean().round(3),
        "Synthetic": synthetic_data.isnull().mean().round(3),
    }
)

	Real	Synthetic
ticket_id	0.000	0.000
created_at	0.000	0.000
customer_email	0.101	0.107
priority	0.000	0.000
category	0.146	0.131
is_escalated	0.200	0.193
response_time_hours	0.049	0.044
resolution_time_hours	0.305	0.306
satisfaction_score	0.684	0.664
internal_notes_count	0.928	0.925
resolved_at	0.305	0.314
resolution_status	0.305	0.304
agent_name	0.078	0.078
response_time_legacy	0.000	0.000
customer_phone	0.252	0.252
num_reassignments	0.195	0.192
customer_notes	1.000	1.000

The null rates are very close. SDV preserves them out of the box without any null-related configuration. This is because the default mode ('random') reintroduces nulls at roughly the original proportion for each column.

4. Evaluating synthetic data quality¶

SDV includes a built-in quality evaluation workflow. The evaluate_quality() function compares the real and synthetic data across two dimensions: Column Shapes (does each column's distribution look similar?) and Column Pair Trends (are correlations between columns preserved?). The overall score ranges from 0% to 100%.

It's okay — and even expected — to have a score that is not exactly 100%. A perfect score could actually indicate the synthetic data is too close to the real data.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data=data, synthetic_data=synthetic_data, metadata=metadata
)

Generating report ...

|          | 0/17 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |          | 0/17 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |██████████| 17/17 [00:00<00:00, 550.58it/s]|

Column Shapes Score: 90.53%

|          | 0/136 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |          | 0/136 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |█████▏    | 70/136 [00:00<00:00, 688.83it/s]|

(2/2) Evaluating Column Pair Trends: |██████████| 136/136 [00:00<00:00, 560.45it/s]|

Column Pair Trends Score: 67.05%

Overall Score (Average): 78.79%

How well do null rates match specifically?

The Quality Report includes MissingValueSimilarity as part of Column Shapes. This metric scores how well the synthetic null rates match the real data for each column — from 0.0 (completely different) to 1.0 (identical). Let's compute it per column:

from sdmetrics.single_column import MissingValueSimilarity

for col in data.columns[data.isnull().any()]:
    score = MissingValueSimilarity.compute(
        real_data=data[col], synthetic_data=synthetic_data[col]
    )
    print(f"{col:30s} MissingValueSimilarity: {score:.3f}")

customer_email                 MissingValueSimilarity: 0.994
category                       MissingValueSimilarity: 0.986
is_escalated                   MissingValueSimilarity: 0.992
response_time_hours            MissingValueSimilarity: 0.996
resolution_time_hours          MissingValueSimilarity: 0.998
satisfaction_score             MissingValueSimilarity: 0.980
internal_notes_count           MissingValueSimilarity: 0.998
resolved_at                    MissingValueSimilarity: 0.991
resolution_status              MissingValueSimilarity: 0.999
agent_name                     MissingValueSimilarity: 1.000
customer_phone                 MissingValueSimilarity: 1.000
num_reassignments              MissingValueSimilarity: 0.998
customer_notes                 MissingValueSimilarity: 1.000

5. Controlling how nulls appear in synthetic data¶

By default, SDV places nulls randomly at roughly the original rate. But what if the pattern of missingness matters — for example, unresolved support tickets always have null values for resolution time, resolution date, and resolution status? In that case, you might want the synthesizer to learn when values should be null, not just how often.

SDV offers three modes for null generation, configured through the missing_value_generation parameter:

Mode	Behavior	Best for
`'random'` (default)	Nulls placed randomly at the original column-level rate	Most use cases — accurate null rates without extra complexity
`'from_column'`	The synthesizer learns when values should be null based on patterns in other columns	Meaningful missingness — e.g., nulls that are correlated with other features
`None`	No nulls are generated for that column	Columns that must be complete in the synthetic output

You configure these modes using update_transformers(). Let's set up synthesizers with 'from_column' and None modes so we can compare all three:

Defining nullable columns¶

First, let's identify which columns contain null values and need transformer configuration. These are the numerical and datetime columns where we want to control null behavior:

from rdt.transformers.numerical import FloatFormatter
from rdt.transformers.datetime import UnixTimestampEncoder

nullable_numerical = [
    "response_time_hours",
    "resolution_time_hours",
    "satisfaction_score",
    "internal_notes_count",
    "num_reassignments",
    "response_time_legacy",
]

`from_column` synthesizers¶

In 'from_column' mode, the synthesizer adds a binary indicator column for each nullable column — tracking whether each row was originally null. It then learns when values should be null based on patterns across all columns, not just the overall rate. This is useful when missingness is meaningful (e.g., unresolved tickets always have null resolution times).

fc_synthesizer = GaussianCopulaSynthesizer(metadata)
fc_synthesizer.auto_assign_transformers(data)
fc_synthesizer.update_transformers(phone_transformer())
fc_synthesizer.update_transformers(
    {
        col: FloatFormatter(
            missing_value_replacement="mean", missing_value_generation="from_column"
        )
        for col in nullable_numerical
    }
)
fc_synthesizer.update_transformers(
    {
        "resolved_at": UnixTimestampEncoder(missing_value_generation="from_column"),
    }
)
fc_synthesizer.fit(data)
synthetic_fc = fc_synthesizer.sample(num_rows=2000)

`None` synthesizers¶

Setting missing_value_generation=None tells SDV not to generate any null values for that column. The transformer still fills in nulls before training (using the mean for numerical columns), but it does not reintroduce them when generating synthetic data. Use this when a column must be complete in the output.

# --- None mode (no nulls for numerical/datetime columns) ---
none_synth = GaussianCopulaSynthesizer(metadata)
none_synth.auto_assign_transformers(data)
none_synth.update_transformers(phone_transformer())
none_synth.update_transformers(
    {
        col: FloatFormatter(
            missing_value_replacement="mean", missing_value_generation=None
        )
        for col in nullable_numerical
    }
)
none_synth.update_transformers(
    {
        "resolved_at": UnixTimestampEncoder(missing_value_generation=None),
    }
)
none_synth.fit(data)
synthetic_none = none_synth.sample(num_rows=2000)

How do the three modes compare?

null_comparison = pd.DataFrame(
    {
        "Real": data.isnull().mean().round(3),
        "random": synthetic_data.isnull().mean().round(3),
        "from_column": synthetic_fc.isnull().mean().round(3),
        "None": synthetic_none.isnull().mean().round(3),
    }
)
null_comparison[null_comparison["Real"] > 0]

	Real	random	from_column	None
customer_email	0.101	0.107	0.102	0.107
category	0.146	0.131	0.132	0.131
is_escalated	0.200	0.193	0.208	0.193
response_time_hours	0.049	0.044	1.000	0.000
resolution_time_hours	0.305	0.306	0.308	0.000
satisfaction_score	0.684	0.664	0.674	0.000
internal_notes_count	0.928	0.925	1.000	0.000
resolved_at	0.305	0.314	0.308	0.000
resolution_status	0.305	0.304	0.306	0.304
agent_name	0.078	0.078	0.078	0.078
customer_phone	0.252	0.252	0.252	0.252
num_reassignments	0.195	0.192	0.197	0.000
customer_notes	1.000	1.000	1.000	1.000

Real vs Synthetic Null Rates — All Three Modes

A few things stand out in the comparison:

'random' closely matches the real null rates across all columns. This is the safest default.
'from_column' can sometimes produce very different null rates. Notice response_time_hours jumping to 100% null (vs 4.9% real) — this is a known behavior with GaussianCopula, which we explore in the Interesting Cases with Null Values in SDV cookbook.
None eliminates nulls for numerical and datetime columns, but categorical columns (category, is_escalated) and PII columns (customer_email, agent_name, customer_phone) still show nulls because they handle missingness differently.

Key takeaway: Use 'random' (default) unless you specifically need null patterns tied to other columns.

6. Cleaning placeholder values before synthesis¶

A common real-world pattern is using placeholder values like -1, 999, or "N/A" instead of actual nulls. SDV does not detect these as missing — it treats them as legitimate data points and will reproduce them in the synthetic output.

Our dataset includes a column called response_time_legacy that uses -1.0 as a placeholder for missing values instead of NaN. Since SDV sees -1 as a valid number, the synthesizer will learn it as part of the data's distribution — producing synthetic rows with -1 values that look like real response times.

Does our dataset have any placeholders?

print("response_time_legacy (-1 placeholder):")
print(f"  Null count: {data['response_time_legacy'].isnull().sum()}")
print(f"  -1 count:   {(data['response_time_legacy'] == -1.0).sum()}")
print(f"  Min value:  {data['response_time_legacy'].min():.2f}")

response_time_legacy (-1 placeholder):
  Null count: 0
  -1 count:   98
  Min value:  -1.00

SDV treats -1 as a real value, not as a missing indicator

The synthetic data faithfully reproduces -1 values in response_time_legacy because SDV learned them as real data points — it has no way to know that -1 is a placeholder for "missing." Always convert placeholders to NaN before fitting:

data = data.replace(-1, np.nan)
data = data.replace('', np.nan)

Also watch for empty strings — pd.isna() does not flag "" as missing, so these need to be converted explicitly.

7. Preserving correlated null patterns¶

In our dataset, three columns are always null or non-null together: resolution_time_hours, resolved_at, and resolution_status. This makes sense — they all describe the resolution of a support ticket, so if a ticket isn't resolved, all three fields are missing. This is a correlated null pattern.

By default, SDV's 'random' mode treats each column independently when deciding which synthetic rows should have nulls. This breaks the correlation — you'll see rows where resolution_time_hours is null but resolution_status is not, which would never happen in the real data.

Does the default synthesizer preserve this correlation?

correlated_cols = ["resolution_time_hours", "resolved_at", "resolution_status"]

real_null_pattern = data[correlated_cols].isnull()
real_all_same = (real_null_pattern.nunique(axis=1) == 1).all()
print(f"Real data — nulls always aligned: {real_all_same}")

synth_null_pattern = synthetic_data[correlated_cols].isnull()
synth_all_same = (synth_null_pattern.nunique(axis=1) == 1).all()
print(f"Synthetic data — nulls always aligned: {synth_all_same}")

if not synth_all_same:
    misaligned = (synth_null_pattern.nunique(axis=1) > 1).sum()
    print(f"  Rows with broken correlation: {misaligned} / {len(synthetic_data)}")

Real data — nulls always aligned: True
Synthetic data — nulls always aligned: False
  Rows with broken correlation: 1293 / 2000

Default SDV breaks correlated null patterns

The null correlation in the real data is perfect (1.0 across all pairs), but the synthetic data shows much weaker correlation. We can reduce this by switching to 'from_column' mode on the correlated columns, which tells the synthesizer to learn when these values should be null:

from rdt.transformers.numerical import FloatFormatter
from rdt.transformers.datetime import UnixTimestampEncoder

fc_corr_synth = GaussianCopulaSynthesizer(metadata)
fc_corr_synth.auto_assign_transformers(data)
fc_corr_synth.update_transformers(phone_transformer())

fc_corr_synth.update_transformers(
    {
        "resolution_time_hours": FloatFormatter(
            missing_value_replacement="mean", missing_value_generation="from_column"
        ),
        "resolved_at": UnixTimestampEncoder(missing_value_generation="from_column"),
    }
)

fc_corr_synth.fit(data)
fc_corr_synthetic = fc_corr_synth.sample(num_rows=2000)

Did from_column mode improve the correlation?

fc_corr_null_pattern = fc_corr_synthetic[correlated_cols].isnull()
fc_corr_misaligned = (fc_corr_null_pattern.nunique(axis=1) > 1).sum()

print(
    f"Default (random) — misaligned rows: "
    f"{(synth_null_pattern.nunique(axis=1) > 1).sum()} / {len(synthetic_data)}"
)
print(f"from_column — misaligned rows: {fc_corr_misaligned} / {len(fc_corr_synthetic)}")

Default (random) — misaligned rows: 1293 / 2000
from_column — misaligned rows: 402 / 2000

Correlated nulls: from_column reduces misalignment

'from_column' reduces misalignment significantly but doesn't guarantee perfect alignment — each indicator column is modeled independently by the synthesizer.

For strict enforcement, SDV Enterprise offers FixedNullCombinations — a constraint purpose-built for enforcing null co-occurrence patterns. It locks the null/non-null structure while still allowing flexibility in how non-null values combine.

8. Customizing null handling per column¶

You can mix and match null modes across columns using update_transformers(). This lets you apply 'from_column' where missingness is meaningful, None where a column must be complete, and leave the default 'random' for everything else.

The two key parameters on each transformer are:

missing_value_generation — controls how nulls reappear in synthetic data ('random', 'from_column', or None)
missing_value_replacement — controls what value replaces nulls during training (default: 'mean'; alternative: 'random', which chooses a value uniformly at random from the column's min/max range)

custom = GaussianCopulaSynthesizer(metadata)
custom.auto_assign_transformers(data)
custom.update_transformers(phone_transformer())

# Use 'from_column' for the correlated resolution group
custom.update_transformers(
    {
        "resolution_time_hours": FloatFormatter(
            missing_value_replacement="mean", missing_value_generation="from_column"
        ),
        "resolved_at": UnixTimestampEncoder(missing_value_generation="from_column"),
    }
)

# Eliminate nulls entirely for satisfaction_score
custom.update_transformers(
    {
        "satisfaction_score": FloatFormatter(
            missing_value_replacement="mean", missing_value_generation=None
        ),
    }
)

custom.fit(data)
custom_result = custom.sample(num_rows=2000)

updated_cols = ["resolution_time_hours", "resolved_at", "satisfaction_score"]
for col in updated_cols:
    print(
        f"{col} null rate — Real: {data[col].isnull().mean():.1%}, "
        f"Synthetic: {custom_result[col].isnull().mean():.1%}"
    )

resolution_time_hours null rate — Real: 30.5%, Synthetic: 30.6%
resolved_at null rate — Real: 30.5%, Synthetic: 30.6%
satisfaction_score null rate — Real: 68.4%, Synthetic: 0.0%

Each column follows the mode you assigned: 'from_column' for the resolution group and None for satisfaction_score. You can mix and match modes freely — SDV applies them independently per column.

9. Recommendations¶

Scenario	Recommended approach
Nulls are random / no meaningful pattern	`'random'` (default) — no configuration needed
Missingness is meaningful or correlated with other features	`'from_column'` via `update_transformers()`
Column must have zero nulls in synthetic output	`None` via `update_transformers()`
Columns are always null together	`'from_column'` mode + `FixedNullCombinations` constraint (Enterprise)
Data uses `-1`, `"N/A"`, or `""` as missing indicators	Convert to `NaN` before fitting

Checklist before submitting synthetic data:

Replace placeholders with NaN before fitting — SDV doesn't auto-detect -1, 999, "N/A", or empty strings
Compare null rates after sampling: synthetic_data.isnull().mean() vs data.isnull().mean()
Run evaluate_quality() for a full quality assessment including null rate similarity
Test constraints on a small data subset first — null-related constraint errors are among the most common SDV issues

Conclusion¶

SDV handles missing values automatically. The default behavior preserves null rates without any configuration — just fit and sample. When you need more control, update_transformers() lets you choose how each column handles nulls, and constraints like FixedNullCombinations can enforce correlated null patterns.

The most important step is data preparation: always convert placeholder values to NaN before fitting, and validate null rates after sampling.