Synthesize a Table (Gaussian Coupla)¶

In this notebook, we'll use the SDV to create synthetic data for a single table and evaluate it. The SDV uses machine learning to learn patterns from real data and emulates them when creating synthetic data.

We'll use the Gaussian Copula algorithm to do this. Gaussian Copula is a fast, customizable and transparent way to synthesize data.

import warnings

warnings.filterwarnings('ignore')

1. Loading the demo data¶

For this demo, we'll use a fake dataset that describes some fictional guests staying at a hotel.

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

Details: The data is available as a single table.

guest_email is a primary key that uniquely identifies every row
Other columns have a variety of data types and some the data may be missing.

real_data.head()

	guest_email	has_rewards	room_type	amenities_fee	checkin_date	checkout_date	room_rate	billing_address	credit_card_number
0	michaelsanders@shaw.net	False	BASIC	37.89	27 Dec 2020	29 Dec 2020	131.23	49380 Rivers Street\nSpencerville, AK 68265	4075084747483975747
1	randy49@brown.biz	False	BASIC	24.37	30 Dec 2020	02 Jan 2021	114.43	88394 Boyle Meadows\nConleyberg, TN 22063	180072822063468
2	webermelissa@neal.com	True	DELUXE	0.00	17 Sep 2020	18 Sep 2020	368.33	0323 Lisa Station Apt. 208\nPort Thomas, LA 82585	38983476971380
3	gsims@terry.com	False	BASIC	NaN	28 Dec 2020	31 Dec 2020	115.61	77 Massachusetts Ave\nCambridge, MA 02139	4969551998845740
4	misty33@smith.biz	False	BASIC	16.45	05 Apr 2020	NaN	122.41	1234 Corporate Drive\nBoston, MA 02116	3558512986488983

The demo also includes metadata, a description of the dataset. It includes the primary keys as well as the data types for each column (called "sdtypes").

metadata.visualize()

No description has been provided for this image

2. Basic Usage¶

2.1 Creating a Synthesizer¶

An SDV synthesizer is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

Now the synthesizer is ready to use!

2.2 Generating Synthetic Data¶

Use the sample function and pass in any number of rows to synthesize.

synthetic_data = synthesizer.sample(num_rows=500)
synthetic_data.head()

	guest_email	has_rewards	room_type	amenities_fee	checkin_date	checkout_date	room_rate	billing_address	credit_card_number
0	dsullivan@example.net	False	BASIC	0.29	27 Mar 2020	09 Mar 2020	135.15	90469 Karla Knolls Apt. 781\nSusanberg, CA 70033	5161033759518983
1	steven59@example.org	False	DELUXE	8.15	07 Sep 2020	25 Jun 2020	183.24	6108 Carla Ports Apt. 116\nPort Evan, MI 71694	4133047413145475690
2	brandon15@example.net	False	BASIC	11.65	22 Mar 2020	01 Apr 2020	163.57	86709 Jeremy Manors Apt. 786\nPort Garychester...	4977328103788
3	humphreyjennifer@example.net	False	BASIC	48.12	04 Jun 2020	14 May 2020	127.75	8906 Bobby Trail\nEast Sandra, NY 43986	3524946844839485
4	joshuabrown@example.net	False	DELUXE	11.07	08 Jan 2020	13 Jan 2020	180.12	732 Dennis Lane\nPort Nicholasstad, DE 49786	4446905799576890978

The synthesizer is generating synthetic guests in the same format as the original data.

2.3 Evaluating Real vs. Synthetic Data¶

SDV has built-in functions for evaluating the synthetic data and getting more insight.

As a first step, we can run a diagnostic to ensure that the data is valid. SDV's diagnostic performs some basic checks such as:

All primary keys must be unique
Continuous values must adhere to the min/max of the real data
Discrete columns (non-PII) must have the same categories as the real data
Etc.

from sdv.evaluation.single_table import run_diagnostic

diagnostic = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata
)

Generating report ...

|          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Data Validity: |          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Data Validity: |██████████| 9/9 [00:00<00:00, 1535.06it/s]|

Data Validity Score: 100.0%

|          | 0/1 [00:00<?, ?it/s]|

(2/2) Evaluating Data Structure: |          | 0/1 [00:00<?, ?it/s]|

(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 540.78it/s]|

Data Structure Score: 100.0%

Overall Score (Average): 100.0%

The score is 100%, indicating that the data is fully valid.

We can also measure the data quality or the statistical similarity between the real and synthetic data. This value may vary anywhere from 0 to 100%.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

Generating report ...

|          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |██████████| 9/9 [00:00<00:00, 1201.58it/s]|

Column Shapes Score: 90.06%

|          | 0/36 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |          | 0/36 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |██████████| 36/36 [00:00<00:00, 613.03it/s]|

Column Pair Trends Score: 83.47%

Overall Score (Average): 86.76%

According to the score, the synthetic data is similar to the real data in terms of statistical similarity.

We can also get more details from the report. For example, the Column Shapes sub-score is 89%. Which columns had the highest vs. the lowest scores?

quality_report.get_details('Column Shapes')

	Column	Metric	Score
0	has_rewards	TVComplement	0.982000
1	room_type	TVComplement	0.984000
2	amenities_fee	KSComplement	0.764778
3	checkin_date	KSComplement	0.962000
4	checkout_date	KSComplement	0.968750
5	room_rate	KSComplement	0.742000

2.4 Visualizing the Data¶

For more insights, we can visualize the real vs. synthetic data.

Let's perform a 1D visualization comparing a column of the real data to the synthetic data.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='room_rate',
    metadata=metadata
)

fig.show()

We can also visualize in 2D, comparing the correlations of a pair of columns.

from sdv.evaluation.single_table import get_column_pair_plot

fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_names=['room_rate', 'room_type'],
    metadata=metadata
)

fig.show()

2.5 Anonymization¶

In the original dataset, we had some sensitive columns such as the guest's email, billing address and phone number. In the synthetic data, these columns are fully anonymized -- they contain entirely fake values that follow the format of the original.

PII columns are not included in the quality report, but we can inspect them to see that they are different.

sensitive_column_names = ['guest_email', 'billing_address', 'credit_card_number']

real_data[sensitive_column_names].head(3)

	guest_email	billing_address	credit_card_number
0	michaelsanders@shaw.net	49380 Rivers Street\nSpencerville, AK 68265	4075084747483975747
1	randy49@brown.biz	88394 Boyle Meadows\nConleyberg, TN 22063	180072822063468
2	webermelissa@neal.com	0323 Lisa Station Apt. 208\nPort Thomas, LA 82585	38983476971380

synthetic_data[sensitive_column_names].head(3)

	guest_email	billing_address	credit_card_number
0	dsullivan@example.net	90469 Karla Knolls Apt. 781\nSusanberg, CA 70033	5161033759518983
1	steven59@example.org	6108 Carla Ports Apt. 116\nPort Evan, MI 71694	4133047413145475690
2	brandon15@example.net	86709 Jeremy Manors Apt. 786\nPort Garychester...	4977328103788

2.6 Saving and Loading¶

We can save the synthesizer to share with others and sample more synthetic data in the future.

synthesizer.save('my_synthesizer.pkl')

from sdv.utils import load_synthesizer

synthesizer = load_synthesizer('my_synthesizer.pkl')
synthesizer.sample(num_rows=3)

	guest_email	has_rewards	room_type	amenities_fee	checkin_date	checkout_date	room_rate	billing_address	credit_card_number
0	cburns@example.com	False	SUITE	NaN	14 Jun 2020	23 Jun 2020	298.22	Unit 7909 Box 7283\nDPO AA 48458	30158284644887
1	jordanpamela@example.org	False	BASIC	0.04	26 Oct 2020	09 Nov 2020	244.69	0837 Stewart Pike Suite 951\nPort Zachary, IA ...	4375705762878737
2	amanda33@example.org	False	DELUXE	35.96	10 Apr 2020	05 Apr 2020	176.64	24644 Pollard Burgs Apt. 192\nSarahview, PA 78396	180027361090256

3. Gaussian Copula Customization¶

A key benefit of using the Gaussian Copula is customization and transparency. This synthesizer estimates the shape of every column using a 1D distribution. We can set these shapes ourselves.

custom_synthesizer = GaussianCopulaSynthesizer(
    metadata,
    default_distribution='truncnorm',
    numerical_distributions={
        'checkin_date': 'uniform',
        'checkout_date': 'uniform',
        'room_rate': 'gaussian_kde'
    }
)

custom_synthesizer.fit(real_data)

After training, we can inspect the distributions. In this case, the synthesizer returns the parameter it learned using the truncnorm distribution.

More information about truncnorm distribution is available in the scipy documentation.

learned_distributions = custom_synthesizer.get_learned_distributions()
learned_distributions['has_rewards']

{'distribution': 'truncnorm',
 'learned_parameters': {'a': np.float64(-0.5413473371285197),
  'b': np.float64(0.46476035354052103),
  'loc': np.float64(0.5389123373114356),
  'scale': np.float64(0.9878956258627398)}}

By setting these distributions strategically, you can make tradeoffs in the quality of your synthetic data.

synthetic_data_customized = custom_synthesizer.sample(num_rows=500)

quality_report = evaluate_quality(
    real_data,
    synthetic_data_customized,
    metadata
)

Generating report ...

|          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |          | 0/9 [00:00<?, ?it/s]|

(1/2) Evaluating Column Shapes: |██████████| 9/9 [00:00<00:00, 1308.27it/s]|

Column Shapes Score: 93.49%

|          | 0/36 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |          | 0/36 [00:00<?, ?it/s]|

(2/2) Evaluating Column Pair Trends: |██████████| 36/36 [00:00<00:00, 625.19it/s]|

Column Pair Trends Score: 88.91%

Overall Score (Average): 91.2%

And we can verify this using the visualization functions.

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data_customized,
    column_name='room_rate',
    metadata=metadata
)

fig.show()

4. Conditional Sampling¶

Another benefit of using the Gaussian Copula is the ability to efficiently sample conditions. This allows us to simulate hypothetical scenarios.

Let's start by creating a scenario where every hotel guest is staying in a SUITE (half with rewards and half without).

from sdv.sampling import Condition

suite_guests_with_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': True}
)

suite_guests_without_rewards = Condition(
    num_rows=250,
    column_values={'room_type': 'SUITE', 'has_rewards': False}
)

Now we can simulate this scenario efficiently using our trained synthesizer.

simulated_synthetic_data = custom_synthesizer.sample_from_conditions(conditions=[
  suite_guests_with_rewards,
  suite_guests_without_rewards
])

  0%|          | 0/500 [00:00<?, ?it/s]

Sampling conditions:   0%|          | 0/500 [00:00<?, ?it/s]

Sampling conditions:  50%|█████     | 250/500 [00:00<00:00, 1449.00it/s]

Sampling conditions: 100%|██████████| 500/500 [00:00<00:00, 1473.59it/s]

Sampling conditions: 100%|██████████| 500/500 [00:00<00:00, 1457.29it/s]

We can verify this by plotting the data.

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=simulated_synthetic_data,
    column_name='room_type',
    metadata=metadata
)

fig.update_layout(
    title='Using synthetic data to simulate room_type scenario'
)

fig.show()

5. What's Next?¶

For more information about the Gaussian Copula Synthesizer, visit the documentation.