Synthesize a Table (Gaussian Coupla)ΒΆ
In this notebook, we'll use the SDV to create synthetic data for a single table and evaluate it. The SDV uses machine learning to learn patterns from real data and emulates them when creating synthetic data.
We'll use the Gaussian Copula algorithm to do this. Gaussian Copula is a fast, customizable and transparent way to synthesize data.
import warnings
warnings.filterwarnings('ignore')
1. Loading the demo dataΒΆ
For this demo, we'll use a fake dataset that describes some fictional guests staying at a hotel.
from sdv.datasets.demo import download_demo
real_data, metadata = download_demo(
modality='single_table',
dataset_name='fake_hotel_guests'
)
Details: The data is available as a single table.
guest_emailis a primary key that uniquely identifies every row- Other columns have a variety of data types and some the data may be missing.
real_data.head()
| guest_email | has_rewards | room_type | amenities_fee | checkin_date | checkout_date | room_rate | billing_address | credit_card_number | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | michaelsanders@shaw.net | False | BASIC | 37.89 | 27 Dec 2020 | 29 Dec 2020 | 131.23 | 49380 Rivers Street\nSpencerville, AK 68265 | 4075084747483975747 |
| 1 | randy49@brown.biz | False | BASIC | 24.37 | 30 Dec 2020 | 02 Jan 2021 | 114.43 | 88394 Boyle Meadows\nConleyberg, TN 22063 | 180072822063468 |
| 2 | webermelissa@neal.com | True | DELUXE | 0.00 | 17 Sep 2020 | 18 Sep 2020 | 368.33 | 0323 Lisa Station Apt. 208\nPort Thomas, LA 82585 | 38983476971380 |
| 3 | gsims@terry.com | False | BASIC | NaN | 28 Dec 2020 | 31 Dec 2020 | 115.61 | 77 Massachusetts Ave\nCambridge, MA 02139 | 4969551998845740 |
| 4 | misty33@smith.biz | False | BASIC | 16.45 | 05 Apr 2020 | NaN | 122.41 | 1234 Corporate Drive\nBoston, MA 02116 | 3558512986488983 |
The demo also includes metadata, a description of the dataset. It includes the primary keys as well as the data types for each column (called "sdtypes").
metadata.visualize()
2. Basic UsageΒΆ
2.1 Creating a SynthesizerΒΆ
An SDV synthesizer is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
Now the synthesizer is ready to use!
2.2 Generating Synthetic DataΒΆ
Use the sample function and pass in any number of rows to synthesize.
synthetic_data = synthesizer.sample(num_rows=500)
synthetic_data.head()
| guest_email | has_rewards | room_type | amenities_fee | checkin_date | checkout_date | room_rate | billing_address | credit_card_number | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | dsullivan@example.net | False | BASIC | 0.29 | 27 Mar 2020 | 09 Mar 2020 | 135.15 | 90469 Karla Knolls Apt. 781\nSusanberg, CA 70033 | 5161033759518983 |
| 1 | steven59@example.org | False | DELUXE | 8.15 | 07 Sep 2020 | 25 Jun 2020 | 183.24 | 6108 Carla Ports Apt. 116\nPort Evan, MI 71694 | 4133047413145475690 |
| 2 | brandon15@example.net | False | BASIC | 11.65 | 22 Mar 2020 | 01 Apr 2020 | 163.57 | 86709 Jeremy Manors Apt. 786\nPort Garychester... | 4977328103788 |
| 3 | humphreyjennifer@example.net | False | BASIC | 48.12 | 04 Jun 2020 | 14 May 2020 | 127.75 | 8906 Bobby Trail\nEast Sandra, NY 43986 | 3524946844839485 |
| 4 | joshuabrown@example.net | False | DELUXE | 11.07 | 08 Jan 2020 | 13 Jan 2020 | 180.12 | 732 Dennis Lane\nPort Nicholasstad, DE 49786 | 4446905799576890978 |
The synthesizer is generating synthetic guests in the same format as the original data.
2.3 Evaluating Real vs. Synthetic DataΒΆ
SDV has built-in functions for evaluating the synthetic data and getting more insight.
As a first step, we can run a diagnostic to ensure that the data is valid. SDV's diagnostic performs some basic checks such as:
- All primary keys must be unique
- Continuous values must adhere to the min/max of the real data
- Discrete columns (non-PII) must have the same categories as the real data
- Etc.
from sdv.evaluation.single_table import run_diagnostic
diagnostic = run_diagnostic(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
)
Generating report ...
| | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Data Validity: | | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Data Validity: |ββββββββββ| 9/9 [00:00<00:00, 1535.06it/s]|
Data Validity Score: 100.0%
| | 0/1 [00:00<?, ?it/s]|
(2/2) Evaluating Data Structure: | | 0/1 [00:00<?, ?it/s]|
(2/2) Evaluating Data Structure: |ββββββββββ| 1/1 [00:00<00:00, 540.78it/s]|
Data Structure Score: 100.0%
Overall Score (Average): 100.0%
The score is 100%, indicating that the data is fully valid.
We can also measure the data quality or the statistical similarity between the real and synthetic data. This value may vary anywhere from 0 to 100%.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
real_data,
synthetic_data,
metadata
)
Generating report ...
| | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Column Shapes: | | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Column Shapes: |ββββββββββ| 9/9 [00:00<00:00, 1201.58it/s]|
Column Shapes Score: 90.06%
| | 0/36 [00:00<?, ?it/s]|
(2/2) Evaluating Column Pair Trends: | | 0/36 [00:00<?, ?it/s]|
(2/2) Evaluating Column Pair Trends: |ββββββββββ| 36/36 [00:00<00:00, 613.03it/s]|
Column Pair Trends Score: 83.47%
Overall Score (Average): 86.76%
According to the score, the synthetic data is similar to the real data in terms of statistical similarity.
We can also get more details from the report. For example, the Column Shapes sub-score is 89%. Which columns had the highest vs. the lowest scores?
quality_report.get_details('Column Shapes')
| Column | Metric | Score | |
|---|---|---|---|
| 0 | has_rewards | TVComplement | 0.982000 |
| 1 | room_type | TVComplement | 0.984000 |
| 2 | amenities_fee | KSComplement | 0.764778 |
| 3 | checkin_date | KSComplement | 0.962000 |
| 4 | checkout_date | KSComplement | 0.968750 |
| 5 | room_rate | KSComplement | 0.742000 |
2.4 Visualizing the DataΒΆ
For more insights, we can visualize the real vs. synthetic data.
Let's perform a 1D visualization comparing a column of the real data to the synthetic data.
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data,
column_name='room_rate',
metadata=metadata
)
fig.show()
We can also visualize in 2D, comparing the correlations of a pair of columns.
from sdv.evaluation.single_table import get_column_pair_plot
fig = get_column_pair_plot(
real_data=real_data,
synthetic_data=synthetic_data,
column_names=['room_rate', 'room_type'],
metadata=metadata
)
fig.show()
2.5 AnonymizationΒΆ
In the original dataset, we had some sensitive columns such as the guest's email, billing address and phone number. In the synthetic data, these columns are fully anonymized -- they contain entirely fake values that follow the format of the original.
PII columns are not included in the quality report, but we can inspect them to see that they are different.
sensitive_column_names = ['guest_email', 'billing_address', 'credit_card_number']
real_data[sensitive_column_names].head(3)
| guest_email | billing_address | credit_card_number | |
|---|---|---|---|
| 0 | michaelsanders@shaw.net | 49380 Rivers Street\nSpencerville, AK 68265 | 4075084747483975747 |
| 1 | randy49@brown.biz | 88394 Boyle Meadows\nConleyberg, TN 22063 | 180072822063468 |
| 2 | webermelissa@neal.com | 0323 Lisa Station Apt. 208\nPort Thomas, LA 82585 | 38983476971380 |
synthetic_data[sensitive_column_names].head(3)
| guest_email | billing_address | credit_card_number | |
|---|---|---|---|
| 0 | dsullivan@example.net | 90469 Karla Knolls Apt. 781\nSusanberg, CA 70033 | 5161033759518983 |
| 1 | steven59@example.org | 6108 Carla Ports Apt. 116\nPort Evan, MI 71694 | 4133047413145475690 |
| 2 | brandon15@example.net | 86709 Jeremy Manors Apt. 786\nPort Garychester... | 4977328103788 |
2.6 Saving and LoadingΒΆ
We can save the synthesizer to share with others and sample more synthetic data in the future.
synthesizer.save('my_synthesizer.pkl')
from sdv.utils import load_synthesizer
synthesizer = load_synthesizer('my_synthesizer.pkl')
synthesizer.sample(num_rows=3)
| guest_email | has_rewards | room_type | amenities_fee | checkin_date | checkout_date | room_rate | billing_address | credit_card_number | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | cburns@example.com | False | SUITE | NaN | 14 Jun 2020 | 23 Jun 2020 | 298.22 | Unit 7909 Box 7283\nDPO AA 48458 | 30158284644887 |
| 1 | jordanpamela@example.org | False | BASIC | 0.04 | 26 Oct 2020 | 09 Nov 2020 | 244.69 | 0837 Stewart Pike Suite 951\nPort Zachary, IA ... | 4375705762878737 |
| 2 | amanda33@example.org | False | DELUXE | 35.96 | 10 Apr 2020 | 05 Apr 2020 | 176.64 | 24644 Pollard Burgs Apt. 192\nSarahview, PA 78396 | 180027361090256 |
3. Gaussian Copula CustomizationΒΆ
A key benefit of using the Gaussian Copula is customization and transparency. This synthesizer estimates the shape of every column using a 1D distribution. We can set these shapes ourselves.
custom_synthesizer = GaussianCopulaSynthesizer(
metadata,
default_distribution='truncnorm',
numerical_distributions={
'checkin_date': 'uniform',
'checkout_date': 'uniform',
'room_rate': 'gaussian_kde'
}
)
custom_synthesizer.fit(real_data)
After training, we can inspect the distributions. In this case, the synthesizer returns the parameter it learned using the truncnorm distribution.
More information about truncnorm distribution is available in the scipy documentation.
learned_distributions = custom_synthesizer.get_learned_distributions()
learned_distributions['has_rewards']
{'distribution': 'truncnorm',
'learned_parameters': {'a': np.float64(-0.5413473371285197),
'b': np.float64(0.46476035354052103),
'loc': np.float64(0.5389123373114356),
'scale': np.float64(0.9878956258627398)}}
By setting these distributions strategically, you can make tradeoffs in the quality of your synthetic data.
synthetic_data_customized = custom_synthesizer.sample(num_rows=500)
quality_report = evaluate_quality(
real_data,
synthetic_data_customized,
metadata
)
Generating report ...
| | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Column Shapes: | | 0/9 [00:00<?, ?it/s]|
(1/2) Evaluating Column Shapes: |ββββββββββ| 9/9 [00:00<00:00, 1308.27it/s]|
Column Shapes Score: 93.49%
| | 0/36 [00:00<?, ?it/s]|
(2/2) Evaluating Column Pair Trends: | | 0/36 [00:00<?, ?it/s]|
(2/2) Evaluating Column Pair Trends: |ββββββββββ| 36/36 [00:00<00:00, 625.19it/s]|
Column Pair Trends Score: 88.91%
Overall Score (Average): 91.2%
And we can verify this using the visualization functions.
fig = get_column_plot(
real_data=real_data,
synthetic_data=synthetic_data_customized,
column_name='room_rate',
metadata=metadata
)
fig.show()
4. Conditional SamplingΒΆ
Another benefit of using the Gaussian Copula is the ability to efficiently sample conditions. This allows us to simulate hypothetical scenarios.
Let's start by creating a scenario where every hotel guest is staying in a SUITE (half with rewards and half without).
from sdv.sampling import Condition
suite_guests_with_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': True}
)
suite_guests_without_rewards = Condition(
num_rows=250,
column_values={'room_type': 'SUITE', 'has_rewards': False}
)
Now we can simulate this scenario efficiently using our trained synthesizer.
simulated_synthetic_data = custom_synthesizer.sample_from_conditions(conditions=[
suite_guests_with_rewards,
suite_guests_without_rewards
])
0%| | 0/500 [00:00<?, ?it/s]
Sampling conditions: 0%| | 0/500 [00:00<?, ?it/s]
Sampling conditions: 50%|βββββ | 250/500 [00:00<00:00, 1449.00it/s]
Sampling conditions: 100%|ββββββββββ| 500/500 [00:00<00:00, 1473.59it/s]
Sampling conditions: 100%|ββββββββββ| 500/500 [00:00<00:00, 1457.29it/s]
We can verify this by plotting the data.
fig = get_column_plot(
real_data=real_data,
synthetic_data=simulated_synthetic_data,
column_name='room_type',
metadata=metadata
)
fig.update_layout(
title='Using synthetic data to simulate room_type scenario'
)
fig.show()
5. What's Next?ΒΆ
For more information about the Gaussian Copula Synthesizer, visit the documentation.