Synthetic Data
Data & TrackingAlso: AI-Generated Data · Simulated Data
Quick definition
Synthetic data is artificially generated information that statistically resembles real data without containing any actual user records. It is produced by algorithms or AI models trained on real datasets. Marketers use it to train predictive models, test campaigns, and fill gaps where real data is too sparse or too sensitive to use.
How it varies across Australia
Adoption of synthetic data in Australian marketing sits well behind the US and UK, where enterprise brands have been using it for model training and privacy compliance for several years. Uptake in Australia is accelerating as Privacy Act reform puts more pressure on how first-party data is stored and used.
See data and tracking maturity across Australian industries →The main types in marketing use
Generated entirely from a statistical model with no real records used as seeds. Highest privacy protection.
Real records mixed with generated replacements for sensitive fields. Balances fidelity and privacy.
Real data extended with generated rows to increase volume for model training. Common in sparse cohorts.
What it actually means
Synthetic data starts with a real dataset, learns its statistical shape, and then generates new records that have the same distributions, correlations and edge cases without containing any original row.
The marketing use cases divide roughly into two buckets. The first is privacy-safe model training. If you want to build a churn prediction model or a lookalike audience model but your real customer data is too thin, too sensitive, or subject to consent restrictions, synthetic data fills the gap. You train on data that behaves like your customers without using your customers.
The second bucket is testing and simulation. Synthetic data lets you stress-test a campaign personalisation engine, a CRM workflow, or a data pipeline before real user records flow through it. You catch attribution bugs, consent failures, and segmentation errors before they touch live data.
The connection to AI-generated search and generative AI is direct. Large language models are themselves trained partly on synthetic data to fill gaps in real text corpora. Marketers building AI content tools, chatbots, or product recommendation engines face the same tradeoff: real data is better but not always available, legal, or sufficient in volume. Synthetic data is the bridge.
The risk is model drift. Synthetic data is only as good as the real data it was generated from. If the original dataset was biased, the synthetic version inherits and sometimes amplifies that bias. Garbage in, garbage out applies twice over.
Synthetic data is not fake data. It is data that reflects real patterns without exposing real people.
How it shows up
Synthetic data shows up in marketing in three common places.
First, in audience modelling. When a CRM segment is too small to train a reliable lookalike model, synthetic augmentation inflates it to a trainable size without diluting the signal.
Second, in privacy compliance workflows. Under the Australian Privacy Act and emerging consent frameworks, some teams use synthetic exports in place of real customer records when sharing data with agencies, partners, or testing environments.
Third, in generative AI tooling. If you are building or fine-tuning a model on your brand voice, product descriptions, or customer FAQs, synthetic data generation can create training examples the real corpus does not contain.
The Australian context
The Privacy Act 1988 and the proposed reforms before the Australian Parliament have made data minimisation a live compliance concern for most mid-market businesses. Synthetic data offers a practical path: you retain the analytical value of your customer data without storing or sharing real records in contexts where consent is ambiguous.
ACCC enforcement activity around data misuse has also sharpened board-level attention. Teams that can demonstrate their AI models were trained on synthetic rather than raw personal data are in a materially better position if a complaint is made.
The counter-consideration is that the Australian Privacy Act does not yet have formal guidance on synthetic data. Whether a synthetic dataset derived from personal information counts as personal information itself is still unsettled. Legal advice before deploying a synthetic data pipeline is worth the cost.
Where people get this wrong
Related terms
Common questions
Is synthetic data legal to use in Australia?
The Privacy Act 1988 does not specifically address synthetic data. Whether a synthetic dataset derived from personal information is itself personal information remains unsettled. Most legal advisers recommend treating it as personal information unless it can be demonstrated the original records cannot be re-identified from the output. Get advice before deploying.
Can synthetic data replace A/B testing?
No. Synthetic data is useful for building and validating the systems you will test with, not for running the tests themselves. A/B testing requires real user behaviour to produce real signal. Synthetic data can help you design a better experiment but cannot substitute for running it.
How is synthetic data different from anonymised data?
Anonymised data removes identifying fields from real records. Synthetic data generates entirely new records that were never real. Synthetic is generally considered the more privacy-protective option because there are no original records to re-identify, though regulatory guidance on this distinction in Australia is still developing.
What tools do marketers use to generate synthetic data?
Common tools include Gretel, Mostly AI and Tonic for structured customer data, and Python libraries like SDV (Synthetic Data Vault) for more custom generation. Most of these require a data analyst or data scientist to operate. Off-the-shelf options suitable for non-technical marketers are still limited.
Keep exploring
About New Rebellion
New Rebellion is a marketing intelligence consultancy. We build tools, score Australian businesses on how their marketing actually performs, and publish Debrief every day. This dictionary is part of how we work in the open.
How we think →