Synthetic Data

Data & Tracking

Also: AI-Generated Data · Simulated Data

What it isArtificially generated data that mirrors real data

Main useTraining AI without exposing real users

Watch forGarbage in, garbage out

Marketing angleFills gaps real data cannot

Quick definition

Synthetic data is artificially generated information that statistically resembles real data without containing any actual user records. It is produced by algorithms or AI models trained on real datasets. Marketers use it to train predictive models, test campaigns, and fill gaps where real data is too sparse or too sensitive to use.

How it varies across Australia

Adoption of synthetic data in Australian marketing sits well behind the US and UK, where enterprise brands have been using it for model training and privacy compliance for several years. Uptake in Australia is accelerating as Privacy Act reform puts more pressure on how first-party data is stored and used.

See data and tracking maturity across Australian industries →

The main types in marketing use

Fully synthetic

Generated entirely from a statistical model with no real records used as seeds. Highest privacy protection.

Partially synthetic

Real records mixed with generated replacements for sensitive fields. Balances fidelity and privacy.

Augmented synthetic

Real data extended with generated rows to increase volume for model training. Common in sparse cohorts.

What it actually means

Synthetic data starts with a real dataset, learns its statistical shape, and then generates new records that have the same distributions, correlations and edge cases without containing any original row.

The marketing use cases divide roughly into two buckets. The first is privacy-safe model training. If you want to build a churn prediction model or a lookalike audience model but your real customer data is too thin, too sensitive, or subject to consent restrictions, synthetic data fills the gap. You train on data that behaves like your customers without using your customers.

The second bucket is testing and simulation. Synthetic data lets you stress-test a campaign personalisation engine, a CRM workflow, or a data pipeline before real user records flow through it. You catch attribution bugs, consent failures, and segmentation errors before they touch live data.

The connection to AI-generated search and generative AI is direct. Large language models are themselves trained partly on synthetic data to fill gaps in real text corpora. Marketers building AI content tools, chatbots, or product recommendation engines face the same tradeoff: real data is better but not always available, legal, or sufficient in volume. Synthetic data is the bridge.

The risk is model drift. Synthetic data is only as good as the real data it was generated from. If the original dataset was biased, the synthetic version inherits and sometimes amplifies that bias. Garbage in, garbage out applies twice over.

Synthetic data is not fake data. It is data that reflects real patterns without exposing real people.

How it shows up

Synthetic data shows up in marketing in three common places.

First, in audience modelling. When a CRM segment is too small to train a reliable lookalike model, synthetic augmentation inflates it to a trainable size without diluting the signal.

Second, in privacy compliance workflows. Under the Australian Privacy Act and emerging consent frameworks, some teams use synthetic exports in place of real customer records when sharing data with agencies, partners, or testing environments.

Third, in generative AI tooling. If you are building or fine-tuning a model on your brand voice, product descriptions, or customer FAQs, synthetic data generation can create training examples the real corpus does not contain.

The Australian context

The Privacy Act 1988 and the proposed reforms before the Australian Parliament have made data minimisation a live compliance concern for most mid-market businesses. Synthetic data offers a practical path: you retain the analytical value of your customer data without storing or sharing real records in contexts where consent is ambiguous.

ACCC enforcement activity around data misuse has also sharpened board-level attention. Teams that can demonstrate their AI models were trained on synthetic rather than raw personal data are in a materially better position if a complaint is made.

The counter-consideration is that the Australian Privacy Act does not yet have formal guidance on synthetic data. Whether a synthetic dataset derived from personal information counts as personal information itself is still unsettled. Legal advice before deploying a synthetic data pipeline is worth the cost.

Where people get this wrong

Treating synthetic data as interchangeable with real data for reporting.Synthetic data is for building and testing systems, not for reporting performance metrics. A conversion rate calculated on synthetic sessions is not a conversion rate. It is a simulation.

Assuming synthetic data solves the bias problem.Synthetic data inherits the biases of the real data it was generated from. If your real customer base skews toward one demographic, the synthetic version will too, and the model trained on it will reflect that skew.

Generating synthetic data without checking distributional fidelity.A synthetic dataset that looks superficially correct but has wrong correlations between variables will train a model that behaves well in testing and breaks in production. Always validate the statistical shape before using the synthetic set for training.

Common questions

Is synthetic data legal to use in Australia?

The Privacy Act 1988 does not specifically address synthetic data. Whether a synthetic dataset derived from personal information is itself personal information remains unsettled. Most legal advisers recommend treating it as personal information unless it can be demonstrated the original records cannot be re-identified from the output. Get advice before deploying.

Can synthetic data replace A/B testing?

No. Synthetic data is useful for building and validating the systems you will test with, not for running the tests themselves. A/B testing requires real user behaviour to produce real signal. Synthetic data can help you design a better experiment but cannot substitute for running it.

How is synthetic data different from anonymised data?

Anonymised data removes identifying fields from real records. Synthetic data generates entirely new records that were never real. Synthetic is generally considered the more privacy-protective option because there are no original records to re-identify, though regulatory guidance on this distinction in Australia is still developing.

What tools do marketers use to generate synthetic data?

Common tools include Gretel, Mostly AI and Tonic for structured customer data, and Python libraries like SDV (Synthetic Data Vault) for more custom generation. Most of these require a data analyst or data scientist to operate. Off-the-shelf options suitable for non-technical marketers are still limited.

Debrief

Get the next one

No spam. No fluff. Just the next article, straight to your inbox.

Keep exploring

About New Rebellion

New Rebellion is a marketing intelligence consultancy. We build tools, score Australian businesses on how their marketing actually performs, and publish Debrief every day. This dictionary is part of how we work in the open.

How we think →