AI’s Achilles' Heel: Garbage Data

Marc Shull
Jun 17
6 min read

Artificial Intelligence (“AI”) has become one of the most transformative technologies of our time. From autonomous vehicles and personalized marketing to medical diagnostics and counterfeit goods detection, AI systems are increasingly engrained into our work and personal lives, often in ways we do not see. Frequently overlooked among the many concerns about AI is a serious concern that both undermines the value of AI to businesses and its potential impact on consumers. No matter how advanced an AI capability may be, if the data used to train the AI model or the data input post-training by a user has quality issues, then the outputs will not be reliable and could produce harmful results. Hence the adage “garbage in, garbage out”.

What does this mean? It means that if an AI system is trained on poor quality, biased, or incomplete data, its outputs -- no matter how sophisticated the AI -- will also be flawed, although this may not be apparent to the user. This problem has undermined businesses’ use of data for decades, but failing to fix this lingering problem while investing in more advanced systems will continue to undermine the effectiveness, fairness, and reliability of AI systems across sectors. It can also exacerbate the risk of discriminatory data uses which could lead to fines (e.g. GDPR) and damage a business's brand image.

To solve this problem businesses will need to invest in fixing their data quality issues. This often requires identifying and removing bad /old data, finding higher quality data sources, integrating disparate data assets, and addressing collection gaps. While not specific to data quality, businesses must also verify they have the legal right to process personal data, often at the record level. With the explosive rise of AI, if businesses continue to ignore their data quality problems, they will risk their investments in AI being wasteful and avoidable failures.

The Backbone of AI: Training Data

At its core, AI relies on data to learn patterns, make predictions, and generate insights. These models don’t possess inherent understanding or reasoning capabilities. Instead, they "learn" from examples in their training datasets. For instance:

A facial recognition algorithm is trained on thousands (or millions) of images to learn what different faces look like.
A chatbot like ChatGPT is trained on huge volumes of text to understand language patterns and context.
A medical diagnostic tool might be trained on patient data, lab results, and historical diagnoses.

The quality of this data -- its accuracy, recency, representativeness, completeness, and lack of bias (far easier said than done) -- determines how well the model will perform in the real world.

What Constitutes Poor Quality Data?

Poor quality data can take many forms:

Inaccurate data: Data inputs that are counter-productive to quality output due to human error, faulty data capture, or outdated information.
Incomplete data: Most data sets have completeness issues (i.e. missing values) which can lead to AI model “guessing” at crucial steps or record exclusion, and may need the trainer to determine the best approach.
Biased data: Data sets that are not representative of the audience to be modeled, excludes or underrepresents certain audience segments. Data bias is not always apparent.
Unrepresentative data: A data set that doesn't reflect the real world, or target model audience, can cause models to fail or underperform when used outside of narrow scenarios.
Noisy data: Irrelevant information can dilute useful signals and confuse the model development process producing a sub-optimal model.

When data suffers from one or more of these issues, the AI system’s predictions and decisions become unreliable—or worse, actively harmful.

Real-World Examples of AI Failing Due to Poor Quality Data

Facial Recognition Bias

Facial recognition technologies have been criticized for disproportionately misidentifying people of color, particularly Black women. A study by MIT Media Lab showed error rates of up to 35% for darker-skinned women compared to less than 1% for lighter-skinned men. Why? The training datasets were heavily skewed toward white male faces, making the models less accurate for anyone outside that group.

Financial Credit Scoring

AI tools used to determine creditworthiness can inherit biases from historical lending data. If certain communities were historically underserved by traditional banks, an AI model trained on that data may penalize those communities again, even if the model doesn’t explicitly factor in the reasons for those communities being underserved.

Data Quality in AI Deployment: Not Just Training

While training data gets most of the attention, data quality at the point of use -- when AI is deployed in the real world -- is equally crucial. AI models operate in dynamic environments, and their effectiveness depends on receiving accurate, relevant input data. For example, a predictive product recommendation model may have been trained on clean consumer data from Italy. But if the deployed system receives less complete data about Japanese consumers, its predictions will likely be off. This often leads to wasted time and undermines investment in AI technologies and applications. To address this, it is key to monitor data drift -- changes in the statistical properties of input data over time – which is essential for maintaining AI performance and determining when a model rebuild, new model, or additional training is necessary.

Why This Problem Persists

Despite growing awareness, poor data quality remains a chronic issue for most business and marketing data applications, far beyond AI. These problems largely exist because:

Data collection is expensive and time-consuming: It takes significant resources to gather high-quality, identified data -- especially in emerging industries, highly regulated industries like healthcare, or where collection opportunities are limited (ex. autonomous driving).
It is a daunting task to evaluate years or decades worth of data: Evaluating historical data is, at best, a daunting task, especially without the context necessary to understand what was done and why.
Data ownership: As data is the fuel for many marketing programs and business operations, a business’s data assets are rarely owned by one person or team, so any changes, no matter how justifiable, often run into access and control barriers. Often such changes come with unintended impacts as the platforms that utilize data are often unmapped and not fully understood outside of individual teams using those platforms.
Historical bias is embedded in many datasets: Much of our data reflects past consumer behaviors in response to specific targeting, content, or other factors, which are not always good predictors of how new and evolving audiences will respond. Historical bias in data can be extremely hard to identify or remove.
Lack of transparency: Proprietary disparate datasets and black-box models make it difficult for internal or external auditors to assess data quality in some cases.
Speed over substance: Many businesses are rushing to develop AI solutions without fully investigating or investing in the integrity of their data and data pipelines.

Building Impactful AI Starts With Quality Data

So what can be done to address the data quality problem?

1. Data Assessments and Validation

Regular assessments can help identify bias, gaps, and errors in datasets. Techniques like representativeness testing, correlation analysis, and data drift checking should be standard in AI development and monitoring workflows.

2. Keep the Human Element

Involving domain experts to review, clean, and label data improves quality and contextual accuracy. Crowd-sourced or automated labeling methods, while scalable, often fall short where the data inputs or model outputs are sensitive to nuances.

3. Synthetic Data and Augmentation

Where real-world data is scarce or biased, synthetic data -- artificially generated data that mimics real scenarios -- can help fill in gaps. However, care must be taken to ensure that synthetic data does not introduce new biases.

4. Implement Robust Data Governance

Businesses need clear policies for data collection, consent management, data retention, access controls, anonymization, privacy by design, documentation, usage restrictions, classification, etc.. What data is permitted to be used for model training, what models are appropriate to build, and other critical questions should be addressed by data governance policies and processes. With the preponderance of data privacy laws and regulations compliance is key to minimizing risks of crippling fines.

5. Post-Deployment Monitoring

Implementing systems that detect data drift, anomalies, or performance drops ensures that the model performs as expected as market conditions and consumer behaviors evolve.

The Hidden Cost of Bad Data

Despite how Artificial Intelligence tools are being touted as the magic solution, it is ultimately limited by the data that is used to train and produce outputs businesses and marketers rely on. When trained or operated with poor quality data, even the most advanced AI system can produce misleading, biased, or harmful outcomes. As the stakes of AI continue to rise, with decisions affecting billions of dollars in ad spends, the processing of billions of consumers' personal data, and potentially discriminatory automated decision making, the importance of data quality cannot be overstated. Businesses must invest in better data practices, not just better AI tools, or it will just be “garbage in, garbage out”.

If you need help getting your data in shape, we can help

Read more about our Data Strategy Services (click here) or email us at info@mkt-iq.com.

Photo Credit: https://unsplash.com/@sigmund