The AI Hatchery
Posts
Community Expert Article #1: The Bad Label

Community Expert Article #1: The Bad Label

When a Single Annotation Broke Our Fraud Detection Model

Yujian Tang & Ravi Teja Thutari
May 20, 2025

The Community Expert program is a new program we’re launching to gather and share insight from real world practitioners in software and AI. We will periodically send these out as they come in. If you would like to contribute, please apply here.

This article is by Ravi Teja Thutari, senior software engineer at Hopper.

Introduction

In the field of machine learning, developers believe that accurate models are made possible by building complicated architectures and using large datasets. The real basis of a reliable model is excellent data.

Movies made with the latest technology may not succeed if they have low-quality scripts. For domains such as fraud detection, efficiency in identifying issues is absolute. When a single mistake in annotations brought about huge misclassifications, both users and our team lost confidence in our system.

It explains how an inconsequential error in our data labeling disrupted the production of our fraud detection tool, what corrective steps we took and what other people can learn from this.

Background

We had developed a system using machine learning, meant to rapidly catch irregular transactions. The model was a Gradient Boosted Decision Trees(GBDT) and it learned on over 10 million old transactions with lots of different features included.

Merchant category code is a type of code assigned to specific types of industries.
Transactions happen for specific amounts and are carried out on a scheduled basis.
Geolocation metadata
This idea also includes device and browser fingerprinting.
Analysis of people’s past actions

For several months, teams of fraud analysts and rules helped label our training data. We felt secure about starting production because the F1 score was good and the false positives accounted for by the model were minor.

The Red Flag

Things went well during the first week the new system was deployed, but then I started receiving a lot of tickets from users who were unable to make their regular payments to popular merchants.

I did some debugging, re-used data to run again and checked all the pipelines I developed using data. The data pipeline was functioning correctly and the configuration was correct too. However, it kept reliably calling these transactions risky, as the probability of prediction exceeded 0.9.

The Investigation

We selected transactions where users claimed that something was wrong. Quite often, I noticed that the same merchant kept appearing in these cities. Although a lot of people were subscribed to this merchant, its high-risk status might not seem reasonable.

According to SHapley Additive exPlanations(SHAP), one merchant ID was strongly influencing the predictions, while its features did not stand out.

In the end, the mistake was revealed: just one transaction for this merchant was marked as fraudulent due to a wrong labeling. Using a script appropriate for the previous version of the labels was where the error started.

Why It Mattered

This situation got even worse in the next step.

The error was not common: This user with various currency profiles signed up for a high-priced package. In other words, there were not many scenarios to teach the model from.
Gradient boosting models are sensitive to outliers: GBDTs are sensitive to unusual values, but in neural networks, there is less of this risk.
The label skewed feature importance: Paying more attention to the normal features led the model to think that a transaction described as normal and high-risk is fraudulent.
Model was trained on imbalanced data: Since there were so few fraudulent transactions (0.5%), each ‘Fraudulent’ label in the data had high significance.

The Fix

Once identified, fixing the mislabeled data point was trivial. But the larger challenge was rebuilding trust in the model.

Short-term Remediation

Rolled back the model to the previous stable version
Removed and relabeled the erroneous entry
Rebalanced the dataset to include more examples of the rare edge case
Retrained the model with additional validation against live user feedback

Long-term Improvements

Data Versioning: We integrated tools like DVC (Data Version Control) to version datasets and track changes
Annotation Review Pipeline: Implemented a second-pass human review for any manually labeled examples
Active Learning: Introduced a loop where low-confidence predictions are sent to fraud analysts for review, gradually improving the dataset
Explainability-first Dashboard: Built internal tooling to monitor high-contribution features and sudden shifts in prediction behavior

Case Study Snapshot: Fixing a 5 Figure Mistake

To quantify the impact, we analyzed a single user who experienced three transaction declines due to our model's false positives. This user had two failed payments for a $200/month subscription, followed by a canceled card due to repeated alerts.

The loss breakdown:

Lost revenue from one subscription: $400
Increased support costs: Estimated $150 (3 tickets, 1 manual override)
Reputation damage: One star review and social media backlash

Multiply that by dozens of similar users, and the cost of one bad label quickly scaled to tens of thousands of dollars.

Lessons Learned

1. Never Underestimate the Power of One Label

A single data point can carry outsized influence, especially in low-signal, low-frequency problems like fraud detection.

2. Invest in Data Quality, Not Just Model Quality

The best models trained on noisy data still produce noise. Data validation, clean-up pipelines, and annotation tooling are critical.

3. Trust But Verify

Even experienced teams make mistakes. Instituting verification steps (e.g., human-in-the-loop or sampling audits) is worth the overhead.

4. Monitoring is Not Optional

Model observability is a non-negotiable part of production AI. We now track:

Drift in feature distributions
SHAP value anomalies
Changes in prediction confidence
User-reported discrepancies

5. Labels Are Infrastructure

Treat your labeled dataset with the same respect you treat your codebase. Use pull requests, code review-like approval for annotations, and Continuous Integration for validation checks.

Conclusion

I found the entire experience very humbling. We believed our efforts in training the model, checking how it works and testing it covered it all, little did we know that data underlies everything.

Almost every failure in machine learning is not caused by ineffective algorithms. The failure comes from incorrect beliefs or not noticing all the possibilities which happened to us with the wrong label. Those issues can be resolved through humility, as well as by building better models. To make sure your AI systems are trustworthy, make sure you take your data seriously too.

Reply

or to participate.