Logeshwaran.org: AWS ML Exam preparation FREE Guide: Pretraining Bias on Class, Label Imbalance, SMOTE, DPL & Clarify Explained (Part 7)

Welcome back to our AWS Machine Learning Associate Series, which we started for our non tech blog readers to learn AWS Machine Learning for free to clear AWS ML Associate certification.. In the last post, we saw how raw data must be cleaned and validated before training.

Now, let’s imagine Jake and Ethan have done that work. The ledger is neat, the blanks are filled, and the duplicates are gone. But Ethan knows there’s still a hidden danger: bias in the data before training. In this post, lets see about pretraining Bias concepts like class imbalance, label imbalance, SMOTE, DL in AWS Machine Learning with same Jake and Ethan story for non tech readers to understand. Let's begin!

Chapter 1: Understanding Class Imbalance: The Core ML Bias Problem

Jake slid his old leather ledger across the counter with a grin. “You said you’d do some training to predict my iPhone sales, right? Well, here’s the book. Go ahead, train your magic machine.”

Ethan opened the ledger, flipping through the pages. “Uncle Jake, this isn’t magic. It’s machine learning. But before I can train anything, I need to check if your data is fair.”

Jake raised an eyebrow. “Fair? It’s just sales. What’s unfair about that?”

What is Class Imbalance in ML?

Ethan tapped a page. “Look here. Out of 1,000 entries, 900 are men buying phones and only 100 are women. That’s called class imbalance.”

Jake blinked. “Wait, wait. Class imbalance? Last time I went to class was tenth standard. Are you talking about school classes?”

Ethan laughed. “Not that kind of class. In machine learning, a class just means a category. Like ‘male’ vs. ‘female,’ or ‘approved’ vs. ‘rejected.’ When one category has way more examples than the other, we call it class imbalance.”

Jake scratched his head. “So because I sold more phones to men, the machine will think only men buy phones?”

“Exactly. The model doesn’t know the real world. It only knows the data you give it. If the data is skewed, the predictions will be skewed too.”

Also Read: Run Google's AI Locally: A Beginner's Guide to Running Gemini AI (Gemma) on Windows free forever! {Privacy}

Ethan leaned back. “Think of it like your investments. Imagine you put 90% of your money into one risky penny stock and only 10% into safe blue‑chip shares. Your portfolio is unbalanced. If that one stock crashes, you lose everything.

That’s what happens with class imbalance — the model learns one side really well, but ignores the other. It’s like betting your entire future on one stock.”

Jake whistled. “So my ledger is like a bad investment portfolio — too much in one place.”

Why Class Imbalance is Dangerous in Machine Learning!

Ethan explained:

In credit scoring, if most applications in the dataset are rejections, the model may just reject everyone — even good customers.
In fraud detection, if 99% of transactions are safe and only 1% are fraud, the model may ignore fraud completely.
In insurance underwriting, if most claims are small, the model may fail to predict rare but costly claims.

Jake frowned. “So imbalance isn’t just unfair — it can cost money, security, even trust.

Handling Class Imbalance: Starting with Resampling in ML

Jake leaned forward, tapping the ledger. “Alright, Ethan. You’ve convinced me this class imbalance thing is a problem. But what do we actually do about it? I can’t just go out and invent new customers.”

Ethan smiled. “True, you can’t invent customers in real life. But in machine learning, we have a trick called resampling.”

What is Resampling in ML?

Jake squinted. “Resampling? Sounds like when I give out free samples of phone cases to customers.”

Ethan laughed. “Not quite. In machine learning, resampling means adjusting the dataset so that the categories — the classes — are more balanced. We do this in two main ways:

Oversampling: Add more examples of the smaller class.
Undersampling: Reduce examples of the bigger class.”

Jake frowned. “Wait, you’re telling me you just… make up data or throw some away?”

“Exactly. Let me explain.

Oversampling in ML

Ethan pointed at the ledger. “See how you have 900 male customers and only 100 female customers? With oversampling, we copy or generate more female entries until the numbers are closer. So instead of 900 vs. 100, we might make it 900 vs. 900.”

Jake raised an eyebrow. “Copy entries? Isn’t that cheating?”

“Not cheating — balancing. Think of it like giving equal airtime in an election debate. If one candidate gets 90% of the time and the other only 10%, the audience won’t hear both sides. Oversampling makes sure both voices are heard.”

Undersampling in ML

Ethan flipped another page. “Now, undersampling is the opposite. Instead of adding more female entries, we reduce the male entries. So instead of 900 vs. 100, we might cut it down to 100 vs. 100.”

Jake gasped. “You’re throwing away my sales records?”

“Not literally. We’re just training the model on a smaller, balanced sample. It’s like a survey — you don’t need to ask every single person in the city, just a fair sample from each group.”

Oversampling vs. Undersampling in Machine Learning

Jake scratched his head. “So which one’s better?”

Ethan shrugged. “It depends.

Oversampling keeps all the data but risks overfitting — the model might memorize the repeated entries.
Undersampling avoids duplication but throws away information.

What if Resampling (Oversampling/Undersampling) Fails ?

That’s why we often combine them or use smarter methods like SMOTE.”

Jake chuckled. “SMOTE? Sounds like a new phone brand.”

Ethan grinned. “Not quite. It stands for Synthetic Minority Oversampling Technique. Instead of just copying entries, it creates new, realistic ones.

But let’s not get ahead of ourselves — we’ll get to SMOTE in a moment.”

How AWS Helps with Resampling

Jake leaned in. “Alright, but how do you actually do this? You’re not going to sit here copying ledger entries by hand, are you?”

Ethan shook his head. “Nope. On AWS, we can:

Use SageMaker Processing to run resampling scripts.
Store balanced datasets in Amazon S3.
Automate the workflow with AWS Glue or Step Functions.

And before we even start, SageMaker Clarify can generate a bias report to show us the imbalance in the first place.”

Jake nodded slowly. “So AWS is like my assistant, helping me tidy up the ledger before we train the model.”

Chapter 2: When the Answers Are Skewed — Label Imbalance in Machine Learning (ML)

Jake leaned back after hearing about resampling. “Alright, nephew, I get it. Too many of one group, too few of another — that’s class imbalance. But you also mentioned something about label imbalance. What’s that now? Labels? Like the stickers on my iPhone boxes?”

Ethan chuckled. “Not those labels, Uncle Jake. In machine learning, a label is the answer we’re trying to predict. Think of it like the final column in your ledger that says whether a loan was approved or rejected, or whether a transaction was fraud or safe. That’s the label.”

Jake scratched his head. “Alright, so what’s label imbalance then?”

Ethan flipped to another page in the ledger. “Imagine this: out of 1,000 loan applications, 950 are marked rejected and only 50 are approved. That’s label imbalance. The outcomes themselves are skewed.

If we train a model on this, it’ll just learn to reject everyone, because that’s the majority pattern.”

Jake frowned. “So the machine will look smart — 95% accurate — but it’s actually dumb, because it never learns what an approval looks like?”

“Exactly,” Ethan nodded. “It’s like a teacher who fails everyone because most students failed last year. Easy job, but terrible teaching.”

Ethan continued:

Credit Card Approvals: If most applications are rejected, the model may reject everyone.
Fraud Detection: If 99% of transactions are safe, the model may ignore fraud completely.
Insurance Claims: If most claims are small, the model may fail to predict rare but costly claims.
Medical Diagnosis: If most patients are healthy, the model may miss rare but critical diseases.

Jake sighed. “So even if the groups are balanced — like men and women customers — the outcomes can still be lopsided.”

“Right,” Ethan said. “Class imbalance is about the groups. Label imbalance is about the answers.”

Jake leaned forward, tapping the ledger. “Alright, nephew, I get it. If most of my loan applications are marked rejected, the machine will just learn to reject everyone. That’s label imbalance. But how do we fix this?”

Ethan smiled. “Good question. There are several ways. Let me walk you through them.

Balanced Accuracy in ML

Ethan began, “Normally, people look at accuracy — how many predictions the model got right. But with label imbalance, accuracy can be misleading.

Imagine this: 95% of your applications are rejected. If the model just says rejected every time, it’s 95% accurate — but it’s useless, because it never approves anyone.”

Jake frowned. “So the machine looks smart, but it’s actually dumb.”

“Exactly. That’s why we use balanced accuracy. Instead of just looking at overall accuracy, we measure how well the model does on each label separately.

For approved cases, how many did it get right?
For rejected cases, how many did it get right?

Then we average those scores. That way, the model can’t hide behind the majority label.”

Jake nodded slowly. “So it’s like grading a student in both math and English separately, instead of just giving them a high score because they’re good at math.”

“Perfect analogy,” Ethan said.

F1 Score

Ethan continued, “Another metric we use is the F1 score. It balances two things:

Precision: Of the cases the model predicted as approved, how many were actually approved?
Recall: Of all the cases that should have been approved, how many did the model catch?”

Jake scratched his head. “Sounds like exam results again. Give me an example.”

“Alright,” Ethan said. “Imagine fraud detection.

If the model flags 10 transactions as fraud, but only 2 are actually fraud, that’s low precision.
If there were 100 fraud cases in total, but the model only caught 2, that’s low recall.

The F1 score combines both, so the model can’t just be good at one and ignore the other.”

Jake chuckled. “So it’s like a shopkeeper who not only spots fake currency but also doesn’t accuse honest customers by mistake.”

“Exactly,” Ethan said.

Stratified Sampling

Jake leaned forward. “Alright, but what about when you split the data into training and testing sets? Doesn’t the imbalance carry over?”

“Good catch,” Ethan said. “That’s why we use stratified sampling.

When we split the data, we make sure both sets have the same proportion of labels. So if 90% are rejections and 10% are approvals, both the training set and the test set keep that ratio.

That way, the model is tested fairly, and we don’t end up with a test set that has no approvals at all.”

Jake nodded. “So it’s like making sure both cricket teams have the same number of batsmen and bowlers before the match starts.”

“Exactly,” Ethan said. “Balanced teams, balanced evaluation.”

Resampling & SMOTE

Ethan leaned back. “Finally, we can actually change the dataset itself.

Oversampling: Add more examples of the minority label.
Undersampling: Reduce examples of the majority label.
SMOTE (Synthetic Minority Oversampling Technique): Instead of just copying, it creates new, realistic examples of the minority label.”

Jake leaned back, still digesting the talk about label imbalance. “Alright, nephew, I get it. Too many rejections, not enough approvals — that’s label imbalance. But you also mentioned some fancy thing called SMOTE. What’s that now? Sounds like a new phone brand.”

SMOTE in ML

Ethan laughed. “Not a phone brand, Uncle Jake. SMOTE stands for Synthetic Minority Oversampling Technique. It’s a way to handle imbalance — whether it’s class imbalance or label imbalance — by creating new, realistic examples for the smaller group.”

Jake raised an eyebrow. “Wait, wait. You mean you just make up data? Isn’t that cheating?”

“Not cheating,” Ethan said. “Think of it like training a cashier. If you don’t have enough fake banknotes to practice with, you print realistic practice notes. They’re not real money, but they help the cashier learn to spot counterfeits. SMOTE does the same — it generates synthetic but realistic data points for the minority group.”

How SMOTE Works in ML

Ethan pulled out a notepad. “Here’s the idea:

We take the few minority examples we have.

We look at their nearest neighbors — similar cases.

Then we create new examples in between them.

So instead of just copying the same data, SMOTE creates fresh, slightly different examples that look realistic.”

Jake scratched his head. “So it’s like if I only had 10 loyal customers, and I imagined a few more who behave almost like them, but with small differences?”

“Exactly,” Ethan said. “That way, the model learns a richer picture of what that group looks like.”

Jake leaned forward. “Alright, but how do you actually do this? You’re not sitting here inventing fake customers by hand, are you?”

Ethan shook his head. “Nope. On AWS, we can:

Use SageMaker Processing to run SMOTE scripts on the dataset.

Store the balanced dataset in Amazon S3.

Automate the workflow with AWS Glue or Step Functions.

And before we even start, SageMaker Clarify can show us the imbalance that SMOTE needs to fix.”

Jake grinned. “So AWS is like my assistant, generating practice cases so the model doesn’t get lazy.”

Jake asked, “Alright, but is SMOTE always the answer?”

Ethan shook his head. “Not always.

Pros: It creates realistic new examples, avoids simple duplication, and helps the model learn minority patterns.

Cons: If the minority data is noisy or wrong, SMOTE can generate bad synthetic examples. And it can increase training time.”

Jake nodded. “So it’s like seasoning food. A little makes it better, too much can spoil the dish.”

“Exactly,” Ethan said.

Jake leaned back, impressed. “So SMOTE is a way to create synthetic but realistic examples for the smaller group. That way, the model doesn’t just memorize the majority, but actually learns both sides.”

“Right,” Ethan said. “It’s one of the most powerful tools for handling imbalance. And with AWS services like SageMaker Processing and Clarify, we can use it at scale.”

Jake grinned. “Alright, nephew. First resampling, then label imbalance, now SMOTE. What’s next in your bag of tricks?”

Ethan smiled. “Next, we’ll talk about DPL — Difference in Positive Proportions. That’s when one group consistently gets fewer good outcomes than another. Another kind of unfairness we need to check.”

DPL in Machine Learning: Deep Dive into Fairness Metrics (DPL)

Ethan chuckled. “Not a stock, Uncle Jake. DPL stands for Difference in Positive Proportions. It’s a fairness metric. It measures whether different groups are getting positive outcomes at the same rate.”

Jake frowned. “Positive outcomes? You mean like approvals?”

“Exactly. Imagine two groups applying for credit cards. If men get approved 70% of the time but women only 30%, that’s a big gap. DPL quantifies that difference.”

Jake raised an eyebrow. “So the machine could end up being biased, giving one group more yeses than another?”

“Exactly. And that’s why we check DPL before training. It tells us if one group is consistently getting fewer good outcomes.”

Ethan explained, “Suppose a bank trains a model on past loan data. Historically, men were approved more often than women. If the model learns that pattern, it will continue approving men more and rejecting women more.

DPL measures that gap. If the difference is large, it’s a red flag for unfairness.”

Jake sighed. “So even if the model is accurate overall, it can still be unfair to one group.”

“Exactly,” Ethan said.

Jake asked, “And how do you check this in practice?”

Ethan opened his laptop. “With AWS SageMaker Clarify, we can:

Define the outcome column (like approved/rejected).
Define the group column (like gender or region).
Clarify then calculates DPL automatically.

The bias report shows whether one group is consistently getting fewer positive outcomes.”

Jake nodded. “So Clarify is like an auditor checking if one group of customers is always getting fewer yeses.”

“Exactly,” Ethan said.

Jake leaned back, impressed. “So DPL is about fairness in outcomes. Even if the data looks balanced, if one group gets fewer positive results, the model is unfair. And SageMaker Clarify can measure that.”

“Right,” Ethan said. “That’s why we always check DPL before training. It’s like checking if both sides of the scale are balanced before making a deal.”

Jake leaned back, rubbing his temples. “Alright, nephew. Today you’ve taken me through a whole maze — class imbalance, label imbalance, resampling, balanced accuracy, F1 scores, and even that fancy SMOTE trick. My head’s spinning, but I think I’m starting to get it.”

Ethan smiled. “That’s the idea, Uncle Jake. These are the hidden traps in your data before training even begins. If we don’t check for them, the model will look accurate but act unfair or useless.”

Jake chuckled. “So basically, if I don’t balance my ledger, the machine will just keep saying rejected, rejected, rejected — like a grumpy loan officer.”

“Exactly,” Ethan laughed. “But don’t worry, we’ve got the tools — resampling, SMOTE, and AWS SageMaker Clarify — to keep things fair.”

Jake leaned forward. “And you said there’s more?

Ethan smiled. “Next, we’ll talk about divergence metrics — KL, JS, and TVD.

Those help us measure fairness gaps and compare distributions. Think of it as the next layer of the fairness audit.”

Jake grinned. “Alright then. I’ll bring the ledger, you bring the math pies. Let’s continue this fairness detective work in our next session.”

I think its too much to process, so let's pause it here today! In the next chapter of Jake & Ethan’s journey, we’ll explore:

DPL (Difference in Positive Proportions): How to measure fairness gaps in outcomes.
KL, JS, and TVD Divergence Metrics: The math detectives that compare distributions and reveal hidden bias.
AWS SageMaker Clarify Reports: How these metrics show up in real bias audits.

Stay tuned!

Also Read:

How to Prepare for a possible U.S. Recession in 2025: Inflation, Tariffs, and the $38 Trillion Debt Challenge