From iPhone Shop to AI Insights: AWS Machine Learning Basics Explained with Real Stories

Welcome to the 5th Part of AWS Machine learning associate Series. In our last Post, we had seen Business Problem Formulation – Does our Business Need an ML Solution? or not. So, now, let's imagine our problem requires ML Solution. The first step involves collecting data.. But we dont' know how to do or where to start right? So, in this post let's see fundamentals of data — the building blocks of every machine learning project. If you’ve ever wondered how Apple predicts iPhone sales, how Amazon recommends products, or how a small shopkeeper can avoid wasting rice, the answer is the same: data. So, its important to understand Fundamentals first i had written as usual, even for a non tech user to understand, so let's begin..

This story will take you on a journey with Jake, a small‑town iPhone shop owner, to show how AWS services like Amazon S3, Elastic File System (EFS), and Elastic Block Store (EBS), along with concepts like JSON Lines, Parquet, regression, and word embeddings..

Alright, let's begin...

Chapter 1 – Short introduction/recap..

Jake wasn’t a tech guy. He didn’t own a laptop. His world was measured in iPhones sold, customer smiles, and the weight of cash in his register at the end of the day.

But Jake had a problem. Some months, he ordered too many iPhones from his distributor. The extra boxes sat in the storeroom, gathering dust while newer models launched. Other months, he ordered too few, and customers left disappointed, walking straight into a competitor’s shop.

One weekend, Jake’s nephew Ethan came to visit. Ethan worked in the city as a junior data analyst. He loved technology, cloud computing, and explaining things in simple ways. As he watched his uncle struggle with the same problem month after month, he smiled and said, “Uncle Jake, you don’t realize it, but you’re already sitting on something powerful. You just need to look at it differently.”

Jake raised an eyebrow. “Powerful? All I’ve got is a shop and a ledger.”

Ethan grinned. “Exactly. That’s where it all begins.”

So, what is Data?

Simple, data is just a collection of facts, figures, observations, or symbols that can be stored and processed by a computer. It's the raw material for information and knowledge, and it can exist in many forms, from numbers and text to images and audio. A single, individual piece of data is known as a datum.

Chapter 2 – The Ledger of Truth (Data Collection)

Every evening, Jake sat behind the counter of his iPhone shop with a cup of coffee and his old leather ledger. He would carefully write down the day’s sales:

“3 iPhone 15 Pro sold.”
“2 iPhone SE sold.”
“Rainy day, fewer walk‑ins.”
“Holiday weekend, more sales.”

To Jake, it was just habit. But Ethan leaned over the counter and tapped the page.

“Uncle Jake, think of your ledger as the soil. If you want to grow crops, you need good soil. If you want to grow machine learning models, you need good data. Without it, nothing else works.”

Jake chuckled. “So you’re saying this old book is… data?”

“Exactly,” Ethan said. “This is data collection. You’re gathering information so you can use it later. Apple does the same thing when they track iPhone sales and customer reviews. Amazon does it every time someone clicks ‘Buy Now.’ You’re doing it too — just on paper.”

What is Data Collection?

In plain words, data collection means gathering information so you can use it later. It doesn’t matter if the information is written in a notebook, typed into Excel, or automatically recorded by a computer system — it’s all data collection.

Chapter 3 – The Containers of Memory (Data Formats)

One evening, as Jake closed his shop, Ethan noticed the old ledger again. He smiled and said, “Uncle Jake, this book is great for you, but computers don’t read ledgers. They need something called data formats.”
Jake frowned. “Formats? Like what?”
Ethan pulled out his phone and showed him. “Think of it like storing iPhones. Sometimes you keep them in display boxes, sometimes in cartons, sometimes on shelves. Each container has its purpose. Data is the same — it needs the right container.”
He explained:
CSV (Comma-Separated Values) is like a simple Excel sheet. Each row is a record, each column is a detail. Easy for humans to read, but not always efficient for huge data.
JSON Lines (JSONL) is like a diary where each line is a separate story. Perfect for streaming data, like when Apple tracks iPhone sales every second during launch week.
Apache Parquet is like a cupboard with shelves, where similar items are grouped together. This makes it faster and cheaper to search, especially when dealing with big data like Amazon’s Prime Day transactions.
Jake nodded slowly. “So just like I choose the right box for an iPhone, computers need the right box for data.”
“Exactly,” Ethan said.

Chapter 4 – The Journey of the Box (Data Ingestion)

The next day, Jake asked, “Okay, Ethan, I get the containers. But how does the data even get into the computer in the first place?”
Ethan grinned. “That’s called data ingestion. Think of it like moving iPhones from the warehouse to your shop. You can bring them in trucks once a week — that’s batch ingestion. Or you can bring them in smaller shipments throughout the day — that’s real-time ingestion.”
Jake laughed. “So my ledger is the warehouse, and the computer is the shop?”
“Exactly,” Ethan said. “And AWS has tools for this. AWS Glue helps clean and move data. Amazon Kinesis streams data in real time, like Apple tracking iPhone sales per second during launch week. AWS Data Pipeline moves data on a schedule, like trucks delivering stock every Friday.”
Jake leaned back. “So ingestion is just about moving data from one place to another, like moving iPhones from storage to shelves.”
“Right,” Ethan said. “And the faster and cleaner you move it, the better your shop runs.”

Chapter 5 – The Cloud Warehouses (AWS Storage Services)

Jake was curious now. “So once the data is moved, where does the computer keep it?”

Ethan smiled. “In the cloud. Think of it as magical warehouses in the sky.”

He explained:

Amazon S3 (Simple Storage Service) is like a giant warehouse with endless shelves. Apple could store millions of iPhone photos here. It’s cheap, durable, and practically unlimited.

Amazon Elastic File System (EFS) is like a shared storeroom where many employees can walk in at the same time. Perfect for collaborative apps where multiple people need access.

Amazon Elastic Block Store (EBS) is like a personal locker, fast and private. It attaches directly to a computer (an EC2 instance) and is great for workloads that need quick access, like Jake’s real-time sales dashboard.

Jake imagined his iPhone boxes stacked neatly in these magical warehouses in the sky. For the first time, he realized the cloud wasn’t just about rain — it was about storage, speed, and scale.

(We will learn about this services, in coming posts, for now let's stick to basics)

Chapter 6 – The Language of Numbers and Words (Data Types)

One evening, Jake was flipping through his ledger when Ethan leaned over. “Uncle Jake, do you know what kind of data you’re collecting?”
Jake shrugged. “It’s just sales, isn’t it?”
Ethan smiled. “That’s where data types come in. Computers need to know what kind of information they’re dealing with. Just like you know the difference between cash in your register and customer comments, the computer needs to know whether something is a number or text.”
In machine learning, data types are the foundation. They tell the system how to treat the information. Numbers can be added, averaged, or used in calculations. Words, on the other hand, need special handling because computers don’t naturally understand language.

Chapter 7 – What is Text Data?

Ethan pointed to Jake’s notes: “Rainy day, fewer walk‑ins.”
“This,” Ethan explained, “is text data. It’s made of words, sentences, or even whole documents. Think of customer reviews on the Apple Store about the latest iPhone, or comments on Amazon about a phone case. Those reviews are text data. They carry meaning, but computers can’t understand them directly.”
He continued, “Text data is powerful because it captures human opinions and context. Apple uses it to analyze customer feedback on iPhones. Amazon uses it to improve product listings and recommendations. But before a computer can use it, we need to convert those words into numbers.”

Chapter 8 – What is Numerical Data?

Then Ethan pointed at another entry: “3 iPhone 15 Pro sold.”
“This,” he said, “is numerical data. It’s easy for computers to handle. Your age, the price of an iPhone, the number of customers in a day — all of these are numerical. Machine learning loves numbers because they can be measured, compared, and predicted.”
He gave an example: “When Apple reports that 10 million iPhones were sold in the first week, that’s numerical data. When Amazon tracks that a product’s price dropped from $999 to $899, that’s numerical data too. These numbers help companies forecast demand, set prices, and manage inventory.”
Jake nodded. Numbers were straightforward. Words, though — that was trickier.

Chapter 9 – Teaching the Computer to Read Words (Word‑to‑Word Mapping)

Jake frowned. “But how can a computer ever understand words like ‘rainy day’ or ‘holiday weekend’?”
Ethan explained: “We start with word‑to‑word mapping. It’s like a dictionary. For example, the word ‘iPhone’ maps to ‘Apple,’ and ‘Galaxy’ maps to ‘Samsung.’ This helps the computer see connections between words.”
He added, “This is part of Natural Language Processing (NLP) — the field of machine learning that teaches computers to understand human language. Word‑to‑word mapping is the simplest way to give computers a sense of meaning.”

Chapter 10 – The Simple Count Method (Bag of Words)

“But a dictionary isn’t enough,” Ethan continued. “Sometimes we just count.”
He wrote on a scrap of paper: ‘I like iPhones, I like Android phones.’
“I” appears 2 times
“like” appears 2 times
“iPhones” appears 1 time
“Android” appears 1 time
“This is called the simple count method, or the Bag of Words model. We turn text into numbers by counting how often each word appears. It’s simple, but it works. That’s how companies like Amazon can quickly see which products are mentioned most in reviews.”
Jake laughed. “So the computer is just keeping score, like a basketball game.”

Chapter 11 – Word Embeddings (Giving Words Meaning)

Ethan shook his head. “Counting is useful, but it doesn’t capture meaning. That’s where word embeddings come in.”
He drew a little map. “Imagine placing words on a chart. Words with similar meanings are close together. ‘Apple’ and ‘iPhone’ are near each other because they’re related. ‘Samsung’ and ‘Galaxy’ are near each other too. This way, the computer doesn’t just see words — it sees relationships.”
Word embeddings are one of the most powerful tools in machine learning and NLP. They allow algorithms to capture context and meaning, not just counts. That’s how Amazon knows that when you search for “iPhone charger,” it should also show you “Lightning cable.”
Jake’s eyes lit up. “So the computer makes friends between words, just like people do.”

Quick Refresher: What Are Data Types?

Before we dive deeper, let’s make sure “data types” are crystal clear.
Data types are simply the categories of data that tell a computer how to treat information.
Numerical data → Numbers you can measure or calculate with. Example: “Jake sold 25 iPhones today.”
Categorical data → Labels or groups. Example: “Customer bought iPhone 15 Pro vs iPhone SE.”
Text data → Words, sentences, reviews. Example: “Battery life is amazing.”
Image/Video data → Photos, videos, or visuals. Example: A selfie taken with an iPhone camera.
Audio data → Sounds, voice notes, or recordings. Example: A customer’s voice message asking about stock.
Think of it like the sections in Jake’s shop: phones (numerical count), models (categories), customer reviews (text), demo photos (images), and customer calls (audio). Each is a different type of data, and each needs to be handled differently in machine learning.

Chapter 12 – The First Prediction (Regression)

Jake asked Ethan, “So if I have all this data, how do I actually use it to predict sales?”
Ethan explained: “That’s where regression comes in. Regression is a type of machine learning algorithm that predicts numbers based on past data. For example, if you sold 20 iPhones last week and 25 this week, regression can help predict how many you’ll sell next week.”
He added, “Apple uses regression to forecast iPhone sales before a launch. Amazon uses it to predict how many units of a product will sell during Prime Day. You can use it to decide how many iPhones to order next month.”
Jake’s eyes widened. “So no more guessing?”
“Exactly,” Ethan said. “It’s like weather forecasting, but for your shop.”

Chapter 13 – Sorting the Shelves (Classification Algorithms)

Jake then asked, “But what about different types of customers? Some want the Pro, some want the SE. How do I sort them?”
Ethan replied, “That’s where classification algorithms come in. Instead of predicting a number, classification sorts things into categories. Think of it like labeling boxes in your shop: ‘High‑end buyers,’ ‘Budget buyers,’ ‘Window shoppers.’”

Chapter 14 – Cleaning the Data (ETL: Extract, Transform, Load)

Ethan noticed Jake’s ledger was messy. Sometimes he wrote “iPhone 14 Pro,” sometimes just “14 Pro,” and sometimes “Pro Max.”
“Computers don’t like that,” Ethan said. “That’s why we use ETL — Extract, Transform, Load.”
Extract → Pull the data from its source (Jake’s ledger, Apple’s sales systems, Amazon’s order logs).
Transform → Clean and standardize it. “iPhone 14 Pro Max” should always be written the same way.
Load → Put the cleaned data into a database or storage system like Amazon S3.
Without ETL, machine learning models get confused. With ETL, the data becomes reliable fuel for predictions.

Chapter 15 – Structured vs. Unstructured Data

Ethan showed Jake two piles of information. One was neat: rows of numbers, dates, and product codes. The other was messy: customer comments, photos of receipts, even voice messages.
“This,” Ethan said, pointing to the neat pile, “is structured data. It’s organized, like a spreadsheet. Easy for computers to process. Your daily sales numbers, iPhone prices, and inventory counts all fit here.”
He pointed to the messy pile. “This is unstructured data. It’s text, images, audio, video. Customer reviews on the Apple Store, unboxing videos on YouTube, tweets about iPhones — all of this is unstructured. It’s harder for computers, but it’s also incredibly valuable.”
AWS services like Amazon Comprehend (for text) and Amazon Rekognition (for images and video) help turn unstructured data into insights.
He also said you can use service like Mechanical turk to label the data manually with humans, as not all datas are labelled already.

Chapter 16 – Case Study: Apple’s iPhone Launch

Ethan gave Jake a real example. “When Apple launches a new iPhone, they don’t just guess how many to produce. They use:
Structured data: Pre‑order numbers, past sales figures, store inventory.
Unstructured data: Tweets, YouTube reviews, customer feedback.
By combining both, Apple can forecast demand more accurately.”

Chapter 17 – Case Study: Amazon Prime Day

“Amazon does the same thing during Prime Day,” Ethan continued. “They ingest billions of transactions in real time using Amazon Kinesis, clean and transform the data with AWS Glue, and store it in Amazon S3. Machine learning models then classify shoppers, recommend products, and even adjust prices dynamically.”
Jake was amazed. “So that’s why my Amazon homepage looks different from yours?”
“Exactly,” Ethan said. “It’s personalized.”

Chapter 18 – Case Study: Jake’s Shop

Finally, Ethan turned back to Jake. “You may not be Apple or Amazon, but the principles are the same. If you collect your sales data properly, clean it, and store it in the right format, you can use machine learning to predict demand, classify customers, and even personalize offers. Imagine sending a text to your regular buyers: ‘Hey, the new iPhone just arrived — want me to reserve one for you?’ That’s machine learning in action, scaled down to your shop.”
Jake leaned back, smiling. For the first time, he saw his little iPhone shop as part of a much bigger story — one that stretched from his small town all the way to Cupertino and Seattle.
We are almost at the end of this post, so, let's close them with data types..
Understanding Data Types..
Before Jake could fully appreciate the power of machine learning, Ethan wanted him to understand one last foundation: data types.
Ethan leaned back and said, “Uncle Jake, before we wrap this up, you need to understand one last thing: the kinds of data you’re working with. Some of your notes are numbers, like ‘25 iPhones sold today’ or ‘average selling price was $999.’ That’s called quantitative data because it can be measured and expressed in numbers. Sometimes it’s whole numbers, like the count of iPhones sold, which we call discrete, and sometimes it’s values with decimals, like battery life in hours or how long a customer waited, which we call continuous. On the other hand, some of your notes aren’t numbers at all — they’re labels or descriptions, like ‘customer bought iPhone 15 Pro instead of SE’ or ‘satisfaction was Excellent instead of Poor.’ That’s qualitative data, also called categorical, because it describes qualities or groups. Some categories don’t have an order, like iPhone model names — those are nominal. Others do have an order, like satisfaction ratings from Poor to Excellent — those are ordinal. Both types are important: the numbers help you predict how many iPhones to stock, while the categories help you understand what kinds of customers you have and how they feel. Together, they give you the full picture of your shop.”
Jake smiled, finally seeing it clearly. His ledger wasn’t just a messy notebook anymore — it was a mix of numbers and stories, quantities and categories, all waiting to be turned into insights. For the first time, he realized he wasn’t just running a shop; he was sitting on a treasure chest of data, the same kind of treasure Apple and Amazon use to power their decisions.
Well, that's end of this story and post too😅. I hope you are clear with fundamentals of data.. Keep learning! See you on next post.
Read: AWS Certified ML Exam preparation series: Data Cleaning, Imputation, Outlier Detection, and Feature Engineering with AWS Services (Part 6)

Logeshwaran.org

Data Fundamentals, ingestion, transformations: Beginner's guide for AWS Machine Learning Exam Series Part 5