SQL Mastery for AI & Data: From Basics to Feature Engineering
The Complete SQL Roadmap for AI/ML Engineers
Python + AI is a strong combo, and SQL is a must-have because most AI/ML workflows need structured data from databases.
For AI/ML/data-related jobs, you don’t need to be a full-time DBA, but you must be comfortable with querying, cleaning, and transforming data efficiently.
Here’s a SQL roadmap for AI/Data/ML jobs 👇
🔹 1. SQL Basics (Foundations)
What is SQL? (DDL, DML, DQL, DCL, TCL)
Database concepts: tables, rows, columns, primary keys, foreign keys
SELECT
,FROM
,WHERE
basicsFiltering data with conditions (
=
,<
,>
,<>
,BETWEEN
,IN
,LIKE
)Sorting results with
ORDER BY
Limiting results (
LIMIT
/TOP
)Aliases (
AS
)
🔹 2. Aggregations & Grouping
COUNT()
,SUM()
,AVG()
,MIN()
,MAX()
GROUP BY
(single and multiple columns)HAVING
vsWHERE
Distinct values (
DISTINCT
)
🔹 3. Joins (Very Important for AI Jobs)
Inner Join
Left Join
Right Join
Full Outer Join
Self Join
Cross Join
Multi-table joins
👉 Most ML feature engineering needs joining multiple datasets.
🔹 4. Subqueries & Derived Tables
Simple subqueries (in
WHERE
,SELECT
,FROM
)Correlated subqueries
EXISTS
vsIN
Common Table Expressions (CTEs) with
WITH
🔹 5. Advanced Filtering & Window Functions
CASE WHEN
for conditional columnsCOALESCE
,NULLIF
, handling NULL valuesRanking functions (
ROW_NUMBER
,RANK
,DENSE_RANK
)Aggregates with windowing (
SUM() OVER()
,AVG() OVER()
)Moving averages, cumulative sums
👉 Very useful for time series AI problems (stock, IoT, logs).
🔹 6. Data Manipulation
INSERT
,UPDATE
,DELETE
MERGE
(UPSERT operations)Bulk inserts
🔹 7. Set Operations
UNION
vsUNION ALL
INTERSECT
EXCEPT
/MINUS
🔹 8. Data Cleaning & Transformation (Critical for AI/ML)
String functions (
TRIM
,SUBSTRING
,CONCAT
,UPPER/LOWER
)Date & time functions (
DATEADD
,DATEDIFF
,EXTRACT
)Converting data types (
CAST
,CONVERT
)Pivoting & unpivoting data (
PIVOT
,UNPIVOT
)
🔹 9. Performance & Optimization (Intermediate Level)
Indexes basics (clustered, non-clustered)
Query execution plan (basic understanding)
EXPLAIN
keyword to analyze queriesUsing appropriate joins vs subqueries for performance
🔹 10. AI/Data-Job Specific SQL Topics
Writing queries for feature engineering
SQL for time series (lag, lead, window)
SQL for ETL workflows
SQL for data quality checks (missing, duplicates, anomalies)
SQL integration with Python (pandas + SQLAlchemy)
Using SQL in data warehouses (BigQuery, Snowflake, Redshift, Synapse)
✅ Summary:
For AI/data jobs, you need to be very strong in SELECT queries, joins, aggregations, subqueries, and window functions.
DBA-level topics (backups, replication, clustering) are not required unless you aim for data engineering.
✅ Run environment: examples below are written for SQLite (
sqlite3
). SQLite is zero-config and runs on Windows/Mac/Linux. If you use MySQL/Postgres, the logic is the same but a few tiny syntax differences apply
Chapter 1 — SQL Basics (Foundations)
What this chapter covers (plain language)
SQL is the language used to talk to relational databases. Think of a database as a set of spreadsheets (tables). SQL lets you:
Create and change tables (like creating a new sheet).
Put rows into tables (like typing rows into spreadsheets).
Ask questions (queries) — that’s the most common: “Give me all customers who bought X”.
Make simple filters, sort the results, and rename columns in the results.
You don’t need to be a database admin to be useful for AI/data work — you just need to be comfortable with asking the right questions and extracting the right data.
Very short vocabulary (layman)
Table: like a spreadsheet page (e.g.,
employees
).Row: one record (one person, one order).
Column: property of the row (name, price).
Primary Key (PK): unique id for each row (like a student ID).
Foreign Key (FK): a link from one table to a row in another table (like employee’s department id).
Query: a question you ask the database (
SELECT ...
).DDL: commands that define structure (CREATE, DROP).
DML: commands that change data (INSERT, UPDATE, DELETE).
DQL: query language —
SELECT
statements (data retrieval).
How I’ll show examples
Each example block is self-contained: it drops the table if it exists, creates the table, inserts rows, and runs the SELECT you need to see the result. Copy the whole block and paste into sqlite3
and it will run.
Example 1 — Very first query: SELECT
and SELECT *
What we want: get all rows and columns from a table.
-- Example 1: list all employees
DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
first_name TEXT,
last_name TEXT,
role TEXT,
salary INTEGER
);
INSERT INTO employees (id, first_name, last_name, role, salary) VALUES
(1, 'Asha', 'Kumar', 'Data Analyst', 50000),
(2, 'John', 'Doe', 'ML Engineer', 90000),
(3, 'Maya', 'Singh', 'Data Scientist', 85000),
(4, 'Liam', 'Ng', 'DevOps', 75000),
(5, 'Sara', 'Khan', 'Intern', 20000);
-- The simplest question: give me everything
SELECT * FROM employees;
Layman explanation:
DROP TABLE IF EXISTS
— remove old table so the code can be run many times without error.CREATE TABLE
— defines columns.INSERT INTO
— adds rows (we manually provide IDs so this works everywhere).SELECT * FROM employees;
— “give me every column of every row” (like opening the whole spreadsheet).
Example 2 — Select specific columns and WHERE
filter
What we want: pick certain columns and filter rows.
-- Example 2: basic column selection and filtering
DROP TABLE IF EXISTS products;
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
name TEXT,
category TEXT,
price INTEGER,
stock INTEGER
);
INSERT INTO products (product_id, name, category, price, stock) VALUES
(101, 'Latte', 'Beverage', 4, 100),
(102, 'Espresso', 'Beverage', 3, 50),
(103, 'Muffin', 'Food', 2, 0),
(104, 'Bagel', 'Food', 2, 30),
(105, 'GreenTea', 'Beverage', 3, 20);
-- Show only name and price for products that cost 3 or more
SELECT name, price
FROM products
WHERE price >= 3;
Layman explanation:
SELECT name, price
— only fetch these columns (faster and cleaner).WHERE price >= 3
— only rows where the price is at least 3.WHERE
is used to filter like Excel filters.
More filters (run separately):
-- Products out of stock
SELECT * FROM products WHERE stock = 0;
-- Products that are either category 'Food' or price > 3
SELECT * FROM products WHERE category = 'Food' OR price > 3;
-- Products with names containing 'e' (simple text match)
SELECT * FROM products WHERE name LIKE '%e%';
-- Products with IDs in this small set
SELECT * FROM products WHERE product_id IN (101, 104);
Example 3 — ORDER BY
and LIMIT
(sorting and top-n)
What we want: sort results and take only top rows (e.g., top 3 salaries).
-- Example 3: orders and top-n
DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer TEXT,
amount INTEGER,
order_date TEXT -- We'll store date as text for this beginner example
);
INSERT INTO orders (order_id, customer, amount, order_date) VALUES
(1, 'Asha', 120, '2025-09-01'),
(2, 'John', 500, '2025-09-02'),
(3, 'Maya', 300, '2025-09-03'),
(4, 'Liam', 50, '2025-08-28'),
(5, 'Sara', 700, '2025-09-05');
-- Show orders sorted by amount, highest first
SELECT * FROM orders ORDER BY amount DESC;
-- Show only the top 3 biggest orders
SELECT * FROM orders ORDER BY amount DESC LIMIT 3;
Layman explanation:
ORDER BY amount DESC
— sorts rows byamount
from big → small.ASC
would be small → big.LIMIT 3
— return only the first 3 rows of that sorted list (useful for top-k queries).
Example 4 — Column aliases & simple expressions
What we want: rename output columns and compute new columns.
-- Example 4: aliases and computed columns
DROP TABLE IF EXISTS customers;
CREATE TABLE customers (
cid INTEGER PRIMARY KEY,
first TEXT,
last TEXT,
country TEXT,
purchases INTEGER
);
INSERT INTO customers (cid, first, last, country, purchases) VALUES
(1, 'Ravi', 'Patel', 'India', 5),
(2, 'Emma', 'Brown', 'UK', 2),
(3, 'Chen', 'Lee', 'China', 8),
(4, 'Olivia', 'Garcia', 'Spain', 1);
-- Make a full_name column (concatenate) and show an adjusted metric
SELECT
cid,
first || ' ' || last AS full_name, -- in SQLite use || to join strings
purchases,
purchases * 10 AS purchase_score -- a computed column
FROM customers;
Layman explanation:
AS full_name
gives a friendlier name to the calculated column.first || ' ' || last
combines first and last name (SQLite style).purchases * 10 AS purchase_score
shows how we can compute values on the fly — great for feature engineering later.
Note: In MySQL/Postgres you might use
CONCAT(first, ' ', last)
instead of||
— but the above is SQLite-safe.
Example 5 — NULL
, IS NULL
, and COALESCE
What we want: learn how missing values behave and how to replace them.
-- Example 5: handling NULLs
DROP TABLE IF EXISTS inventory;
CREATE TABLE inventory (
item_id INTEGER PRIMARY KEY,
name TEXT,
location TEXT,
quantity INTEGER
);
INSERT INTO inventory (item_id, name, location, quantity) VALUES
(1, 'Widget A', 'Aisle 1', 10),
(2, 'Widget B', NULL, 5),
(3, 'Widget C', 'Aisle 3', NULL),
(4, 'Widget D', NULL, NULL);
-- Find rows where location is unknown
SELECT * FROM inventory WHERE location IS NULL;
-- Find rows where quantity is missing
SELECT * FROM inventory WHERE quantity IS NULL;
-- Use COALESCE to replace NULL with a default for display
SELECT item_id, name,
COALESCE(location, 'UNKNOWN') AS location_text,
COALESCE(quantity, 0) AS qty
FROM inventory;
Layman explanation:
NULL
means “unknown” or “no value”.IS NULL
is how you check for unknowns.= NULL
will not work.COALESCE
picks the first non-null value — handy to show readable defaults.
Example 6 — Primary key, foreign key, and a simple relationship
What we want: show how two tables relate (department → employees).
-- Example 6: departments and employees (basic FK)
PRAGMA foreign_keys = ON; -- make SQLite enforce foreign keys in this session
DROP TABLE IF EXISTS employees_dept;
DROP TABLE IF EXISTS departments;
CREATE TABLE departments (
id INTEGER PRIMARY KEY,
dept_name TEXT
);
CREATE TABLE employees_dept (
emp_id INTEGER PRIMARY KEY,
name TEXT,
dept_id INTEGER,
salary INTEGER,
FOREIGN KEY (dept_id) REFERENCES departments(id)
);
INSERT INTO departments (id, dept_name) VALUES
(10, 'Engineering'),
(20, 'Data'),
(30, 'HR');
INSERT INTO employees_dept (emp_id, name, dept_id, salary) VALUES
(100, 'Asha', 20, 60000),
(101, 'John', 10, 90000),
(102, 'Maya', 20, 85000),
(103, 'Liam', 10, 70000),
(104, 'Sara', 30, 40000);
-- Show employees and their dept_id (raw)
SELECT * FROM employees_dept;
-- Very simple join to show department name (we'll cover joins deeply later)
SELECT e.emp_id, e.name, e.salary, d.dept_name
FROM employees_dept e
JOIN departments d ON e.dept_id = d.id;
Layman explanation:
departments
has a uniqueid
per row — that’s the primary key.employees_dept.dept_id
stores the department id — a foreign key linking todepartments
.The
JOIN
pulls department names into the employee view. This is how you stitch related tables together.
Quick checklist: the key basic commands you just learned
CREATE TABLE
— make a new table.DROP TABLE IF EXISTS
— remove an old table.INSERT INTO ... VALUES (...)
— add rows.SELECT column1, column2 FROM table
— pick columns you want.WHERE
— filter rows by conditions.ORDER BY ... ASC|DESC
— sort results.LIMIT n
— take only the first n rows.AS
— rename a column in the results.IS NULL
/COALESCE
— handle missing values.Primary Key (unique id) and Foreign Key (link between tables).
10 Exercises (Chapter 1) — practice (use the tables from the examples)
Do these on your own. Each exercise assumes you already ran the example blocks above (so the tables exist). If they don’t, re-run the relevant example block before starting.
List names and roles: From
employees
, showfirst_name
,last_name
, androle
only.High earners: From
employees
, list rows wheresalary
is greater than 70,000. Sort them by salary descending.Products with vowel: From
products
, show items whose name contains the lettera
(case sensitive is fine).Out of stock: From
products
, return productname
andcategory
for all items withstock = 0
.Top customers: From
orders
, list the top 2 customers byamount
.Create a view (or simulated): Write a query that shows each
customer
and theiramount
asorder_amount
(useAS
to rename the column).Replace NULLs: From
inventory
, produce a list wherelocation
is replaced with'UNKNOWN'
andquantity
replaced with0
. (UseCOALESCE
.)Department lookup: Using
employees_dept
anddepartments
, showname
anddept_name
for employees in theData
department only.Add a new product: Insert a new product into
products
withproduct_id 106
and then query to show it exists.Count rows: Use a
SELECT
to count how many rows are inemployees
(useCOUNT(*)
). This introduces a tiny aggregation — we’ll do full aggregations next chapter.
Short tips & pitfalls (layman)
Always
DROP TABLE IF EXISTS
at the top of example scripts when learning — avoids "table already exists" errors.NULL
is not the same as an empty string''
. UseIS NULL
to check nulls.WHERE
filters rows;HAVING
(we’ll meet later) filters groups.Practice by copying the code into
sqlite3
and step through line by line. Change values, re-run, see what changes.
Chapter 2 — Aggregations & Grouping
What this chapter covers (plain language)
When you have lots of rows, you usually don’t want to see each row — you want summary numbers:
How many customers bought something?
What’s the average salary?
What’s the total revenue last week?
That’s what aggregate functions do. Then, with GROUP BY, you can split data into categories (e.g., sales per country).
The 5 most common aggregate functions
COUNT()
→ how many rows (or values).SUM()
→ total of numbers.AVG()
→ average of numbers.MIN()
→ smallest value.MAX()
→ biggest value.
Example 1 — Counting rows
-- Example 1: counting rows
DROP TABLE IF EXISTS sales;
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
product TEXT,
quantity INTEGER,
price INTEGER
);
INSERT INTO sales (sale_id, product, quantity, price) VALUES
(1, 'Latte', 2, 4),
(2, 'Latte', 1, 4),
(3, 'Espresso', 3, 3),
(4, 'Muffin', 5, 2),
(5, 'Espresso', 2, 3),
(6, 'Latte', 4, 4);
-- Count all rows
SELECT COUNT(*) AS total_sales FROM sales;
-- Count only rows where product = 'Latte'
SELECT COUNT(*) AS latte_sales FROM sales WHERE product = 'Latte';
Explanation:
COUNT(*)
= number of rows in the table.You can add
WHERE
to count only a subset.
Example 2 — SUM and AVG
-- Example 2: SUM and AVG
-- Total quantity sold
SELECT SUM(quantity) AS total_quantity FROM sales;
-- Average price of items
SELECT AVG(price) AS avg_price FROM sales;
-- Total revenue = quantity * price for all rows
SELECT SUM(quantity * price) AS total_revenue FROM sales;
Explanation:
SUM(quantity)
adds up all quantities.AVG(price)
gives the mean of all prices.You can calculate expressions inside aggregate functions (here:
quantity * price
).
Example 3 — MIN and MAX
-- Example 3: MIN and MAX
-- Cheapest item price
SELECT MIN(price) AS min_price FROM sales;
-- Most expensive item price
SELECT MAX(price) AS max_price FROM sales;
-- Largest quantity in a single sale
SELECT MAX(quantity) AS largest_order FROM sales;
Explanation:
MIN()
finds the smallest number.MAX()
finds the largest number.
Example 4 — GROUP BY (splitting into categories)
-- Example 4: GROUP BY
-- Total quantity sold per product
SELECT product, SUM(quantity) AS total_qty
FROM sales
GROUP BY product;
-- Average price per product
SELECT product, AVG(price) AS avg_price
FROM sales
GROUP BY product;
-- Count how many times each product was sold
SELECT product, COUNT(*) AS order_count
FROM sales
GROUP BY product;
Explanation:
GROUP BY product
means: split the table by product, then run the aggregate separately for each group.Without
GROUP BY
, aggregates give one row. WithGROUP BY
, you get one row per group.
Example 5 — GROUP BY with multiple columns
-- Example 5: grouping by multiple columns
DROP TABLE IF EXISTS employee_salaries;
CREATE TABLE employee_salaries (
emp_id INTEGER PRIMARY KEY,
department TEXT,
role TEXT,
salary INTEGER
);
INSERT INTO employee_salaries (emp_id, department, role, salary) VALUES
(1, 'Data', 'Analyst', 50000),
(2, 'Data', 'Scientist', 85000),
(3, 'Engineering', 'DevOps', 75000),
(4, 'Engineering', 'Developer', 70000),
(5, 'Engineering', 'Developer', 72000),
(6, 'HR', 'Recruiter', 40000);
-- Average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department;
-- Average salary per department + role
SELECT department, role, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department, role;
Explanation:
You can group by more than one column.
First query groups by department.
Second query groups by both department and role.
Example 6 — HAVING (filter groups)
-- Example 6: HAVING
-- Only show departments where average salary > 60000
SELECT department, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department
HAVING AVG(salary) > 60000;
-- Only show products sold more than 2 times
SELECT product, COUNT(*) AS order_count
FROM sales
GROUP BY product
HAVING COUNT(*) > 2;
Explanation:
WHERE
filters rows before grouping.HAVING
filters after grouping.Rule of thumb:
WHERE
is for rows,HAVING
is for groups.
Example 7 — DISTINCT vs GROUP BY
-- Example 7: DISTINCT vs GROUP BY
DROP TABLE IF EXISTS students;
CREATE TABLE students (
id INTEGER PRIMARY KEY,
name TEXT,
class TEXT
);
INSERT INTO students (id, name, class) VALUES
(1, 'Asha', 'Math'),
(2, 'John', 'Science'),
(3, 'Maya', 'Math'),
(4, 'Liam', 'English'),
(5, 'Sara', 'Science');
-- Unique classes with DISTINCT
SELECT DISTINCT class FROM students;
-- Same result using GROUP BY
SELECT class FROM students GROUP BY class;
Explanation:
DISTINCT
removes duplicates.GROUP BY
can also achieve that — but it’s more powerful because you can add aggregates.
10 Exercises (Chapter 2) — practice
Use the sales
, employee_salaries
, and students
tables created above.
Find the total revenue per product from the
sales
table.Find the average quantity per product.
Show the product with the maximum average price.
Find the total number of distinct products in the
sales
table (useCOUNT(DISTINCT ...)
).From
employee_salaries
, show the highest salary in each department.From
employee_salaries
, show department and role combinations where average salary is below 60,000.From
sales
, show only products where the total quantity sold is more than 5.From
students
, count how many students are in each class.From
sales
, calculate the average revenue (quantity * price) per product.From
employee_salaries
, calculate the total salary cost per department and sort results from highest to lowest.
✅ Summary of this chapter
COUNT
,SUM
,AVG
,MIN
,MAX
summarize data.GROUP BY
splits data into categories.HAVING
filters groups.DISTINCT
removes duplicates (like Excel’s "Remove Duplicates").
Chapter 3 — Joins (Combining Tables)
Now let’s move to the most critical SQL concept for AI/Data jobs — Joins.
Without joins, you’ll only query one table at a time. But in the real world, data is spread across many tables — customer info, orders, payments, products. Joins let you connect them.
What this chapter covers (plain language)
Think of each table like a sheet in Excel. A join is like using VLOOKUP to bring data from one sheet into another — but more powerful.
We’ll cover:
Inner Join
Left Join
Right Join (not in SQLite, but I’ll show the workaround)
Full Outer Join (also workaround in SQLite)
Cross Join
Self Join
Example setup (for all join examples)
We’ll create two related tables: customers
and orders
.
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS orders;
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT,
country TEXT
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
product TEXT,
amount INTEGER
);
INSERT INTO customers (customer_id, name, country) VALUES
(1, 'Asha', 'India'),
(2, 'John', 'USA'),
(3, 'Maya', 'UK'),
(4, 'Liam', 'Canada');
INSERT INTO orders (order_id, customer_id, product, amount) VALUES
(101, 1, 'Laptop', 1000),
(102, 1, 'Mouse', 20),
(103, 2, 'Keyboard', 50),
(104, 2, 'Monitor', 200),
(105, 3, 'Tablet', 300);
Example 1 — INNER JOIN
-- Inner join: only customers who have placed orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;
Layman explanation:
INNER JOIN = "Only keep rows where both tables have a match."
Here: only customers who placed at least one order appear.
Liam
(customer 4) does not appear because he has no orders.
Example 2 — LEFT JOIN
-- Left join: all customers, even if they have no orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id;
Layman explanation:
LEFT JOIN = “Take all rows from the left table (customers), and bring data from the right (orders) if it exists.”
If no match, the columns from the right table are
NULL
.Liam now appears, but with
NULL
for product and amount.
Example 3 — RIGHT JOIN (simulated in SQLite)
SQLite doesn’t have RIGHT JOIN
, but we can flip the tables and use LEFT JOIN.
-- Right join equivalent: all orders, even if no matching customer
SELECT o.order_id, o.product, o.amount, c.name, c.country
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id;
Layman explanation:
This is a "right join" in disguise: take all orders, bring in customer info if available.
If there’s an order with a customer_id not in customers, it will show
NULL
for name/country.
Example 4 — FULL OUTER JOIN (simulated in SQLite)
SQLite also doesn’t support FULL JOIN
directly. But we can simulate with UNION
.
-- Full outer join = all customers + all orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
UNION
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id; -- in SQLite, use the LEFT JOIN flip trick
Layman explanation:
FULL OUTER JOIN = “Give me everything, whether or not there’s a match.”
You’ll see customers without orders and orders without customers.
Example 5 — CROSS JOIN
-- Cross join: every customer paired with every product
DROP TABLE IF EXISTS products;
CREATE TABLE products (
pid INTEGER PRIMARY KEY,
pname TEXT
);
INSERT INTO products (pid, pname) VALUES
(1, 'Phone'),
(2, 'Laptop'),
(3, 'Headphones');
-- Cartesian product
SELECT c.name, p.pname
FROM customers c
CROSS JOIN products p;
Layman explanation:
CROSS JOIN = every row of one table combined with every row of another.
If there are 4 customers × 3 products = 12 rows.
Usually used for generating combinations, not for normal reporting.
Example 6 — SELF JOIN
-- Self join: employees referencing managers
DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT,
manager_id INTEGER
);
INSERT INTO employees (emp_id, name, manager_id) VALUES
(1, 'Alice', NULL), -- CEO
(2, 'Bob', 1), -- Alice manages Bob
(3, 'Charlie', 1), -- Alice manages Charlie
(4, 'David', 2); -- Bob manages David
-- Self join: find employee and their manager name
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.emp_id;
Layman explanation:
A self join joins a table to itself.
Here, employees reference their manager via
manager_id
.David → Bob, Bob → Alice, Alice has no manager.
Quick checklist: Joins
INNER JOIN = only matching rows.
LEFT JOIN = all from left + matches from right.
RIGHT JOIN = all from right + matches from left (simulate in SQLite).
FULL OUTER JOIN = all rows from both (simulate with
UNION
).CROSS JOIN = every combination.
SELF JOIN = table joined with itself.
10 Exercises (Chapter 3)
Show all customers and their orders (use INNER JOIN).
Show all customers, even those without orders (use LEFT JOIN).
Show all orders, even if no customer exists for them (use RIGHT JOIN simulation).
Create a query to show customer name + country + total order amount (hint: join + SUM + GROUP BY).
Find customers who have no orders (LEFT JOIN +
WHERE order_id IS NULL
).Find products that have never been ordered (use
products
+orders
).Using
employees
, list all employees with their manager name.Using
employees
, find employees who are managers (appear inmanager_id
).Generate all combinations of customer names and products (use CROSS JOIN).
Create a query that shows all customers with number of orders, including those with 0 orders.
✅ Summary of this chapter
Joins let you combine related tables.
They are the heart of SQL for AI/data jobs (most feature engineering is multi-table joins).
Must master LEFT JOIN + GROUP BY for real projects.
Chapter 4 — Subqueries & CTEs
Now we go to Chapter 4 — Subqueries & CTEs (Common Table Expressions).
This is where SQL gets really powerful: you can write queries inside queries to handle complex data questions.
What this chapter covers (plain language)
Subquery = a query inside another query.
You can put a subquery in:
WHERE
→ filter rows based on another query.SELECT
→ create new calculated columns.FROM
→ use subquery results as a temporary table.
Correlated Subquery = subquery that depends on the outer query.
CTE (Common Table Expression) = a readable way to build temporary tables with
WITH
.
Example setup
We’ll reuse customers
and orders
, and add payments
.
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payments;
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT,
country TEXT
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
product TEXT,
amount INTEGER
);
CREATE TABLE payments (
payment_id INTEGER PRIMARY KEY,
order_id INTEGER,
payment_amount INTEGER
);
INSERT INTO customers (customer_id, name, country) VALUES
(1, 'Asha', 'India'),
(2, 'John', 'USA'),
(3, 'Maya', 'UK'),
(4, 'Liam', 'Canada');
INSERT INTO orders (order_id, customer_id, product, amount) VALUES
(101, 1, 'Laptop', 1000),
(102, 1, 'Mouse', 20),
(103, 2, 'Keyboard', 50),
(104, 2, 'Monitor', 200),
(105, 3, 'Tablet', 300);
INSERT INTO payments (payment_id, order_id, payment_amount) VALUES
(1, 101, 1000),
(2, 102, 20),
(3, 103, 40), -- underpaid
(4, 104, 200),
(5, 105, 300);
Example 1 — Subquery in WHERE
-- Customers who have placed at least one order
SELECT name
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
);
Layman explanation:
First the inner query finds all
customer_id
fromorders
.The outer query checks which customers are in that list.
This avoids having to join.
Example 2 — Subquery in SELECT
-- Show each customer and their total order amount
SELECT
name,
(SELECT SUM(amount)
FROM orders o
WHERE o.customer_id = c.customer_id) AS total_spent
FROM customers c;
Layman explanation:
For each customer row, SQL runs the subquery to sum their orders.
This is like an Excel formula in each row.
Example 3 — Subquery in FROM (derived table)
-- Find customers and their average order size
SELECT c.name, avg_orders.avg_amount
FROM customers c
JOIN (
SELECT customer_id, AVG(amount) AS avg_amount
FROM orders
GROUP BY customer_id
) avg_orders
ON c.customer_id = avg_orders.customer_id;
Layman explanation:
The subquery in
FROM
calculates average order per customer.We join that temporary table back to
customers
.This is more efficient when reusing aggregates.
Example 4 — Correlated subquery
-- Find customers who have an order larger than 500
SELECT name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 500
);
Layman explanation:
EXISTS
returns true if the subquery finds at least one row.Here, it checks if each customer has a big order (>500).
This depends on the outer query → correlated subquery.
Example 5 — CTE (WITH clause)
-- CTE to calculate total payments vs orders
WITH order_totals AS (
SELECT o.order_id, o.amount AS order_amount,
COALESCE(SUM(p.payment_amount), 0) AS total_paid
FROM orders o
LEFT JOIN payments p ON o.order_id = p.order_id
GROUP BY o.order_id
)
SELECT order_id, order_amount, total_paid,
(order_amount - total_paid) AS balance
FROM order_totals;
Layman explanation:
WITH order_totals AS (...)
defines a temporary table.Then you can query it like a normal table.
Much cleaner than nested subqueries, easier to read.
Example 6 — Multiple CTEs
-- CTEs can be chained
WITH order_totals AS (
SELECT customer_id, SUM(amount) AS total_amount
FROM orders
GROUP BY customer_id
),
customer_payments AS (
SELECT o.customer_id, SUM(p.payment_amount) AS total_paid
FROM orders o
JOIN payments p ON o.order_id = p.order_id
GROUP BY o.customer_id
)
SELECT c.name, o.total_amount, cp.total_paid
FROM customers c
LEFT JOIN order_totals o ON c.customer_id = o.customer_id
LEFT JOIN customer_payments cp ON c.customer_id = cp.customer_id;
Layman explanation:
First CTE calculates total order amount per customer.
Second CTE calculates total payments per customer.
Final query joins them all for a clear report.
Quick checklist
Subqueries can be in
WHERE
,SELECT
, orFROM
.Correlated subqueries depend on outer query.
EXISTS
is efficient for "does a row exist?" checks.CTEs (
WITH
) make queries cleaner and reusable.
10 Exercises (Chapter 4)
Write a query to list customers who have not placed any orders (use subquery in
WHERE
).Show each customer’s name and their largest order amount (subquery in
SELECT
).Using a subquery in
FROM
, calculate average payment per order.Find customers who spent more than 500 in total (use correlated subquery).
Use
EXISTS
to find orders that have payments recorded.Write a CTE to show orders and their remaining balance (order amount – total paid).
Write a query with two CTEs: one for total orders per customer, one for total payments, then join.
Find customers who have orders but no payments.
Use a subquery in
FROM
to calculate the top spender customer_id, then join to show their name.Use a correlated subquery to show each order and whether it is above that customer’s average order size.
✅ Summary of this chapter
Subqueries let you nest questions inside questions.
CTEs make big queries readable and modular.
These are essential for complex analytics, especially in AI/ML pipelines where you need to prepare features from multiple data sources.
Chapter 5 — Advanced Filtering & Window Functions
Now we’re moving into Chapter 5 — Advanced Filtering & Window Functions.
This is one of the most critical chapters for AI/Data jobs because:
You’ll need to handle ranking (e.g., top N customers).
You’ll calculate running totals, moving averages, differences → essential in time-series problems (stocks, sensors, logs).
You’ll handle conditional logic for feature engineering.
What this chapter covers (plain language)
CASE WHEN → if-else conditions inside SQL.
COALESCE / NULLIF → deal with
NULL
values.Window functions (
OVER()
) → look at rows relative to each other.Ranking:
ROW_NUMBER()
,RANK()
,DENSE_RANK()
.Running totals, moving averages:
SUM() OVER()
,AVG() OVER()
.Time-series helpers:
LAG()
,LEAD()
.
Example setup (sales data)
DROP TABLE IF EXISTS sales;
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
customer TEXT,
amount INTEGER,
sale_date TEXT
);
INSERT INTO sales (sale_id, customer, amount, sale_date) VALUES
(1, 'Asha', 100, '2025-09-01'),
(2, 'John', 200, '2025-09-01'),
(3, 'Asha', 300, '2025-09-02'),
(4, 'Maya', 150, '2025-09-02'),
(5, 'John', 400, '2025-09-03'),
(6, 'Asha', 250, '2025-09-04'),
(7, 'Maya', 500, '2025-09-05');
Example 1 — CASE WHEN (if-else logic)
-- Label sales as 'High' if > 300 else 'Low'
SELECT sale_id, customer, amount,
CASE
WHEN amount > 300 THEN 'High'
ELSE 'Low'
END AS sale_category
FROM sales;
Layman explanation:
CASE WHEN
works likeIF
in Excel or Python.Here, sales above 300 are "High", others "Low".
You can nest multiple conditions too.
Example 2 — COALESCE and NULLIF
DROP TABLE IF EXISTS payments;
CREATE TABLE payments (
pay_id INTEGER PRIMARY KEY,
customer TEXT,
amount INTEGER,
discount INTEGER
);
INSERT INTO payments (pay_id, customer, amount, discount) VALUES
(1, 'Asha', 100, NULL),
(2, 'John', 200, 20),
(3, 'Maya', 150, 0),
(4, 'Liam', 300, NULL);
-- Replace NULL discount with 0
SELECT pay_id, customer, amount,
COALESCE(discount, 0) AS discount_applied
FROM payments;
-- Avoid divide-by-zero errors with NULLIF
SELECT pay_id, customer,
amount / NULLIF(discount, 0) AS ratio
FROM payments;
Layman explanation:
COALESCE(x, 0)
replaces NULL with a default (0).NULLIF(x, 0)
returns NULL ifx = 0
(avoids divide by zero errors).
Example 3 — ROW_NUMBER, RANK, DENSE_RANK
-- Rank sales per customer
SELECT customer, amount,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY amount DESC) AS row_num,
RANK() OVER (PARTITION BY customer ORDER BY amount DESC) AS rank_val,
DENSE_RANK() OVER (PARTITION BY customer ORDER BY amount DESC) AS dense_rank_val
FROM sales;
Layman explanation:
ROW_NUMBER()
→ unique sequence (1,2,3…).RANK()
→ gives gaps (1,2,2,4).DENSE_RANK()
→ no gaps (1,2,2,3).PARTITION BY customer
→ restart numbering for each customer.
Example 4 — Running totals with SUM OVER
-- Running total of sales by date
SELECT sale_date, customer, amount,
SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
Layman explanation:
SUM(amount) OVER (ORDER BY sale_date)
→ running total.UNBOUNDED PRECEDING
= start from the first row.CURRENT ROW
= up to this row.
Example 5 — Moving averages with AVG OVER
-- 2-day moving average sales amount
SELECT sale_date, customer, amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM sales;
Layman explanation:
Looks at "current row + previous 1 row".
Calculates average over that window.
Useful for smoothing noisy time-series data.
Example 6 — LAG and LEAD (time-series difference)
-- Compare each sale with previous sale (LAG)
SELECT sale_id, customer, amount, sale_date,
LAG(amount) OVER (PARTITION BY customer ORDER BY sale_date) AS prev_amount,
(amount - LAG(amount) OVER (PARTITION BY customer ORDER BY sale_date)) AS change_from_prev
FROM sales;
-- Look ahead with LEAD
SELECT sale_id, customer, amount, sale_date,
LEAD(amount) OVER (PARTITION BY customer ORDER BY sale_date) AS next_amount
FROM sales;
Layman explanation:
LAG(x)
→ fetch previous row’s value.LEAD(x)
→ fetch next row’s value.Great for calculating differences, growth rates, trends.
Quick checklist
CASE WHEN
= if-else in SQL.COALESCE
= replace NULL.NULLIF
= avoid divide-by-zero.ROW_NUMBER
,RANK
,DENSE_RANK
= row ranking.SUM OVER
= running total.AVG OVER
= moving average.LAG
/LEAD
= compare with previous/next row.
10 Exercises (Chapter 5)
Label sales as
"Big"
if > 250,"Medium"
if between 100–250, else"Small"
.Show payments with NULL discounts replaced by 0 using
COALESCE
.Find the top 2 highest sales per customer (use ROW_NUMBER).
Calculate the total revenue so far (running total) ordered by
sale_date
.Find the average sales amount per customer (use
AVG() OVER PARTITION
).Show each sale and the difference from the previous sale amount for that customer (use
LAG
).Show each sale and the next sale amount for that customer (use
LEAD
).Create a query that calculates a 3-day moving average of sales.
Rank customers by total sales amount across all dates (use RANK).
Write a query that shows sales where the amount is greater than the customer’s average sale amount (use window AVG + comparison).
✅ Summary of this chapter
Window functions let you analyze trends across rows — essential for AI (time-series, ranking, anomaly detection).
Mastering
ROW_NUMBER
,RANK
,LAG
,LEAD
, and running totals will make your SQL stand out.
Chapter 6 — Data Manipulation (INSERT, UPDATE, DELETE, MERGE)
Now we’ll cover Chapter 6 — Data Manipulation (INSERT, UPDATE, DELETE, MERGE/UPSERT).
This is about changing data, not just reading it. In AI/data workflows, you’ll often insert cleaned data, update values, or delete errors. For ETL (Extract → Transform → Load), this chapter is key.
What this chapter covers (plain language)
INSERT
→ add new rows.UPDATE
→ change values in existing rows.DELETE
→ remove rows.MERGE
(or UPSERT) → insert new rows or update if they already exist.
Example setup
DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY,
name TEXT,
role TEXT,
salary INTEGER
);
INSERT INTO employees (emp_id, name, role, salary) VALUES
(1, 'Asha', 'Data Analyst', 50000),
(2, 'John', 'ML Engineer', 90000),
(3, 'Maya', 'Data Scientist', 85000);
Example 1 — INSERT
-- Insert a single row
INSERT INTO employees (emp_id, name, role, salary)
VALUES (4, 'Liam', 'DevOps', 75000);
-- Insert multiple rows at once
INSERT INTO employees (emp_id, name, role, salary)
VALUES
(5, 'Sara', 'Intern', 20000),
(6, 'Chen', 'Data Engineer', 80000);
-- Check the data
SELECT * FROM employees;
Layman explanation:
INSERT INTO table (columns) VALUES (values)
adds a row.You can add multiple rows in one statement.
Example 2 — UPDATE
-- Update Sara's salary
UPDATE employees
SET salary = 25000
WHERE name = 'Sara';
-- Increase salary of all "Data" roles by 10%
UPDATE employees
SET salary = salary * 1.1
WHERE role LIKE 'Data%';
-- Check results
SELECT * FROM employees;
Layman explanation:
UPDATE table SET column = value WHERE condition
.Without
WHERE
, all rows will be updated (dangerous!).
Example 3 — DELETE
-- Delete interns
DELETE FROM employees
WHERE role = 'Intern';
-- Delete all employees with salary < 30000
DELETE FROM employees
WHERE salary < 30000;
-- Check
SELECT * FROM employees;
Layman explanation:
DELETE FROM table WHERE condition
.If you forget
WHERE
, it deletes everything. Always double-check!
Example 4 — UPSERT (INSERT OR REPLACE in SQLite)
-- If emp_id exists, replace it. If not, insert new.
INSERT OR REPLACE INTO employees (emp_id, name, role, salary)
VALUES (2, 'John', 'ML Engineer', 95000);
-- Insert new employee with same method
INSERT OR REPLACE INTO employees (emp_id, name, role, salary)
VALUES (7, 'Emma', 'AI Researcher', 100000);
SELECT * FROM employees;
Layman explanation:
UPSERT = Update if exists, Insert if not.
In SQLite:
INSERT OR REPLACE
.In PostgreSQL:
INSERT ... ON CONFLICT (id) DO UPDATE
.In MySQL:
INSERT ... ON DUPLICATE KEY UPDATE
.
Example 5 — MERGE (manual simulation for SQLite)
SQLite doesn’t support MERGE
directly. But let’s simulate.
Suppose we have a new data table:
DROP TABLE IF EXISTS new_employees;
CREATE TABLE new_employees (
emp_id INTEGER PRIMARY KEY,
name TEXT,
role TEXT,
salary INTEGER
);
INSERT INTO new_employees (emp_id, name, role, salary) VALUES
(2, 'John', 'ML Engineer', 97000), -- existing employee (update)
(8, 'Olivia', 'Data Engineer', 82000); -- new employee
Simulate MERGE:
-- Step 1: Update existing
UPDATE employees
SET salary = (SELECT salary FROM new_employees ne WHERE ne.emp_id = employees.emp_id)
WHERE emp_id IN (SELECT emp_id FROM new_employees);
-- Step 2: Insert new ones
INSERT INTO employees (emp_id, name, role, salary)
SELECT ne.emp_id, ne.name, ne.role, ne.salary
FROM new_employees ne
WHERE ne.emp_id NOT IN (SELECT emp_id FROM employees);
-- Check final
SELECT * FROM employees;
Layman explanation:
First
UPDATE
matches rows byemp_id
and changes salary.Then
INSERT
adds any new rows missing inemployees
.That’s what
MERGE
does in big databases (Oracle, SQL Server).
Quick checklist
INSERT
→ add rows.UPDATE
→ change rows (always useWHERE
).DELETE
→ remove rows (always useWHERE
).UPSERT/MERGE
→ insert or update depending on whether the row exists.
10 Exercises (Chapter 6)
Insert a new employee
"Ravi"
with role"Data Intern"
and salary15000
.Insert 2 new employees in one query:
"Emma"
(AI Engineer, 110000),"Noah"
(ML Intern, 18000).Update salary of all
"Engineer"
roles by +5000.Update
"Asha"
’s role from"Data Analyst"
to"Senior Data Analyst"
.Delete all employees with role
"Intern"
.Delete employees whose salary is less than 20000.
Use UPSERT to update John’s salary to
120000
.Use UPSERT to insert a new employee
"David"
(DevOps, 70000).Create a new table
new_hires
with 2 rows (1 existing emp_id, 1 new). Simulate MERGE: update the existing one, insert the new one.Write a query that deletes all employees with
"Data"
in their role but only if salary < 60000.
✅ Summary of this chapter
These commands are part of ETL and data cleaning.
In AI/data projects, you often insert raw data, update corrections, delete wrong rows, or merge new data feeds.
Always be careful with
UPDATE
andDELETE
— forgettingWHERE
wipes everything.
Chapter 7 — Set Operations
Let’s move to Chapter 7 — Set Operations (UNION, INTERSECT, EXCEPT).
Set operations are about combining results of multiple queries. This is super useful in data integration, filtering, and feature creation.
What this chapter covers (plain language)
UNION → combine results, remove duplicates.
UNION ALL → combine results, keep duplicates.
INTERSECT → rows that appear in both queries.
EXCEPT / MINUS → rows in one query but not the other.
👉 These are like Venn diagram operations for SQL.
Example setup
DROP TABLE IF EXISTS team_a;
DROP TABLE IF EXISTS team_b;
CREATE TABLE team_a (
name TEXT
);
CREATE TABLE team_b (
name TEXT
);
INSERT INTO team_a (name) VALUES
('Asha'),
('John'),
('Maya'),
('Liam');
INSERT INTO team_b (name) VALUES
('Maya'),
('John'),
('Olivia'),
('Emma');
Team A: Asha, John, Maya, Liam
Team B: Maya, John, Olivia, Emma
Example 1 — UNION (no duplicates)
-- All unique members from both teams
SELECT name FROM team_a
UNION
SELECT name FROM team_b;
Result: Asha, John, Liam, Maya, Olivia, Emma (no duplicates).
Layman explanation:
UNION
merges results from both queries.Duplicates are removed automatically.
Example 2 — UNION ALL (keep duplicates)
-- All members, including duplicates
SELECT name FROM team_a
UNION ALL
SELECT name FROM team_b;
Result: John and Maya appear twice (once from each team).
Layman explanation:
UNION ALL
keeps duplicates.Faster than
UNION
(since no deduplication).
Example 3 — INTERSECT
-- Members common to both teams
SELECT name FROM team_a
INTERSECT
SELECT name FROM team_b;
Result: John, Maya
Layman explanation:
INTERSECT
= only rows that appear in both sets.Like the overlap in a Venn diagram.
Example 4 — EXCEPT (or MINUS)
-- Members in Team A but not in Team B
SELECT name FROM team_a
EXCEPT
SELECT name FROM team_b;
Result: Asha, Liam
Layman explanation:
EXCEPT
= rows from first query that are not in second.In Oracle, it’s called
MINUS
.
Example 5 — More practical example (Orders vs Payments)
Let’s reuse orders and payments.
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payments;
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer TEXT
);
CREATE TABLE payments (
payment_id INTEGER PRIMARY KEY,
order_id INTEGER
);
INSERT INTO orders (order_id, customer) VALUES
(1, 'Asha'),
(2, 'John'),
(3, 'Maya'),
(4, 'Liam');
INSERT INTO payments (payment_id, order_id) VALUES
(101, 1),
(102, 2),
(103, 3);
Orders: 1–4
Payments: 1–3
-- Orders that have been paid (intersection)
SELECT order_id FROM orders
INTERSECT
SELECT order_id FROM payments;
-- Orders that have not been paid (set difference)
SELECT order_id FROM orders
EXCEPT
SELECT order_id FROM payments;
Result:
Paid: 1, 2, 3
Not Paid: 4
Quick checklist
UNION
= combine, remove duplicates.UNION ALL
= combine, keep duplicates.INTERSECT
= common rows.EXCEPT
= rows in first query but not second.
10 Exercises (Chapter 7)
List all unique names from
team_a
andteam_b
(use UNION).List all names including duplicates (use UNION ALL).
Find names common to both teams (use INTERSECT).
Find names only in Team A (use EXCEPT).
Find names only in Team B (use EXCEPT).
Using
orders
andpayments
, find orders that have been paid (use INTERSECT).Using
orders
andpayments
, find orders that have not been paid (use EXCEPT).Insert a duplicate row into
team_a
(John
again). Show difference between UNION and UNION ALL results.Combine all customers and all employees into one list (create small extra tables).
Write a query to get customers who placed orders but are not in payments table.
✅ Summary of this chapter
Set operations are like Venn diagrams in SQL.
They are powerful for data comparison (paid vs unpaid, active vs inactive, A/B group differences).
Very handy in AI for feature creation and labeling.
Chapter 8 — Data Cleaning & Transformation
Let’s continue with Chapter 8 — Data Cleaning & Transformation.
This is one of the most practical chapters for AI/ML, because before you train any model, you’ll spend 70–80% of your time cleaning and transforming raw data.
What this chapter covers (plain language)
String functions → cleaning names, trimming spaces, searching text.
Date & time functions → extracting day/month/year, calculating differences.
Type conversion → converting text to numbers or dates.
Pivot/Unpivot → reshape data from wide → long or long → wide.
Example setup
DROP TABLE IF EXISTS raw_data;
CREATE TABLE raw_data (
id INTEGER PRIMARY KEY,
name TEXT,
email TEXT,
signup_date TEXT,
score TEXT
);
INSERT INTO raw_data (id, name, email, signup_date, score) VALUES
(1, ' Asha ', 'asha@example.com', '2025-01-10', '85'),
(2, 'John', 'john@example.com', '2025-02-15', '90'),
(3, 'Maya Singh', 'maya_singh@example.com', '2025-03-20', 'NULL'),
(4, 'Liam', 'liam@example.com', '2025-03-25', '72'),
(5, 'Sara ', 'sara@example.com', '2025-04-05', '88');
Example 1 — String functions
-- TRIM spaces
SELECT id, name, TRIM(name) AS clean_name FROM raw_data;
-- UPPER / LOWER
SELECT id, UPPER(name) AS upper_name, LOWER(name) AS lower_name FROM raw_data;
-- SUBSTRING (get first 4 letters)
SELECT id, SUBSTR(name, 1, 4) AS short_name FROM raw_data;
-- FIND position of '@' in email
SELECT id, INSTR(email, '@') AS at_position FROM raw_data;
-- CONCAT (combine name + email)
SELECT id, name || ' <' || email || '>' AS contact_info FROM raw_data;
Layman explanation:
TRIM()
→ removes extra spaces.UPPER()
,LOWER()
→ normalize text.SUBSTR()
→ extract part of a string.INSTR()
→ find character position.||
(SQLite) → concatenate strings.
Example 2 — Date functions
-- Extract year, month, day
SELECT id, signup_date,
STRFTIME('%Y', signup_date) AS year,
STRFTIME('%m', signup_date) AS month,
STRFTIME('%d', signup_date) AS day
FROM raw_data;
-- Days since signup
SELECT id, signup_date,
JULIANDAY('2025-09-16') - JULIANDAY(signup_date) AS days_since_signup
FROM raw_data;
Layman explanation:
STRFTIME('%Y')
→ year,%m
→ month,%d
→ day.JULIANDAY(date)
→ number format → lets you subtract dates.
Example 3 — Type conversion
-- Convert score from TEXT to INTEGER
SELECT id, name, CAST(score AS INTEGER) AS score_num
FROM raw_data;
Layman explanation:
CAST(x AS INTEGER)
converts text → number.Useful when data is stored as text in CSV files.
Example 4 — Handling NULLs in transformation
-- Replace invalid score 'NULL' with 0
SELECT id, name,
CASE WHEN score = 'NULL' THEN 0 ELSE CAST(score AS INTEGER) END AS clean_score
FROM raw_data;
-- OR using NULLIF + COALESCE
SELECT id, name,
COALESCE(NULLIF(score, 'NULL'), 0) AS clean_score
FROM raw_data;
Layman explanation:
NULLIF(score, 'NULL')
→ treat'NULL'
string as NULL.COALESCE(..., 0)
→ replace NULL with 0.
Example 5 — Pivot (long → wide)
Let’s make a table with scores in subjects.
DROP TABLE IF EXISTS marks;
CREATE TABLE marks (
student TEXT,
subject TEXT,
score INTEGER
);
INSERT INTO marks (student, subject, score) VALUES
('Asha', 'Math', 85),
('Asha', 'Science', 90),
('John', 'Math', 78),
('John', 'Science', 88);
Pivot (manual in SQLite):
SELECT student,
SUM(CASE WHEN subject = 'Math' THEN score END) AS math_score,
SUM(CASE WHEN subject = 'Science' THEN score END) AS science_score
FROM marks
GROUP BY student;
Result:
Asha → 85, 90
John → 78, 88
Layman explanation:
Pivot = convert "rows" into "columns".
We simulate it with
CASE WHEN
.
Example 6 — Unpivot (wide → long)
Suppose you have wide-format data:
DROP TABLE IF EXISTS student_scores;
CREATE TABLE student_scores (
student TEXT,
math INTEGER,
science INTEGER
);
INSERT INTO student_scores (student, math, science) VALUES
('Asha', 85, 90),
('John', 78, 88);
Unpivot (manual in SQLite):
SELECT student, 'Math' AS subject, math AS score FROM student_scores
UNION ALL
SELECT student, 'Science' AS subject, science AS score FROM student_scores;
Result:
Asha, Math, 85
Asha, Science, 90
John, Math, 78
John, Science, 88
Layman explanation:
Unpivot = convert "columns" into "rows".
Useful when you need to normalize wide Excel-like tables.
Quick checklist
TRIM
,UPPER
,SUBSTR
,INSTR
→ clean strings.STRFTIME
,JULIANDAY
→ extract and calculate with dates.CAST
,COALESCE
,NULLIF
→ fix dirty types and missing values.CASE WHEN
inside SELECT → conditional transformation.Pivot = rows → columns.
Unpivot = columns → rows.
10 Exercises (Chapter 8)
Clean the
raw_data.name
column by trimming spaces.Convert all names to uppercase.
Extract the year of signup for each customer.
Find customers who signed up in March (
%m = '03'
).Convert
score
column to integer, treating'NULL'
as 0.Show number of days since each customer signed up (relative to today).
Create a pivot table from
marks
showing each student’s math and science score.Create an unpivot table from
student_scores
showing subject + score.Find customers whose email domain is
example.com
(use SUBSTR/INSTR).Create a new column
score_category
:"High"
if score ≥ 85,"Low"
otherwise.
✅ Summary of this chapter
This is the bread and butter of feature engineering: clean strings, handle dates, fix types, reshape data.
Pivot/unpivot is especially important in analytics pipelines.
Mastering this chapter makes you valuable for AI jobs, since 80% of ML time is data prep.
Chapter 9 — Performance & Optimization
Let’s dive into Chapter 9 — Performance & Optimization.
This is a must-know for AI/data jobs, because when you query millions of rows, slow SQL will waste hours. Even as a data/AI engineer, you need to understand indexes and query tuning basics.
What this chapter covers (plain language)
Why queries get slow.
How indexes speed things up.
How to check what SQL is doing (EXPLAIN).
Basic tips to make queries faster.
⚡ Reminder: You don’t need DBA-level tuning, but you must know enough to avoid writing queries that take hours instead of seconds.
Example setup
DROP TABLE IF EXISTS big_orders;
CREATE TABLE big_orders (
order_id INTEGER PRIMARY KEY,
customer TEXT,
amount INTEGER,
order_date TEXT
);
-- Insert 20 rows (imagine millions in real life)
INSERT INTO big_orders (order_id, customer, amount, order_date) VALUES
(1, 'Asha', 200, '2025-01-01'),
(2, 'John', 500, '2025-01-02'),
(3, 'Maya', 300, '2025-01-03'),
(4, 'Liam', 700, '2025-01-04'),
(5, 'Sara', 150, '2025-01-05'),
(6, 'Asha', 900, '2025-02-01'),
(7, 'John', 450, '2025-02-02'),
(8, 'Maya', 600, '2025-02-03'),
(9, 'Liam', 1000, '2025-02-04'),
(10, 'Sara', 120, '2025-02-05'),
(11, 'Asha', 400, '2025-03-01'),
(12, 'John', 250, '2025-03-02'),
(13, 'Maya', 850, '2025-03-03'),
(14, 'Liam', 950, '2025-03-04'),
(15, 'Sara', 500, '2025-03-05'),
(16, 'Asha', 100, '2025-04-01'),
(17, 'John', 750, '2025-04-02'),
(18, 'Maya', 200, '2025-04-03'),
(19, 'Liam', 300, '2025-04-04'),
(20, 'Sara', 400, '2025-04-05');
Example 1 — Full table scan
-- Find orders where amount > 800
EXPLAIN QUERY PLAN
SELECT * FROM big_orders WHERE amount > 800;
Layman explanation:
Without an index, SQL will check every row (called a full table scan).
With 20 rows, it’s fine. With 20 million rows, it’s very slow.
Example 2 — Creating an index
-- Create index on amount
CREATE INDEX idx_amount ON big_orders(amount);
-- Now run the same query
EXPLAIN QUERY PLAN
SELECT * FROM big_orders WHERE amount > 800;
Layman explanation:
An index is like the index in a book → faster lookup.
Now SQL doesn’t scan all rows, it jumps to rows where
amount > 800
.
Example 3 — Index on multiple columns
-- Index on (customer, order_date)
CREATE INDEX idx_customer_date ON big_orders(customer, order_date);
-- Query using that index
SELECT * FROM big_orders
WHERE customer = 'Asha' AND order_date > '2025-02-01';
Layman explanation:
Composite indexes speed up queries filtering by two or more columns.
But only useful if your WHERE matches the order of columns.
Example 4 — Avoiding SELECT *
-- Bad: SELECT *
SELECT * FROM big_orders WHERE customer = 'John';
-- Good: only select needed columns
SELECT customer, amount FROM big_orders WHERE customer = 'John';
Layman explanation:
SELECT *
loads all columns, even if you don’t need them.In huge tables with 50+ columns, this slows down queries.
Example 5 — Avoiding unnecessary functions in WHERE
-- Bad: function on column (index not used)
SELECT * FROM big_orders
WHERE STRFTIME('%Y', order_date) = '2025';
-- Good: compare directly (uses index)
SELECT * FROM big_orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-12-31';
Layman explanation:
If you wrap columns in functions, indexes are ignored.
Instead, filter using raw values.
Example 6 — LIMIT for efficiency
-- Get top 5 biggest orders
SELECT customer, amount
FROM big_orders
ORDER BY amount DESC
LIMIT 5;
Layman explanation:
Always use
LIMIT
when exploring.Otherwise, queries may return millions of rows and choke memory.
Quick checklist (Performance Rules of Thumb)
Always index columns used in
WHERE
,JOIN
,ORDER BY
.Avoid
SELECT *
, pick only needed columns.Avoid wrapping columns in functions inside
WHERE
.Use
LIMIT
when exploring.Use
EXPLAIN
to check if indexes are used.
10 Exercises (Chapter 9)
Create an index on
customer
inbig_orders
. Run a query filtering bycustomer
and check withEXPLAIN
.Create a composite index on
(customer, amount)
and query John’s orders above 400.Write a query to get the top 3 biggest orders.
Write a query to find orders in March 2025 without using functions in WHERE.
Compare
SELECT *
vs selecting onlycustomer, amount
.Delete index
idx_amount
and check howEXPLAIN
changes foramount > 800
.Find all orders by
Asha
after2025-03-01
(ensure your index helps).Create a query that sorts by
order_date DESC
and limits to last 2 orders.Insert 2 duplicate rows for
John
with amount 500. How would you ensure uniqueness? (Hint: PRIMARY KEY, UNIQUE).Use
EXPLAIN QUERY PLAN
on a simple query, and check whether it says "USING INDEX".
✅ Summary of this chapter
SQL performance is mostly about indexes.
Good habits: avoid
SELECT *
, use LIMIT, filter smartly.For AI/data jobs, you don’t need to be a tuning expert — but knowing this saves you from slow pipelines.
Perfect Bavi 🚀 Now we’ve reached the final practical chapter:
Chapter 10 — AI/Data Job Specific SQL (Feature Engineering, Time-Series, Data Quality Checks).
This is where everything we’ve learned ties into AI/ML workflows. SQL here is used to prepare datasets, create features, and ensure data quality before training models.
Chapter 10 — AI/Data Job Specific SQL
What this chapter covers (plain language)
Feature engineering in SQL (new columns from raw data).
Time-series operations (lags, moving averages).
ETL workflows (Extract → Transform → Load).
Data quality checks (missing values, duplicates).
SQL + Python integration (using Pandas + SQLAlchemy).
Example setup (transactions)
DROP TABLE IF EXISTS transactions;
CREATE TABLE transactions (
txn_id INTEGER PRIMARY KEY,
customer TEXT,
amount INTEGER,
txn_date TEXT
);
INSERT INTO transactions (txn_id, customer, amount, txn_date) VALUES
(1, 'Asha', 100, '2025-01-01'),
(2, 'Asha', 200, '2025-01-05'),
(3, 'John', 500, '2025-01-02'),
(4, 'John', 50, '2025-01-10'),
(5, 'Maya', 300, '2025-01-03'),
(6, 'Maya', 400, '2025-01-06'),
(7, 'Maya', 150, '2025-01-09');
Example 1 — Feature engineering: totals and averages
-- Total and average spend per customer
SELECT customer,
SUM(amount) AS total_spent,
AVG(amount) AS avg_spent,
COUNT(*) AS txn_count
FROM transactions
GROUP BY customer;
Layman explanation:
Features = new columns for ML (like "total spend", "average spend").
This is classic SQL feature engineering.
Example 2 — Recency feature (days since last transaction)
-- Days since last transaction (relative to 2025-01-15)
SELECT customer,
MAX(txn_date) AS last_txn,
JULIANDAY('2025-01-15') - JULIANDAY(MAX(txn_date)) AS days_since_last
FROM transactions
GROUP BY customer;
Layman explanation:
Recency is a key feature in customer churn models.
Here, we calculate days since the most recent transaction.
Example 3 — Time-series: lag feature
-- Compare each txn amount to previous txn of same customer
SELECT txn_id, customer, txn_date, amount,
LAG(amount) OVER (PARTITION BY customer ORDER BY txn_date) AS prev_amount,
(amount - LAG(amount) OVER (PARTITION BY customer ORDER BY txn_date)) AS diff_from_prev
FROM transactions;
Layman explanation:
LAG()
→ fetch previous transaction.The
diff_from_prev
feature helps detect anomalies/trends.
Example 4 — Moving average feature
-- 2-transaction moving average per customer
SELECT txn_id, customer, txn_date, amount,
AVG(amount) OVER (
PARTITION BY customer
ORDER BY txn_date
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM transactions;
Layman explanation:
Moving averages smooth out noisy data.
Very useful for forecasting (stock prices, sales).
Example 5 — Data quality checks
-- Missing values check
SELECT * FROM transactions
WHERE customer IS NULL OR amount IS NULL OR txn_date IS NULL;
-- Duplicate detection
SELECT customer, txn_date, COUNT(*) AS dup_count
FROM transactions
GROUP BY customer, txn_date
HAVING COUNT(*) > 1;
-- Outlier detection (txns > 3x average)
WITH avg_txn AS (
SELECT AVG(amount) AS avg_amt FROM transactions
)
SELECT t.*
FROM transactions t, avg_txn
WHERE t.amount > 3 * avg_txn.avg_amt;
Layman explanation:
AI pipelines break if you have missing values or duplicates.
SQL can quickly check data quality before loading into ML.
Example 6 — SQL + Python integration (concept)
In real projects, you’ll pull SQL data into Pandas for ML.
import pandas as pd
from sqlalchemy import create_engine
# Connect to SQLite DB
engine = create_engine("sqlite:///mydb.db")
# Load features directly from SQL
query = """
SELECT customer,
SUM(amount) AS total_spent,
AVG(amount) AS avg_spent,
COUNT(*) AS txn_count
FROM transactions
GROUP BY customer;
"""
df = pd.read_sql(query, engine)
print(df)
Layman explanation:
Use
pandas.read_sql()
to directly fetch query results.SQL does the aggregation, Python does the ML.
Quick checklist (AI/ML focus)
Feature engineering → totals, averages, recency, frequency.
Time-series → lag, lead, moving averages.
Data quality → check for NULLs, duplicates, outliers.
ETL in SQL → clean + transform before exporting to Python.
SQL + Pandas → combine the best of both worlds.
10 Exercises (Chapter 10)
Calculate each customer’s total transactions and average spend.
Find customers who spent more than 500 in total.
For each customer, calculate days since last txn (relative to 2025-01-15).
Use
LAG()
to show difference between each txn and previous txn.Use a 3-transaction moving average per customer.
Check if any rows have
NULL
values intransactions
.Find if there are duplicate
(customer, txn_date)
rows.Detect outlier transactions > 2× customer’s average.
Combine SQL + Python: pull transaction summary into Pandas and plot bar chart (code in Python).
Create features:
total_spent
,txn_count
,days_since_last
→ imagine these as model inputs.
✅ Summary of this chapter
SQL is not just for querying → it’s a feature engineering toolkit.
Time-series SQL (
LAG
,LEAD
, moving averages) = gold for forecasting.Data quality SQL prevents garbage-in → garbage-out in ML.
In real AI projects: SQL prepares the data → Python/R trains the model.