SQL Mastery for AI & Data: From Basics to Feature Engineering

The Complete SQL Roadmap for AI/ML Engineers

Sep 16, 2025

Python + AI is a strong combo, and SQL is a must-have because most AI/ML workflows need structured data from databases.
For AI/ML/data-related jobs, you don’t need to be a full-time DBA, but you must be comfortable with querying, cleaning, and transforming data efficiently.

Here’s a SQL roadmap for AI/Data/ML jobs 👇

🔹 1. SQL Basics (Foundations)

What is SQL? (DDL, DML, DQL, DCL, TCL)
Database concepts: tables, rows, columns, primary keys, foreign keys
SELECT, FROM, WHERE basics
Filtering data with conditions (=, <, >, <>, BETWEEN, IN, LIKE)
Sorting results with ORDER BY
Limiting results (LIMIT / TOP)
Aliases (AS)

🔹 2. Aggregations & Grouping

COUNT(), SUM(), AVG(), MIN(), MAX()
GROUP BY (single and multiple columns)
HAVING vs WHERE
Distinct values (DISTINCT)

🔹 3. Joins (Very Important for AI Jobs)

Inner Join
Left Join
Right Join
Full Outer Join
Self Join
Cross Join
Multi-table joins
👉 Most ML feature engineering needs joining multiple datasets.

🔹 4. Subqueries & Derived Tables

Simple subqueries (in WHERE, SELECT, FROM)
Correlated subqueries
EXISTS vs IN
Common Table Expressions (CTEs) with WITH

🔹 5. Advanced Filtering & Window Functions

CASE WHEN for conditional columns
COALESCE, NULLIF, handling NULL values
Ranking functions (ROW_NUMBER, RANK, DENSE_RANK)
Aggregates with windowing (SUM() OVER(), AVG() OVER())
Moving averages, cumulative sums
👉 Very useful for time series AI problems (stock, IoT, logs).

🔹 6. Data Manipulation

INSERT, UPDATE, DELETE
MERGE (UPSERT operations)
Bulk inserts

🔹 7. Set Operations

UNION vs UNION ALL
INTERSECT
EXCEPT / MINUS

🔹 8. Data Cleaning & Transformation (Critical for AI/ML)

String functions (TRIM, SUBSTRING, CONCAT, UPPER/LOWER)
Date & time functions (DATEADD, DATEDIFF, EXTRACT)
Converting data types (CAST, CONVERT)
Pivoting & unpivoting data (PIVOT, UNPIVOT)

🔹 9. Performance & Optimization (Intermediate Level)

Indexes basics (clustered, non-clustered)
Query execution plan (basic understanding)
EXPLAIN keyword to analyze queries
Using appropriate joins vs subqueries for performance

🔹 10. AI/Data-Job Specific SQL Topics

Writing queries for feature engineering
SQL for time series (lag, lead, window)
SQL for ETL workflows
SQL for data quality checks (missing, duplicates, anomalies)
SQL integration with Python (pandas + SQLAlchemy)
Using SQL in data warehouses (BigQuery, Snowflake, Redshift, Synapse)

✅ Summary:
For AI/data jobs, you need to be very strong in SELECT queries, joins, aggregations, subqueries, and window functions.
DBA-level topics (backups, replication, clustering) are not required unless you aim for data engineering.

✅ Run environment: examples below are written for SQLite (sqlite3). SQLite is zero-config and runs on Windows/Mac/Linux. If you use MySQL/Postgres, the logic is the same but a few tiny syntax differences apply

Chapter 1 — SQL Basics (Foundations)

What this chapter covers (plain language)

SQL is the language used to talk to relational databases. Think of a database as a set of spreadsheets (tables). SQL lets you:

Create and change tables (like creating a new sheet).
Put rows into tables (like typing rows into spreadsheets).
Ask questions (queries) — that’s the most common: “Give me all customers who bought X”.
Make simple filters, sort the results, and rename columns in the results.

You don’t need to be a database admin to be useful for AI/data work — you just need to be comfortable with asking the right questions and extracting the right data.

Very short vocabulary (layman)

Table: like a spreadsheet page (e.g., employees).
Row: one record (one person, one order).
Column: property of the row (name, price).
Primary Key (PK): unique id for each row (like a student ID).
Foreign Key (FK): a link from one table to a row in another table (like employee’s department id).
Query: a question you ask the database (SELECT ...).
DDL: commands that define structure (CREATE, DROP).
DML: commands that change data (INSERT, UPDATE, DELETE).
DQL: query language — SELECT statements (data retrieval).

How I’ll show examples

Each example block is self-contained: it drops the table if it exists, creates the table, inserts rows, and runs the SELECT you need to see the result. Copy the whole block and paste into sqlite3 and it will run.

Example 1 — Very first query: `SELECT` and `SELECT *`

What we want: get all rows and columns from a table.

-- Example 1: list all employees
DROP TABLE IF EXISTS employees;

CREATE TABLE employees (
  id INTEGER PRIMARY KEY,
  first_name TEXT,
  last_name TEXT,
  role TEXT,
  salary INTEGER
);

INSERT INTO employees (id, first_name, last_name, role, salary) VALUES
(1, 'Asha', 'Kumar', 'Data Analyst', 50000),
(2, 'John', 'Doe', 'ML Engineer', 90000),
(3, 'Maya', 'Singh', 'Data Scientist', 85000),
(4, 'Liam', 'Ng', 'DevOps', 75000),
(5, 'Sara', 'Khan', 'Intern', 20000);

-- The simplest question: give me everything
SELECT * FROM employees;

Layman explanation:

DROP TABLE IF EXISTS — remove old table so the code can be run many times without error.
CREATE TABLE — defines columns.
INSERT INTO — adds rows (we manually provide IDs so this works everywhere).
SELECT * FROM employees; — “give me every column of every row” (like opening the whole spreadsheet).

Example 2 — Select specific columns and `WHERE` filter

What we want: pick certain columns and filter rows.

-- Example 2: basic column selection and filtering
DROP TABLE IF EXISTS products;

CREATE TABLE products (
  product_id INTEGER PRIMARY KEY,
  name TEXT,
  category TEXT,
  price INTEGER,
  stock INTEGER
);

INSERT INTO products (product_id, name, category, price, stock) VALUES
(101, 'Latte', 'Beverage', 4, 100),
(102, 'Espresso', 'Beverage', 3, 50),
(103, 'Muffin', 'Food', 2, 0),
(104, 'Bagel', 'Food', 2, 30),
(105, 'GreenTea', 'Beverage', 3, 20);

-- Show only name and price for products that cost 3 or more
SELECT name, price
FROM products
WHERE price >= 3;

Layman explanation:

SELECT name, price — only fetch these columns (faster and cleaner).
WHERE price >= 3 — only rows where the price is at least 3. WHERE is used to filter like Excel filters.

More filters (run separately):

-- Products out of stock
SELECT * FROM products WHERE stock = 0;

-- Products that are either category 'Food' or price > 3
SELECT * FROM products WHERE category = 'Food' OR price > 3;

-- Products with names containing 'e' (simple text match)
SELECT * FROM products WHERE name LIKE '%e%';

-- Products with IDs in this small set
SELECT * FROM products WHERE product_id IN (101, 104);

Example 3 — `ORDER BY` and `LIMIT` (sorting and top-n)

What we want: sort results and take only top rows (e.g., top 3 salaries).

-- Example 3: orders and top-n
DROP TABLE IF EXISTS orders;

CREATE TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer TEXT,
  amount INTEGER,
  order_date TEXT  -- We'll store date as text for this beginner example
);

INSERT INTO orders (order_id, customer, amount, order_date) VALUES
(1, 'Asha', 120, '2025-09-01'),
(2, 'John', 500, '2025-09-02'),
(3, 'Maya', 300, '2025-09-03'),
(4, 'Liam', 50, '2025-08-28'),
(5, 'Sara', 700, '2025-09-05');

-- Show orders sorted by amount, highest first
SELECT * FROM orders ORDER BY amount DESC;

-- Show only the top 3 biggest orders
SELECT * FROM orders ORDER BY amount DESC LIMIT 3;

Layman explanation:

ORDER BY amount DESC — sorts rows by amount from big → small. ASC would be small → big.
LIMIT 3 — return only the first 3 rows of that sorted list (useful for top-k queries).

Example 4 — Column aliases & simple expressions

What we want: rename output columns and compute new columns.

-- Example 4: aliases and computed columns
DROP TABLE IF EXISTS customers;

CREATE TABLE customers (
  cid INTEGER PRIMARY KEY,
  first TEXT,
  last TEXT,
  country TEXT,
  purchases INTEGER
);

INSERT INTO customers (cid, first, last, country, purchases) VALUES
(1, 'Ravi', 'Patel', 'India', 5),
(2, 'Emma', 'Brown', 'UK', 2),
(3, 'Chen', 'Lee', 'China', 8),
(4, 'Olivia', 'Garcia', 'Spain', 1);

-- Make a full_name column (concatenate) and show an adjusted metric
SELECT
  cid,
  first || ' ' || last AS full_name,  -- in SQLite use || to join strings
  purchases,
  purchases * 10 AS purchase_score     -- a computed column
FROM customers;

Layman explanation:

AS full_name gives a friendlier name to the calculated column.
first || ' ' || last combines first and last name (SQLite style).
purchases * 10 AS purchase_score shows how we can compute values on the fly — great for feature engineering later.

Note: In MySQL/Postgres you might use CONCAT(first, ' ', last) instead of || — but the above is SQLite-safe.

Example 5 — `NULL`, `IS NULL`, and `COALESCE`

What we want: learn how missing values behave and how to replace them.

-- Example 5: handling NULLs
DROP TABLE IF EXISTS inventory;

CREATE TABLE inventory (
  item_id INTEGER PRIMARY KEY,
  name TEXT,
  location TEXT,
  quantity INTEGER
);

INSERT INTO inventory (item_id, name, location, quantity) VALUES
(1, 'Widget A', 'Aisle 1', 10),
(2, 'Widget B', NULL, 5),
(3, 'Widget C', 'Aisle 3', NULL),
(4, 'Widget D', NULL, NULL);

-- Find rows where location is unknown
SELECT * FROM inventory WHERE location IS NULL;

-- Find rows where quantity is missing
SELECT * FROM inventory WHERE quantity IS NULL;

-- Use COALESCE to replace NULL with a default for display
SELECT item_id, name,
       COALESCE(location, 'UNKNOWN') AS location_text,
       COALESCE(quantity, 0) AS qty
FROM inventory;

Layman explanation:

NULL means “unknown” or “no value”.
IS NULL is how you check for unknowns. = NULL will not work.
COALESCE picks the first non-null value — handy to show readable defaults.

Example 6 — Primary key, foreign key, and a simple relationship

What we want: show how two tables relate (department → employees).

-- Example 6: departments and employees (basic FK)
PRAGMA foreign_keys = ON;  -- make SQLite enforce foreign keys in this session

DROP TABLE IF EXISTS employees_dept;
DROP TABLE IF EXISTS departments;

CREATE TABLE departments (
  id INTEGER PRIMARY KEY,
  dept_name TEXT
);

CREATE TABLE employees_dept (
  emp_id INTEGER PRIMARY KEY,
  name TEXT,
  dept_id INTEGER,
  salary INTEGER,
  FOREIGN KEY (dept_id) REFERENCES departments(id)
);

INSERT INTO departments (id, dept_name) VALUES
(10, 'Engineering'),
(20, 'Data'),
(30, 'HR');

INSERT INTO employees_dept (emp_id, name, dept_id, salary) VALUES
(100, 'Asha', 20, 60000),
(101, 'John', 10, 90000),
(102, 'Maya', 20, 85000),
(103, 'Liam', 10, 70000),
(104, 'Sara', 30, 40000);

-- Show employees and their dept_id (raw)
SELECT * FROM employees_dept;

-- Very simple join to show department name (we'll cover joins deeply later)
SELECT e.emp_id, e.name, e.salary, d.dept_name
FROM employees_dept e
JOIN departments d ON e.dept_id = d.id;

Layman explanation:

departments has a unique id per row — that’s the primary key.
employees_dept.dept_id stores the department id — a foreign key linking to departments.
The JOIN pulls department names into the employee view. This is how you stitch related tables together.

Quick checklist: the key basic commands you just learned

CREATE TABLE — make a new table.
DROP TABLE IF EXISTS — remove an old table.
INSERT INTO ... VALUES (...) — add rows.
SELECT column1, column2 FROM table — pick columns you want.
WHERE — filter rows by conditions.
ORDER BY ... ASC|DESC — sort results.
LIMIT n — take only the first n rows.
AS — rename a column in the results.
IS NULL / COALESCE — handle missing values.
Primary Key (unique id) and Foreign Key (link between tables).

10 Exercises (Chapter 1) — practice (use the tables from the examples)

Do these on your own. Each exercise assumes you already ran the example blocks above (so the tables exist). If they don’t, re-run the relevant example block before starting.

List names and roles: From employees, show first_name, last_name, and role only.
High earners: From employees, list rows where salary is greater than 70,000. Sort them by salary descending.
Products with vowel: From products, show items whose name contains the letter a (case sensitive is fine).
Out of stock: From products, return product name and category for all items with stock = 0.
Top customers: From orders, list the top 2 customers by amount.
Create a view (or simulated): Write a query that shows each customer and their amount as order_amount (use AS to rename the column).
Replace NULLs: From inventory, produce a list where location is replaced with 'UNKNOWN' and quantity replaced with 0. (Use COALESCE.)
Department lookup: Using employees_dept and departments, show name and dept_name for employees in the Data department only.
Add a new product: Insert a new product into products with product_id 106 and then query to show it exists.
Count rows: Use a SELECT to count how many rows are in employees (use COUNT(*)). This introduces a tiny aggregation — we’ll do full aggregations next chapter.

Short tips & pitfalls (layman)

Always DROP TABLE IF EXISTS at the top of example scripts when learning — avoids "table already exists" errors.
NULL is not the same as an empty string ''. Use IS NULL to check nulls.
WHERE filters rows; HAVING (we’ll meet later) filters groups.
Practice by copying the code into sqlite3 and step through line by line. Change values, re-run, see what changes.

Chapter 2 — Aggregations & Grouping

What this chapter covers (plain language)

When you have lots of rows, you usually don’t want to see each row — you want summary numbers:

How many customers bought something?
What’s the average salary?
What’s the total revenue last week?

That’s what aggregate functions do. Then, with GROUP BY, you can split data into categories (e.g., sales per country).

The 5 most common aggregate functions

COUNT() → how many rows (or values).
SUM() → total of numbers.
AVG() → average of numbers.
MIN() → smallest value.
MAX() → biggest value.

Example 1 — Counting rows

-- Example 1: counting rows
DROP TABLE IF EXISTS sales;

CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY,
  product TEXT,
  quantity INTEGER,
  price INTEGER
);

INSERT INTO sales (sale_id, product, quantity, price) VALUES
(1, 'Latte', 2, 4),
(2, 'Latte', 1, 4),
(3, 'Espresso', 3, 3),
(4, 'Muffin', 5, 2),
(5, 'Espresso', 2, 3),
(6, 'Latte', 4, 4);

-- Count all rows
SELECT COUNT(*) AS total_sales FROM sales;

-- Count only rows where product = 'Latte'
SELECT COUNT(*) AS latte_sales FROM sales WHERE product = 'Latte';

Explanation:

COUNT(*) = number of rows in the table.
You can add WHERE to count only a subset.

Example 2 — SUM and AVG

-- Example 2: SUM and AVG
-- Total quantity sold
SELECT SUM(quantity) AS total_quantity FROM sales;

-- Average price of items
SELECT AVG(price) AS avg_price FROM sales;

-- Total revenue = quantity * price for all rows
SELECT SUM(quantity * price) AS total_revenue FROM sales;

Explanation:

SUM(quantity) adds up all quantities.
AVG(price) gives the mean of all prices.
You can calculate expressions inside aggregate functions (here: quantity * price).

Example 3 — MIN and MAX

-- Example 3: MIN and MAX
-- Cheapest item price
SELECT MIN(price) AS min_price FROM sales;

-- Most expensive item price
SELECT MAX(price) AS max_price FROM sales;

-- Largest quantity in a single sale
SELECT MAX(quantity) AS largest_order FROM sales;

Explanation:

MIN() finds the smallest number.
MAX() finds the largest number.

Example 4 — GROUP BY (splitting into categories)

-- Example 4: GROUP BY
-- Total quantity sold per product
SELECT product, SUM(quantity) AS total_qty
FROM sales
GROUP BY product;

-- Average price per product
SELECT product, AVG(price) AS avg_price
FROM sales
GROUP BY product;

-- Count how many times each product was sold
SELECT product, COUNT(*) AS order_count
FROM sales
GROUP BY product;

Explanation:

GROUP BY product means: split the table by product, then run the aggregate separately for each group.
Without GROUP BY, aggregates give one row. With GROUP BY, you get one row per group.

Example 5 — GROUP BY with multiple columns

-- Example 5: grouping by multiple columns
DROP TABLE IF EXISTS employee_salaries;

CREATE TABLE employee_salaries (
  emp_id INTEGER PRIMARY KEY,
  department TEXT,
  role TEXT,
  salary INTEGER
);

INSERT INTO employee_salaries (emp_id, department, role, salary) VALUES
(1, 'Data', 'Analyst', 50000),
(2, 'Data', 'Scientist', 85000),
(3, 'Engineering', 'DevOps', 75000),
(4, 'Engineering', 'Developer', 70000),
(5, 'Engineering', 'Developer', 72000),
(6, 'HR', 'Recruiter', 40000);

-- Average salary per department
SELECT department, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department;

-- Average salary per department + role
SELECT department, role, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department, role;

Explanation:

You can group by more than one column.
First query groups by department.
Second query groups by both department and role.

Example 6 — HAVING (filter groups)

-- Example 6: HAVING
-- Only show departments where average salary > 60000
SELECT department, AVG(salary) AS avg_salary
FROM employee_salaries
GROUP BY department
HAVING AVG(salary) > 60000;

-- Only show products sold more than 2 times
SELECT product, COUNT(*) AS order_count
FROM sales
GROUP BY product
HAVING COUNT(*) > 2;

Explanation:

WHERE filters rows before grouping.
HAVING filters after grouping.
Rule of thumb: WHERE is for rows, HAVING is for groups.

Example 7 — DISTINCT vs GROUP BY

-- Example 7: DISTINCT vs GROUP BY
DROP TABLE IF EXISTS students;

CREATE TABLE students (
  id INTEGER PRIMARY KEY,
  name TEXT,
  class TEXT
);

INSERT INTO students (id, name, class) VALUES
(1, 'Asha', 'Math'),
(2, 'John', 'Science'),
(3, 'Maya', 'Math'),
(4, 'Liam', 'English'),
(5, 'Sara', 'Science');

-- Unique classes with DISTINCT
SELECT DISTINCT class FROM students;

-- Same result using GROUP BY
SELECT class FROM students GROUP BY class;

Explanation:

DISTINCT removes duplicates.
GROUP BY can also achieve that — but it’s more powerful because you can add aggregates.

10 Exercises (Chapter 2) — practice

Use the sales, employee_salaries, and students tables created above.

Find the total revenue per product from the sales table.
Find the average quantity per product.
Show the product with the maximum average price.
Find the total number of distinct products in the sales table (use COUNT(DISTINCT ...)).
From employee_salaries, show the highest salary in each department.
From employee_salaries, show department and role combinations where average salary is below 60,000.
From sales, show only products where the total quantity sold is more than 5.
From students, count how many students are in each class.
From sales, calculate the average revenue (quantity * price) per product.
From employee_salaries, calculate the total salary cost per department and sort results from highest to lowest.

✅ Summary of this chapter

COUNT, SUM, AVG, MIN, MAX summarize data.
GROUP BY splits data into categories.
HAVING filters groups.
DISTINCT removes duplicates (like Excel’s "Remove Duplicates").

Chapter 3 — Joins (Combining Tables)

Now let’s move to the most critical SQL concept for AI/Data jobs — Joins.
Without joins, you’ll only query one table at a time. But in the real world, data is spread across many tables — customer info, orders, payments, products. Joins let you connect them.

What this chapter covers (plain language)

Think of each table like a sheet in Excel. A join is like using VLOOKUP to bring data from one sheet into another — but more powerful.

We’ll cover:

Inner Join
Left Join
Right Join (not in SQLite, but I’ll show the workaround)
Full Outer Join (also workaround in SQLite)
Cross Join
Self Join

Example setup (for all join examples)

We’ll create two related tables: customers and orders.

DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS orders;

CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  name TEXT,
  country TEXT
);

CREATE TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer_id INTEGER,
  product TEXT,
  amount INTEGER
);

INSERT INTO customers (customer_id, name, country) VALUES
(1, 'Asha', 'India'),
(2, 'John', 'USA'),
(3, 'Maya', 'UK'),
(4, 'Liam', 'Canada');

INSERT INTO orders (order_id, customer_id, product, amount) VALUES
(101, 1, 'Laptop', 1000),
(102, 1, 'Mouse', 20),
(103, 2, 'Keyboard', 50),
(104, 2, 'Monitor', 200),
(105, 3, 'Tablet', 300);

Example 1 — INNER JOIN

-- Inner join: only customers who have placed orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
INNER JOIN orders o
  ON c.customer_id = o.customer_id;

Layman explanation:

INNER JOIN = "Only keep rows where both tables have a match."
Here: only customers who placed at least one order appear.
Liam (customer 4) does not appear because he has no orders.

Example 2 — LEFT JOIN

-- Left join: all customers, even if they have no orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o
  ON c.customer_id = o.customer_id;

Layman explanation:

LEFT JOIN = “Take all rows from the left table (customers), and bring data from the right (orders) if it exists.”
If no match, the columns from the right table are NULL.
Liam now appears, but with NULL for product and amount.

Example 3 — RIGHT JOIN (simulated in SQLite)

SQLite doesn’t have RIGHT JOIN, but we can flip the tables and use LEFT JOIN.

-- Right join equivalent: all orders, even if no matching customer
SELECT o.order_id, o.product, o.amount, c.name, c.country
FROM orders o
LEFT JOIN customers c
  ON o.customer_id = c.customer_id;

Layman explanation:

This is a "right join" in disguise: take all orders, bring in customer info if available.
If there’s an order with a customer_id not in customers, it will show NULL for name/country.

Example 4 — FULL OUTER JOIN (simulated in SQLite)

SQLite also doesn’t support FULL JOIN directly. But we can simulate with UNION.

-- Full outer join = all customers + all orders
SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id

UNION

SELECT c.customer_id, c.name, o.product, o.amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id; -- in SQLite, use the LEFT JOIN flip trick

Layman explanation:

FULL OUTER JOIN = “Give me everything, whether or not there’s a match.”
You’ll see customers without orders and orders without customers.

Example 5 — CROSS JOIN

-- Cross join: every customer paired with every product
DROP TABLE IF EXISTS products;
CREATE TABLE products (
  pid INTEGER PRIMARY KEY,
  pname TEXT
);

INSERT INTO products (pid, pname) VALUES
(1, 'Phone'),
(2, 'Laptop'),
(3, 'Headphones');

-- Cartesian product
SELECT c.name, p.pname
FROM customers c
CROSS JOIN products p;

Layman explanation:

CROSS JOIN = every row of one table combined with every row of another.
If there are 4 customers × 3 products = 12 rows.
Usually used for generating combinations, not for normal reporting.

Example 6 — SELF JOIN

-- Self join: employees referencing managers
DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  emp_id INTEGER PRIMARY KEY,
  name TEXT,
  manager_id INTEGER
);

INSERT INTO employees (emp_id, name, manager_id) VALUES
(1, 'Alice', NULL),   -- CEO
(2, 'Bob', 1),        -- Alice manages Bob
(3, 'Charlie', 1),    -- Alice manages Charlie
(4, 'David', 2);      -- Bob manages David

-- Self join: find employee and their manager name
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.emp_id;

Layman explanation:

A self join joins a table to itself.
Here, employees reference their manager via manager_id.
David → Bob, Bob → Alice, Alice has no manager.

Quick checklist: Joins

INNER JOIN = only matching rows.
LEFT JOIN = all from left + matches from right.
RIGHT JOIN = all from right + matches from left (simulate in SQLite).
FULL OUTER JOIN = all rows from both (simulate with UNION).
CROSS JOIN = every combination.
SELF JOIN = table joined with itself.

10 Exercises (Chapter 3)

Show all customers and their orders (use INNER JOIN).
Show all customers, even those without orders (use LEFT JOIN).
Show all orders, even if no customer exists for them (use RIGHT JOIN simulation).
Create a query to show customer name + country + total order amount (hint: join + SUM + GROUP BY).
Find customers who have no orders (LEFT JOIN + WHERE order_id IS NULL).
Find products that have never been ordered (use products + orders).
Using employees, list all employees with their manager name.
Using employees, find employees who are managers (appear in manager_id).
Generate all combinations of customer names and products (use CROSS JOIN).
Create a query that shows all customers with number of orders, including those with 0 orders.

✅ Summary of this chapter

Joins let you combine related tables.
They are the heart of SQL for AI/data jobs (most feature engineering is multi-table joins).
Must master LEFT JOIN + GROUP BY for real projects.

Chapter 4 — Subqueries & CTEs

Now we go to Chapter 4 — Subqueries & CTEs (Common Table Expressions).
This is where SQL gets really powerful: you can write queries inside queries to handle complex data questions.

What this chapter covers (plain language)

Subquery = a query inside another query.
You can put a subquery in:
- WHERE → filter rows based on another query.
- SELECT → create new calculated columns.
- FROM → use subquery results as a temporary table.
Correlated Subquery = subquery that depends on the outer query.
CTE (Common Table Expression) = a readable way to build temporary tables with WITH.

Example setup

We’ll reuse customers and orders, and add payments.

DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payments;

CREATE TABLE customers (
  customer_id INTEGER PRIMARY KEY,
  name TEXT,
  country TEXT
);

CREATE TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer_id INTEGER,
  product TEXT,
  amount INTEGER
);

CREATE TABLE payments (
  payment_id INTEGER PRIMARY KEY,
  order_id INTEGER,
  payment_amount INTEGER
);

INSERT INTO customers (customer_id, name, country) VALUES
(1, 'Asha', 'India'),
(2, 'John', 'USA'),
(3, 'Maya', 'UK'),
(4, 'Liam', 'Canada');

INSERT INTO orders (order_id, customer_id, product, amount) VALUES
(101, 1, 'Laptop', 1000),
(102, 1, 'Mouse', 20),
(103, 2, 'Keyboard', 50),
(104, 2, 'Monitor', 200),
(105, 3, 'Tablet', 300);

INSERT INTO payments (payment_id, order_id, payment_amount) VALUES
(1, 101, 1000),
(2, 102, 20),
(3, 103, 40),   -- underpaid
(4, 104, 200),
(5, 105, 300);

Example 1 — Subquery in WHERE

-- Customers who have placed at least one order
SELECT name
FROM customers
WHERE customer_id IN (
  SELECT DISTINCT customer_id
  FROM orders
);

Layman explanation:

First the inner query finds all customer_id from orders.
The outer query checks which customers are in that list.
This avoids having to join.

Example 2 — Subquery in SELECT

-- Show each customer and their total order amount
SELECT
  name,
  (SELECT SUM(amount)
   FROM orders o
   WHERE o.customer_id = c.customer_id) AS total_spent
FROM customers c;

Layman explanation:

For each customer row, SQL runs the subquery to sum their orders.
This is like an Excel formula in each row.

Example 3 — Subquery in FROM (derived table)

-- Find customers and their average order size
SELECT c.name, avg_orders.avg_amount
FROM customers c
JOIN (
  SELECT customer_id, AVG(amount) AS avg_amount
  FROM orders
  GROUP BY customer_id
) avg_orders
  ON c.customer_id = avg_orders.customer_id;

Layman explanation:

The subquery in FROM calculates average order per customer.
We join that temporary table back to customers.
This is more efficient when reusing aggregates.

Example 4 — Correlated subquery

-- Find customers who have an order larger than 500
SELECT name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
    AND o.amount > 500
);

Layman explanation:

EXISTS returns true if the subquery finds at least one row.
Here, it checks if each customer has a big order (>500).
This depends on the outer query → correlated subquery.

Example 5 — CTE (WITH clause)

-- CTE to calculate total payments vs orders
WITH order_totals AS (
  SELECT o.order_id, o.amount AS order_amount,
         COALESCE(SUM(p.payment_amount), 0) AS total_paid
  FROM orders o
  LEFT JOIN payments p ON o.order_id = p.order_id
  GROUP BY o.order_id
)
SELECT order_id, order_amount, total_paid,
       (order_amount - total_paid) AS balance
FROM order_totals;

Layman explanation:

WITH order_totals AS (...) defines a temporary table.
Then you can query it like a normal table.
Much cleaner than nested subqueries, easier to read.

Example 6 — Multiple CTEs

-- CTEs can be chained
WITH order_totals AS (
  SELECT customer_id, SUM(amount) AS total_amount
  FROM orders
  GROUP BY customer_id
),
customer_payments AS (
  SELECT o.customer_id, SUM(p.payment_amount) AS total_paid
  FROM orders o
  JOIN payments p ON o.order_id = p.order_id
  GROUP BY o.customer_id
)
SELECT c.name, o.total_amount, cp.total_paid
FROM customers c
LEFT JOIN order_totals o ON c.customer_id = o.customer_id
LEFT JOIN customer_payments cp ON c.customer_id = cp.customer_id;

Layman explanation:

First CTE calculates total order amount per customer.
Second CTE calculates total payments per customer.
Final query joins them all for a clear report.

Quick checklist

Subqueries can be in WHERE, SELECT, or FROM.
Correlated subqueries depend on outer query.
EXISTS is efficient for "does a row exist?" checks.
CTEs (WITH) make queries cleaner and reusable.

10 Exercises (Chapter 4)

Write a query to list customers who have not placed any orders (use subquery in WHERE).
Show each customer’s name and their largest order amount (subquery in SELECT).
Using a subquery in FROM, calculate average payment per order.
Find customers who spent more than 500 in total (use correlated subquery).
Use EXISTS to find orders that have payments recorded.
Write a CTE to show orders and their remaining balance (order amount – total paid).
Write a query with two CTEs: one for total orders per customer, one for total payments, then join.
Find customers who have orders but no payments.
Use a subquery in FROM to calculate the top spender customer_id, then join to show their name.
Use a correlated subquery to show each order and whether it is above that customer’s average order size.

✅ Summary of this chapter

Subqueries let you nest questions inside questions.
CTEs make big queries readable and modular.
These are essential for complex analytics, especially in AI/ML pipelines where you need to prepare features from multiple data sources.

Chapter 5 — Advanced Filtering & Window Functions

Now we’re moving into Chapter 5 — Advanced Filtering & Window Functions.
This is one of the most critical chapters for AI/Data jobs because:

You’ll need to handle ranking (e.g., top N customers).
You’ll calculate running totals, moving averages, differences → essential in time-series problems (stocks, sensors, logs).
You’ll handle conditional logic for feature engineering.

What this chapter covers (plain language)

CASE WHEN → if-else conditions inside SQL.
COALESCE / NULLIF → deal with NULL values.
Window functions (OVER()) → look at rows relative to each other.
Ranking: ROW_NUMBER(), RANK(), DENSE_RANK().
Running totals, moving averages: SUM() OVER(), AVG() OVER().
Time-series helpers: LAG(), LEAD().

Example setup (sales data)

DROP TABLE IF EXISTS sales;
CREATE TABLE sales (
  sale_id INTEGER PRIMARY KEY,
  customer TEXT,
  amount INTEGER,
  sale_date TEXT
);

INSERT INTO sales (sale_id, customer, amount, sale_date) VALUES
(1, 'Asha', 100, '2025-09-01'),
(2, 'John', 200, '2025-09-01'),
(3, 'Asha', 300, '2025-09-02'),
(4, 'Maya', 150, '2025-09-02'),
(5, 'John', 400, '2025-09-03'),
(6, 'Asha', 250, '2025-09-04'),
(7, 'Maya', 500, '2025-09-05');

Example 1 — CASE WHEN (if-else logic)

-- Label sales as 'High' if > 300 else 'Low'
SELECT sale_id, customer, amount,
       CASE
         WHEN amount > 300 THEN 'High'
         ELSE 'Low'
       END AS sale_category
FROM sales;

Layman explanation:

CASE WHEN works like IF in Excel or Python.
Here, sales above 300 are "High", others "Low".
You can nest multiple conditions too.

Example 2 — COALESCE and NULLIF

DROP TABLE IF EXISTS payments;
CREATE TABLE payments (
  pay_id INTEGER PRIMARY KEY,
  customer TEXT,
  amount INTEGER,
  discount INTEGER
);

INSERT INTO payments (pay_id, customer, amount, discount) VALUES
(1, 'Asha', 100, NULL),
(2, 'John', 200, 20),
(3, 'Maya', 150, 0),
(4, 'Liam', 300, NULL);

-- Replace NULL discount with 0
SELECT pay_id, customer, amount,
       COALESCE(discount, 0) AS discount_applied
FROM payments;

-- Avoid divide-by-zero errors with NULLIF
SELECT pay_id, customer,
       amount / NULLIF(discount, 0) AS ratio
FROM payments;

Layman explanation:

COALESCE(x, 0) replaces NULL with a default (0).
NULLIF(x, 0) returns NULL if x = 0 (avoids divide by zero errors).

Example 3 — ROW_NUMBER, RANK, DENSE_RANK

-- Rank sales per customer
SELECT customer, amount,
       ROW_NUMBER() OVER (PARTITION BY customer ORDER BY amount DESC) AS row_num,
       RANK() OVER (PARTITION BY customer ORDER BY amount DESC) AS rank_val,
       DENSE_RANK() OVER (PARTITION BY customer ORDER BY amount DESC) AS dense_rank_val
FROM sales;

Layman explanation:

ROW_NUMBER() → unique sequence (1,2,3…).
RANK() → gives gaps (1,2,2,4).
DENSE_RANK() → no gaps (1,2,2,3).
PARTITION BY customer → restart numbering for each customer.

Example 4 — Running totals with SUM OVER

-- Running total of sales by date
SELECT sale_date, customer, amount,
       SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

Layman explanation:

SUM(amount) OVER (ORDER BY sale_date) → running total.
UNBOUNDED PRECEDING = start from the first row.
CURRENT ROW = up to this row.

Example 5 — Moving averages with AVG OVER

-- 2-day moving average sales amount
SELECT sale_date, customer, amount,
       AVG(amount) OVER (
         ORDER BY sale_date
         ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
       ) AS moving_avg
FROM sales;

Layman explanation:

Looks at "current row + previous 1 row".
Calculates average over that window.
Useful for smoothing noisy time-series data.

Example 6 — LAG and LEAD (time-series difference)

-- Compare each sale with previous sale (LAG)
SELECT sale_id, customer, amount, sale_date,
       LAG(amount) OVER (PARTITION BY customer ORDER BY sale_date) AS prev_amount,
       (amount - LAG(amount) OVER (PARTITION BY customer ORDER BY sale_date)) AS change_from_prev
FROM sales;

-- Look ahead with LEAD
SELECT sale_id, customer, amount, sale_date,
       LEAD(amount) OVER (PARTITION BY customer ORDER BY sale_date) AS next_amount
FROM sales;

Layman explanation:

LAG(x) → fetch previous row’s value.
LEAD(x) → fetch next row’s value.
Great for calculating differences, growth rates, trends.

Quick checklist

CASE WHEN = if-else in SQL.
COALESCE = replace NULL.
NULLIF = avoid divide-by-zero.
ROW_NUMBER, RANK, DENSE_RANK = row ranking.
SUM OVER = running total.
AVG OVER = moving average.
LAG / LEAD = compare with previous/next row.

10 Exercises (Chapter 5)

Label sales as "Big" if > 250, "Medium" if between 100–250, else "Small".
Show payments with NULL discounts replaced by 0 using COALESCE.
Find the top 2 highest sales per customer (use ROW_NUMBER).
Calculate the total revenue so far (running total) ordered by sale_date.
Find the average sales amount per customer (use AVG() OVER PARTITION).
Show each sale and the difference from the previous sale amount for that customer (use LAG).
Show each sale and the next sale amount for that customer (use LEAD).
Create a query that calculates a 3-day moving average of sales.
Rank customers by total sales amount across all dates (use RANK).
Write a query that shows sales where the amount is greater than the customer’s average sale amount (use window AVG + comparison).

✅ Summary of this chapter

Window functions let you analyze trends across rows — essential for AI (time-series, ranking, anomaly detection).
Mastering ROW_NUMBER, RANK, LAG, LEAD, and running totals will make your SQL stand out.

Chapter 6 — Data Manipulation (INSERT, UPDATE, DELETE, MERGE)

Now we’ll cover Chapter 6 — Data Manipulation (INSERT, UPDATE, DELETE, MERGE/UPSERT).

This is about changing data, not just reading it. In AI/data workflows, you’ll often insert cleaned data, update values, or delete errors. For ETL (Extract → Transform → Load), this chapter is key.

What this chapter covers (plain language)

INSERT → add new rows.
UPDATE → change values in existing rows.
DELETE → remove rows.
MERGE (or UPSERT) → insert new rows or update if they already exist.

Example setup

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
  emp_id INTEGER PRIMARY KEY,
  name TEXT,
  role TEXT,
  salary INTEGER
);

INSERT INTO employees (emp_id, name, role, salary) VALUES
(1, 'Asha', 'Data Analyst', 50000),
(2, 'John', 'ML Engineer', 90000),
(3, 'Maya', 'Data Scientist', 85000);

Example 1 — INSERT

-- Insert a single row
INSERT INTO employees (emp_id, name, role, salary)
VALUES (4, 'Liam', 'DevOps', 75000);

-- Insert multiple rows at once
INSERT INTO employees (emp_id, name, role, salary)
VALUES
(5, 'Sara', 'Intern', 20000),
(6, 'Chen', 'Data Engineer', 80000);

-- Check the data
SELECT * FROM employees;

Layman explanation:

INSERT INTO table (columns) VALUES (values) adds a row.
You can add multiple rows in one statement.

Example 2 — UPDATE

-- Update Sara's salary
UPDATE employees
SET salary = 25000
WHERE name = 'Sara';

-- Increase salary of all "Data" roles by 10%
UPDATE employees
SET salary = salary * 1.1
WHERE role LIKE 'Data%';

-- Check results
SELECT * FROM employees;

Layman explanation:

UPDATE table SET column = value WHERE condition.
Without WHERE, all rows will be updated (dangerous!).

Example 3 — DELETE

-- Delete interns
DELETE FROM employees
WHERE role = 'Intern';

-- Delete all employees with salary < 30000
DELETE FROM employees
WHERE salary < 30000;

-- Check
SELECT * FROM employees;

Layman explanation:

DELETE FROM table WHERE condition.
If you forget WHERE, it deletes everything. Always double-check!

Example 4 — UPSERT (INSERT OR REPLACE in SQLite)

-- If emp_id exists, replace it. If not, insert new.
INSERT OR REPLACE INTO employees (emp_id, name, role, salary)
VALUES (2, 'John', 'ML Engineer', 95000);

-- Insert new employee with same method
INSERT OR REPLACE INTO employees (emp_id, name, role, salary)
VALUES (7, 'Emma', 'AI Researcher', 100000);

SELECT * FROM employees;

Layman explanation:

UPSERT = Update if exists, Insert if not.
In SQLite: INSERT OR REPLACE.
In PostgreSQL: INSERT ... ON CONFLICT (id) DO UPDATE.
In MySQL: INSERT ... ON DUPLICATE KEY UPDATE.

Example 5 — MERGE (manual simulation for SQLite)

SQLite doesn’t support MERGE directly. But let’s simulate.

Suppose we have a new data table:

DROP TABLE IF EXISTS new_employees;
CREATE TABLE new_employees (
  emp_id INTEGER PRIMARY KEY,
  name TEXT,
  role TEXT,
  salary INTEGER
);

INSERT INTO new_employees (emp_id, name, role, salary) VALUES
(2, 'John', 'ML Engineer', 97000),  -- existing employee (update)
(8, 'Olivia', 'Data Engineer', 82000); -- new employee

Simulate MERGE:

-- Step 1: Update existing
UPDATE employees
SET salary = (SELECT salary FROM new_employees ne WHERE ne.emp_id = employees.emp_id)
WHERE emp_id IN (SELECT emp_id FROM new_employees);

-- Step 2: Insert new ones
INSERT INTO employees (emp_id, name, role, salary)
SELECT ne.emp_id, ne.name, ne.role, ne.salary
FROM new_employees ne
WHERE ne.emp_id NOT IN (SELECT emp_id FROM employees);

-- Check final
SELECT * FROM employees;

Layman explanation:

First UPDATE matches rows by emp_id and changes salary.
Then INSERT adds any new rows missing in employees.
That’s what MERGE does in big databases (Oracle, SQL Server).

Quick checklist

INSERT → add rows.
UPDATE → change rows (always use WHERE).
DELETE → remove rows (always use WHERE).
UPSERT/MERGE → insert or update depending on whether the row exists.

10 Exercises (Chapter 6)

Insert a new employee "Ravi" with role "Data Intern" and salary 15000.
Insert 2 new employees in one query: "Emma" (AI Engineer, 110000), "Noah" (ML Intern, 18000).
Update salary of all "Engineer" roles by +5000.
Update "Asha"’s role from "Data Analyst" to "Senior Data Analyst".
Delete all employees with role "Intern".
Delete employees whose salary is less than 20000.
Use UPSERT to update John’s salary to 120000.
Use UPSERT to insert a new employee "David" (DevOps, 70000).
Create a new table new_hires with 2 rows (1 existing emp_id, 1 new). Simulate MERGE: update the existing one, insert the new one.
Write a query that deletes all employees with "Data" in their role but only if salary < 60000.

✅ Summary of this chapter

These commands are part of ETL and data cleaning.
In AI/data projects, you often insert raw data, update corrections, delete wrong rows, or merge new data feeds.
Always be careful with UPDATE and DELETE — forgetting WHERE wipes everything.

Chapter 7 — Set Operations

Let’s move to Chapter 7 — Set Operations (UNION, INTERSECT, EXCEPT).

Set operations are about combining results of multiple queries. This is super useful in data integration, filtering, and feature creation.

What this chapter covers (plain language)

UNION → combine results, remove duplicates.
UNION ALL → combine results, keep duplicates.
INTERSECT → rows that appear in both queries.
EXCEPT / MINUS → rows in one query but not the other.

👉 These are like Venn diagram operations for SQL.

Example setup

DROP TABLE IF EXISTS team_a;
DROP TABLE IF EXISTS team_b;

CREATE TABLE team_a (
  name TEXT
);

CREATE TABLE team_b (
  name TEXT
);

INSERT INTO team_a (name) VALUES
('Asha'),
('John'),
('Maya'),
('Liam');

INSERT INTO team_b (name) VALUES
('Maya'),
('John'),
('Olivia'),
('Emma');

Team A: Asha, John, Maya, Liam
Team B: Maya, John, Olivia, Emma

Example 1 — UNION (no duplicates)

-- All unique members from both teams
SELECT name FROM team_a
UNION
SELECT name FROM team_b;

Result: Asha, John, Liam, Maya, Olivia, Emma (no duplicates).

Layman explanation:

UNION merges results from both queries.
Duplicates are removed automatically.

Example 2 — UNION ALL (keep duplicates)

-- All members, including duplicates
SELECT name FROM team_a
UNION ALL
SELECT name FROM team_b;

Result: John and Maya appear twice (once from each team).

Layman explanation:

UNION ALL keeps duplicates.
Faster than UNION (since no deduplication).

Example 3 — INTERSECT

-- Members common to both teams
SELECT name FROM team_a
INTERSECT
SELECT name FROM team_b;

Result: John, Maya

Layman explanation:

INTERSECT = only rows that appear in both sets.
Like the overlap in a Venn diagram.

Example 4 — EXCEPT (or MINUS)

-- Members in Team A but not in Team B
SELECT name FROM team_a
EXCEPT
SELECT name FROM team_b;

Result: Asha, Liam

Layman explanation:

EXCEPT = rows from first query that are not in second.
In Oracle, it’s called MINUS.

Example 5 — More practical example (Orders vs Payments)

Let’s reuse orders and payments.

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS payments;

CREATE TABLE orders (
  order_id INTEGER PRIMARY KEY,
  customer TEXT
);

CREATE TABLE payments (
  payment_id INTEGER PRIMARY KEY,
  order_id INTEGER
);

INSERT INTO orders (order_id, customer) VALUES
(1, 'Asha'),
(2, 'John'),
(3, 'Maya'),
(4, 'Liam');

INSERT INTO payments (payment_id, order_id) VALUES
(101, 1),
(102, 2),
(103, 3);

Orders: 1–4
Payments: 1–3

-- Orders that have been paid (intersection)
SELECT order_id FROM orders
INTERSECT
SELECT order_id FROM payments;

-- Orders that have not been paid (set difference)
SELECT order_id FROM orders
EXCEPT
SELECT order_id FROM payments;

Result:

Paid: 1, 2, 3
Not Paid: 4

Quick checklist

UNION = combine, remove duplicates.
UNION ALL = combine, keep duplicates.
INTERSECT = common rows.
EXCEPT = rows in first query but not second.

10 Exercises (Chapter 7)

List all unique names from team_a and team_b (use UNION).
List all names including duplicates (use UNION ALL).
Find names common to both teams (use INTERSECT).
Find names only in Team A (use EXCEPT).
Find names only in Team B (use EXCEPT).
Using orders and payments, find orders that have been paid (use INTERSECT).
Using orders and payments, find orders that have not been paid (use EXCEPT).
Insert a duplicate row into team_a (John again). Show difference between UNION and UNION ALL results.
Combine all customers and all employees into one list (create small extra tables).
Write a query to get customers who placed orders but are not in payments table.

✅ Summary of this chapter

Set operations are like Venn diagrams in SQL.
They are powerful for data comparison (paid vs unpaid, active vs inactive, A/B group differences).
Very handy in AI for feature creation and labeling.

Chapter 8 — Data Cleaning & Transformation

Let’s continue with Chapter 8 — Data Cleaning & Transformation.
This is one of the most practical chapters for AI/ML, because before you train any model, you’ll spend 70–80% of your time cleaning and transforming raw data.

What this chapter covers (plain language)

String functions → cleaning names, trimming spaces, searching text.
Date & time functions → extracting day/month/year, calculating differences.
Type conversion → converting text to numbers or dates.
Pivot/Unpivot → reshape data from wide → long or long → wide.

Example setup

DROP TABLE IF EXISTS raw_data;
CREATE TABLE raw_data (
  id INTEGER PRIMARY KEY,
  name TEXT,
  email TEXT,
  signup_date TEXT,
  score TEXT
);

INSERT INTO raw_data (id, name, email, signup_date, score) VALUES
(1, '  Asha  ', 'asha@example.com', '2025-01-10', '85'),
(2, 'John', 'john@example.com', '2025-02-15', '90'),
(3, 'Maya Singh', 'maya_singh@example.com', '2025-03-20', 'NULL'),
(4, 'Liam', 'liam@example.com', '2025-03-25', '72'),
(5, 'Sara  ', 'sara@example.com', '2025-04-05', '88');

Example 1 — String functions

-- TRIM spaces
SELECT id, name, TRIM(name) AS clean_name FROM raw_data;

-- UPPER / LOWER
SELECT id, UPPER(name) AS upper_name, LOWER(name) AS lower_name FROM raw_data;

-- SUBSTRING (get first 4 letters)
SELECT id, SUBSTR(name, 1, 4) AS short_name FROM raw_data;

-- FIND position of '@' in email
SELECT id, INSTR(email, '@') AS at_position FROM raw_data;

-- CONCAT (combine name + email)
SELECT id, name || ' <' || email || '>' AS contact_info FROM raw_data;

Layman explanation:

TRIM() → removes extra spaces.
UPPER(), LOWER() → normalize text.
SUBSTR() → extract part of a string.
INSTR() → find character position.
|| (SQLite) → concatenate strings.

Example 2 — Date functions

-- Extract year, month, day
SELECT id, signup_date,
       STRFTIME('%Y', signup_date) AS year,
       STRFTIME('%m', signup_date) AS month,
       STRFTIME('%d', signup_date) AS day
FROM raw_data;

-- Days since signup
SELECT id, signup_date,
       JULIANDAY('2025-09-16') - JULIANDAY(signup_date) AS days_since_signup
FROM raw_data;

Layman explanation:

STRFTIME('%Y') → year, %m → month, %d → day.
JULIANDAY(date) → number format → lets you subtract dates.

Example 3 — Type conversion

-- Convert score from TEXT to INTEGER
SELECT id, name, CAST(score AS INTEGER) AS score_num
FROM raw_data;

Layman explanation:

CAST(x AS INTEGER) converts text → number.
Useful when data is stored as text in CSV files.

Example 4 — Handling NULLs in transformation

-- Replace invalid score 'NULL' with 0
SELECT id, name,
       CASE WHEN score = 'NULL' THEN 0 ELSE CAST(score AS INTEGER) END AS clean_score
FROM raw_data;

-- OR using NULLIF + COALESCE
SELECT id, name,
       COALESCE(NULLIF(score, 'NULL'), 0) AS clean_score
FROM raw_data;

Layman explanation:

NULLIF(score, 'NULL') → treat 'NULL' string as NULL.
COALESCE(..., 0) → replace NULL with 0.

Example 5 — Pivot (long → wide)

Let’s make a table with scores in subjects.

DROP TABLE IF EXISTS marks;
CREATE TABLE marks (
  student TEXT,
  subject TEXT,
  score INTEGER
);

INSERT INTO marks (student, subject, score) VALUES
('Asha', 'Math', 85),
('Asha', 'Science', 90),
('John', 'Math', 78),
('John', 'Science', 88);

Pivot (manual in SQLite):

SELECT student,
       SUM(CASE WHEN subject = 'Math' THEN score END) AS math_score,
       SUM(CASE WHEN subject = 'Science' THEN score END) AS science_score
FROM marks
GROUP BY student;

Result:

Asha → 85, 90
John → 78, 88

Layman explanation:

Pivot = convert "rows" into "columns".
We simulate it with CASE WHEN.

Example 6 — Unpivot (wide → long)

Suppose you have wide-format data:

DROP TABLE IF EXISTS student_scores;
CREATE TABLE student_scores (
  student TEXT,
  math INTEGER,
  science INTEGER
);

INSERT INTO student_scores (student, math, science) VALUES
('Asha', 85, 90),
('John', 78, 88);

Unpivot (manual in SQLite):

SELECT student, 'Math' AS subject, math AS score FROM student_scores
UNION ALL
SELECT student, 'Science' AS subject, science AS score FROM student_scores;

Result:

Asha, Math, 85
Asha, Science, 90
John, Math, 78
John, Science, 88

Layman explanation:

Unpivot = convert "columns" into "rows".
Useful when you need to normalize wide Excel-like tables.

Quick checklist

TRIM, UPPER, SUBSTR, INSTR → clean strings.
STRFTIME, JULIANDAY → extract and calculate with dates.
CAST, COALESCE, NULLIF → fix dirty types and missing values.
CASE WHEN inside SELECT → conditional transformation.
Pivot = rows → columns.
Unpivot = columns → rows.

10 Exercises (Chapter 8)

Clean the raw_data.name column by trimming spaces.
Convert all names to uppercase.
Extract the year of signup for each customer.
Find customers who signed up in March (%m = '03').
Convert score column to integer, treating 'NULL' as 0.
Show number of days since each customer signed up (relative to today).
Create a pivot table from marks showing each student’s math and science score.
Create an unpivot table from student_scores showing subject + score.
Find customers whose email domain is example.com (use SUBSTR/INSTR).
Create a new column score_category: "High" if score ≥ 85, "Low" otherwise.

✅ Summary of this chapter

This is the bread and butter of feature engineering: clean strings, handle dates, fix types, reshape data.
Pivot/unpivot is especially important in analytics pipelines.
Mastering this chapter makes you valuable for AI jobs, since 80% of ML time is data prep.

Chapter 9 — Performance & Optimization

Let’s dive into Chapter 9 — Performance & Optimization.
This is a must-know for AI/data jobs, because when you query millions of rows, slow SQL will waste hours. Even as a data/AI engineer, you need to understand indexes and query tuning basics.

What this chapter covers (plain language)

Why queries get slow.
How indexes speed things up.
How to check what SQL is doing (EXPLAIN).
Basic tips to make queries faster.

⚡ Reminder: You don’t need DBA-level tuning, but you must know enough to avoid writing queries that take hours instead of seconds.

Example setup

DROP TABLE IF EXISTS big_orders;
CREATE TABLE big_orders (
  order_id INTEGER PRIMARY KEY,
  customer TEXT,
  amount INTEGER,
  order_date TEXT
);

-- Insert 20 rows (imagine millions in real life)
INSERT INTO big_orders (order_id, customer, amount, order_date) VALUES
(1, 'Asha', 200, '2025-01-01'),
(2, 'John', 500, '2025-01-02'),
(3, 'Maya', 300, '2025-01-03'),
(4, 'Liam', 700, '2025-01-04'),
(5, 'Sara', 150, '2025-01-05'),
(6, 'Asha', 900, '2025-02-01'),
(7, 'John', 450, '2025-02-02'),
(8, 'Maya', 600, '2025-02-03'),
(9, 'Liam', 1000, '2025-02-04'),
(10, 'Sara', 120, '2025-02-05'),
(11, 'Asha', 400, '2025-03-01'),
(12, 'John', 250, '2025-03-02'),
(13, 'Maya', 850, '2025-03-03'),
(14, 'Liam', 950, '2025-03-04'),
(15, 'Sara', 500, '2025-03-05'),
(16, 'Asha', 100, '2025-04-01'),
(17, 'John', 750, '2025-04-02'),
(18, 'Maya', 200, '2025-04-03'),
(19, 'Liam', 300, '2025-04-04'),
(20, 'Sara', 400, '2025-04-05');

Example 1 — Full table scan

-- Find orders where amount > 800
EXPLAIN QUERY PLAN
SELECT * FROM big_orders WHERE amount > 800;

Layman explanation:

Without an index, SQL will check every row (called a full table scan).
With 20 rows, it’s fine. With 20 million rows, it’s very slow.

Example 2 — Creating an index

-- Create index on amount
CREATE INDEX idx_amount ON big_orders(amount);

-- Now run the same query
EXPLAIN QUERY PLAN
SELECT * FROM big_orders WHERE amount > 800;

Layman explanation:

An index is like the index in a book → faster lookup.
Now SQL doesn’t scan all rows, it jumps to rows where amount > 800.

Example 3 — Index on multiple columns

-- Index on (customer, order_date)
CREATE INDEX idx_customer_date ON big_orders(customer, order_date);

-- Query using that index
SELECT * FROM big_orders
WHERE customer = 'Asha' AND order_date > '2025-02-01';

Layman explanation:

Composite indexes speed up queries filtering by two or more columns.
But only useful if your WHERE matches the order of columns.

Example 4 — Avoiding SELECT *

-- Bad: SELECT *
SELECT * FROM big_orders WHERE customer = 'John';

-- Good: only select needed columns
SELECT customer, amount FROM big_orders WHERE customer = 'John';

Layman explanation:

SELECT * loads all columns, even if you don’t need them.
In huge tables with 50+ columns, this slows down queries.

Example 5 — Avoiding unnecessary functions in WHERE

-- Bad: function on column (index not used)
SELECT * FROM big_orders
WHERE STRFTIME('%Y', order_date) = '2025';

-- Good: compare directly (uses index)
SELECT * FROM big_orders
WHERE order_date BETWEEN '2025-01-01' AND '2025-12-31';

Layman explanation:

If you wrap columns in functions, indexes are ignored.
Instead, filter using raw values.

Example 6 — LIMIT for efficiency

-- Get top 5 biggest orders
SELECT customer, amount
FROM big_orders
ORDER BY amount DESC
LIMIT 5;

Layman explanation:

Always use LIMIT when exploring.
Otherwise, queries may return millions of rows and choke memory.

Quick checklist (Performance Rules of Thumb)

Always index columns used in WHERE, JOIN, ORDER BY.
Avoid SELECT *, pick only needed columns.
Avoid wrapping columns in functions inside WHERE.
Use LIMIT when exploring.
Use EXPLAIN to check if indexes are used.

10 Exercises (Chapter 9)

Create an index on customer in big_orders. Run a query filtering by customer and check with EXPLAIN.
Create a composite index on (customer, amount) and query John’s orders above 400.
Write a query to get the top 3 biggest orders.
Write a query to find orders in March 2025 without using functions in WHERE.
Compare SELECT * vs selecting only customer, amount.
Delete index idx_amount and check how EXPLAIN changes for amount > 800.
Find all orders by Asha after 2025-03-01 (ensure your index helps).
Create a query that sorts by order_date DESC and limits to last 2 orders.
Insert 2 duplicate rows for John with amount 500. How would you ensure uniqueness? (Hint: PRIMARY KEY, UNIQUE).
Use EXPLAIN QUERY PLAN on a simple query, and check whether it says "USING INDEX".

✅ Summary of this chapter

SQL performance is mostly about indexes.
Good habits: avoid SELECT *, use LIMIT, filter smartly.
For AI/data jobs, you don’t need to be a tuning expert — but knowing this saves you from slow pipelines.

Perfect Bavi 🚀 Now we’ve reached the final practical chapter:
Chapter 10 — AI/Data Job Specific SQL (Feature Engineering, Time-Series, Data Quality Checks).

This is where everything we’ve learned ties into AI/ML workflows. SQL here is used to prepare datasets, create features, and ensure data quality before training models.

Chapter 10 — AI/Data Job Specific SQL

What this chapter covers (plain language)

Feature engineering in SQL (new columns from raw data).
Time-series operations (lags, moving averages).
ETL workflows (Extract → Transform → Load).
Data quality checks (missing values, duplicates).
SQL + Python integration (using Pandas + SQLAlchemy).

Example setup (transactions)

DROP TABLE IF EXISTS transactions;
CREATE TABLE transactions (
  txn_id INTEGER PRIMARY KEY,
  customer TEXT,
  amount INTEGER,
  txn_date TEXT
);

INSERT INTO transactions (txn_id, customer, amount, txn_date) VALUES
(1, 'Asha', 100, '2025-01-01'),
(2, 'Asha', 200, '2025-01-05'),
(3, 'John', 500, '2025-01-02'),
(4, 'John', 50, '2025-01-10'),
(5, 'Maya', 300, '2025-01-03'),
(6, 'Maya', 400, '2025-01-06'),
(7, 'Maya', 150, '2025-01-09');

Example 1 — Feature engineering: totals and averages

-- Total and average spend per customer
SELECT customer,
       SUM(amount) AS total_spent,
       AVG(amount) AS avg_spent,
       COUNT(*) AS txn_count
FROM transactions
GROUP BY customer;

Layman explanation:

Features = new columns for ML (like "total spend", "average spend").
This is classic SQL feature engineering.

Example 2 — Recency feature (days since last transaction)

-- Days since last transaction (relative to 2025-01-15)
SELECT customer,
       MAX(txn_date) AS last_txn,
       JULIANDAY('2025-01-15') - JULIANDAY(MAX(txn_date)) AS days_since_last
FROM transactions
GROUP BY customer;

Layman explanation:

Recency is a key feature in customer churn models.
Here, we calculate days since the most recent transaction.

Example 3 — Time-series: lag feature

-- Compare each txn amount to previous txn of same customer
SELECT txn_id, customer, txn_date, amount,
       LAG(amount) OVER (PARTITION BY customer ORDER BY txn_date) AS prev_amount,
       (amount - LAG(amount) OVER (PARTITION BY customer ORDER BY txn_date)) AS diff_from_prev
FROM transactions;

Layman explanation:

LAG() → fetch previous transaction.
The diff_from_prev feature helps detect anomalies/trends.

Example 4 — Moving average feature

-- 2-transaction moving average per customer
SELECT txn_id, customer, txn_date, amount,
       AVG(amount) OVER (
         PARTITION BY customer
         ORDER BY txn_date
         ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
       ) AS moving_avg
FROM transactions;

Layman explanation:

Moving averages smooth out noisy data.
Very useful for forecasting (stock prices, sales).

Example 5 — Data quality checks

-- Missing values check
SELECT * FROM transactions
WHERE customer IS NULL OR amount IS NULL OR txn_date IS NULL;

-- Duplicate detection
SELECT customer, txn_date, COUNT(*) AS dup_count
FROM transactions
GROUP BY customer, txn_date
HAVING COUNT(*) > 1;

-- Outlier detection (txns > 3x average)
WITH avg_txn AS (
  SELECT AVG(amount) AS avg_amt FROM transactions
)
SELECT t.*
FROM transactions t, avg_txn
WHERE t.amount > 3 * avg_txn.avg_amt;

Layman explanation:

AI pipelines break if you have missing values or duplicates.
SQL can quickly check data quality before loading into ML.

Example 6 — SQL + Python integration (concept)

In real projects, you’ll pull SQL data into Pandas for ML.

import pandas as pd
from sqlalchemy import create_engine

# Connect to SQLite DB
engine = create_engine("sqlite:///mydb.db")

# Load features directly from SQL
query = """
SELECT customer,
       SUM(amount) AS total_spent,
       AVG(amount) AS avg_spent,
       COUNT(*) AS txn_count
FROM transactions
GROUP BY customer;
"""
df = pd.read_sql(query, engine)
print(df)

Layman explanation:

Use pandas.read_sql() to directly fetch query results.
SQL does the aggregation, Python does the ML.

Quick checklist (AI/ML focus)

Feature engineering → totals, averages, recency, frequency.
Time-series → lag, lead, moving averages.
Data quality → check for NULLs, duplicates, outliers.
ETL in SQL → clean + transform before exporting to Python.
SQL + Pandas → combine the best of both worlds.

10 Exercises (Chapter 10)

Calculate each customer’s total transactions and average spend.
Find customers who spent more than 500 in total.
For each customer, calculate days since last txn (relative to 2025-01-15).
Use LAG() to show difference between each txn and previous txn.
Use a 3-transaction moving average per customer.
Check if any rows have NULL values in transactions.
Find if there are duplicate (customer, txn_date) rows.
Detect outlier transactions > 2× customer’s average.
Combine SQL + Python: pull transaction summary into Pandas and plot bar chart (code in Python).
Create features: total_spent, txn_count, days_since_last → imagine these as model inputs.

✅ Summary of this chapter

SQL is not just for querying → it’s a feature engineering toolkit.
Time-series SQL (LAG, LEAD, moving averages) = gold for forecasting.
Data quality SQL prevents garbage-in → garbage-out in ML.
In real AI projects: SQL prepares the data → Python/R trains the model.

SQL Mastery for AI & Data: From Basics to Feature Engineering

The Complete SQL Roadmap for AI/ML Engineers

🔹 1. SQL Basics (Foundations)

🔹 2. Aggregations & Grouping

🔹 3. Joins (Very Important for AI Jobs)

🔹 4. Subqueries & Derived Tables

🔹 5. Advanced Filtering & Window Functions

🔹 6. Data Manipulation

🔹 7. Set Operations

🔹 8. Data Cleaning & Transformation (Critical for AI/ML)

🔹 9. Performance & Optimization (Intermediate Level)

🔹 10. AI/Data-Job Specific SQL Topics

Chapter 1 — SQL Basics (Foundations)

What this chapter covers (plain language)

Very short vocabulary (layman)

How I’ll show examples

Example 1 — Very first query: SELECT and SELECT *

Example 2 — Select specific columns and WHERE filter

Example 3 — ORDER BY and LIMIT (sorting and top-n)

Example 4 — Column aliases & simple expressions

Example 5 — NULL, IS NULL, and COALESCE

Example 6 — Primary key, foreign key, and a simple relationship

Quick checklist: the key basic commands you just learned

10 Exercises (Chapter 1) — practice (use the tables from the examples)

Short tips & pitfalls (layman)

Chapter 2 — Aggregations & Grouping

What this chapter covers (plain language)

The 5 most common aggregate functions

Example 1 — Counting rows

Example 2 — SUM and AVG

Example 3 — MIN and MAX

Example 4 — GROUP BY (splitting into categories)

Example 5 — GROUP BY with multiple columns

Example 6 — HAVING (filter groups)

Example 7 — DISTINCT vs GROUP BY

10 Exercises (Chapter 2) — practice

Chapter 3 — Joins (Combining Tables)

What this chapter covers (plain language)

Example setup (for all join examples)

Example 1 — INNER JOIN

Example 2 — LEFT JOIN

Example 3 — RIGHT JOIN (simulated in SQLite)

Example 4 — FULL OUTER JOIN (simulated in SQLite)

Example 5 — CROSS JOIN

Example 6 — SELF JOIN

Quick checklist: Joins

10 Exercises (Chapter 3)

Chapter 4 — Subqueries & CTEs

What this chapter covers (plain language)

Example setup

Example 1 — Subquery in WHERE

Example 2 — Subquery in SELECT

Example 3 — Subquery in FROM (derived table)

Example 4 — Correlated subquery

Example 5 — CTE (WITH clause)

Example 6 — Multiple CTEs

Quick checklist

10 Exercises (Chapter 4)

Chapter 5 — Advanced Filtering & Window Functions

What this chapter covers (plain language)

Example setup (sales data)

Example 1 — CASE WHEN (if-else logic)

Example 2 — COALESCE and NULLIF

Example 3 — ROW_NUMBER, RANK, DENSE_RANK

Example 4 — Running totals with SUM OVER

Example 5 — Moving averages with AVG OVER

Example 6 — LAG and LEAD (time-series difference)

Quick checklist

10 Exercises (Chapter 5)

Chapter 6 — Data Manipulation (INSERT, UPDATE, DELETE, MERGE)

What this chapter covers (plain language)

Example setup

Example 1 — INSERT

Example 2 — UPDATE

Example 3 — DELETE

Example 4 — UPSERT (INSERT OR REPLACE in SQLite)

Example 5 — MERGE (manual simulation for SQLite)

Quick checklist

10 Exercises (Chapter 6)

Chapter 7 — Set Operations

Example 1 — Very first query: `SELECT` and `SELECT *`

Example 2 — Select specific columns and `WHERE` filter

Example 3 — `ORDER BY` and `LIMIT` (sorting and top-n)

Example 5 — `NULL`, `IS NULL`, and `COALESCE`