Basic query patterns for fast recall

This note is literally just a list of queries for fast recall. It is not intended to be educational in any way.

A in B

Q: Give me a query for rows in A that have a match in B

SELECT a.*
FROM A a
WHERE EXISTS (
  SELECT 1 FROM B b WHERE b.key = a.key
);

-- NOT IN (SELECT b.key FROM B b) can be less than ideal as an alternative if sub-query can return NULL

I.e. “if (key in b)”

A not in B

Q: Give me a query for rows in A with no match in B

SELECT a.*
FROM A a
WHERE NOT EXISTS (
  SELECT 1 FROM B b WHERE b.key = a.key
);

A XOR B

Q: Give me a query for rows that are either in A or B, but not both

SELECT a.key
FROM A a
WHERE NOT EXISTS (SELECT 1 FROM B b WHERE b.key = a.key)
UNION ALL
SELECT b.key
FROM B b
WHERE NOT EXISTS (SELECT 1 FROM A a WHERE a.key = b.key);

TL;DR: NOT EXISTS (SELECT 1 FROM B WHERE B.key = A.key) is basically

(NR==FNR) {
    b[$1]=1;
    next;
}
!($1 in b) {
    print;
}

Find top-n (`ROW_NUMBER()`)

Q: Give me the second highest salary in an employees table

SELECT salary
FROM (
    SELECT salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn = 2;

Find top and bottom n

Q: Give me the second highest and second lowest salary in an employees table

SELECT salary
FROM (
    SELECT salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn,
    COUNT('x') OVER () AS tc
    FROM employees
) ranked
WHERE rn = 2 OR rn = (tc-1);

NOTE: doesn’t consider ties -> DENSE_RANK() if that matters

Find top-n per group (`ROW_NUMBER()` + `PARTITION BY`)

Q: Give me the top three salaries per department in an employees table

SELECT *
FROM (
    SELECT
        dept,
        employee,
        salary,
        ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;

Find top-n of summed value

Q: How do you find the top 3 customers by revenue?

WITH total AS (
    SELECT
        c.customer_name,
        SUM(t.price) AS price_sum,
        DENSE_RANK() OVER (ORDER BY SUM(t.price) DESC) as rnk
    FROM customers c
    JOIN transaction t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.customer_name
)
SELECT customer_name, price_sum FROM total WHERE rnk <= 3;

NOTE: If customers with 0 transactions should show up:

-> JOIN to LEFT JOIN wrap SUM(t.price) inside COALESCE(..., 0)

Why not `ROW_NUMBER()`?

-> ROW_NUMBER() assigns a unique sequential number to each result, which gets exactly n rows
-> DENSE_RANK() assigns the same number for ties, may not exactly get n rows, but gets top n values correctly in case ties need to be considered

a,100
b,100
c,90
d,80

-> ROW_NUMBER() -> 1,2,3,4
-> DENSE_RANK() -> 1,1,2,3

Same attribute grouping

Q: Give me all the employees who have the same manager

SELECT manager_id, GROUP_CONCAT(employee_name) AS team_members
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id;

PostgreSQL version:

SELECT
    manager_id,
    STRING_AGG(employee_name, ', ' ORDER BY employee_name) AS team_members
FROM employees
WHERE manager_id IS NOT NULL
GROUP BY manager_id;

PostgreSQL’s STRING_AGG requires an explicit delimiter.

Duplicate Filtering

Q: Give me a query to delete duplicate products from a product table

WITH duplicates AS (
   SELECT product_id, product_name, price,
   ROW_NUMBER() OVER (PARTITION BY product_name, price ORDER BY product_id) AS rn
   FROM product
)
DELETE p FROM product p JOIN duplicates d ON p.product_id = d.product_id
WHERE d.rn > 1;

For small ad-hoc tasks/reports, consider piping output through awk and rebuilding from the results, the idiom is easier to remember than this.

PostgreSQL syntax differs ever so slightly:

WITH duplicates AS (
   SELECT product_id, product_name, price,
   ROW_NUMBER() OVER (PARTITION BY product_name, price ORDER BY product_id) AS rn
   FROM product
)
DELETE p FROM product p USING duplicates d WHERE p.product_id = d.product_id
AND d.rn > 1;

Relational division (`GROUP BY` + `HAVING COUNT` = n)

Q: Give me a query to find customers who made purchases in every month of a year.

WITH purchases AS (
    SELECT customer_id, MONTH(date) as transaction_month
    FROM transaction
    WHERE YEAR(date) = 2025
    GROUP BY customer_id, MONTH(date)
)
SELECT customer_id FROM purchases GROUP BY customer_id HAVING COUNT(transaction_month) = 12;

PostgreSQL has no MONTH(), use DATE_TRUNC().

Contribution analysis

Q: Give me a query to calculate the % contribution of each product to total value

WITH product_summary AS (
    SELECT
        t.product_id,
        SUM(t.price) AS price_sum
    FROM
        transaction t
    GROUP BY
        t.product_id
)
SELECT
    ps.product_id,
    ps.price_sum,
    ROUND((ps.price_sum * 100.0) / SUM(ps.price_sum) OVER (), 2) AS contrib
FROM
    product_summary ps;

Anti-Join

Q: Give me a query to look for products that had no sales in the last 30 days

WITH product_summary AS (
    SELECT DISTINCT
        t.product_id
    FROM
        transaction t
    WHERE
        t.date >= (NOW() - INTERVAL 30 DAY)
)
SELECT p.product_id FROM product p
LEFT JOIN product_summary ps
ON p.product_id = ps.product_id
WHERE ps.product_id IS NULL;

In a real system, this kind of query is not always feasible. A transaction ledger might be hundreds of millions, or even billions of rows. This can potentially turn into a full table scan. A summary table would be the realistic option, if this information is frequently requested.

Without CTE for old DB versions

-- products that have at least one transaction (Semi-Join)
SELECT p.*
FROM product p
WHERE EXISTS (
    SELECT 1 FROM transaction t WHERE t.product_id = p.product_id AND t.date >= (NOW() - INTERVAL 30 DAY)
);

-- products with no transactions (Anti-Join)
SELECT p.*
FROM product p
WHERE NOT EXISTS (
    SELECT 1 FROM transaction t WHERE t.product_id = p.product_id AND t.date >= (NOW() - INTERVAL 30 DAY)
);

Self-Join

Q: Give me a query to find all pairs of employees who work for the same department

SELECT e1.name AS employee1, e2.name AS employee2, e1.department
FROM employees e1
JOIN employees e2 ON e1.department = e2.department
WHERE e1.employee_id < e2.employee_id;  -- Avoid duplicates and self-pairs

The WHERE clause makes sure we don’t pick up things like [Bob,Bob].

Gaps and Islands

Q: Give me a query to find all periods of consecutive days where a user logged in

WITH unique_login_days AS (
    -- 1. Ensure one record per user per day
    SELECT DISTINCT user_id, CAST(login_at AS DATE) AS login_date
    FROM user_logins
),
login_summary AS (
    -- 2. Summary of each login day sequentially for each user
    SELECT user_id, login_date,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
    FROM unique_login_days
),
buckets AS (
    -- 3. Subtract the rank from the date.
    -- Consecutive dates will result in the same 'bucket_id' date.
    SELECT user_id, login_date,
           (login_date - INTERVAL rn DAY) AS bucket_id
    FROM login_summary
)
-- 4. Group by the bucket_id to find the start, end, and duration
SELECT user_id,
       MIN(login_date) AS streak_start,
       MAX(login_date) AS streak_end,
       COUNT('x') AS consecutive_days
FROM buckets
GROUP BY user_id, bucket_id
ORDER BY user_id, streak_start;

If only streaks of a certain length are of interest, slap a HAVING onto the final query, e.g. HAVING COUNT('x') >= 3;

PostgreSQL: login_date - rn * INTERVAL '1 day'

Conditional aggregation (SUM(CASE WHEN …))

Q: Give me a query to find out how many windows/linux machines are available to a department

SELECT
    dept,
    SUM(CASE WHEN symbol= 'W' THEN 1 ELSE 0 END) AS windows,
    SUM(CASE WHEN symbol = 'L' THEN 1 ELSE 0 END) AS linux
FROM machines
GROUP BY dept;

Running totals

Q: Give me a query for a cumulative sum of positive sales

WITH transaction_summary AS (
    SELECT DAYOFYEAR(date) AS d, COUNT('x') AS c
    FROM transaction
    WHERE price > 0
    GROUP BY d
)
SELECT d, c, SUM(c) OVER (ORDER BY d) FROM transaction_summary ORDER BY d;

PostgreSQL: No DAYOFYEAR(), use EXTRACT(DOY FROM TIMESTAMP date)

Compare with previous - `LAG()`

Q: Give me a query to find instances where sales increased from the previous day

WITH sales_summary AS (
    SELECT
        date,
        sales,
        LAG(sales) OVER (ORDER BY date) AS previous_sales
    FROM daily_sales
)
SELECT date, sales, previous_sales
FROM sales_summary
WHERE sales > previous_sales;

UPSERT

MariaDB/MySQL:

INSERT INTO price_history (product_id, price)
VALUES (1, 9.99)
ON DUPLICATE KEY UPDATE price = VALUES(price);

PostgreSQL:

INSERT INTO price_history (product_id, price)
VALUES (1, 9.99)
ON CONFLICT (product_id)
DO UPDATE SET price = EXCLUDED.price;