Sitemap

Finding Product Pairs in Transactions Using Google BigQuery: Multiple SQL Approaches

3 min readFeb 4, 2025

Introduction

Understanding which products are frequently bought together is crucial for businesses looking to optimize their inventory, create targeted promotions, and enhance customer experience. This article will explore multiple SQL approaches to finding product pairs in Google BigQuery using a transactional dataset.

Use Case: Market Basket Analysis

Imagine we have a dataset of transactions where each row represents a product purchased within a transaction. Our goal is to find pairs of products that are frequently bought together.

๐Ÿ“Œ Sample Table: transactions

Transaction table

We aim to generate a result showing product pairs and their frequency across transactions.

๐Ÿ”น Method 1: Self Join (Classic Approach)

The simplest method is using a self-join, where we join the transactions table with itself to find pairs within the same transaction.

SELECT
a.product AS product1,
b.product AS product2,
COUNT(*) AS pair_count
FROM transactions a
JOIN transactions b
ON a.transaction_id = b.transaction_id
AND a.product < b.product -- Avoid duplicates (A,B) == (B,A)
GROUP BY product1, product2
ORDER BY pair_count DESC;

๐Ÿ“Š Sample Output:

Sample output

โœ… Pros: Simple and easy to understand.
โŒ Cons: Slightly inefficient for large datasets due to JOIN operations.

๐Ÿ”น Method 2: Using ARRAY_AGG & UNNEST (BigQuery-Optimized)

BigQuery provides an efficient way to handle array-based operations. We first group products into arrays bytransaction_id, then generate all possible pairs using UNNEST.

WITH product_combinations AS (
SELECT
transaction_id,
ARRAY_AGG(product ORDER BY product) AS product_list
FROM transactions
GROUP BY transaction_id
)
SELECT
product_pair,
COUNT(*) AS pair_count
FROM product_combinations,
UNNEST(GENERATE_ARRAY(0, ARRAY_LENGTH(product_list) - 2)) AS i,
UNNEST(GENERATE_ARRAY(i + 1, ARRAY_LENGTH(product_list) - 1)) AS j,
UNNEST([STRUCT(product_list[OFFSET(i)] AS product1, product_list[OFFSET(j)] AS product2)]) AS product_pair
GROUP BY product_pair
ORDER BY pair_count DESC;

โœ… Pros: Optimized for BigQuery, avoids expensive JOIN operations.
โŒ Cons: Requires knowledge of BigQuery-specific functions (ARRAY_AGG, UNNEST).

๐Ÿ”น Method 3: CROSS JOIN with ROW_NUMBER()

Another approach is to use ROW_NUMBER() to create a unique ranking for each product within a transaction, then perform a JOIN based on rank.

WITH ranked_products AS (
SELECT
transaction_id,
product,
ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY product) AS rn
FROM transactions
)
SELECT
a.product AS product1,
b.product AS product2,
COUNT(*) AS pair_count
FROM ranked_products a
JOIN ranked_products b
ON a.transaction_id = b.transaction_id
AND a.rn < b.rn -- Avoid duplicate pairs
GROUP BY product1, product2
ORDER BY pair_count DESC;

โœ… Pros: Reduces duplicate pair generation by leveraging ROW_NUMBER().
โŒ Cons: Still requiresJOIN, but more optimized.

๐Ÿ”น Method 4: Using STRING_AGG() and SPLIT() (BigQuery-Specific)

For datasets already aggregated as strings, we can useSTRING_AGG() to create product lists per transaction and then split them into pairs.

WITH aggregated AS (
SELECT
transaction_id,
STRING_AGG(product, ',') AS product_list
FROM transactions
GROUP BY transaction_id
)
SELECT
p1 AS product1,
p2 AS product2,
COUNT(*) AS pair_count
FROM aggregated,
UNNEST(SPLIT(product_list, ',')) AS p1,
UNNEST(SPLIT(product_list, ',')) AS p2
WHERE p1 < p2 -- Avoid duplicate pairs
GROUP BY product1, product2
ORDER BY pair_count DESC;

โœ… Pros: Efficient for string-based processing.
โŒ Cons: Not suitable for very large datasets with long strings.

๐Ÿ”น Summary: Best Methods for Different Scenarios

Conclusion

Finding product pairs in transactions is a powerful technique for market basket analysis, which can help businesses optimize sales strategies. Google BigQuery offers multiple ways to achieve this, from classic SQL joins to more efficient BigQuery-specific functions like ARRAY_AGG() and UNNEST().

If youโ€™re working with large-scale data, Method 2 (ARRAY_AGG + UNNEST) is the best choice. For smaller datasets or SQL-friendly approaches, Method 1 (Self Join) remains a solid option.

๐Ÿ’ก Which method worked best for you? Let me know in the comments!

Related Articles

--

--

Yannawut Kimnaruk
Yannawut Kimnaruk

No responses yet