Saturday, November 7, 2009

Optimizing Basket Analysis

Basket Analysis is the analysis of sales by basket (or transaction). It can be useful in understanding questions such as
  • What is the total value of sales for transactions that include products x (and y, and z...)?
  • What is the average transaction value for transactions that include products x (and y, and z...)?
  • What proportion of transactions include products x (or y or z)? And what proportion of transactions that include products a (and b, and c...) also include products x (and y, and z...)?
  • How did the attach rate of product x change during the sales promotion series? And was there a lasting effect.
  • Any analysis that requires mix of products within transactions.
Traditionally basket analysis has been achieved by running SQL queries on very large, highly scalable, databases that dynamically summarize by transaction id. You can imagine that these queries involve huge numbers of transaction and item records and really enjoy massively parallel computers. However, it is now possible to run quite effective basket analysis systems using low cost commodity hardware and Microsoft OLAP cubes. Many people do this by creating a cube with a dimension on the transaction id. If you are considering creating such a solution, I have two major suggestions. This has helped me create basket analysis solutions that consume hundreds of millions of transactions in cubes that support relatively fast ad hoc queries (sub 30 seconds).
  1. Rather than creating a dimension on transaction id, create a dimension on basket id. A basket is defined by the unique combination of items. So all the transactions that include items m, n, o and p, would be considered the same basket. Essentially, the basket id can be used as a dimension like the transaction id, but the basket it dimension is a fraction of the size. This makes the solution much more scalable and faster. A basket id dimension table can be created quite simply by creating a bit string with one bit representing each item. For example an 8000 byte string will support 64000 items. There is a little bit of math to create the bit string. but essentially each ItemID can create an exponential of 2 to get its position in the string. ie
    POWER(convert(bigint,2),ItemID). Of course, a bigint can only support 2 to the power of 64, so you will need to do this by a series of 8 byte strings. This would be very very tedious SQL to write, but I found that it is quite easy to write a little SQL to generate the full SQL. Also note, there is no need to keep the full 8000 bytes, it can be a varbinary to the right most byte with an item. See below.

  2. The second major suggestion I have, is to create a dimension using an identity column from the basket id table (there is no need to have the varbinary(8000) column in the cube, and write calculated measures for the basket analysis. ie. I would resist using the distinct count physical measures in the cube. You really need a Basket dimension, so that you can identify specific combinations of items.

The result is a relatively small cube, with a relatively large basket id dimension. The cube will have two measure groups, one for transactions with transaction count and sale value, while the item measure group will have item sales value and item count measures. Here is the short SQL query that will generate the full query for the basket ids. It is only a few lines of SQL, but it will generate over 1100 lines of SQL that will form the query for calculating the unique basket id. You will want to run this query incrementally, so just put a where clause at the bottom to ensure it is over the high water mark. I would also encourage you to put the query in an SSIS package with an OLEDB call to a procedure to look up the identity column on the basket table, so that can be stored with the transaction. The rest, including building the cube, should be easy.

A complementary application to this basket analysis cube is a data mining model targeted at the relationships between sales items and useful for predicting sales items.

If you are having performance issues with your basket analysis solutions or want help with your application, don't hesitate to ask me to help. This is the type of BI activity I really enjoy.

For real-time demonstrations of OLAP, data mining and related BI technologies see http://EasternMining.com.au/Sites/Demonstrations

4 comments:

Richard Lees said...

Here is a soft copy of the SQL. Note, you will need to change the name of your TransactionItem table and columns.

select 'Select TransactionID,sum(ItemID) ItemIDSum,count(*) ItemIDCount,convert(varbinary(max),left('
union all
select ' +CONVERT(BINARY(8),sum(s'+CONVERT(VARCHAR(5),dWord)+'))'
from (select dWord,dWord*64 Topn
from (select top 999 row_number()
over(order by TransactionID) dWord
from TransactionItem) as x) as y
union all
select ',max(ItemID)/8+8)) BasketBitMap From ('
union all
select 'Select TransactionID,ItemID'
union all
select ',max(CASE WHEN ItemID='+convert(varchar(50),TopN-1)
+ ' THEN convert(bigint,-9223372036854775808) WHEN ItemID<'
+convert(varchar(50),TopN-64)
+ ' THEN convert(bigint,0) WHEN ItemID <'
+convert(varchar(50),TopN)
+' THEN POWER(convert(bigint,2),ItemID-'
+convert(varchar(50),TopN-64)
+') ELSE 0 End) S'+convert(varchar(5),dWord)
from (select dWord,dWord*64 Topn from
(select top 999 row_number()
over(order by TransactionID) dWord
from TransactionItem with (nolock)) as x) as y
union all
select 'from TransactionItem group by TransactionID,ItemID'
union all
select ') as q group by TransactionID'

Igor Krupitsky said...

Hi Richard,

Nice entry. I wonder what you think of my implementation of Market Basket Analysis in SSAS:

http://www.codeproject.com/KB/database/Market-Basket-Analysis.aspx

Sam Kane said...

Here are this and some other articles on SSAS Market Basket Analysis:

http://ssas-wiki.com/w/Articles#Market_Basket_Analysis

Anonymous said...

I was wondering if you can expand on this a little. Can you show the resulting cube structure and calculations? You mentioned
What proportion of transactions include products x (or y or z)? And what proportion of transactions that include products a (and b, and c...) also include products x (and y, and z...) and I'm not sure how you would figure this.
Thanks