Sunday, May 31, 2009

Data Mining as an Enhancement to Basket Analysis

Big retailers all want to perform some form of basket analysis, where they analyse the mix of products that customers purchase in one basket. This tends to be an extremely hardware intensive exercise. Using OLAP technologies can help in this exercise, but even with OLAP hardware can be stressed since you really need a dimension on TransactionID. (OK, there are some optimisations you can make to minimise this dimension, but it is still very large and expensive.) However, if you want to analyse which products are bought together (along with other variables such as time of day, customer gender etc) then data mining (DM) can be a great tool and offer its analysis at a very low cost (ie you only need a small commodity server to perform the analysis). DM algorithms have been available as part of the standard edition of SQL Server since 2000.

Essentially, you ask the data mining to create a model to predict products, based on other products (and other variables such as time of day, store, customer gender etc). The data mining algorithms can scan billions of transactions and produce a data mining model in a relatively short time. The data mining algorithms search through many combinations of products/attributes and discover which products/attributes have strong associations. Essentially, the DM model will only stores relevant associations and their predictive strength. The data mining model tends to be very small since it only contains the associations, even though the work it has done to find them can be resource consuming. The data mining model does not need to store the billions of baskets, or the associations between irrelevant products. You are able to very easily browse the model to visualise products (and potentially other attributes such as gender and time of day) and their associations. Also, you can use the data mining model to predict which products might be bought in a particular basket. This data mining prediction query is extremely fast, typically a few milliseconds, so it is quite plausible that online transactions are probing the data mining model for suggested products at real time. For example, your billions of transactions might be terabytes of data, while the resultant DM model could be a couple of hundred megabytes. of data.
I have an online demonstration of a model that does something like this on http://EasternMining.com.au/sites/Demonstrations/Pages/LibrariesSuggestions.aspx I don't have a retailer's transaction history. Most companies privacy policies prevent them from me displaying them on the internet, however, if yours doesn't let me know and I will host the db on my site. What I do have is the borrowing history of a real municipal library (with names and personal identifiers fudged), which is very similar in structure to retailer transactions. The book is like a product and a borrower's loaned books are a transaction or basket. In my demonstration, you don't need to enter the books and gender, just select any borrower, and the demonstration will select the books that they have borrowed and their sex. This is a relational query and is actually embedded in the data mining (DMX) query. Here is the "product suggestion" DMX query. It looks a bit wordy, but it is not complicated, and the SQL Server wizard creates most of the query for you.

SELECT flattened
t.[BorrowerNo],t.sex,t.[decadeofbirth],
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
SHAPE {OPENQUERY([Libraries],
'SELECT
BorrowerNo,
Sex,
DecadeOfBirth
FROM Borrowers_Interesting')}
APPEND
({OPENQUERY([Libraries],
'SELECT
Title,
BorrowerNo
FROM LoansInterestingFull
WHERE Borrower = ''Damen Roxburgh'' ')}
RELATE BorrowerNo TO BorrowerNo)
AS Loans_interesting AS t
ON BookSuggestionFullDT.Sex = t.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = t.DecadeOfBirth AND
BookSuggestionFullDT.[Loans n Titles].Title = t.Loans_interesting.Title

Here is a similar query with the values coded into the query. As you can see, it is much easier to read. I have found the DMX language easier to learn than MDX, since the tasks you are requesting tend to be simpler. I.e. you are generally only asking for a prediction based on some input variables.

SELECT flattened
MyCustomer.Customer,MyCustomer.Sex,MyCustomer.DecadeOfBirth,
TopCount(predict([Loans n Titles],include_statistics),$probability,12)
From BookSuggestionFullDT
PREDICTION JOIN
(SELECT "Richard Lees" as Customer,
"Male" as [Sex] ,
"1960s" as DecadeOfBirth,
(select 'Marie Claire' as [Title] union
select 'Cleo' as Title) as books) as MyCustomer
ON BookSuggestionFullDT.Sex = MyCustomer.Sex AND
BookSuggestionFullDT.[Decade Of Birth] = MyCustomer.[DecadeOfBirth] AND
BookSuggestionFullDT.[Loans n Titles].Title = MyCustomer.Books.Title

So what I am suggesting here is that if you are performing basket analysis, or have a desire to, then you should also consider creating a data mining model. The DM model will help find the associations and optionally predict potential products. You could even have online shopping sites use DM models to make predictions based on the basket content so far. The query tends to be a few milliseconds, certainly sub-second.

If you require any assistance in creating or exploiting your data mining models, please don't hesitate to ask me for assistance.

For more online real-time business intelligence demonstrations see http://RichardLees.com.au/Sites/Demonstrations


Richard

No comments: