A few months back I worked on developing an Ecommerce Connector for Sitecore for a certain Ecommerce product. The idea of the connector was to bring Ecommerce catalogs (products, categories etc.) to Sitecore and keep them in sync with the Ecommerce System. Out of many aspects of the connector one thing that challenged us most was, how to profile the imported items programmatically so that, we can use them for customer engagement. The problem became more critical when we have large volume of catalog items in the Ecommerce System. For small number of items, it is possible to profile them manually, but large volume requires some sort of programmatic way to profile items. I had not have a chance on spending more time on this problem, but it never stopped bugging me. In this blog I will try to find some answers to that problem.
Let’s start with an example. Say, we want to create a Online Camera Store. It is an online Ecommerce site for which the customer engagement will be tackled using Sitecore Experience Platform (formerly known as Digital Marketing System, aka, DMS). So, we get all the products (products are sitecore items) from the Ecommerce System to Sitecore. There can be thousands of products, like, camera bodies, lenses, filters, flashes and many more. To effectively engage customers to buy the products, we have to profile these products. Profiling really means classifying products or in a simple term finding the similarity among the products. The products can be classified using historical purchase data using statistical classification algorithm or machine learning algorithms. Some well know classification algorithms are Decision Tree Classifier, Bayesian Classifier, Logistic Regression, Neural Networks.
On contrary to all these classification methods, profiling in Sitecore doesn’t force user to use any algorithm to classify Sitecore Items. In Sitecore, a user can score an item completely from his/her knowledge about the product. This not necessarily a bad thing because usually the businesses know their products and if they have small number of products, profiling products manually can produce good results. Also, profiling of the products can always be adjusted later on based on the performance. I think, manually profiling products become bit of a problem when we have thousands of products. In that case profiling manually can be very tedious and someone can easily lose the objectivity. This problem can be solved if we can come up with a formula that gives us profile scores for a product for some given characteristics of the product. Ahhh, magic formula. Let’s see how far we can go.
Some refresher about profiling items in Sitecore
- Profile – It defines a customer. For example, Photography Skill Level of a customer.
- Profile Key – It is particular value of a Profile. For Skill Level Profile, the values can be Beginner, Semi Professional and Professional.
- Profile Card – It is combination of Profile Key Values. It’s easy to profile a product using profile card than individually use profile. For example, there can be a profile card for ‘Expensive Professional’, for which the Skill Level has high score and Spending Habit has high score.
- Personas – are profile cards that defines fictitious characters that represent a real customers.
- Patterns – are predefined Profiles that are tested against customer’s behavior.
Going back to my camera store example, we can have a profile called Skill Level and the profile keys can be Beginner, Semi Professional and Professional. A product like, camera body will be profiled with a Profile Cards, which are calculated from Profile Keys. If we can calculate the Profile Key Values for a camera body based on different characteristics of the camera body, we can find out the closest match of the Profile Card and profile the camera body with that Profile Card. So, I have to find a way to calculate the Profile Key Value from known characteristics of the camera body.
Let’s recap again, we are trying to find out Skill Level Value (say, between 0 to 5) of a camera body given some characteristics of the camera body. Some characteristics we can consider are, for example, Frame per Second (fps), Size of the CCD/CMOS, Max Shutter Speed etc. Based on these characteristics a camera body can be designated as professional or not professional in the survey data. Yes, there are can be only two outcomes.
For binary outcomes, the appropriate classification model to consider is the Logistic Regression model. The Logistic Regression model is based on Multiple Least Square Regression. If E is the event that a camera body is for professional and F(E) is the odds that a camera body is for professional, then the Logistic Regression of odds is defined as below
where x1, x2, …., xk are the predictors. In our example, they are Frame per Second, Size of the CCD/CMOS, Max Shutter Speed etc. ß1, ß2, …, ßk are coefficients that we have to estimate from observed data and ε is the error.
The odds that a camera body is for professional is
where P(E) is the probability that a camera body is for professional. That means
If b1, b2, ….., bk are the estimates of ß1, ß2, …, ßk, then the estimate of the probability of a camera body is for professional is
Now, for simplicity, we will work with only one predictor, Frame per Second (fps). Then the estimate of the probability that a camera body is for professional becomes
where x1 is the value of fps.
Now, the question is how are we going the estimate the value of b0 and b1? To do that we need some data. The following table, which I completely made up, shows observations for different values of fps. It’s like we are giving camera bodies to 100 photographers and asking them to decide whether the camera is Professional or not, based on fps.
Each row represents the mid value of interval of 3 fps.
The model we will fit to estimate the coefficients is based on Bionomial distriution and defined as below
We will find p and thus b0 and b1 by maximizing the value of ln L above. This is called Maximum Likelihood estimation. The idea is to find pi s for which value of ln L maximum. We will find this maximum value using Newton’s method of estimation. Newton’s method is used to find the root of non linear equation starting from a guess value. In our case we will start with a guess value pi=0.5. After multiple iteration of Newton’s method the ln L converges to a maximum and we find the coefficient b0 and b1. Here are the iterations.
As you can see the total value LL (ln L) converges to -371.911 after 5 iterations. So, b0=-2.82164 and b1=0.26403. Using these, we can calculate the probability that a camera body is for professional from the figure 1. For example, if the fps value is 6, then the probability will be 0.2249. Multiplying this with 5 will give the Skill Level Profile Key value, which is 1.1245.
For simplicity I showed the process for one predictor. We can extend this for multiple predictors to calculate the Skill Level Profile Key value. Similarly, other Profile Keys can be calculated and we can find a Profile Pattern for a camera equipment and match that to the closest Profile Card. This Profile Card will be used for profiling that equipment.
The process of profiling the Sitecore Items using statistical model can initially look cumbersome, but it has many benefits. Since we are using data to profile items, there is no human bias in them. The model can be improved by computing statistical goodness of fit tests. Once the model is developed, it can be repeatedly used whenever items (products) are imported and whenever model is revised.
- Programming Collective Intelligence is an excellent book to learn ecommerce personalization using statistics and data mining algorithm.
- real-statistics is a fantastic resource to learn statistical analysis using excel.
- Sitecore Predictive Personalization is good resource to learn how predictive personalization works in Sitecore.