A statistical approach for profiling large number of Sitecore Items

A few months back I worked on developing an Ecommerce Connector for Sitecore for a certain Ecommerce product.  The idea of the connector was to bring Ecommerce catalogs (products, categories etc.) to Sitecore and keep them in sync with the Ecommerce System. Out of many aspects of the connector one thing that challenged us most was, how to profile the imported items programmatically so that, we can use them for customer engagement. The problem became more critical when we have large volume of catalog items in the Ecommerce System. For small number of items, it is possible to profile them manually, but large volume requires some sort of programmatic way to profile items. I had not have a chance on spending more time on this problem, but it never stopped bugging me. In this blog I will try to find some answers to that problem.

Let’s start with an example. Say, we want to create a Online Camera Store. It is an online Ecommerce site for which the customer engagement will be tackled using Sitecore Experience Platform (formerly known as Digital Marketing System, aka, DMS). So, we get all the products (products are sitecore items) from the Ecommerce System to Sitecore. There can be thousands of products, like, camera bodies, lenses, filters, flashes and many more. To effectively engage customers to buy the products, we have to profile these products. Profiling really means classifying products or in a simple term finding the similarity among the products. The products can be classified using historical purchase data using statistical classification algorithm or machine learning algorithms. Some well know classification algorithms are Decision Tree Classifier, Bayesian Classifier, Logistic Regression, Neural Networks.

On contrary to all these classification methods, profiling in Sitecore doesn’t force user to use any algorithm to classify Sitecore Items. In Sitecore, a user can score an item completely from his/her knowledge about the product. This not necessarily a bad thing because usually the businesses know their products and if they have small number of products, profiling products manually can produce good results. Also, profiling of the products can always be adjusted later on based on the performance. I think, manually profiling products become bit of a problem when we have thousands of products. In that case profiling manually can be very tedious and someone can easily lose the objectivity. This problem can be solved if we can come up with a formula that gives us profile scores for a product for some given characteristics of the product. Ahhh, magic formula. Let’s see how far we can go.

Some refresher about profiling items in Sitecore

  • Profile – It defines a customer. For example, Photography Skill Level of a customer.
  • Profile Key – It is particular value of a Profile. For Skill Level Profile, the values can be Beginner, Semi Professional and Professional.
  • Profile Card – It is combination of Profile Key Values. It’s easy to profile a product using profile card than individually use profile. For example, there can be a profile card for ‘Expensive Professional’, for which the Skill Level has high score and Spending Habit has high score.
  • Personas –  are profile cards that defines fictitious characters that represent a real customers.
  • Patterns – are predefined Profiles that are tested against customer’s behavior.

Going back to my camera store example, we can have a profile called Skill Level and the profile keys can be Beginner, Semi Professional and Professional. A product like, camera body will be profiled with a Profile Cards, which are calculated from Profile Keys. If we can calculate the Profile Key Values for a camera body based on different characteristics of the camera body, we can find out the closest match of the Profile Card and profile the camera body with that Profile Card. So, I have to find a way to calculate the Profile Key Value from known characteristics of the camera body.

Let’s recap again, we are trying to find out Skill Level Value (say, between 0 to 5) of a camera body given some characteristics of the camera body. Some characteristics we can consider are, for example, Frame per Second (fps), Size of the CCD/CMOS, Max Shutter Speed etc. Based on these characteristics a camera body can be designated as professional or not professional in the survey data. Yes, there are can be only two outcomes.

For binary outcomes, the appropriate classification model to consider is the Logistic Regression model. The Logistic Regression model is based on Multiple Least Square Regression. If E is the event that a camera body is for professional and F(E) is the odds that a camera body is for professional, then the Logistic Regression of odds is defined as below

Logistic Regression

where x1, x2, …., xk are the predictors. In our example, they are Frame per Second, Size of the CCD/CMOS, Max Shutter Speed etc. ß1, ß2, …, ßk are coefficients that we have to estimate from observed data and ε is the error.

The odds that a camera body is for professional is

F(E)=P(E)/(1-P(E))

where P(E) is the probability that a camera body is for professional. That means

Logistic regression

So,

Logistic Regression

and

Logistic Regression

If b1, b2, ….., bk are the estimates of ß1, ß2, …, ßk, then the estimate of the probability of a camera body is for professional is

Logistic Regression

Now, for simplicity, we will work with only one predictor, Frame per Second (fps). Then the estimate of the probability that a camera body is for professional becomes

(Figure 1)F6

where x1 is the value of fps.

Now, the question is how are we going the estimate the value of b0 and b1? To do that we need some data. The following table, which I completely made up, shows observations for different values of fps. It’s like we are giving camera bodies to 100 photographers and asking them to decide whether the camera is Professional or not, based on fps.

F7

Each row represents the mid value of interval of 3 fps.

The model we will fit to estimate the coefficients is based on Bionomial distriution and defined as below

F8
where yi is the observed frequency and pi is actual probability defined above (Figure 1). Taking logs on both side of the above equation we  get the Log Likelihood statistics

F9

We will find p and thus b0 and b1 by maximizing the value of ln L above. This is called Maximum Likelihood estimation. The idea is to find pi s for which value of ln L maximum. We will find this maximum value using Newton’s method of estimation. Newton’s method is used to find the root of non linear equation starting from a guess value. In our case we will start with a guess value pi=0.5. After multiple iteration of Newton’s method the ln L converges to a maximum and we find the coefficient b0 and b1. Here are the iterations.

F10

 

F11

F12

As you can see the total value LL (ln L) converges to -371.911 after 5 iterations. So, b0=-2.82164 and b1=0.26403. Using these, we can calculate the probability that a camera body is for professional from the figure 1. For example, if the fps value is 6, then the probability will be 0.2249. Multiplying this with 5 will give the Skill Level Profile Key value, which is 1.1245.

For simplicity I showed the process for one predictor. We can extend this for multiple predictors to calculate the Skill Level Profile Key value. Similarly, other Profile Keys can be calculated and we can find a Profile Pattern for a camera equipment and match that to the closest Profile Card. This Profile Card will be used for profiling that equipment.

Closing Thoughts

The process of profiling the Sitecore Items using statistical model can initially look cumbersome, but it has many benefits. Since we are using data to profile items, there is no human bias in them. The model can be improved by computing statistical goodness of fit tests. Once the model is developed, it can be repeatedly used whenever items (products) are imported and whenever model is revised.

References

Advertisements

About Himadri Chakrabarti

I am a software developer architect and a Sitecore MVP. My professional interest is everything and anything related to Software Architecture, .NET, Sitecore, Node.js, NoSQL etc. Outside of my profession, I am a hobbyist photographer. Link to my photography site http://himadriphotography.com/
This entry was posted in DMS, Sitecore and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s