Data Science & Analytics Trainee: March 2018

Saturday, 24 March 2018

4th Project - Customer Segmentation

Today's project is about a different approach for segmentation of a customer's Retail dataset.

I will segment the customer database according to Sales amount spent and Number of categories bought by customer. Before going into further details, and before starting the analysis, let me share something with you:

This week i found in Linkedin a very interesting post on datascience+ website, which is "An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity."

I was just browsing on this site and I found this "Editorial Stylebook" that explains how authors can be consistent for writing interesting and relevant tutorials. Some of the questions they advise to address are pasted in the end of this post, with my answers!

Now, let's go into the analysis of today!
The segmentation I will perform has the main objective of identify tactical consumer groups. With this approach, we are able to understand how our customers behave with more detail and thus better target them.

The dataset:

InvoiceNo: Unique Invoice number. Nominal, if starting with "C" it means cancellation.
StockCode: Unique Product (item) code. Nominal

Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction.
InvoiceDate: Invoice Date and time.

UnitPrice: Unit price.
CustomerID: Unique Customer number.

Country: Country name.

#Installing packages
install.packages("reshape2")
install.packages("dplyr")
install.packages("ggplot2")
install.packages ("stargazer")
library("reshape2")
library("dplyr")
library("ggplot2")
library("stargazer")

To read the dataset, I used the "Import dataset" tool in RStudio.

#Selecting only customers from Country "United Kingdom"
Online_Retail_f=subset(Online_Retail,Country=="United Kingdom")

Dataset:

On a first looking through the dataset, I can see that Quantity variable have negative values which I already expected since there are in the dataset Cancelation transactions. Let's consider only cases where quantity >0 to exclude cancelations from our dataset.
#Excluding cancelations
Online_Retail_f=subset(Online_Retail_f, Quantity>0)

#Creating variable "Sales" as Sales = quantity*price
Online_Retail_f$sales= Online_Retail_f$Quantity* Online_Retail_f$UnitPrice

#Summary statistics
stargazer(Online_Retail_f,omit= c("InvoiceNo","StockCode","Description", "InvoiceDate"),type = "text", title="Descriptive statistics", digits=1)

We can see from the output that it still exists transactions with negative UnitPrice, and thus negative sales. Also, there are transactions where customerID is missing.So let's treat this cases. I will call this new dataset "Retail".

#Removing cases where UnitPrice is negative
Retail=subset(Online_Retail_f,UnitPrice>0)

#Removing cases where CustomerID is null
Retail=subset(Retail,!is.na(CustomerID))

#Watching summary statistics again
stargazer(Retail,omit= c("InvoiceNo","StockCode","Description", "InvoiceDate"),type = "text", title="Descriptive statistics", digits=1)

Now we can see that minimum value of Unit Price is 0.001 which does not make sense. Regarding customerID, we can see that NA values were excluded. Let's check what are the cases where Unit Price has minimum values.
I also notice that maximum values for Unit Price and Quantity are strange so let's check also what are the cases where this is happening. I will check case by case, by sorting the columns in the dataset viewer:

For minumum values of unitPrice, i will consider that values below 4 cents are wrong and I will exclude them. For maximum values of UnitPrice, I will leave as it is since it seems that this cases are related to postal services. I can see that "Manual" appears frequently in dataset and I'm not really aware of what this is related to, so I will let this as it is. For maximum quantity values we can see that values are strange but again, I'm not aware if these are normal values for this categories, so I'll let them.
#Selecting only cases where UnitPrice is bigger then 3 cents
Retail=subset(Retail, UnitPrice>0.03)

Let's now reshape the transactional dataset into a dataset where I will have for each customer, the total number of different categories bought and the total amount spent. In my last post, I used a different approach on R to reshape the file but this time, I found this "sqldf" package which allows to deal with data with SQL queries , which is the best way for me to do it !
#Installing package and creating a new reshaped dataset
require(sqldf)
Segm_table=sqldf("
SELECT CustomerID, COUNT(distinct(StockCode)) as P, SUM(sales) as M
FROM Retail
GROUP BY CustomerID")

#Different command for summary statistics

summary(Segm_table)

We are able to see that maximum values for total categories and total sales are very high so probably they will be treated as oultiers. Let's check for outliers and exclude them as they can influence the choice of the intervals for the segmentation variables, as we will see further in this analysis.

Last post, I presented a function which I found to be very usefull to deal with outliers. Nevertheless, and since for beginners in R language, it can be of difficult interpretation the function, I will present here a new (and very simplified) approach to deal with it.
#First, to check the outliers of Sales and Products amounts, we call the boxplot.stats function and retreive the outliers with $out command.
outliers = boxplot.stats(Segm_table$M)$out
outliers2= boxplot.stats(Segm_table$P)$out

#Then we create two columns which are the "outliers identifiers". OutliersM will have values 1 if the value of the variable "M" (sales) was found as an outlier and 0 if it was not. The same for OutliersP
Segm_table$OutliersM = ifelse(Segm_table$M %in% outliers, 1,0)
Segm_table$OutliersP = ifelse(Segm_table$P %in% outliers2, 1,0)

#Removing outliers by subseting the data by selecting only cases where this two variables have value "0" (no outlier)
Segm_table_v1=subset(Segm_table,Segm_table$OutliersM==0 & Segm_table$OutliersP==0 )

#Finally, we can remove the "outlier identifiers" variables
Segm_table_v1= Segm_table_v1[,c(-4,-5)]

#Visualization
par(mfrow=c(1,2))
hist(Segm_table_v1$M, xlab="Sales", )
hist(Segm_table_v1$P, xlab="Quantity")

It seems there are no outliers now!

I will now create the segment variables in a new dataset called "cust.segm". For the tresholds, I will select the quantiles of each variable.
#creating segment variable for Sales
cust.segm = Segm_table_v1 %>%
mutate(segm.sales=ifelse(between(M, 0, quantile(Segm_table_v1$M, 0.25)), "1",
ifelse(between(M, quantile(Segm_table_v1$M, 0.25), quantile(Segm_table_v1$M, 0.50)), "2",
ifelse(between(M, quantile(Segm_table_v1$M, 0.50), quantile(Segm_table_v1$M, 0.75)), "3",
ifelse(M>quantile(Segm_table_v1$M, 0.75), '4','4'))))) %>%
#creating segment variable for Quantities
mutate(segm.quantity=ifelse(between(P, 0, quantile(Segm_table_v1$P, 0.25)), '1',
ifelse(between(P, quantile(Segm_table_v1$P, 0.25),quantile(Segm_table_v1$P, 0.50)), '2',
ifelse(between(P, quantile(Segm_table_v1$P, 0.50), quantile(Segm_table_v1$P, 0.75)), '3',
ifelse(P>quantile(Segm_table_v1$P, 0.75), '4','4')))))

#Changing segment variables to factos and set the levels with the correspondent labels

cust.segm$segm.sales=as.factor(cust.segm$segm.sales)

levels(cust.segm$segm.sales)[levels(cust.segm$segm.sales)=="1"] <- "<€261"

levels(cust.segm$segm.sales)[levels(cust.segm$segm.sales)=="2"] <- "[€261 €532["

levels(cust.segm$segm.sales)[levels(cust.segm$segm.sales)=="3"] <- "[€533-€1095]"

levels(cust.segm$segm.sales)[levels(cust.segm$segm.sales)=="4"] <- ">€1905"

cust.segm$segm.quantity=as.factor(cust.segm$segm.quantity)

levels(cust.segm$segm.quantity)[levels(cust.segm$segm.quantity)=="1"] <- "<14"

levels(cust.segm$segm.quantity)[levels(cust.segm$segm.quantity)=="2"] <- "[14-29["

levels(cust.segm$segm.quantity)[levels(cust.segm$segm.quantity)=="3"] <- "[29-60]"

levels(cust.segm$segm.quantity)[levels(cust.segm$segm.quantity)=="4"] <- ">60"

# For visualization I will create this new dataset where i will group per segment variables the number of customers.

visualization <- cust.segm %>%

group_by(segm.sales, segm.quantity) %>%

summarise(customers=n()) %>%

ungroup()

#Plot

require(gplots)

with(visualization, balloonplot(segm.quantity, segm.sales, customers, zlab = "#

Customers"))

We are able to see that 574 customers (the majority) are customers that buy less quantities and spend small amounts of money. Nevertheless, 529 customers are in the best segment, once they buy a lot and spend very high amounts.

Customers in the segment >=1905€ and <14 are customers that spend a lot of money but buy very small amounts of different categories. They can receive cross-sell campaigns to make them buy more (different categories from the ones they actually buy).

As we can see, whit this visualization, we can select targeted campaigns and approaches to communicate with the segments that have more interest to develop or retain.

Editorial Stylebook questions:

The first time I performed this analysis in R, did I struggle with it?

Yes, in two major points 1) Getting the dataset in the right "shape" to perform the analysis and 2) in creating the visualization tool

What other tutorials online cover this topic?
If you search for "RFM segmentation in R" you find some usefull tutorials, which I used to help me performing this analysis.

What aspects of this analysis do other tutorials miss or explain poorly?
Once every tutorial I found was about doing segmentation on RFM variables, they all missed other approaches (i,e using different segmentation variables), which I cover.

Does my post contribute novel information that others have missed?
I tried to use other approaches on first hand and as mentioned above, on the segmentation variables to use and also on the way of getting the dataset in the correct "shape" (in every tutorial I found, the transactional database was not shaped in the same way I do in this post)

Does my post use a novel dataset, or are these the same data used in other tutorials?
I didn't find any tutorial on same analysis as mine with the dataset I'm using

Sunday, 18 March 2018

3rd Project - Cluster Analysis in Retail Dataset

Today's data relates to a store retail agency. This data which is from an online Course i took some time ago of SAS is described below:

Variable	Description
Saledate_Key	Key of date purchase was made
Shipping Date Key	Key of date purchase was shipped
Sales_Quantity	Quantity sold
Transaction_type	Type of transaction (mastercard, email, paypal..)
Sales_key	Key of transactional sales
Product_Key	Key of respective product
Customer_key	Key of customer
Promotion_ind	0 if sold in normal price, 1 if sold in promotion
Promotion_key	Key of promotiom
Product_price	Price of product
Weight	Weight of product (unit, killos,etc)
Department_key	Key of Product Department
Sku_number	Key of SKU
Product cost	Cost of product
Sales_amount	Total sales amount

Objective: My objective in this project is to analyze the data to identify groups of customers with same behaviour regarding purchases (sales amount) of the different products.

To achieve this, i will follow 3 steps:

Descriptive statistics and visualization
Clustering interesting/different groups of customers
Recommend strategies

Let's start.

1. Some Exploratory analysis

Dataset:

#installing packages

install.packages("sas7bdat")

install.packages("stargazer")

install.packages("dplyr")

install_github("easyGgplot2", "kassambara")

install.packages("ave")

install.packages("reshape2")

install.packages("factoextra")

library("sas7bdat")

library("stargazer")

library("dplyr")

library(easyGgplot2)

library("ave")

library(reshape2)

library("factoextra")

#Reading dataset

all_sales=read.sas7bdat("allsales.sas7bdat")

all_sales$Promotion_ind=as.factor(all_sales$Promotion_ind)

str(all_sales)

#Summary Statistics

stargazer(all_sales,omit=c("sales_key","product_key","Customer_key","Promotion_key","Department_key","Sku_number"),type="text", title = "Descriptive statistics", digits=1)

Looking to simple statistics, it seems that there are no strange values in dataset that need to be corrected.

#histograms for continuous variables

par(mfrow=c(1,2))
hist(all_sales$Sales_quantity, freq=T, col="pink", xlab="Quantity", xlim=c(0,20), main="Quantity Histogram")
hist(all_sales$sales_amount, freq=T, col="pink", xlab= "Sales", xlim=c(0,3500), main="Sales Histogram")

Looking to this histograms we are not really able to see what are most frequent intervals. On Quantity variable, we are able to see that most frequently, products quantity vary between 0 and 5 but let's take a closer look on this interval. Also, on Sales variable we are able to see that amount between 0 and 1000 are most frequent and for that reason I will check closer to this intervals:

#Adjusting x limits

hist(all_sales$Sales_quantity, freq=T, col="pink", xlab="Quantity", xlim=c(0,5), main="Quantity Histogram")

hist(all_sales$sales_amount, freq=T, col="pink", xlab= "Sales", xlim=c(0,1000), main="Sales Histogram")

Now, I am able to see that most frequently, quantity sold varies between 1 and 2 products and most frequently, sales amount varies between €200 and €400.

Let's check now what is the most used payment method

ggplot(data=na.omit(all_sales), aes(x=Transaction_type)) + ggtitle("Type of Transaction") +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5, na.rm=FALSE)+ ylab("Percentage") +coord_flip() +theme_bw() +theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

The most common type of payment method is Amex, followed by Mastercard.

Let's check now if products bought in promotion are frequent in this dataset:
#Simple bar chart
barchart(all_sales$Promotion_ind, col="Darkred")

We can see that majority of products bought weren't in Promotion.

2. Cluster Analysis

For this cluster analysis, I will be using K-Means algorithm. Since the dataset is in the form of a transactional ds (each row represents a product), and the objective is grouping customers with simillar purchases amount of certain categories of products, I will have to reshape the file:

#Treating data for cluster analysis
group_customers = all_sales %>%
group_by(Customer_key, product) %>%
summarise(sales = sum(sales_amount))
cluster=dcast(group_customers, Customer_key ~ product, value.var="sales", fill=0)

If initally each row of the dataset represented a transaction (product sold), now each row represents a customer and we can see in columns the sales amount spent on each product:

Because name of variables are now too long and to facilitate visualization, i will rename the column names. Here I will show only example of code i used for first three variables:
#Renaming variable names, beginning in the second column
colnames(cluster)[2] <- "beverA"
colnames(cluster)[3] <- "beverNA"
colnames(cluster)[4] <- "digEq"

#Looking to summary statistics
stargazer(cluster,type="text", title = "Descriptive statistics", digits=1)

We can see that Flooring, Frozen, Shelving and Store Furniture products have highest maximum values of sales.

#Selecting only the variables to the cluster analysis (exclude customer ID)

mydata=cluster[,2:34]

newdata <- mydata[c(-3,-5)]

#Treat data for clustering

boxplot(mydata)

We can see that there are several outliers in the variables in study. Once k-means is very sensitive to this outliers, I will remove them (set as NA) and imput mean value of specific variable.

#Dealing with Outliers with a function that detects oultiers as values which are higher than 3*standard deviation:
findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}

outliers <- findOutlier(mydata)

#Removing Outliers with a function that receives as parameters the data that we want the outliers to be removed and the outliers found in the function above.
removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}

dataFilt <- removeOutlier(mydata, outliers)

#Replacing missing values with Mean
for(i in 1:ncol(dataFilt)){
dataFilt[is.na(dataFilt[,i]), i] <- mean(dataFilt[,i], na.rm = TRUE)
}

Let's check now the boxplots of the variables:

boxplot(dataFilt)

We can see that now outliers were removed. Nevertheless, there are some variables with high varibility and higher interval of values than others, which can be a problem for the results of K-means analysis. For this reason, I will standardize the variables before performing the cluster analysis.

#Standardizing variables

dataFilt=scale(dataFilt)

#Selecting number of clusters to be considered with Elbow Graph

wss <- (nrow(na.omit(dataFilt))-1)*sum(apply(na.omit(dataFilt),2,var))
for (i in 2:15)
wss[i] <- sum(kmeans(na.omit(dataFilt),centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Looking to the graph, I will pick best solution as k=5 clusters. Nevertheless, I will also conduct the analysis with k=6 clusters and k=4 clusters and based on interpretation last solution will be selected.

#Running kmeans with different numbers of clusters
fit4 <- kmeans(dataFilt, 4)
fit5 <- kmeans(dataFilt, 5)
fit6 <- kmeans(dataFilt, 6)

#Visualizing cluster centroids and understand patterns
install.packages("toaster")
library(toaster)
createCentroidPlot( fit4, format="line", groupByCluster=TRUE, coordFlip = TRUE,title = "Centroids")

Cluster 1&2 (n=165+163)

This segments represents customers with highest value for the company.

They are simillar in respect to sales amount when comparing all four clusters, but they differ in the categories that bring more value to the company. We can see that while cluster 1 has customers that spend a lot in Pharmacy, Natural Organic, Merchandising, or Instore Systems, Cluster 2 has customers where this categories represents lowest amount of sales spent in. The objective of the company should be to focus on the retention and satisfaction of this clients through strategies that make them feel emotionally connected and make them consider everything they would lose if they left (eg custom packaging for home, sending a celebration card and small supply of anniversary, delivery of priority orders). Adittionaly, cross sell campaings could also be used to increase sales amount in worst categories of each group.

Cluster 3 (n=215)

This segment is made up of around 27% of the total number of customers and is where "average" customers are included (with an actual value much higher than the first segment but much lower than the third and fourth segments).

Strategies to promote the growth of this group can include cross-selling campaigns. Using as example Food categories, and taking in consideration that they spend more in categories such as Natural Organic, Gourmet or Frozen and that they spend less in categories such as Non alcoholic beverages, do campaigns like"Buy three packs of Ice Tea and you can receive a free product from our Gourmet list at your choice" , could make the amount sold higher and the customer satisfied.

Cluster 4 (n=232)

This cluster represents customers with sales amount below average in every products and thus, customers with lower value for the company. Strategies to grow their potential include the creation of loyalty campaings (eg accumulation of points for each € spent and possibility to convert these points on purchases or special direct promotions) on categories they buy most.