Data Science & Analytics Trainee: 5th Project - Factor Analysis on a Survey dataset

Today, I will perform a Factor Analysis on a dataset regarding a survey conducted at San Francisco Airport.The dataset contains data pertaining to customer demographics and satisfaction with Airport facilities, services, and initiatives and was collected in May 2017 through interviews with 2,831 customers in each of San Francisco Airport's four terminals and seven boarding areas. The dataset is available in the SFO official website.

The overall objective of Factor Analysis is condensing the information in a set of original variables in a smaller set of composite dimensions with a minimum loss of information.Factor analysis may have different specific objectives such as 1) Identify and interpret the underlying dimensions that explain the correlations between groups of the original variables; 2) Reduce the original variables to a smaller set of dimensions of analysis, more easily interpretable; 3) Replace the original variables by new uncorrelated variables in subsequent multivariate analyzes (e.g.. regression analysis and cluster analysis).

My main objective is identify the structuring dimensions of passenger satisfaction and in next post, I will also perform a cluster analysis in order to segment the passengers into homogenous groups based on their satisfaction levels (or structures) found in this analysis and characterize them based on sociodemographic variables, using an hierachical clustering method.

About the dataset:
The original data reflect a great deal of information about waiting times, cleanliness of spaces, flight types, airlines, sociodemographic data and satisfaction levels, perceptions, preferences and suggestions of travelers interviewed. Since, as already mentioned, the objective of the study is to identify the structuring dimensions of passenger satisfaction I decided not to use all the variables present in the questionnaire, but only those that will allow me to meet the objectives:

RESPNUM: Respondent Number (automatically generated)
TRIP_PURPOSE: What is the main purpose of your trip today?
GET_AIRPORT: How did you get to the airport today?
CHECK_BAGAGE: While at SFO today, did you check bagage?
PURCHASE_STORE: While at SFO today, did you purchase anyhing from an airport store?
PURCHASE_RESTAU: While at SFO today, did you make a restaurant purchase
USE_WIFI: While at SFO today, did you use free Wi-Fi
Q7_ART: Rating SFO regarding Artwork and exhibitions (1-5 scale)
Q7_REST: Rating SFO regarding restaurants (1-5 scale)
Q7_SHOPS: Rating SFO regarding retail shops and concessions (1-5 scale)
Q7_SIGNS: Rating SFO regarding signs and directions inside SFO (1-5 scale)
Q7_WALKWAYS: Rating SFO regarding escalators/elevators/moving walkays (1-5 scale)
Q7_SCREENS: Rating SFO regarding information on screens/monitors (1-5 scale)

Q7INFO_DOWN: Rating SFO regarding Information booths (lower level - near baggage claim) (1-5 scale)

Q7INFO_UP: Rating SFO regarding Information booths (upper level - departure area) (1-5 scale)

Q7_WIFI: Rating SFO regarding accessing and using free WiFi at SFO (1-5 scale)

Q7_ROADSIGNS: Rating SFO regarding signs and directions on SFO airport roadways (1-5 scale)
Q7_PARKING: Rating SFO regarding airport parking facilities (1-5 scale)
Q7_AIRTRAIN: Rating SFO regarding AirTrain (1-5 scale)
Q7LTP_BUS: Rating SFO regarding long term parking lot shuttle (bus ride) (1-5 scale)

Q7_RENTAL: Rating SFO regarding airport rental car center (1-5 scale)
Q9_BOARDING: Rating SFO regarding cleanless of boarding areas (1-5 scale) Q9_AIRTRAIN: Rating SFO regarding cleanless of AirTrain (1-5 scale)
Q9_RENTAL: Rating SFO regarding cleanless of airport Rental Car Center (1-5 scale)

Q9_FOOD: Rating SFO regarding cleanless of airport Restaurants (1-5 scale)

Q9_RESTROOM: Rating SFO regarding cleanless of restrooms (1-5 scale)
AGE: Categorical
INCOME: Categorical
GENDER: Categorical
FLY: Did you fly 100,000 miles or more per year? Categorical

#Reading dataset
airport=read.csv("SFO2017.csv", header=T, sep=";")

#Data wrangling

#We need to set categorical variables as factors and name the levels of categories (levels)

airport$trip_purpose=as.factor(airport$trip_purpose)

levels(airport$trip_purpose)=c(" ","Business/Work","Pleasure/Vacation","Visit Friends or relatives","School","Conference/convention", "Wedding/funeral/graduation","Other","Escorting others", "Volunteer","Moving/immigration" )

airport$get_airport=as.factor(airport$get_airport)

levels(airport$get_airport)=c(" ", "Drove and parked", "Dropped off", "Connecting from other flight","Taxi", "Uber or similar","BART","Door-to-dorr van service","Free hotel shuttle","Rental Car center","Other","Limo/town Car", "Sonoma/similar airport bus","Company rented bus","SamTrans/other bus","Caltrain/train","VTA", "Carshare")

airport$age=as.factor(airport$age)

levels(airport$age)=c(" ", "Under 18", "18-24","25-34","35-44","45-54","55-64","> 65"," ")

airport$gender=as.factor(airport$gender)

levels(airport$gender)=c(" ", "Male", "Female","Other"," ")

airport$income=as.factor(airport$income)

levels(airport$income)=c(" ", "Under 50,000", "50,000-100,000","100,01-150,000","Over 150,000")

airport$fly_miles=as.factor(airport$fly_miles)

levels(airport$fly_miles)=c(" ", "Yes", "No","Don't know"," ")

#Summary statistics

install.packages("stargazer")

library(stargazer)

stargazer(airport,omit= "RESPNUM", type = "text", title="Descriptive statistics", digits=1)

If we look to N in descriptive statistics, we can understand that there are many variables with missing values. For the factor analysis and also for the clustering solution, we need to treat this missing values, so I will impute mean of variable for the NA values.

#Replacing missing values of the variables that will be used in analysis (rating variables) for(i in 8:26{airport[,i]=ifelse(is.na(airport[,i]),round(mean(airport[,i],na.rm=T),0),airport[])}
Next step will be explore the categorical variables that i chose to use in this analysis. An easy way to explore this variables is to use the equivalent to pivot tables of excel in R. I will use package rpivotTable and show you this powerfull tool, which is dinamic so allows us to change variables in columns and rows just like in excel.

#Load rpivotTable
install.packages("rpivotTable")
library(rpivotTable)
#create pivot table
rpivotTable(airport, rows="trip_purpose", col= "gender", aggregatorName="Count", vals="RESPNUM", rendererName="Treemap")

As you can see, we can drag the variables to rows or columns and choose type of visualization. In this case, I chose "Table Barchart" and I can see the distribution of number of respondents per trip purpose and age. We can see that the majority of respondents where flighting for pleasure/vacation and within this purpose,the majority of passengers were aged between 25 and 34 years old.
Business/work trips were also very common with the majority of passengers flying for this purpose with age between 35-44 years old.

In graphic bellow,also computed with rpivotTable, we can see that the majority of respondents didn't purchase anything in stores.

rpivotTable also allowed to create the following table:

Here we can see that the majority of passangers surveyed have an annual income of over $150,000 and didn't fly 100,000 miles or more per year. As it was expected, the majority of the passengers that flew more than 100,000 miles have the higher annual income.

Finally, we can also cumpute a regular table with the distribution of gender, where we are able to see that the majority of passengers are male:

#Factor analysis install.packages("psych") install.packages('GPArotation') library(psych) library("GPArotation") install.packages("corrplot") library(corrplot)

In order to understand if the variables of rating can be grouped on the basis of their correlations, that is, if there are variables strongly correlated with each other, but that have small correlations with other groups of variables and if each of these groups represents a latent factor / variable, the first step is to analyze the correlations between the various variables, taking into account that the more correlated they are, the higher the probability that the FA will produce relevant results

#Computing correlation

cor=cor(airport[,8:26]) corrplot(cor, type = "upper",tl.col = "black", tl.srt = 60)

We can see that some variables are highly correlated, such as Q7_art with Q7_shops and Q7_restaurants, Q7_signs with Q7_walkays and Q7_screens, and so on. At this phase, it seems that a FA makes sense. Nevertheless, there is a measure of the adequacy of FA to the dataset used, known as KMO.

This measure is more useful in the presence of large samples and produces an overall measure, but also variable by variable so one should begin by analyzing the measure for each. For variables, the acceptable values of KMO are greater than 0.5 and overall, values larger than 0.8 indicate a quite adequate data to FA. # Verifying adequecy of factor analysis KMO(cor)

From the output, we can see that for overall and variable, KMO's values are greater than minimum acceptable, indicating that we can procede with the analysis.

For conducting the FA, we need to choose the method that we will want to use. There are two types of FA: Principal component analysis (PCA) and Common and Specific Factor Analysis (CSFA). PCA may be more suitable when the main goal is to reduce the number of analysis variables to the minimum number of components representing the variance in the original data; when it is expected that all the variables are interrelated, reducing the likelihood of specific factors (specific variances are very small); when you want to produce a small number of indices that represent as completely as possible the variability of the original data; or when the main objective is to produce a small number of not correlated components, which represent the variability of the original data, for use in subsequent multivariate analysis (for ex. regression or cluster analysis).

CSFA may be more suitable when one admits the existence of specific factors that influence the levels of the original variables; when the goal is not only to reduce the original data to a smaller set of factors, but also the identification and interpretation of the meaning of the common factors that explain (even partially) the original variables, i.e. explain the correlations (or covariances) between the original variables.In this context it is necessary to choose a method of estimating the communalities and a method for extraction of factors (many methods available such as Principal component, Maximum-likelihood, Unweight least squares).

Because my main goal "is not only to reduce the original data to a smaller set of factors, but also the identification and interpretation of the meaning of the common factors that explain (even partially) the original variables", I will perform a CSFA using principal component method for the extraction of factors (most common). But first, we need to choose the numbers of factors that we want to retain, i.e, the number of latent variables that will represent the correlation between the original variables of the dataset. For this, we have also some rules of thumb which I will explain right away:

1. previous hypothesis about the number of dimensions 2. retaining factors with eigenvalues greater than 1 (with standardized data) or with eigenvalues greater than the averagevariance (with original data) 3. choosing the number of factors so that the accumulated percentage of variance reaches a satisfactory level (in human variables is customary to consider solutions representing at least 50% of the variance of the original data) 4. using a graph with successive eigenvalue decomposition (scree plot); retain a number of factors, after which there is a sharp decline in the slope of the curve 5. after choosing the number of factors with these criteria, solutions with different number of factors (in the neighborhood of the suggested criteria) should be produced and take the final decision based on the interpretation of the solutions.

#Retain factors with eigen values > 1 (I am not using standardized data but when using correlation matrix instead of covariance, it is the same as using the standardized data) eigen <- eigen(cor) eigen$values

#Solution : m=4factors (four eigenvalues greater than 1)

#Cumulative propotion of eigenvalue cumulative.proportion <- 0 prop <- c() cumulative <- c() for (i in eigen$values) { proportion <- i / dim(airport[,8:26])[2] cumulative.proportion <- cumulative.proportion + proportion

prop <- append(prop, proportion) cumulative <- append(cumulative, cumulative.proportion) } data.frame(cbind(prop, cumulative))

#We can see that 4 factors account for 40% of sample variance when m=4 factors

#Scree plot par(mfrow=c(1,1)) plot(eigen$values, xlab = 'Nr.Factors', ylab = 'Eigenvalue', main = 'Scree Plot', type = 'b', xaxt = 'n') axis(1, at = seq(1, 19, by = 1)) abline(h=1, col="blue")

#Final solution m=4

The initial solution of a factor analysis, rarely results in factors that are interpretable. One should produce a rotation of factors (originates a redistribution of variance), seeking to produce a new solution where each factor presents high correlations with only a part of the original variables. The rotation redistributes the variance of original factors to the new ones, aiming to achieve a more meaningful patter. Several rotations can be tried, and one retains the one which leads to the most interpretable and useful solution (considering the objectives of the analysis).

After performing the rotation with Varimax, Quarimax and Equamax method, the method chosen based on interpretation and useful solution was Varimax. After the rotation, it is usual to interpret the factors attributing a label to each of them, that should translate the main idea/information underlying the factor. To this end, we analyze all the original variables and identify the factor or factors where the variable show the most significant loading. Factor loadings equal or larger .5 are usually considere significant. Variables that do not exhibit a significant loading on any factor o with very small communality (<50%) should be removed from the analysis. In the interpretation of each factor, one should consider that each original variable has as more importance as larger is its loading.

#Rotation factor_analysis = principal(airport[,8:26], nfactors =4, rotate = "varimax") loadings=print(factor_analysis$loadings,cutoff=0.5)

Factor 1 - Internal Services

We can see that for factor 1 (RC1), variables with loadings >0,5 are Q7_signs, Q7_walkways, Q7_screens, Q7info_down, Q7info_up, Q7_wifi, which represent internal services (inside) of the the airport such as singns, information given in screens or in bording areas or bagage claim areas, and wifi.

Factor 2 - Cleanless services

We can see that for factor 2 (RC2), variables with loadings >0,5 are Q9_boarding, Q9_airtrain, Q9_rental, Q9_food and Q9_restroom, which represent all variables related to ratings in cleanless services in areas such as boarding areas, airTrain, rental Car center, restaurants areas and restroom areas.

Factor 3 - External services

We can see that for factor 3 (RC3), variables with loadings >0,5 are Q7_roadsigns, Q7_parking, Q7_airtrain, Q7LTP_bus, Q7_rental car, which represent all variables related to ratings of services outside the airport, such as signs in roads, parking facilities, airTrain, bus of the airport or rental car center.

Factor 4 - Entretainment services

We can see that for factor 4 (RC4), variables with loadings >0,5 are Q7_art, Q7_rest and Q7_shops which represent all variables related to ratings of arts and exhibitions, restaurants or retail shops in the airport.

Now that we have "replaced" the 19 variables of rating for 4 factors that represents the 4 latent variables, we can create homogeneous groups of passangers based on this 4 factors and after that, characterize them regarding sociodemographic characteristics. I will let this for my next post :)

Data Science & Analytics Trainee

Tuesday, 3 April 2018

5th Project - Factor Analysis on a Survey dataset

2 comments:

7th Project: Multinomial Logistic Regression - Predicting Crime in Sydney