Monday, 26 February 2018

2nd Project - Exploring Unemployment in US

Image of area being analyzed

Today I will make some more exploratory data analysis, on a dataset regarding unemployment statistics in USA. I will be using different approaches to answer some random questions as I don't have a specific objective.  
I will be using the September 2013 version of this this dataset, available online.
I retrieved the data from the edX course "The Analytics Edge" which I highly recommend! 
The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise I will use a more compact version of the dataset, which has the following variables:
Variable
Description
PeopleInHousehold
Number of People in the interviewee’s household

Region
The census region where the interviewee lives

MetroAreaCode
A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area)

Age
The age, in years, of the interviewee. 80 represents people aged 80-84 and 85 represents people aged 85 and higher

Married
The marriage status of the interviewee
Sex
The sex of the interviewee
Education
The maximum level of education obtained by the interviewee

Race
The Race of the interviewee
Hispanic
Wheter the interviewee is of Hispanic ethnicity

CountryOfBirthdayCode
A code identifying the country of birth of the interviewee.

Citizenship
The United States citizenship status of the interviewee

EmploymentStatus
The status of employment of the enterviewee

Industry
The industry of employment of the interviewee (only available if they are employed)

Let's start


#Installing packages
install.packages("ggplot2")
install.packages ("dplyr")
install.packages("devtools")
install_github("easyGgplot2", "kassambara")
library(ggplot2)
library("dplyr")
library(devtools)
library(easyGgplot2)

#Reading CSV file
cps=read.csv("CPSData.csv",header=T,sep=",")
str(cps)
# Set to factor once the variable is 0 if it's not Hispanic and 1 if it's Hispanic 
cps$Hispanic=as.factor(cps$Hispanic)  

#Summary statistics for metric variables
require(stargazer)
stargazer(cps, omit= c("MetroAreaCode","CountryOfBirthCode"),type="text", 
          title = "Descriptive statistics", digits=1)


Summary statistics

#boxplot for numeric variables
par(mfrow=c(1,2))
boxplot(cps$PeopleInHousehold, xlab="People in Household")
boxplot(cps$Age, xlab="Age")
Boxplots of PinHous and Age

Watching to summary statistics we can see that maximum of People in households value is 15.When looking also to the boxplot, we are able to see that there are outliers (the ones presenting people in household >=8)
Let's discover what is maximum number of people living in Household where number of PeopleInHousehold>8  in each Region and State.
Minimum value of Age variable is 0 which is not possible. Considering that minimum age of a person to be able to answer a questionnaire is 14 years old, let's also treat this wrong values.

#from dplyr package
#Treating Age
cps$Age=ifelse(cps$Age<14,NA, cps$Age)

#Maximum number of people living in Household per Region and State
table2 = cps %>%
  filter(PeopleInHousehold>8) %>%
  group_by(Region, State) %>%
  summarise(Max_People_Household = max(PeopleInHousehold)) %>%
  arrange(Region, desc(Max_People_Household))
view(table2)
Summary table

We can see the results sorted by Region and number of Maximum People in Households in descendent order.
The maximum number of people living in a Household that we can find in this dataset living in Missouri, Midwest, is 15. 

#How many interviewees are in each Region
table(cps$Region)
Summary of Region

 We can see that the majority of interviwees are from South region

#How many interviews are Native citizens of US
table(cps$Citizenship)
Summary of Citizenship
 We can see that the majority of interviwees are US Native.

#What is the average age in each Region 
tapply(cps$Age, cps$Region, mean)
Tapply function for Mean age by region

 In the output (which have more information than shown here), we are able to see that Utah is the youngest Region, with an average age of the interviewees of ~31 years old. On the contrary, West Virginia is the oldest Region, with an average age of the interviewees of 42,5. 

#What is the percentage of each employed status of interviewees
ggplot(data=na.omit(cps), aes(x=EmploymentStatus)) + ggtitle("Employment Status") + 
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5, na.rm=FALSE)+ ylab("Percentage")  +coord_flip()  +theme_bw() +theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())


Bar chart of Employment Status
We can see that the majority of interviewees are Employed (more than 75%).

#Let's check the distribution of Age between employed and unemployed people
ggplot2.histogram(data=na.omit(subset(cps,EmploymentStatus=="Employed"|EmploymentStatus=="Unemployed")), xName='Age',groupName='EmploymentStatus', legendPosition="top",faceting=TRUE, facetingVarNames="EmploymentStatus",facetingDirection="horizontal")+theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())


Histograms of Employment status

The majority of Employees are between 40 and 60 years old. Again, is very easy to understand in this plot that the amount of employed people is much higher than unemployed. Here we are not able to see in a very clear way the distribution of unemployed people. We could do separate histograms and change the axis options if needed.

# What is the distribution of Employed Status per Industry
require(ggthemes)
require(viridis)
Employed_count = cps %>% group_by(EmploymentStatus,Industry)%>%   summarise(Total = n())
Employed_count$EmploymentStatus = ordered(Employed_count$EmploymentStatus)
ggplot(na.omit(Employed_count), aes(EmploymentStatus, Industry, fill = Total)) +
  geom_tile(size = 1, color = "white") +
  scale_fill_viridis()  +
  geom_text(aes(label=Total), color='white') +
  ggtitle("Employment Status per Industry") +
  xlab('Employment Status') +theme_bw()+ theme(plot.title = element_text(size = 16), 
   axis.title = element_text(size = 12, face = "bold"))  + theme(panel.border = element_blank(),        panel.grid.major = element_blank(), panel.grid.minor = element_blank())


Employment status per Industry
Educational and health services is the industry where we have more employed interviewees. On the other hand, there are no employees in Armed forces.



Saturday, 24 February 2018

First Project - Exploring Insurance Dataset


Image about topic that is being adressed


Today I am developing my first project in Data Science Training with a dataset regarding a fictional insurance company in Portugal. In the ABT (Analytic Based Table) I have data regarding a sample of 10.290 Customers.

For this first post, the ideia is to develop a descriptive analysis in R with summary statistics and some visualization. Let's start!
For each customer, the following variables are available:
Variable
Description
ID
ID
First_Policy
Year of the Customer's first policy
Birthday
Customer's Age
Education
Academic Degree (Categorical)
Salary
Gross Monthly Salary (€)
Area
Living Area (Categorical)
Children
Binary Variable (Y=1)
CMV
Customer Monetary Value
Claims
Claims Rate
Motor
Premiums (€) in LOB: Motor
Household
Premiums (€) in LOB: Household
Health
Premiums (€) in LOB: Health
Life
Premiums (€) in LOB: Life
Work_Compensation
Premiums (€) in LOB: Work Compensations


#Installing some relevant packages for specific functions
install.packages("stargazer") #Summary statistics
library(stargazer)
install.packages("ggplot2") #Graphics 
library(ggplot2)

#Read dataset and show information about the variables

insurance=read.csv("A2Z_Insurance.csv",header=T, sep=";")
str(insurance)


str function for variables


Wrangling to do...


#In the output, we are able to see that Customer Identity, First Policy Year, Has.Children and Geograpgic.Living.Area are set as int. Customer Identity is an identifier, so we want it to be read by R as different type rather than int . The other 3 variables will be treated as factor.  Let's change the variable types:
insurance$Customer.Identity=as.character(insurance$Customer.Identity)
insurance$First.Policy.s.Year=as.factor(insurance$First.Policy.s.Year)
insurance$Has.Children..Y.1.=as.factor(insurance$Has.Children..Y.1.)
insurance$Geographic.Living.Area=as.factor(insurance$Geographic.Living.Area)

Exploratory data analysis 


1. Summary Statistics

#Let's now retreive some summary statistics. The first line shows statistics for every variable in dataset. As I want to download the results and to be able to see it in a more "organized" way, I use stargazer function from Stargazer package to retreive the summary and i will save it in csv file called "Estatisticas.csv". 
summary(insurance)
stargazer(insurance, type = "text", title="Descriptive statistics", digits=1, out="Estatisticas.csv")


Output of summary statistics

#As we can see in the output, Customer Age has a minimum value of 15 and a miximum value of 988. This values represent wrong data since : 

   1) To be able to make an insurance policy, you have to be 18+ 
   2) No one has 988 years old 
Claims Rate is given in % and maximum value presented is 256.2 which also doesn't make sense.
Premiums represents the payment customers do regarding respective Insurance (motor, household, health, etc). Some variables of Premium shows negative values so I'll consider them as wrong data.
Let's treat wrong values. In first line, and second line, I'm telling R to change the variable Age to NA (no value, or missing value) if age is >100 or age is <18. Ifelse function in very intiuitive as it works as "if" function in Excel. I applied same methodology for Premiums variables.

insurance$Customer.Age=ifelse(insurance$Customer.Age>100,NA,insurance$Customer.Age)

insurance$Customer.Age=ifelse(insurance$Customer.Age<18,NA,insurance$Customer.Age)
insurance$Claims.Rate=ifelse(insurance$Claims.Rate>100,NA,insurance$Claims.Rate)
insurance$Premiums.in.LOB..Motor=ifelse(insurance$Premiums.in.LOB..Motor<0,NA,insurance$Premiums.in.LOB..Motor)
insurance$Premiums.in.LOB..Household=ifelse(insurance$Premiums.in.LOB..Household<0,NA,insurance$Premiums.in.LOB..Household)
insurance$Premiums.in.LOB..Health=ifelse(insurance$Premiums.in.LOB..Health<0,NA,insurance$Premiums.in.LOB..Health)
insurance$Premiums.in.LOB...Life=ifelse(insurance$Premiums.in.LOB...Life<0,NA,insurance$Premiums.in.LOB...Life)
insurance$Premiums.in.LOB..Work.Compensations=ifelse(insurance$Premiums.in.LOB..Work.Compensations<0,NA,insurance$Premiums.in.LOB..Work.Compensations)

#After doing the changes, and running again summary statistics I was able to see that for some Premium variables, minimum value are 0. Doesn't really make sense that a customer pays €0 for an insurance. Nevertheless, and since I'm not able to check how much actually customer payed, i will let the values.


2. Visualization

#Let's now plot some Histograms for continuous variables. I'll use ggplot function from ggplo2 package.

#Histogram for Premiums Motor

plot1=ggplot(data=insurance, aes(insurance$Premiums.in.LOB..Motor)) + 
  geom_histogram(breaks=seq(0, 700, by = 100), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Premium_Motor", y="Count") + 
  xlim(c(0,700)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#Histogram for Premiums Life

plot2=ggplot(data=insurance, aes(insurance$Premiums.in.LOB...Life)) + 
  geom_histogram(breaks=seq(0, 350, by = 50), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Premium_Life", y="Count") + 
  xlim(c(0,350)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())

#Histogram for Premiums Household

plot3=ggplot(data=insurance, aes(insurance$Premiums.in.LOB..Health)) + 
  geom_histogram(breaks=seq(0, 500, by = 50), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Premium_Household", y="Count") + 
  xlim(c(0,500)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#Histogram for Premiums Health

plot4=ggplot(data=insurance, aes(insurance$Premiums.in.LOB..Health)) + 
  geom_histogram(breaks=seq(0, 500, by = 50), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Premium_Health", y="Count") + 
  xlim(c(0,500)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(),   panel.grid.minor = element_blank())

#Histogram for Premiums Work Compensations

plot5=ggplot(data=insurance, aes(insurance$Premiums.in.LOB..Work.Compensations)) + 
  geom_histogram(breaks=seq(0, 500, by = 50), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Premium_Work", y="Count") + 
  xlim(c(0,500)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#As I want to display the histograms all in the same window, I'll install gridExtra package to use grid.arrange function. I'll plot all histograms in a window of 2x3 dimension.

require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, plot5,nrow=2,ncol=3)




Histograms for continuous variables Part 1


#We can see by ploting the histograms that typically, Premium Motor payments were around €200 and €400. In the case of Premium Life and Premium work this value is lower : between €0 and  €50. Premium Hosehould payments and Premium Health payments were typically around €100 and €200.

#Histogram for Gross Monthly Salary
p1=ggplot(data=insurance, aes(insurance$Gross.Monthly.Salary)) + 
  geom_histogram(breaks=seq(0, 10000, by = 1000), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Gross Monthly Salary", y="Count") + 
  xlim(c(0,10000)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(),  panel.grid.minor = element_blank())

#Histogram for Customer Monetary Value

p2=ggplot(data=insurance, aes(insurance$Gross.Monthly.Salary)) + 
  geom_histogram(breaks=seq(-20, 12000, by = 1000), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Monetary Value", y="Count") + 
  xlim(c(-20,12000)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())

#Histogram for Claims Rate

p3=ggplot(data=insurance, aes(insurance$Claims.Rate)) + 
  geom_histogram(breaks=seq(0, 2.5, by = 0.2), 
                 col="red", 
                 fill="green", 
                 alpha = .2) + 
  labs(title="Histogram") +
  labs(x="Claims_Rate", y="Count") + 
  xlim(c(0,2.5)) +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

grid.arrange(p1,p2,p3,ncol=3)



Histograms of continuous variables part 2

#Gross Monthly Salary and Monetary Values between €3000 and €6000 are quite frequent and Claims Rate values are typically between 0.8 and 1.0


#Let's now plot some Bar plots for categorical variables. 

#Residential Area
plot6 = ggplot(data=subset(insurance,!is.na(Geographic.Living.Area)), aes(x=Geographic.Living.Area)) + ggtitle("Living Area") + xlab("Area") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5, na.rm=FALSE)+ ylab("Percentage")  +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#Education Degree

plot7= ggplot(data=subset(insurance,Educational.Degree!=0), aes(x=Educational.Degree)) + ggtitle("Educational Degree") + xlab("Educational Degree") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5, na.rm=FALSE)+ ylab("Percentage")  + coord_flip() +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank())

#Has Children

plot8= ggplot(data=subset(insurance,!is.na(Has.Children..Y.1.)), aes(x=Has.Children..Y.1.)) + ggtitle("Children") + xlab("Children") +
  geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5, na.rm=FALSE)+ ylab("Percentage") +theme_bw()+ theme(panel.border = element_blank(), panel.grid.major = element_blank(),panel.grid.minor = element_blank())

grid.arrange(plot6,plot8,ncol=2)

plot7
Bar plots for Area and Children Variables



We can see that the majority of the customers  (around 40%) live in Area 4 and have children (around 70%).



Bar plot for educational degree

#Regarding Educational Degree, we can see that the highest percentage (around 50%) of customers have BSc or MSc degree, followed by High School (around 35%).



7th Project: Multinomial Logistic Regression - Predicting Crime in Sydney

Sydney Crime Spot Sydney Crime Spot Mafalda Silva 14th November 2018 T...