HR Analytics using Supervised and Unsupervised Machine Learning in R

Problem Statement

Why do so many tech employees choose to leave their companies? According to a news release in 2018, the tech industry was notorious for a high employee turnover rate of 13.2%, as compared to other industries such as Media and Entertainment (11.4%), Government/Education (11.2%) and Financial Services (10.8%) (viGlobal, 2018). Given the pervasive turnover within the industry, should such culture be accepted in HR talent management teams? Or are there preemptive data-driven strategies that can be employed to tackle staff retention?

High turnover has adverse effects on the business. From a monetary standpoint, the amount that the company needs to spend on a replacement staff can range from one to seven times the employee’s annual salary, and this is due to many factors including, inter alia, the cost and time to find replacements and the revenue contribution attributed to that employee (Kochanski & Ledford, 2001 and The Build Network, 2014). Other impacts include low morale for the team and bottlenecks caused by the laborious efforts needed to onboard incoming staff, exacerbated by the knowledge drain of the outgoing staff.

Since 2012, International Business Machines Corp., otherwise known as IBM, has been facing a steady decline in the number of employees year-on-year, with a total reduction of about 20% of its workforce from then till 2020 (Alsop, 2021). This does not bode well for the company especially when talent retention is key to driving revenue growth. The perennial challenge of retaining talents and skilled employees therefore presents a strong business case for IBM to invest in talent retention strategies. For this reason, this project seeks to explore various machine learning methodologies to predict turnover, characterize and personify the type of employee that would leave, and delve into the reasons that could be driving talents away. Armed with these insights, IBM can curate targeted HR strategies for more effective talent retention.

Literature Review

A myriad of research studies have been conducted on tech employees’ intention to leave (Ghapanchi & Aurum, 2011). While there are many organizational antecedents/factors that affect attrition, the reasons can be categorized into five (5) broad categories: individual factors, organizational factors, job-related factors, psychological factors and environmental factors, among which, job-related antecedents such as role ambiguity and role conflicts are the most often cited drivers that trigger turnovers (Ghapanchi & Aurum, 2011). Further corroborating this phenomena, according to a leadership management whitepaper by IBM Smarter Workforce Institute, who conducted an in-depth research among more than 20,000 employees, job-related reasons are the biggest factors (cited by 40% of respondents) for leaving because employees are not happy with their jobs (Zhang & Feinzig, 2016).

While Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016) suggest that job-related factors are the most cited reason for turnover, recent studies have found psychological factors such as pay satisfaction to be the dominant factor. According to the Deloitte Global Millennial Survey in 2019, close to half of the respondents (49%) reported that they would leave their jobs if they had a choice, of which 43% of them cited dissatisfaction with their pay and benefits as the top reason for leaving (Deloitte, 2019). In the tech industry, the results are worse. According to the Dice Survey in 2019, which is a salary report of United States tech firms, 45% of respondents highlighted they would like to change employers within the next year, of which 71% of them cited they are seeking salary compensation as the top reason as to why they would leave (The Dice, 2020). The high turnover in the entire tech industry, and in top tech firms in particular, are likely a result of the increasing demand for tech talents, driven by the rapid growth in technology (Hecker, 2005), which results in a race to attract talents commensurate with increased compensation (ViGlobal, 2018).

Diving deeper into the insights by the IBM whitepaper, it also suggests that turnover rates can vary across generations. For instance, early and mid-career workers are more likely to leave the job for better career opportunities, as mentioned by 74% of Millennials, 68% of Generation Xs and 54% of Baby Boomers (Zhang & Feinzig, 2016). In a similar trend, the study also highlighted that early and mid-career workers have higher propensity to leave the job for greater employer brands and greater flexibility at work (Zhang & Feinzig, 2016).

Regardless of the reasons for turnover, the high turnover of employees in the tech industry presents many ramifications to the organization. The direct effect is that the turnover of employees will slow down the company’s overall business and bring productivity losses. For instance, if an existing software developer leaves, the company often has to take 43 days on average to hire a new one, which is approximately a month and a half of productivity loss and it does not account for onboarding (Lewis, 2020). This consequence will cost the company as much as US$33,251 for each employee who leaves (Lewis, 2020). Apart from the cost related to productivity loss, some sources also factor in the loss of intellectual capital attributed to the departing employees and the time needed to onboard new staff (Kochanski & Ledford, 2001).

As can be seen from the various studies mentioned above, there are a variety of reasons as to why employees would leave a company – job fit, pay satisfaction, career development, etc. and the list goes on. As the tech industry grows and evolves, one is unable to generalize a particular evergreen reason for employees leaving a company because of cultural shifts, demography shifts and generational differences over time, among others (i.e., the reason for an employee leaving IBM in the early 2000s may be different than an employee leaving in 2020). Likewise, an evergreen solution to tackle talent retention in a company would be futile due to the said reasons. Furthermore, due to the many ramifications turnover has on the organization, there is a strong impetus to improve staff retention. This underscores the need for a current, up-to-date, data-driven approach that leverages on existing live employee data to uncover the latest/current reasons that could be driving talents away so that IBM can predict and personify/characterize the type of employees that might leave the organization. These insights would enable better formulation of HR strategies for talent retention.

Research Questions

Myriad factors could affect employee attrition. However, this project aims to identify the salient factors that influences attrition at IBM so that management can make HR decisions for the greatest impact. Our study focuses on internal attrition data from IBM and employee reviews from Glassdoor. We aim to provide IBM with valuable insights that could lead to the most effective and efficient solutions to improve talent retention by answering the following three (3) questions:

What are the key driving factors influencing attrition the most at IBM?

The first question explores the key factors that influences attrition at IBM, among the many other factors cited by Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016). Having such insights would allow IBM HR to develop watch-areas on these factors to improve retention.

Who is likely to leave IBM?

The second question is a prediction problem, where we aim to fit a model with the key factors identified in Question 1. The model accuracy will be optimized by various supervised and unsupervised techniques such as logistic regression, sentiment analysis and clustering. This will help IBM HR to identify talents who are at risk of leaving.

What is the employee type that has the highest tendency to leave IBM?

Finally, the third question seeks to personify/characterize the type of employee that has the highest risk of leaving IBM, which will allow better formulation of HR strategies to retain such individuals with such a persona. Depending on how many personas we can achieve through our analysis, we can adopt a risk-based approach to curate unique HR strategies for employees with different personas.

Data Description

We have two datasets for this project, as shown:

IBM Attrition Dataset: dat_hr
Glassdoor Scraped Text Dataset: dat_gd

The first dataset (dat_hr) is taken from IBM and it consists of attrition data and demographic variables such as employee satisfaction, income and education, among others, while the second dataset (dat_gd) is collected by scraping Glassdoor and it contains past and present IBM employee text reviews and ratings of the company. Details of the various data fields can be found in the Appendix.

Data Preparation

IBM Attrition Dataset

The first thing that we will do is to get a detailed overview of the data in the IBM attrition dataset, dat_hr. We use skim() instead of the conventional glimpse() and str() as it offers additional insights such as the number of missing observations, complete rate and the number of unique observations for both factor and numeric column types. It also returns statistical summaries, which are really useful for quick analysis.

Data summary
Name	dat_hr
Number of rows	1470
Number of columns	35
_______________________
Column type frequency:
factor	9
numeric	26
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Attrition	1	FALSE	2	No: 1233, Yes: 237
BusinessTravel	1	FALSE	3	Tra: 1043, Tra: 277, Non: 150
Department	1	FALSE	3	Res: 961, Sal: 446, Hum: 63
EducationField	1	FALSE	6	Lif: 606, Med: 464, Mar: 159, Tec: 132
Gender	1	FALSE	2	Mal: 882, Fem: 588
JobRole	1	FALSE	9	Sal: 326, Res: 292, Lab: 259, Man: 145
MaritalStatus	1	FALSE	3	Mar: 673, Sin: 470, Div: 327
Over18	1	FALSE	1	Y: 1470
OverTime	1	FALSE	2	No: 1054, Yes: 416

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ï..Age	1	36.92	9.14	18	30.00	36.0	43.00	60	▂▇▇▃▂
DailyRate	1	802.49	403.51	102	465.00	802.0	1157.00	1499	▇▇▇▇▇
DistanceFromHome	1	9.19	8.11	1	2.00	7.0	14.00	29	▇▅▂▂▂
Education	1	2.91	1.02	1	2.00	3.0	4.00	5	▂▃▇▆▁
EmployeeCount	1	1.00	0.00	1	1.00	1.0	1.00	1	▁▁▇▁▁
EmployeeNumber	1	1024.87	602.02	1	491.25	1020.5	1555.75	2068	▇▇▇▇▇
EnvironmentSatisfaction	1	2.72	1.09	1	2.00	3.0	4.00	4	▅▅▁▇▇
HourlyRate	1	65.89	20.33	30	48.00	66.0	83.75	100	▇▇▇▇▇
JobInvolvement	1	2.73	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▁
JobLevel	1	2.06	1.11	1	1.00	2.0	3.00	5	▇▇▃▂▁
JobSatisfaction	1	2.73	1.10	1	2.00	3.0	4.00	4	▅▅▁▇▇
MonthlyIncome	1	6502.93	4707.96	1009	2911.00	4919.0	8379.00	19999	▇▅▂▁▂
MonthlyRate	1	14313.10	7117.79	2094	8047.00	14235.5	20461.50	26999	▇▇▇▇▇
NumCompaniesWorked	1	2.69	2.50	0	1.00	2.0	4.00	9	▇▃▂▂▁
PercentSalaryHike	1	15.21	3.66	11	12.00	14.0	18.00	25	▇▅▃▂▁
PerformanceRating	1	3.15	0.36	3	3.00	3.0	3.00	4	▇▁▁▁▂
RelationshipSatisfaction	1	2.71	1.08	1	2.00	3.0	4.00	4	▅▅▁▇▇
StandardHours	1	80.00	0.00	80	80.00	80.0	80.00	80	▁▁▇▁▁
StockOptionLevel	1	0.79	0.85	0	0.00	1.0	1.00	3	▇▇▁▂▁
TotalWorkingYears	1	11.28	7.78	0	6.00	10.0	15.00	40	▇▇▂▁▁
TrainingTimesLastYear	1	2.80	1.29	0	2.00	3.0	3.00	6	▂▇▇▂▃
WorkLifeBalance	1	2.76	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▂
YearsAtCompany	1	7.01	6.13	0	3.00	5.0	9.00	40	▇▂▁▁▁
YearsInCurrentRole	1	4.23	3.62	0	2.00	3.0	7.00	18	▇▃▂▁▁
YearsSinceLastPromotion	1	2.19	3.22	0	0.00	1.0	3.00	15	▇▁▁▁▁
YearsWithCurrManager	1	4.12	3.57	0	2.00	3.0	7.00	17	▇▂▅▁▁

Here we perform some data transformation, like renaming column names with typos in them and converting some column types into the appropriate data types.

# Renaming the Age column
dat_hr <- dat_hr %>%
  dplyr::rename(Age = ï..Age)

# Converting numeric to factor
dat_hr$Education <- as.factor(dat_hr$Education)
dat_hr$EnvironmentSatisfaction <- as.factor(dat_hr$EnvironmentSatisfaction)
dat_hr$JobInvolvement <- as.factor(dat_hr$JobInvolvement)
dat_hr$JobLevel <- as.factor(dat_hr$JobLevel)
dat_hr$JobSatisfaction <- as.factor(dat_hr$JobSatisfaction)
dat_hr$StockOptionLevel <- as.factor(dat_hr$StockOptionLevel)
dat_hr$PerformanceRating <- as.factor(dat_hr$PerformanceRating)
dat_hr$RelationshipSatisfaction <- as.factor(dat_hr$RelationshipSatisfaction)
dat_hr$WorkLifeBalance <- as.factor(dat_hr$WorkLifeBalance)

Based on the skim() summary above, there are a few variables that only have one unique value. These variables are: Over18, EmployeeCount and StandardHours. We will remove these columns as they are not useful for the modeling process.

dat_hr <- dat_hr %>%
  dplyr::select(-Over18, -EmployeeCount, -StandardHours)

As a final check, we ensure that there are no NAs in the dataset.

sum(is.na(dat_hr))

## [1] 0

Finally, this is the cleaned dat_hr dataset that is ready for analysis work.

## Rows: 1,470
## Columns: 32
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department               <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education                <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction  <fct> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement           <fct> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel                 <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole                  <fct> Sales Executive, Research Scientist, Laborato~
## $ JobSatisfaction          <fct> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating        <fct> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <fct> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StockOptionLevel         <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance          <fct> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~

Glassdoor Scraped Text Dataset

The first thing we will do is to get an overview of what the scraped data looks like. We use the glimpse() function for that. As you can see from the summary below, much cleaning is needed to prepare the data for text mining. For instance, Ratings need to be in numeric form, Current_Former needs to be factorized, Review_ID has web links that need to be truncated, Role has dates concatenated with the roles, Location can be shortened to the 2-letter state abbreviation for standardization purposes and to tokenize the Pros and Cons columns using tidytext.

## Rows: 890
## Columns: 8
## $ Ratings        <chr> "5.0", "4.0", "4.0", "4.0", "5.0", "4.0", "2.0", "4.0",~
## $ Current_Former <chr> "Current Employee", "Current Employee, more than 8 year~
## $ Title          <chr> "EXCELLENT", "Advisory Engineer in STG, IBM", "IBM is n~
## $ Review_ID      <chr> "https://www.glassdoor.sg/Reviews/Employee-Review-IBM-R~
## $ Role           <chr> "19 Feb 2021 - Executive", "26 Aug 2014 - Advisory Engi~
## $ Location       <chr> "Chicago, IL", "Hopewell Junction, NY", "Dallas, TX", "~
## $ Pros           <chr> "Great place to work IBM", "Disclaimer: A lot of what I~
## $ Cons           <chr> "No cons with IBM employment. Great place to work.", "1~

Here we perform the necessary data preparation.

Clean the Ratings column.

# Convert Ratings to numeric
dat_gd$Ratings <- as.numeric(dat_gd$Ratings)

Clean the Current_Former column. Notice that the employment length is concatenated with the string. Therefore we need to separate the employment length from the column and create a new column for it.

# Split the string into 2 columns: Current_Former and Employment_Length
dat_gd <- dat_gd %>%
  separate(Current_Former, c("Current_Former", "Employment_Length"), ", ")

# Convert Current_Former to factor
dat_gd$Current_Former <- as.factor(dat_gd$Current_Former)

Clean Review_ID column.

# Extract review IDs from web links
dat_gd$Review_ID <- sub("https://www.glassdoor.sg/Reviews/Employee-Review-IBM-RVW", "", dat_gd$Review_ID)
dat_gd$Review_ID <- sub(".htm", "", dat_gd$Review_ID)

# Convert Review_ID to numeric
dat_gd$Review_ID <- as.numeric(dat_gd$Review_ID)

Check for duplicates in the Review_ID.

dim(dat_gd[duplicated(dat_gd$Review_ID),])[1]

## [1] 1

There is 1 duplicate entry. Check for the specific Review_ID that is duplicated.

dat_gd[duplicated(dat_gd$Review_ID),]

## # A tibble: 1 x 9
##   Ratings Current_Former  Employment_Length Title Review_ID Role  Location Pros 
##     <dbl> <fct>           <chr>             <chr>     <dbl> <chr> <chr>    <chr>
## 1       5 Current Employ~ NA                EXCE~  42772946 19 F~ Chicago~ Grea~
## # ... with 1 more variable: Cons <chr>

Review_ID 42772946 is duplicated and we will remove it from the data.

dat_gd <- dat_gd[!duplicated(dat_gd$Review_ID),]

Clean the Role column.

# Separate the date portion from the role portion
dat_gd <- dat_gd %>%
  separate(Role, c("Review_Date", "Role"), " - ") 

# Convert Review_Date to date format
dat_gd$Review_Date <- as.Date(dat_gd$Review_Date, "%d %b %Y")

Clean the Location column.

# Extract the 2-letter state abbreviation
dat_gd$Location <- sub(".*,\\s*", "", dat_gd$Location)

Check whether there are outliers in the Location column that do not conform to the 2-letter state abbreviation.

dat_gd$Location[nchar(dat_gd$Location) != 2]

## [1] "Tarlac"  "Chennai"

There are two places that are not in the United States. Tarlac is in Philippines while Chennai is in India. We will remove these rows from the data and convert the type to factor.

# Remove rows
dat_gd <- dat_gd[nchar(dat_gd$Location) == 2,]

# Convert character to factor
dat_gd$Location <- as.factor(dat_gd$Location)

Finally, this is the cleaned dat_gd dataset that is ready for analysis work.

## Rows: 887
## Columns: 10
## $ Ratings           <dbl> 5, 4, 4, 4, 5, 4, 2, 4, 5, 5, 4, 4, 5, 4, 5, 5, 4, 1~
## $ Current_Former    <fct> Current Employee, Current Employee, Current Employee~
## $ Employment_Length <chr> NA, "more than 8 years", "more than 10 years", "more~
## $ Title             <chr> "EXCELLENT", "Advisory Engineer in STG, IBM", "IBM i~
## $ Review_ID         <dbl> 42772946, 4852131, 33522445, 47288155, 46360847, 470~
## $ Review_Date       <date> 2021-02-19, 2014-08-26, 2020-06-04, 2021-05-21, 202~
## $ Role              <chr> "Executive", "Advisory Engineer", "Bid Proposal Mana~
## $ Location          <fct> IL, NY, TX, LA, TX, FL, MI, NY, NC, DC, NY, NY, NY, ~
## $ Pros              <chr> "Great place to work IBM", "Disclaimer: A lot of wha~
## $ Cons              <chr> "No cons with IBM employment. Great place to work.",~

Exploratory Data Analysis

IBM Attrition Dataset

The outcome variable is Attrition and we want to know its distribution.

table(dat_hr$Attrition)

## 
##   No  Yes 
## 1233  237

Perhaps visualizing it may be better to understand the distribution.

dat_hr %>%
  dplyr::group_by(Attrition) %>%
  dplyr::summarise(Total = n()) %>%
  ggplot(aes(x=Attrition, y=Total, fill=Attrition)) +
  geom_col() +
  ggtitle("Distribution of Attrition") +
  theme_classic()

Let’s explore the attrition by income level.

dat_hr %>%
  ggplot(aes(x=MonthlyIncome, fill=Attrition)) +
  geom_density(alpha=0.6) +
  labs(x="Monthly Income", y="") +
  ggtitle("Attrition by Income Level") +
  theme_classic()

Exploring attrition rates with respect to age.

dat_hr %>%
  ggplot(aes(x=Attrition, y=Age, fill=Attrition)) +
  geom_boxplot() +
  ggtitle("Attrition by Age") +
  theme_classic()

Here we calculate the correlation between Attrition and the rest of the variables in the dataset. A correlation funnel is plotted using the correlationfunnel library and it shows us the variables that have the highest to lowest correlation with Attrition (like a funnel). To interpret the plot, for instance, we notice that OverTime is highly correlated with Attrition as it sits on the top of the funnel. Heuristically speaking, employees who work overtime would likely lead to higher attrition due to increased workloads and lengthier working hours. Our correlation plot reflects this observation, which shows that employees who work overtime have positive correlation (right side of the funnel) while those who do not work overtime have negative correlation (left side of the funnel) to Attrition.

To select variables for modeling and clustering work, we will select a correlation threshold (e.g., > 0.1) as it does not make sense to include all variables – this will complicate the analysis of the results and make it more difficult for IBM stakeholders to act upon the findings.

Glassdoor Scraped Text Dataset

The first thing we will do is to analyze the Pros and Cons text reviews in the dat_gd dataset to just have a glimpse of what employees (past and present) are feeling about IBM.

Pros

Analyzing the Pros:

dat_gd %>% 
  unnest_tokens(input=Pros, output=word) %>% 
  anti_join(stop_words) %>% 
  count(word, sort=TRUE)

## # A tibble: 1,787 x 2
##    word              n
##    <chr>         <int>
##  1 ibm             163
##  2 people          157
##  3 company         149
##  4 opportunities   136
##  5 benefits        126
##  6 life            101
##  7 balance          94
##  8 learning         78
##  9 culture          72
## 10 learn            70
## # ... with 1,777 more rows

# Visualizing top 10 most common words in Pros
dat_gd %>% 
  unnest_tokens(input=Pros, output=word) %>% 
  anti_join(stop_words) %>% 
  count(word, sort=TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(x=reorder(word, -n), y=n), stat="identity", fill="#40B5AD") +
  theme_classic() +
  labs(title = "Top 10 Single Words in Pros",
       caption = "Data Source: Glassdoor") +
  theme(axis.title.x=element_blank())

# Visualizing top 30 bigrams in Pros
dat_gd %>% 
  unnest_tokens(input=Pros, output=word, token="ngrams", n=2) %>% 
  separate(word, c("word1", "word2"), sep=" ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  unite(word, word1, word2, sep=" ") %>% 
  count(word, sort=TRUE) %>% 
  slice(1:30) %>% 
  ggplot() + geom_bar(aes(x=reorder(word, n), y=n), stat = "identity", fill = "#40B5AD") +
  theme_classic() +
  coord_flip() +
  labs(title = "Top 30 Bi-Words in Pros",
       caption = "Data Source: Glassdoor") +
  theme(axis.title.y=element_blank())

Cons

Analyzing the Cons:

dat_gd %>% 
  unnest_tokens(input=Cons, output=word) %>% 
  anti_join(stop_words) %>% 
  count(word, sort=TRUE)

## # A tibble: 2,399 x 2
##    word           n
##    <chr>      <int>
##  1 company      152
##  2 ibm          133
##  3 management   112
##  4 time          60
##  5 pay           59
##  6 salary        51
##  7 job           47
##  8 people        46
##  9 team          44
## 10 hard          43
## # ... with 2,389 more rows

# Visualizing top 10 most common words in Cons
dat_gd %>% 
  unnest_tokens(input=Cons, output=word) %>% 
  anti_join(stop_words) %>% 
  count(word, sort=TRUE) %>% 
  slice(1:10) %>% 
  ggplot() + geom_bar(aes(x=reorder(word, -n), y=n), stat="identity", fill="#de5833") +
  theme_classic() +
  labs(title = "Top 10 Single Words in Cons",
       caption = "Data Source: Glassdoor") +
  theme(axis.title.x=element_blank())

# Visualizing top 30 bigrams in Cons
dat_gd %>% 
  unnest_tokens(input=Cons, output=word, token="ngrams", n=2) %>% 
  separate(word, c("word1", "word2"), sep=" ") %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  unite(word, word1, word2, sep=" ") %>% 
  count(word, sort=TRUE) %>% 
  slice(1:30) %>% 
  ggplot() + geom_bar(aes(x=reorder(word, n), y=n), stat = "identity", fill = "#de5833") +
  theme_classic() +
  coord_flip() +
  labs(title = "Top 30 Bi-Words in Cons",
       caption = "Data Source: Glassdoor") +
  theme(axis.title.y=element_blank())

Sentiment Analysis on Glassdoor Reviews

In the Glassdoor reviews, there are two columns of text data called Pros and Cons – one for positive reviews and one for negative reviews. The idea here is to compute the sentiment scores and to use them as inputs to the prediction model that would predict attrition at IBM. We want to explore whether the sentiments in the Glassdoor reviews would better predict attrition at IBM.

Renaming Roles in Glassdoor Dataset

Before any sentiment analysis can be done, we need to categorize the Role column of the dat_gd dataset into a smaller group of categories as there are myriad type of roles that were written in the text reviews, which would make it difficult to perform any sort of summary statistic and joining of data with the dat_hr dataset. The goal is to have an aggregated/averaged sentiment score for each job category based on the text reviews. Eventually, we grouped all roles based on 6 distinct job categories (shown below) because they have distinct career ladder paths from each other, and within each job category, they would have similarities such as pay and education backgrounds, among other things.

The 6 job categories are:

AESP (Assistant Engineering & Scientific Personnel): E.g., Lab Technicians, Technical Support/Specialists, etc.
Corporate: E.g., Admin Executive, Procurement, Human Resources, Business Partners, etc.
Director: E.g., Vice President, Senior Vice President, Chief Technology Offier, etc.
ESP (Engineering & Scientific Personnel): E.g., Data Scientists, Engineers, Member of Technical Staff, Programmers, etc.
Manager: E.g., Team Leads, Project Managers, Service Delivery Managers, etc.
Sales: E.g., Sales Support, Client Representative, Account Representative, etc.

However, while performing the re-categorization exercise, there were roles that were ambiguous and needed manual research in order to determine which category it belonged to. For instance, there were roles such as “sdm” and “rsm”, to which we used Glassdoor’s website to understand what the individual job descriptions were before assigning them into the appropriate categories (“sdm” and “rsm” are the abbreviations for service delivery manager and real storage manager respectively).

# Convert all roles to lowercase
dat_gd$Role <- tolower(dat_gd$Role)

# Rename all roles into standard role types
dat_gd_newroles <- dat_gd %>%
  mutate(Role  = case_when(
    str_detect(Role, "manager") ~ "Manager",
    str_detect(Role, "bdm") ~ "Manager",
    str_detect(Role, "director") ~ "Director",
    str_detect(Role, "engineer") ~ "ESP",
    str_detect(Role, "developer") ~ "ESP",
    str_detect(Role, "analyst") ~ "ESP",
    str_detect(Role, "programmer") ~ "ESP",
    str_detect(Role, "architect") ~ "ESP",
    str_detect(Role, "scientist") ~ "ESP",
    str_detect(Role, "data science") ~ "ESP",
    str_detect(Role, "data scientist") ~ "ESP",
    str_detect(Role, "data scentist") ~ "ESP",
    str_detect(Role, "senior technical staff member") ~ "ESP",
    str_detect(Role, "researcher") ~ "ESP",
    str_detect(Role, "design") ~ "ESP",
    str_detect(Role, "sales") ~ "Sales",
    str_detect(Role, "talent") ~ "Corporate",
    str_detect(Role, "hr") ~ "Corporate",
    str_detect(Role, "human") ~ "Corporate",
    str_detect(Role, "executive") ~ "Corporate",
    str_detect(Role, "partner") ~ "Corporate",
    str_detect(Role, "accountant") ~ "Corporate",
    str_detect(Role, "technician") ~ "AESP",
    str_detect(Role, "specialist") ~ "AESP",
    str_detect(Role, "senior") ~ "Manager",
    str_detect(Role, "managing") ~ "Manager",
    str_detect(Role, "consultant") ~ "Corporate",
    str_detect(Role, "admin") ~ "Corporate",
    str_detect(Role, "vice pres") ~ "Director",
    str_detect(Role, "svp") ~ "Director",
    str_detect(Role, "research") ~ "ESP",
    str_detect(Role, "lead") ~ "Manager",
    str_detect(Role, "writer") ~ "Corporate",
    str_detect(Role, "chief") ~ "Director",
    str_detect(Role, "tech") ~ "AESP",
    str_detect(Role, "client rep") ~ "Sales",
    str_detect(Role, "service rep") ~ "Sales",
    str_detect(Role, "account rep") ~ "Sales",
    str_detect(Role, "support serv") ~ "Sales",
    str_detect(Role, "acquisition") ~ "Corporate",
    str_detect(Role, "president") ~ "Director",
    str_detect(Role, "legal") ~ "Corporate",
    str_detect(Role, "business") ~ "Corporate",
    str_detect(Role, "manage") ~ "Manager",
    str_detect(Role, "product") ~ "Manager",
    str_detect(Role, "procurement") ~ "Corporate",
    str_detect(Role, "market") ~ "Corporate",
    str_detect(Role, "finance") ~ "Corporate",
    str_detect(Role, "people") ~ "Corporate",
    str_detect(Role, "rep") ~ "Corporate",
    str_detect(Role, "dev") ~ "ESP",
    str_detect(Role, "agent") ~ "Corporate",
    str_detect(Role, "bdr") ~ "Corporate",
    str_detect(Role, "quality") ~ "AESP",
    str_detect(Role, "support") ~ "AESP",
    str_detect(Role, "attorney") ~ "Corporate",
    str_detect(Role, "social") ~ "Corporate",
    str_detect(Role, "relations") ~ "Corporate",
    str_detect(Role, "content") ~ "Corporate",
    str_detect(Role, "hardware") ~ "AESP",
    str_detect(Role, "scrum") ~ "Corporate",
    str_detect(Role, "sdm") ~ "Manager",
    str_detect(Role, "rsm") ~ "Manager",
    TRUE ~ Role
  ))

# Discard all other rows that do not have well defined roles
dat_gd_filtered <- dat_gd_newroles %>%
  filter(Role == "ESP" | Role == "Manager" | Role == "Corporate" |
           Role == "Sales" | Role == "AESP" | Role == "Director")

# Convert Role to factor
dat_gd_filtered$Role <- as.factor(dat_gd_filtered$Role)

Ensure that the roles have 6 distinct categories and explore the distribution.

levels(dat_gd_filtered$Role)

## [1] "AESP"      "Corporate" "Director"  "ESP"       "Manager"   "Sales"

summary(dat_gd_filtered$Role)

##      AESP Corporate  Director       ESP   Manager     Sales 
##        53       139        32       334       232        58

Conduct Sentiment Analysis

For the sentiment analysis, we use the sentimentr library as it handles valence shifters (i.e., negators amplifiers, de-amplifiers and adversative conjunctions) while maintaining speed (an alternative is the qdap library, but it is slower in speed). Another reason is because the library gives us the option to aggregate the sentiment scores by grouping variables, which is something that we want to do (i.e., we want to group it by the job roles).

Here we computed the average sentiment scores for the Pros in the text reviews and we visualized them in a density plot for easy viewing.

## Pros
# Avg sentiment scores summarized by Role
dat_gd_avgsentiment_pros <- dat_gd_filtered %>%
  select(-Cons) %>%
  get_sentences() %>%
  sentiment_by("Role")

dat_gd_avgsentiment_pros

##         Role word_count        sd ave_sentiment
## 1:      AESP        871 0.3842475     0.5730397
## 2: Corporate       2186 0.3597134     0.4696195
## 3:  Director        642 0.5245235     0.5968319
## 4:       ESP       6274 0.3822417     0.5135917
## 5:   Manager       3470 0.3515144     0.5727328
## 6:     Sales        740 0.3391197     0.5839903

# Sentiment scores of Pros reviews faceted by Role
dat_gd_filtered %>%
  select(-Cons) %>%
  get_sentences() %>%
  sentiment() %>%
  group_by(element_id, Role) %>%
  summarise(avg_sentiment = mean(sentiment)) %>%
  ungroup() %>%
  left_join(dat_gd_avgsentiment_pros, by="Role") %>%
  ggplot(aes(x=avg_sentiment, fill=Role)) +
  geom_density(alpha=0.6) +
  facet_wrap(~Role) +
  geom_vline(aes(xintercept=ave_sentiment), size=1, color="red") +
  geom_text(aes(x=ave_sentiment+0.6, label=paste0("Mean\n",round(ave_sentiment,3)), y=1.2)) +
  ggtitle("Distribution of Sentiment Scores for Reviews in Pros") +
  labs(x="Sentiment Score", y="Density") +
  theme_classic()

Now we compute the average sentiment scores for the Cons in the text reviews and we visualize them in a density plot.

## Cons
# Avg sentiment scores summarized by Role
dat_gd_avgsentiment_cons <- dat_gd_filtered %>%
  select(-Pros) %>%
  get_sentences() %>%
  sentiment_by("Role")

dat_gd_avgsentiment_cons

##         Role word_count        sd ave_sentiment
## 1:      AESP        960 0.3226684  -0.061601871
## 2: Corporate       2392 0.3291854   0.031341807
## 3:  Director        664 0.3920271  -0.030371233
## 4:       ESP       7116 0.3493464   0.002896978
## 5:   Manager       4079 0.3316412   0.044559058
## 6:     Sales       1310 0.3581594  -0.015564249

# Sentiment scores of Cons reviews faceted by Role
dat_gd_filtered %>%
  select(-Pros) %>%
  get_sentences() %>%
  sentiment() %>%
  group_by(element_id, Role) %>%
  summarise(avg_sentiment = mean(sentiment)) %>%
  ungroup() %>%
  left_join(dat_gd_avgsentiment_cons, by="Role") %>%
  ggplot(aes(x=avg_sentiment, fill=Role)) +
  geom_density(alpha=0.6) +
  facet_wrap(~Role) +
  geom_vline(aes(xintercept=ave_sentiment), size=1, color="red") +
  geom_text(aes(x=ave_sentiment+0.6, label=paste0("Mean\n",round(ave_sentiment,3)), y=1.2)) +
  ggtitle("Distribution of Sentiment Scores for Reviews in Cons") +
  labs(x="Sentiment Score", y="Density") +
  theme_classic()

As you can see from the distribution, the average sentiment scores from the Pros resulted in a more positive polarity while the Cons resulted in a less positive polarity relative to each other, which is anecdotally expected. These sentiment scores will subsequently be used as inputs to the prediction model in the later steps.

Join Both Datasets

With the amalgamation of the roles in the Glassdoor reviews into 6 distinct categories, we shall do the same for the IBM dataset, dat_hr, and attempt to join the sentiment scores to it for prediction modeling purposes.

Renaming Roles in IBM Dataset

First, we performed the re-categorization of the job roles in the IBM dataset. This re-categorization work was definitely easier and more trivial as compared to Glassdoor reviews because the data had distinct job role categories to begin with.

# Convert JobRole from factor to character
dat_hr$JobRole <- as.character(dat_hr$JobRole)

# Rename all roles into standard role types
dat_hr_newroles <- dat_hr %>%
  mutate(JobRole = case_when(
    str_detect(JobRole, "Sales Executive") ~ "Sales",
    str_detect(JobRole, "Research Scientist") ~ "ESP",
    str_detect(JobRole, "Laboratory Technician") ~ "AESP",
    str_detect(JobRole, "Manufacturing Director") ~ "Director",
    str_detect(JobRole, "Healthcare Representative") ~ "Corporate",
    str_detect(JobRole, "Sales Representative") ~ "Sales",
    str_detect(JobRole, "Research Director") ~ "Director",
    str_detect(JobRole, "Human Resources") ~ "Corporate",
    TRUE ~ JobRole
  ))

# Convert JobRole to factor
dat_hr_newroles$JobRole <- as.factor(dat_hr_newroles$JobRole)

# Rename JobRole to Role in order to match dat_gd_avgsentiment_pros and cons
dat_hr_renamed <- dat_hr_newroles %>%
  rename(Role = JobRole)

Ensure that the roles have 6 distinct categories and explore the distribution.

levels(dat_hr_renamed$Role)

## [1] "AESP"      "Corporate" "Director"  "ESP"       "Manager"   "Sales"

summary(dat_hr_renamed$Role)

##      AESP Corporate  Director       ESP   Manager     Sales 
##       259       183       225       292       102       409

Join Sentiment Scores to IBM Dataset

Now that both datasets have a common baseline (i.e., job roles) for joining, we did a left_join to merge the sentiments scores for Pros and Cons into the IBM dataset. The sentiment scores are appended in the last two columns of dat_hr_final.

# dat_gd_avgsentiment_pros joined to dat_hr_renamed
dat_hr_final <- dat_hr_renamed %>%
  left_join(dat_gd_avgsentiment_pros, by="Role") %>%
  select(-word_count, -sd) %>%
  rename(ave_sentiment_pros = ave_sentiment) %>%
  left_join(dat_gd_avgsentiment_cons, by="Role") %>%
  select(-word_count, -sd) %>%
  rename(ave_sentiment_cons = ave_sentiment)

glimpse(dat_hr_final)

## Rows: 1,470
## Columns: 34
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department               <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education                <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction  <fct> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement           <fct> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel                 <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ Role                     <fct> Sales, ESP, AESP, ESP, AESP, AESP, AESP, AESP~
## $ JobSatisfaction          <fct> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating        <fct> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <fct> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StockOptionLevel         <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance          <fct> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
## $ ave_sentiment_pros       <dbl> 0.5839903, 0.5135917, 0.5730397, 0.5135917, 0~
## $ ave_sentiment_cons       <dbl> -0.015564249, 0.002896978, -0.061601871, 0.00~

Predicting Attrition

In this section, we perform the modeling process for predicting attrition at IBM. We shall use inputs from the default IBM dataset together with clustering and supplemental data from the Glassdoor reviews. The prediction modeling methodology is as follows:

Model a baseline logistic regression with original IBM dataset predictors. Use backward selection to identify the best model.
Conduct a variance inflation factor (VIF) test to check for variables with high multicolliearity. Drop selected variables if necessary.
Model a logistic regression model with sentiment scores added into the model. Make an assessment if it improves the prediction accuracy.
Perform clustering with the original IBM dataset before using the clusters as inputs to the model. Make an assessment if it improves the prediction accuracy.
Incorporate both sentiment scores and clustering results as inputs to the model. Make and assessment if it improves the prediction accuracy.
Try other prediction models such as trees to see how it performs against logistic regression.

Since we are attempting to predict attrition, this becomes a classification problem (i.e., “will this employee leave?”) and we will use confusion matrices to compute the accuracy, specificity and sensitivity figures.

The main aim of this project is to reduce the attrition at IBM; in our business context, we want to focus on the false negatives (FN) as opposed to the false positives (FP). Why?

FP: Predicting that an employee would leave but he/she did not.
FN: Predicting that an employee would not leave but he/she did.

Under a FP case, we predict that an employee would leave (Attrition=Yes) but he/she did not. This would mean that the company would have taken additional measures to prevent this employee from leaving, leading to a “wastage” of resources. But this might not be a bad thing because in the process of doing so, the company would have forged a stronger relationship with the employee, which might further reduce the chance of the employee from leaving.

However, under a FN case, we predict that an employee would not leave (Attrition=No) but he/she did leave. This is more detrimental to the company because it means that we have failed to take preventive action to prevent attrition, thereby losing valuable talents in IBM. For this reason, the priority is in reducing the FN in our prediction models. Hence, the crucial metric in our business context is the sensitivity figure because we want to reduce FN, which is in the denominator of the equation, as shown below.

\[ Sensitivity = TP/(TP+FN) \] \[ Specificity = TN/(TN+FP) \] \[ Accuracy = (TN+TP)/(TN+TP+FN+FP) \]

Lastly, we will use the area under the curve (AUC) as a measure of the model quality. The higher the AUC, the better the model.

Split Data

# Split data
set.seed(1000)
split <- createDataPartition(dat_hr_final$Attrition, p=0.7, list=F, groups=100)
train <- dat_hr_final[split,]
test <- dat_hr_final[-split,]

Model 1: Logistic Regression with Backward Selection

We model a logistic regression with the original predictors from the IBM dataset first (without sentiment scores and clustering). We used backward selection for feature selection.

model1 <- glm(Attrition~., family="binomial", data=train[,1:32])
summary(model1)

## 
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train[, 
##     1:32])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7902  -0.4553  -0.2171  -0.0694   3.4280  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       5.257e+00  1.622e+00   3.241 0.001190 ** 
## Age                              -4.446e-02  1.691e-02  -2.629 0.008573 ** 
## BusinessTravelTravel_Frequently   1.911e+00  5.026e-01   3.803 0.000143 ***
## BusinessTravelTravel_Rarely       1.044e+00  4.620e-01   2.258 0.023916 *  
## DailyRate                        -4.898e-04  2.764e-04  -1.772 0.076406 .  
## DepartmentResearch & Development  1.641e-01  8.497e-01   0.193 0.846837    
## DepartmentSales                  -3.832e-01  1.605e+00  -0.239 0.811333    
## DistanceFromHome                  4.743e-02  1.394e-02   3.401 0.000670 ***
## Education2                        2.398e-01  4.013e-01   0.598 0.550161    
## Education3                        2.035e-01  3.492e-01   0.583 0.560018    
## Education4                        2.311e-01  3.826e-01   0.604 0.545927    
## Education5                       -1.100e+00  9.529e-01  -1.154 0.248549    
## EducationFieldLife Sciences      -1.257e+00  1.015e+00  -1.238 0.215839    
## EducationFieldMarketing          -7.822e-01  1.082e+00  -0.723 0.469793    
## EducationFieldMedical            -1.196e+00  1.008e+00  -1.186 0.235716    
## EducationFieldOther              -1.530e+00  1.095e+00  -1.397 0.162370    
## EducationFieldTechnical Degree   -6.066e-01  1.057e+00  -0.574 0.565971    
## EmployeeNumber                   -1.163e-04  1.898e-04  -0.613 0.540157    
## EnvironmentSatisfaction2         -9.162e-01  3.515e-01  -2.606 0.009149 ** 
## EnvironmentSatisfaction3         -1.014e+00  3.215e-01  -3.155 0.001604 ** 
## EnvironmentSatisfaction4         -1.202e+00  3.259e-01  -3.689 0.000225 ***
## GenderMale                        5.812e-01  2.395e-01   2.426 0.015248 *  
## HourlyRate                        1.814e-03  5.550e-03   0.327 0.743727    
## JobInvolvement2                  -1.391e+00  4.674e-01  -2.977 0.002912 ** 
## JobInvolvement3                  -1.665e+00  4.380e-01  -3.802 0.000144 ***
## JobInvolvement4                  -2.145e+00  5.896e-01  -3.637 0.000275 ***
## JobLevel2                        -1.386e+00  4.410e-01  -3.142 0.001678 ** 
## JobLevel3                         2.610e-01  7.951e-01   0.328 0.742689    
## JobLevel4                        -5.122e-01  1.510e+00  -0.339 0.734491    
## JobLevel5                         1.259e+00  1.804e+00   0.698 0.485285    
## RoleCorporate                    -1.283e-01  5.894e-01  -0.218 0.827642    
## RoleDirector                     -3.472e-01  5.809e-01  -0.598 0.550009    
## RoleESP                          -9.264e-01  3.369e-01  -2.750 0.005964 ** 
## RoleManager                       4.274e-01  1.059e+00   0.404 0.686551    
## RoleSales                         1.067e+00  1.493e+00   0.715 0.474751    
## JobSatisfaction2                 -9.525e-01  3.556e-01  -2.678 0.007396 ** 
## JobSatisfaction3                 -8.181e-01  3.004e-01  -2.723 0.006464 ** 
## JobSatisfaction4                 -1.423e+00  3.268e-01  -4.355 1.33e-05 ***
## MaritalStatusMarried              4.239e-01  3.390e-01   1.251 0.211084    
## MaritalStatusSingle               1.163e+00  4.925e-01   2.362 0.018174 *  
## MonthlyIncome                    -1.704e-04  1.052e-04  -1.620 0.105275    
## MonthlyRate                       5.186e-06  1.595e-05   0.325 0.745119    
## NumCompaniesWorked                2.111e-01  4.877e-02   4.328 1.51e-05 ***
## OverTimeYes                       2.123e+00  2.545e-01   8.343  < 2e-16 ***
## PercentSalaryHike                -2.324e-02  4.724e-02  -0.492 0.622818    
## PerformanceRating4                2.530e-01  4.861e-01   0.521 0.602692    
## RelationshipSatisfaction2        -1.105e+00  3.590e-01  -3.077 0.002088 ** 
## RelationshipSatisfaction3        -1.132e+00  3.150e-01  -3.595 0.000324 ***
## RelationshipSatisfaction4        -1.247e+00  3.190e-01  -3.910 9.24e-05 ***
## StockOptionLevel1                -6.250e-01  3.884e-01  -1.609 0.107551    
## StockOptionLevel2                -7.834e-01  5.775e-01  -1.357 0.174904    
## StockOptionLevel3                -4.366e-01  5.941e-01  -0.735 0.462464    
## TotalWorkingYears                -5.732e-02  3.645e-02  -1.573 0.115779    
## TrainingTimesLastYear            -2.158e-01  9.017e-02  -2.393 0.016703 *  
## WorkLifeBalance2                 -8.450e-01  4.564e-01  -1.851 0.064110 .  
## WorkLifeBalance3                 -1.334e+00  4.299e-01  -3.103 0.001914 ** 
## WorkLifeBalance4                 -8.533e-01  5.175e-01  -1.649 0.099176 .  
## YearsAtCompany                    9.188e-02  4.934e-02   1.862 0.062571 .  
## YearsInCurrentRole               -1.395e-01  5.695e-02  -2.450 0.014290 *  
## YearsSinceLastPromotion           1.959e-01  5.370e-02   3.649 0.000264 ***
## YearsWithCurrManager             -1.228e-01  5.584e-02  -2.199 0.027857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 909.69  on 1029  degrees of freedom
## Residual deviance: 569.35  on  969  degrees of freedom
## AIC: 691.35
## 
## Number of Fisher Scoring iterations: 7

# Backward selection
start_mod <- glm(Attrition~., family="binomial", data=train[,1:32])
empty_mod <- glm(Attrition~1, family="binomial", data=train[,1:32])
full_mod <- glm(Attrition~., family="binomial", data=train[,1:32])
model_bw <- step(start_mod,
                 scope=list(upper=full_mod, lower=empty_mod),
                 direction="backward")

We check the VIF of the logistic regression model and realized that JobLevel has high VIF. We will drop JobLevel from the model and check the VIF once more to ensure that the variables are not multicollinear.

# Check VIF
vif(model_bw)

##                               GVIF Df GVIF^(1/(2*Df))
## Age                       1.854540  1        1.361815
## BusinessTravel            1.233003  2        1.053758
## DailyRate                 1.074445  1        1.036555
## DistanceFromHome          1.153649  1        1.074080
## EnvironmentSatisfaction   1.294510  3        1.043961
## Gender                    1.115951  1        1.056386
## JobInvolvement            1.245723  3        1.037298
## JobLevel                 32.054649  4        1.542540
## Role                      5.852106  5        1.193249
## JobSatisfaction           1.320068  3        1.047368
## MaritalStatus             1.257844  2        1.059026
## MonthlyIncome            14.132514  1        3.759324
## NumCompaniesWorked        1.420018  1        1.191645
## OverTime                  1.350808  1        1.162243
## RelationshipSatisfaction  1.265591  3        1.040037
## TotalWorkingYears         5.092692  1        2.256699
## TrainingTimesLastYear     1.082879  1        1.040615
## WorkLifeBalance           1.246539  3        1.037411
## YearsAtCompany            6.194323  1        2.488840
## YearsInCurrentRole        2.866187  1        1.692982
## YearsSinceLastPromotion   2.332125  1        1.527130
## YearsWithCurrManager      2.784844  1        1.668785

# Drop highest VIF (JobLevel)
model_bw <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
                  DistanceFromHome+EnvironmentSatisfaction+Gender+
                  JobInvolvement+JobSatisfaction+MonthlyIncome+
                  NumCompaniesWorked+OverTime+PercentSalaryHike+
                  RelationshipSatisfaction+StockOptionLevel+
                  TotalWorkingYears+TrainingTimesLastYear+
                  WorkLifeBalance+YearsSinceLastPromotion+
                  YearsWithCurrManager, family="binomial", data=train[,1:32])

# Check VIF once more
vif(model_bw)

##                              GVIF Df GVIF^(1/(2*Df))
## Age                      1.780608  1        1.334394
## BusinessTravel           1.175488  2        1.041249
## DailyRate                1.057085  1        1.028146
## Department               1.155066  2        1.036697
## DistanceFromHome         1.107330  1        1.052297
## EnvironmentSatisfaction  1.186843  3        1.028961
## Gender                   1.072582  1        1.035656
## JobInvolvement           1.192317  3        1.029750
## JobSatisfaction          1.213064  3        1.032715
## MonthlyIncome            2.039577  1        1.428138
## NumCompaniesWorked       1.316823  1        1.147529
## OverTime                 1.224833  1        1.106722
## PercentSalaryHike        1.070351  1        1.034578
## RelationshipSatisfaction 1.216061  3        1.033140
## StockOptionLevel         1.258684  3        1.039089
## TotalWorkingYears        3.603873  1        1.898387
## TrainingTimesLastYear    1.080614  1        1.039526
## WorkLifeBalance          1.177136  3        1.027554
## YearsSinceLastPromotion  1.706395  1        1.306290
## YearsWithCurrManager     1.770733  1        1.330689

Now that the VIF is good, we performed the predictions with with a 0.5 threshold.

# Predict
pred <- predict(model_bw, newdata=test[,1:32], "response")
ct <- table(attrition = test$Attrition,
            predictions = as.integer(pred>0.5))
ct

##          predictions
## attrition   0   1
##       No  355  14
##       Yes  30  41

accuracy_df <- data.frame(accuracy = sum(ct[1,1],ct[2,2])/nrow(test),
                          specificity = ct[1,1]/sum(ct[1,1],ct[1,2]),
                          sensitivity = ct[2,2]/sum(ct[2,1],ct[2,2]))
accuracy_df

##   accuracy specificity sensitivity
## 1      0.9   0.9620596   0.5774648

Next, we plot the AUC to determine the optimal threshold value. From the chart, it seems that a threshold of 0.2 is good.

# Plot AUC
ROCRpred <- prediction(pred, test$Attrition)
ROCRperf <- performance(ROCRpred, "acc", "fnr")
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred,"auc")@y.values)

## [1] 0.8940799

# Use 0.2 as threshold
ct_0.2 <- table(attrition = test$Attrition,
                predictions = as.integer(pred>0.2))
ct_0.2

##          predictions
## attrition   0   1
##       No  299  70
##       Yes   9  62

accuracy_df_0.2 <- data.frame(accuracy = sum(ct_0.2[1,1],ct_0.2[2,2])/nrow(test),
                              specificity = ct_0.2[1,1]/sum(ct_0.2[1,1],ct_0.2[1,2]),
                              sensitivity = ct_0.2[2,2]/sum(ct_0.2[2,1],ct_0.2[2,2]))
accuracy_df_0.2

##    accuracy specificity sensitivity
## 1 0.8204545   0.8102981   0.8732394

Model 2: Logistic Regression with Sentiment Scores (Pros)

Next, we perform the same logistic regression as Model #1, but we add the sentiment scores for the Pros as a predictor into the model. The main aim is to observe if there is an improvement to the prediction. We check the VIF again just to ensure that there is no multicollinearity.

# Logistic regression
model_senti <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
                  DistanceFromHome+EnvironmentSatisfaction+Gender+
                  JobInvolvement+JobSatisfaction+MonthlyIncome+
                  NumCompaniesWorked+OverTime+PercentSalaryHike+
                  RelationshipSatisfaction+StockOptionLevel+
                  TotalWorkingYears+TrainingTimesLastYear+
                  WorkLifeBalance+YearsSinceLastPromotion+
                  YearsWithCurrManager+ave_sentiment_pros, 
                  family="binomial", data=train)

# Check the VIF
vif(model_senti)

##                              GVIF Df GVIF^(1/(2*Df))
## Age                      1.794026  1        1.339412
## BusinessTravel           1.179690  2        1.042178
## DailyRate                1.058433  1        1.028802
## Department               1.839307  2        1.164564
## DistanceFromHome         1.108061  1        1.052645
## EnvironmentSatisfaction  1.193185  3        1.029875
## Gender                   1.078229  1        1.038378
## JobInvolvement           1.191892  3        1.029689
## JobSatisfaction          1.229669  3        1.035058
## MonthlyIncome            2.103336  1        1.450288
## NumCompaniesWorked       1.315024  1        1.146745
## OverTime                 1.257477  1        1.121373
## PercentSalaryHike        1.073632  1        1.036162
## RelationshipSatisfaction 1.238651  3        1.036314
## StockOptionLevel         1.263413  3        1.039739
## TotalWorkingYears        3.634832  1        1.906524
## TrainingTimesLastYear    1.084505  1        1.041396
## WorkLifeBalance          1.196203  3        1.030309
## YearsSinceLastPromotion  1.745552  1        1.321193
## YearsWithCurrManager     1.794884  1        1.339733
## ave_sentiment_pros       1.803946  1        1.343111

We now predict the results with a threshold of 0.5.

# Predict
pred_senti <- predict(model_senti, newdata=test, "response")
ct_senti <- table(attrition = test$Attrition,
                  predictions = as.integer(pred_senti>0.5))
ct_senti

##          predictions
## attrition   0   1
##       No  354  15
##       Yes  30  41

accuracy_senti_df <- data.frame(accuracy = sum(ct_senti[1,1],ct_senti[2,2])/nrow(test),
                                specificity = ct_senti[1,1]/sum(ct_senti[1,1],ct_senti[1,2]),
                                sensitivity = ct_senti[2,2]/sum(ct_senti[2,1],ct_senti[2,2]))
accuracy_senti_df

##    accuracy specificity sensitivity
## 1 0.8977273   0.9593496   0.5774648

We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold value should be 0.24.

# Plot AUC
ROCRpred_senti <- prediction(pred_senti, test$Attrition)
ROCRperf_senti <- performance(ROCRpred_senti, "acc", "fnr")
plot(ROCRperf_senti, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred_senti,"auc")@y.values)

## [1] 0.8982404

# Use 0.24 as threshold
ct_senti_0.24 <- table(attrition = test$Attrition,
                       predictions = as.integer(pred_senti>0.24))
ct_senti_0.24

##          predictions
## attrition   0   1
##       No  316  53
##       Yes   9  62

accuracy_senti_0.24 <- data.frame(
  accuracy = sum(ct_senti_0.24[1,1],ct_senti_0.24[2,2])/nrow(test),
  specificity = ct_senti_0.24[1,1]/sum(ct_senti_0.24[1,1],ct_senti_0.24[1,2]),
  sensitivity = ct_senti_0.24[2,2]/sum(ct_senti_0.24[2,1],ct_senti_0.24[2,2]))
accuracy_senti_0.24

##    accuracy specificity sensitivity
## 1 0.8590909   0.8563686   0.8732394

Model 3: Logistic Regression with Sentiment Scores (Cons)

Next, we perform the same logistic regression as Model #1, but we add the sentiment scores for the Cons as a predictor into the model. The main aim is to observe if there is an improvement to the prediction. We check the VIF again just to ensure that there is no multicollinearity.

# Logistic regression
model_senti2 <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
                     DistanceFromHome+EnvironmentSatisfaction+Gender+
                     JobInvolvement+JobSatisfaction+MonthlyIncome+
                     NumCompaniesWorked+OverTime+PercentSalaryHike+
                     RelationshipSatisfaction+StockOptionLevel+
                     TotalWorkingYears+TrainingTimesLastYear+
                     WorkLifeBalance+YearsSinceLastPromotion+
                     YearsWithCurrManager+ave_sentiment_cons, 
                   family="binomial", data=train)

# Check the VIF
vif(model_senti2)

##                              GVIF Df GVIF^(1/(2*Df))
## Age                      1.788089  1        1.337194
## BusinessTravel           1.175845  2        1.041328
## DailyRate                1.060567  1        1.029838
## Department               1.414878  2        1.090636
## DistanceFromHome         1.108395  1        1.052804
## EnvironmentSatisfaction  1.192065  3        1.029714
## Gender                   1.079315  1        1.038901
## JobInvolvement           1.195493  3        1.030207
## JobSatisfaction          1.227236  3        1.034716
## MonthlyIncome            2.086244  1        1.444384
## NumCompaniesWorked       1.316093  1        1.147211
## OverTime                 1.267715  1        1.125929
## PercentSalaryHike        1.074498  1        1.036580
## RelationshipSatisfaction 1.240934  3        1.036632
## StockOptionLevel         1.261519  3        1.039479
## TotalWorkingYears        3.654708  1        1.911729
## TrainingTimesLastYear    1.086453  1        1.042330
## WorkLifeBalance          1.191752  3        1.029669
## YearsSinceLastPromotion  1.736311  1        1.317691
## YearsWithCurrManager     1.790538  1        1.338110
## ave_sentiment_cons       1.325863  1        1.151461

We now predict the results with a threshold of 0.5.

# Predict
pred_senti2 <- predict(model_senti2, newdata=test, "response")
ct_senti2 <- table(attrition = test$Attrition,
                   predictions = as.integer(pred_senti2>0.5))
ct_senti2

##          predictions
## attrition   0   1
##       No  355  14
##       Yes  31  40

accuracy_senti2_df <- data.frame(accuracy = sum(ct_senti2[1,1],ct_senti2[2,2])/nrow(test),
                                 specificity = ct_senti2[1,1]/sum(ct_senti2[1,1],ct_senti2[1,2]),
                                 sensitivity = ct_senti2[2,2]/sum(ct_senti2[2,1],ct_senti2[2,2]))
accuracy_senti2_df

##    accuracy specificity sensitivity
## 1 0.8977273   0.9620596   0.5633803

We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.245.

# Plot AUC
ROCRpred_senti2 <- prediction(pred_senti2, test$Attrition)
ROCRperf_senti2 <- performance(ROCRpred_senti2, "acc", "fnr")
plot(ROCRperf_senti2, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred_senti2,"auc")@y.values)

## [1] 0.9003015

# Use 0.245 as threshold
ct_senti2_0.245 <- table(attrition = test$Attrition,
                         predictions = as.integer(pred_senti2>0.245))
ct_senti2_0.245

##          predictions
## attrition   0   1
##       No  317  52
##       Yes   9  62

accuracy_senti2_0.245 <- data.frame(
  accuracy = sum(ct_senti2_0.245[1,1],ct_senti2_0.245[2,2])/nrow(test),
  specificity = ct_senti2_0.245[1,1]/sum(ct_senti2_0.245[1,1],ct_senti2_0.245[1,2]),
  sensitivity = ct_senti2_0.245[2,2]/sum(ct_senti2_0.245[2,1],ct_senti2_0.245[2,2]))
accuracy_senti2_0.245

##    accuracy specificity sensitivity
## 1 0.8613636   0.8590786   0.8732394

Model 4: Logistic Regression with Clustering

Next, before we perform any modeling, we will perform clustering to the IBM dataset first. We do so by using the correlation funnel to determine the variables that are highly correlated with Attrition in order to perform the clustering analysis.

# Binarize continuous variables and one hot encode factors
corr_tbl  <- dat_hr %>%
  select(-EmployeeNumber) %>%
  binarize(n_bins=5, thresh_infreq=0.01, name_infreq="OTHER", one_hot=T) %>%
  correlate(Attrition__Yes)

# Plot correlation funnel
corr_tbl %>%
  plot_correlation_funnel() %>%
  ggplotly()

As can be seen from the correlation funnel, the following variables were highly correlated with the outcome variable: Overtime, JobLevel, MonthlyIncome, YearsAtCompany, StockOptionLevel, YearsWithCurrManager, TotalWorkingYears, MaritalStatus, Age, YearsInCurrentRole, JobRole, EnvironmentSatisfaction, JobInvolvement and BusinessTravel. Therefore, we use them for the clustering analysis. We add in EmployeeNumber to the variable selection as well as a means to index the data.

# Select top correlated variables for analysis
var_selection <- c("EmployeeNumber", "Attrition", "OverTime", "JobLevel", 
                   "MonthlyIncome", "YearsAtCompany", "StockOptionLevel",
                   "YearsWithCurrManager", "TotalWorkingYears", "MaritalStatus",
                   "Age", "YearsInCurrentRole", "JobRole", "EnvironmentSatisfaction",
                   "JobInvolvement", "BusinessTravel")

dat_hr_subset <- dat_hr %>%
  select(one_of(var_selection)) %>%
  mutate_if(is.character, as.factor)

Finally, we perform the clustering analysis to determine how similar each employees are to each other. If our variables are of numeric (continuous) nature, we could use the Euclidean Distance as a way for computation. However, our dataset consists of both continuous and ordinal types, which would require other techniques for analysis. Based on some research, we discovered that the Gower Distance is able to handle both data types, therefore, we use it for the analysis. Here we compute a distance matrix that compares each employee to every other employee in the data.

# Use Gower distance to handle both numeric and factor types
gower_dist <- daisy(dat_hr_subset[,2:13], metric="gower")
gower_mat <- as.matrix(gower_dist)

Just for a sanity check, we view the most similar employees based on the distance score to ensure that the Gower Distance is working fine. Based on the results, Employee #1272 and Employee #371 are most similar to each other because they match on most of the variables – they had similar monthly incomes, worked the same number of years at IBM, etc.

# View most similar employees to test Gower distance
similar_tbl <- dat_hr_subset[which(gower_mat==min(gower_mat[gower_mat!=min(gower_mat)]), 
                                   arr.ind=T)[1,],]
similar_tbl

##      EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 1272           1780       Yes       No        1          2679              1
## 371             494       Yes       No        1          2716              1
##      StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 1272                0                    0                 1        Single  21
## 371                 0                    0                 1        Single  21
##      YearsInCurrentRole              JobRole EnvironmentSatisfaction
## 1272                  0 Sales Representative                       2
## 371                   0 Sales Representative                       3
##      JobInvolvement BusinessTravel
## 1272              3  Travel_Rarely
## 371               4  Travel_Rarely

To determine the optimal number of clusters to segment the data, we use the silhouette plot. As shown below, we select the 7-cluster solution due to the highest average silhouette width at 7.

# Choose number of clusters using silhouette plot
silhouette_width <- sapply(2:10, FUN=function(x){
  pam(x=gower_dist, k=x)$silinfo$avg.width
})

ggplot(data=data.frame(cluster = 2:10,silhouette_width),aes(x=cluster,y=silhouette_width))+
  geom_line(col='steelblue',size=1.2)+
  geom_point()+
  scale_x_continuous(breaks=seq(2,10,1)) # choose 7 clusters soln

For the analysis, we use the Partitioning Around Mediods (PAM) method because it is less sensitive to outliers in the data (e.g., high salary income) and it provides an exemplary case for each cluster (i.e., medoid), which would be very helpful for us to personify the type of employees that have a high chance of leaving IBM.

Now we run the cluster analysis with 7 clusters and join the cluster membership with the IBM dataset.

# Perform clustering using PAM
k <- 7
pam_fit <- pam(gower_dist, diss=T, k)

# Add cluster assignments to each employee
dat_hr_clusters <- dat_hr_subset %>%
  mutate(cluster=pam_fit$clustering)

We now look at each medoid to understand the demography of each cluster. This shows the exemplary employee representing each cluster. It is obvious that Employee #991 is the model type of employee that would best represent those who left IBM (this is the only row that has Attrition=Yes).

# Look at medoids to understand each cluster
dat_hr_clusters[pam_fit$medoids,]

##      EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 463             621        No       No        2          5337             10
## 1381           1945        No       No        2          5561              5
## 710             991       Yes      Yes        1          2321              3
## 69               88        No       No        1          2194              3
## 1002           1411        No       No        1          3629              3
## 118             154        No       No        3          9738              9
## 700             976        No       No        4         17099              9
##      StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 463                 0                    7                10        Single  34
## 1381                1                    4                 6       Married  35
## 710                 0                    2                 4        Single  31
## 69                  1                    2                 5       Married  35
## 1002                0                    2                 8        Single  37
## 118                 1                    8                10       Married  36
## 700                 1                    8                26       Married  52
##      YearsInCurrentRole               JobRole EnvironmentSatisfaction
## 463                   7       Sales Executive                       4
## 1381                  3       Sales Executive                       2
## 710                   2    Research Scientist                       3
## 69                    2    Research Scientist                       2
## 1002                  2 Laboratory Technician                       1
## 118                   7       Sales Executive                       2
## 700                   8               Manager                       4
##      JobInvolvement    BusinessTravel cluster
## 463               4     Travel_Rarely       1
## 1381              3     Travel_Rarely       2
## 710               2        Non-Travel       3
## 69                3 Travel_Frequently       4
## 1002              3     Travel_Rarely       5
## 118               3 Travel_Frequently       6
## 700               3     Travel_Rarely       7

To better understand the attrition in the population, we computed the attrition in each cluster and how much each cluster captures overall attrition in the entire population. As seen below, approximately 88% of employees in Cluster #3 left IBM and that represents about 52% of all attrition in the entire population.

# Group by attrition rate
attrition_rate_tbl <- dat_hr_clusters %>%
  select(cluster, Attrition) %>%
  mutate(attrition_num = fct_relevel(Attrition, "No", "Yes") %>% as.numeric()-1) %>%
  group_by(cluster) %>%
  summarise(
    Cluster_Turnover_Rate = (sum(attrition_num)/length(attrition_num))*100,
    Turnover_Count = sum(attrition_num),
    Cluster_Size = length(attrition_num)
  ) %>%
  ungroup() %>%
  mutate(Population_Turnover_Rate = (Turnover_Count/sum(Turnover_Count))*100)

attrition_rate_tbl

## # A tibble: 7 x 5
##   cluster Cluster_Turnover_Ra~ Turnover_Count Cluster_Size Population_Turnover_~
##     <int>                <dbl>          <dbl>        <int>                 <dbl>
## 1       1                10.9              27          247                 11.4 
## 2       2                 6.37             20          314                  8.44
## 3       3                87.9             123          140                 51.9 
## 4       4                 7.39             19          257                  8.02
## 5       5                11.3              24          213                 10.1 
## 6       6                12.6              19          151                  8.02
## 7       7                 3.38              5          148                  2.11

We are now ready to use the cluster membership results as an input to the prediction model. Here we append the results to the IBM data, split the data and perform the logistic regression based on the same inputs as Model #1, but with the added clustering results.

# Merge clusters with IBM HR data
dat_hr_full <- dat_hr_final %>%
  mutate(cluster=pam_fit$clustering)

# Split data
set.seed(1000)
split2 <- createDataPartition(dat_hr_full$Attrition, p=0.7, list=F, groups=100)
train2 <- dat_hr_full[split2,]
test2 <- dat_hr_full[-split2,]

# Logistic regression
model_clust <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
                     DistanceFromHome+EnvironmentSatisfaction+Gender+
                     JobInvolvement+JobSatisfaction+MonthlyIncome+
                     NumCompaniesWorked+OverTime+PercentSalaryHike+
                     RelationshipSatisfaction+StockOptionLevel+
                     TotalWorkingYears+TrainingTimesLastYear+
                     WorkLifeBalance+YearsSinceLastPromotion+
                     YearsWithCurrManager+cluster, 
                   family="binomial", data=train2)

We now predict the results with a threshold of 0.5.

# Predict
pred_clust <- predict(model_clust, newdata=test2, "response")
ct_clust <- table(attrition = test2$Attrition,
                  predictions = as.integer(pred_clust>0.5))
ct_clust

##          predictions
## attrition   0   1
##       No  356  13
##       Yes  33  38

accuracy_clust_df <- data.frame(accuracy = sum(ct_clust[1,1],ct_clust[2,2])/nrow(test2),
                                specificity = ct_clust[1,1]/sum(ct_clust[1,1],ct_clust[1,2]),
                                sensitivity = ct_clust[2,2]/sum(ct_clust[2,1],ct_clust[2,2]))
accuracy_clust_df

##    accuracy specificity sensitivity
## 1 0.8954545   0.9647696   0.5352113

We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.22.

# Plot AUC
ROCRpred_clust <- prediction(pred_clust, test2$Attrition)
ROCRperf_clust <- performance(ROCRpred_clust, "acc", "fnr")
plot(ROCRperf_clust, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred_clust,"auc")@y.values)

## [1] 0.8963319

# Use 0.22 as threshold
ct_clust_0.22 <- table(attrition = test2$Attrition,
                       predictions = as.integer(pred_clust>0.22))
ct_clust_0.22

##          predictions
## attrition   0   1
##       No  307  62
##       Yes   9  62

accuracy_clust_0.22 <- data.frame(
  accuracy = sum(ct_clust_0.22[1,1],ct_clust_0.22[2,2])/nrow(test2),
  specificity = ct_clust_0.22[1,1]/sum(ct_clust_0.22[1,1],ct_clust_0.22[1,2]),
  sensitivity = ct_clust_0.22[2,2]/sum(ct_clust_0.22[2,1],ct_clust_0.22[2,2]))
accuracy_clust_0.22

##    accuracy specificity sensitivity
## 1 0.8386364   0.8319783   0.8732394

Model 5: Logistic Regression with both Sentiment Scores and Clustering

Since we have already modeled a few different logistic regression models using sentiment scores and cluster memberships as inputs, it would be interesting to have a model with both sentiment scores and cluster memberships as inputs to the same model and see if it improves the prediction accuracy. We shall use the sentiment scores for Cons (instead of Pros) in this model since it had a better AUC (compare Model #2 and Model #3).

# Logistic regression
model_clust_senti <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
                     DistanceFromHome+EnvironmentSatisfaction+Gender+
                     JobInvolvement+JobSatisfaction+MonthlyIncome+
                     NumCompaniesWorked+OverTime+PercentSalaryHike+
                     RelationshipSatisfaction+StockOptionLevel+
                     TotalWorkingYears+TrainingTimesLastYear+
                     WorkLifeBalance+YearsSinceLastPromotion+
                     YearsWithCurrManager+cluster+ave_sentiment_cons, 
                   family="binomial", data=train2)

We now predict the results with a threshold of 0.5.

# Predict
pred_clust_senti <- predict(model_clust_senti, newdata=test2, "response")
ct_clust_senti <- table(attrition = test2$Attrition,
                        predictions = as.integer(pred_clust_senti>0.5))
ct_clust_senti

##          predictions
## attrition   0   1
##       No  355  14
##       Yes  31  40

accuracy_clust_senti_df <- data.frame(
  accuracy = sum(ct_clust_senti[1,1],ct_clust_senti[2,2])/nrow(test2),
  specificity = ct_clust_senti[1,1]/sum(ct_clust_senti[1,1],ct_clust_senti[1,2]),
  sensitivity = ct_clust_senti[2,2]/sum(ct_clust_senti[2,1],ct_clust_senti[2,2]))
accuracy_clust_senti_df

##    accuracy specificity sensitivity
## 1 0.8977273   0.9620596   0.5633803

We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.26.

# Plot AUC
ROCRpred_clust_senti <- prediction(pred_clust_senti, test2$Attrition)
ROCRperf_clust_senti <- performance(ROCRpred_clust_senti, "acc", "fnr")
plot(ROCRperf_clust_senti, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred_clust_senti,"auc")@y.values)

## [1] 0.9015611

# Use 0.26 as threshold
ct_clust_senti_0.26 <- table(attrition = test2$Attrition,
                    predictions = as.integer(pred_clust_senti>0.26))
ct_clust_senti_0.26

##          predictions
## attrition   0   1
##       No  321  48
##       Yes   9  62

accuracy_clust_senti_0.26 <- data.frame(
  accuracy = sum(ct_clust_senti_0.26[1,1],ct_clust_senti_0.26[2,2])/nrow(test2),
  specificity = ct_clust_senti_0.26[1,1]/sum(ct_clust_senti_0.26[1,1],ct_clust_senti_0.26[1,2]),
  sensitivity = ct_clust_senti_0.26[2,2]/sum(ct_clust_senti_0.26[2,1],ct_clust_senti_0.26[2,2]))
accuracy_clust_senti_0.26

##    accuracy specificity sensitivity
## 1 0.8704545   0.8699187   0.8732394

Model 6: Trees with 10-Fold CV

So far we have only tried logistic regression for the prediction model. It would be worthwhile to try the other machine learning models to see if there can be improvements to the model AUC. In this step, we attempted the decision trees using 10-fold cross validation (CV) and we shall compare the results with the baseline model in Model #1 where we used the default variables from the dat_hr dataset (we will not include the sentiment scores and cluster memberships into the modeling). This comparison should give us an indication as to which modeling technique is better.

# Model using CV
trControl = trainControl(method="cv",number=10) #10-fold cross validation
tuneGrid = expand.grid(.cp=seq(0,0.1,0.001)) 

set.seed(777)
trainCV = train(Attrition~., 
                data=train,
                method="rpart", 
                trControl=trControl,
                tuneGrid=tuneGrid)

trainCV$bestTune

##      cp
## 21 0.02

We shall use a cp value of 0.02.

tree_CV <- rpart(Attrition~.,
               data=train,
               method="class",
               control=rpart.control(cp=trainCV$bestTune))

# Predict
pred_CV = predict(tree_CV, newdata=test, type='class')
ct_tree = table(attrition = test$Attrition, 
                predictions = pred_CV)
ct_tree

##          predictions
## attrition  No Yes
##       No  357  12
##       Yes  54  17

accuracy_tree_df <- data.frame(
  accuracy = sum(ct_tree[1,1],ct_tree[2,2])/nrow(test),
  specificity = ct_tree[1,1]/sum(ct_tree[1,1],ct_tree[1,2]),
  sensitivity = ct_tree[2,2]/sum(ct_tree[2,1],ct_tree[2,2]))
accuracy_tree_df

##   accuracy specificity sensitivity
## 1     0.85   0.9674797   0.2394366

# Plot AUC
pred_tree_auc <- predict(tree_CV, newdata=test, type="prob")
ROCRpred_tree <- prediction(pred_tree_auc[,2], test$Attrition)
ROCRperf_tree <- performance(ROCRpred_tree, "acc", "fnr")
plot(ROCRperf_tree, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

as.numeric(performance(ROCRpred_tree,"auc")@y.values)

## [1] 0.6325814

Summary of Results

This is a summary of all the prediction models that were done. Model 5 had the highest AUC and the best sensitivity and accuracy figures. This was the model that incorporated both sentiment scores (cons) and cluster memberships as predictors, which resulted in improvements to the model AUC, accuracy and specificity figures. Our main aim is to maximize sensitivity because we want a low FN rate due to our business context. Model 5 performs the best in all parameters that we are concerned about. Therefore we shall use that model for interpretation of effects and discussion.

Although the decision tree model performed the best in specificity, it was outperformed in sensitivity and AUC (these factors were far more important to our business context). Hence it is clear that we will not be using the tree model.

##   model               description       auc  accuracy specificity sensitivity
## 1     1     logmod with bw select 0.8940799 0.8204545   0.8102981   0.8732394
## 2     2  logmod with senti (pros) 0.8982404 0.8590909   0.8563686   0.8732394
## 3     3  logmod with senti (cons) 0.9003015 0.8613636   0.8590786   0.8732394
## 4     4         logmod with clust 0.8963319 0.8386364   0.8319783   0.8732394
## 5     5 logmod with senti & clust 0.9015611 0.8704545   0.8699187   0.8732394
## 6     6     trees with 10-fold cv 0.6325814 0.8500000   0.9674797   0.2394366

Discussion

In this section, we shall discuss the analysis findings in relation to the three (3) research questions outlined in this project.

Research Qn 1

What are the key driving factors influencing attrition the most at IBM?

We shall use our best model—which happens to be the logistic regression model with sentiment scores and cluster memberships—to understand the key factors that influence attrition at IBM, knowing which would allow us to create specific recommendations and hypotheses to improve the retention rates (e.g., a 5% increase in monthly salary would reduce attrition by 10%). Note: the benefit of using logistic regression is that it is easy to interpret for communication to stakeholders.

As shown in the model summary below, for instance, it can be inferred that working overtime would result in the highest increase in likelihood towards attrition as compared to the other variables (see coefficient of OverTimeYes).

summary(model_clust_senti)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + DailyRate + 
##     Department + DistanceFromHome + EnvironmentSatisfaction + 
##     Gender + JobInvolvement + JobSatisfaction + MonthlyIncome + 
##     NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsSinceLastPromotion + YearsWithCurrManager + 
##     cluster + ave_sentiment_cons, family = "binomial", data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6336  -0.5224  -0.2700  -0.1051   3.3038  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       4.922e+00  1.225e+00   4.019 5.85e-05 ***
## Age                              -4.541e-02  1.573e-02  -2.887 0.003893 ** 
## BusinessTravelTravel_Frequently   1.648e+00  4.483e-01   3.675 0.000238 ***
## BusinessTravelTravel_Rarely       7.686e-01  4.118e-01   1.866 0.061985 .  
## DailyRate                        -3.994e-04  2.571e-04  -1.553 0.120321    
## DepartmentResearch & Development -1.080e+00  5.490e-01  -1.968 0.049069 *  
## DepartmentSales                  -4.019e-01  5.588e-01  -0.719 0.471974    
## DistanceFromHome                  3.649e-02  1.285e-02   2.839 0.004530 ** 
## EnvironmentSatisfaction2         -1.041e+00  3.239e-01  -3.213 0.001315 ** 
## EnvironmentSatisfaction3         -1.012e+00  2.943e-01  -3.437 0.000589 ***
## EnvironmentSatisfaction4         -1.120e+00  2.972e-01  -3.768 0.000164 ***
## GenderMale                        4.865e-01  2.197e-01   2.215 0.026770 *  
## JobInvolvement2                  -1.113e+00  4.278e-01  -2.601 0.009298 ** 
## JobInvolvement3                  -1.383e+00  4.011e-01  -3.448 0.000564 ***
## JobInvolvement4                  -1.938e+00  5.501e-01  -3.523 0.000426 ***
## JobSatisfaction2                 -6.460e-01  3.254e-01  -1.985 0.047142 *  
## JobSatisfaction3                 -6.544e-01  2.795e-01  -2.341 0.019220 *  
## JobSatisfaction4                 -1.250e+00  3.008e-01  -4.156 3.23e-05 ***
## MonthlyIncome                    -8.597e-05  4.379e-05  -1.963 0.049621 *  
## NumCompaniesWorked                1.677e-01  4.419e-02   3.795 0.000148 ***
## OverTimeYes                       1.881e+00  2.334e-01   8.058 7.74e-16 ***
## PercentSalaryHike                -1.076e-02  2.998e-02  -0.359 0.719691    
## RelationshipSatisfaction2        -9.957e-01  3.282e-01  -3.034 0.002414 ** 
## RelationshipSatisfaction3        -9.384e-01  2.903e-01  -3.233 0.001226 ** 
## RelationshipSatisfaction4        -1.186e+00  3.019e-01  -3.927 8.59e-05 ***
## StockOptionLevel1                -1.172e+00  2.368e-01  -4.949 7.45e-07 ***
## StockOptionLevel2                -1.609e+00  4.692e-01  -3.430 0.000603 ***
## StockOptionLevel3                -9.239e-01  4.676e-01  -1.976 0.048156 *  
## TotalWorkingYears                -5.515e-02  3.212e-02  -1.717 0.085977 .  
## TrainingTimesLastYear            -2.132e-01  8.381e-02  -2.543 0.010978 *  
## WorkLifeBalance2                 -7.111e-01  4.124e-01  -1.724 0.084631 .  
## WorkLifeBalance3                 -1.245e+00  3.903e-01  -3.190 0.001423 ** 
## WorkLifeBalance4                 -7.878e-01  4.734e-01  -1.664 0.096045 .  
## YearsSinceLastPromotion           1.974e-01  4.491e-02   4.395 1.11e-05 ***
## YearsWithCurrManager             -1.217e-01  4.432e-02  -2.747 0.006020 ** 
## cluster                           5.927e-02  7.038e-02   0.842 0.399688    
## ave_sentiment_cons               -8.335e+00  4.096e+00  -2.035 0.041891 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 909.69  on 1029  degrees of freedom
## Residual deviance: 627.80  on  993  degrees of freedom
## AIC: 701.8
## 
## Number of Fisher Scoring iterations: 6

Specifically, there is a 6.557 times higher likelihood of an attrition when an employee works overtime as compared to not working overtime (while holding other factors constant). Phrased differently, there is a 555.7% increase in likelihood of attrition when an employee works overtime.

# How many times is the likelihood of attrition when an employee works overtime?
exp(summary(model_clust_senti)$coef[21])

## [1] 6.557427

# How much more is the likelihood of attrition when an employee works overtime?
100*(exp(summary(model_clust_senti)$coef[21])-1)

## [1] 555.7427

Working overtime is just one of the factors that affect the likelihood of attrition. We want to explore the other factors that also affect attrition. We use the varImp() function to see how the model ranks the importance of each variable. Not surprisingly, the model ranks OverTimeYes as the one with highest importance because it has the greatest statistical significance relative to other variables.

var_imp <- varImp(model_clust_senti) %>%
  arrange(desc(Overall))

var_imp %>%
  ggplot(aes(x=reorder(rownames(var_imp),Overall), y=Overall)) +
  geom_point(color="blue", size=4, alpha=0.6) +
  geom_segment(aes(x=rownames(var_imp), xend=rownames(var_imp), 
                   y=0, yend=Overall), color="skyblue") +
  xlab("Variable") +
  ylab("Overall Importance") +
  theme_classic() +
  coord_flip()

From the plot, we see that StockOptionLevel1 is the second most important variable in the model, as determined by the low p-value (statistically significant). StockOptionLevel1 has a model coefficient of -1.172, which implies a 0.31 times lower likelihood of leaving IBM (-69%) as compared to not having any stock option at all.

Intuitively speaking, we see that YearsSinceLastPromotion holds a high spot on the variable importance chart with a model coefficient of 0.197, which implies a 1.218 times higher likelihood of leaving IBM for each year that the employee has not been promoted (while holding other factors constant). This makes sense because the longer the employee remains in the same salary grade, he/she may not feel recognized by the organization, leading to a higher chance of leaving.

JobSatisfaction4 (job satisfaction score of 4) holds a high importance in the model as well with a model coefficient of -1.25, which implies a 0.286 times lower likelihood of leaving IBM as compared to having a satisfaction score of 1 (while holding other factors constant) (Note: satisfaction scores range from 1 to 4, with 4 indicating that the staff is the most satisfied with the job). This finding agrees with Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016), where having high job satisfaction and happiness would lead to better retention in a company. The same can be said of RelationshipSatisfaction4 as well.

It is worthy to note that NumCompaniesWorked holds a high spot on the variable importance chart too. The model coefficient of 0.168 implies a 1.183 times higher likelihood of leaving IBM for each company that the employee has worked for in the past (while holding other factors constant). This could suggest the “job-hopping” tendency of people who are not able to commit to a particular job, hence the greater tendency to leave IBM if he/she had many jobs in the past. While that is a plausible hypothesis, it might not be entirely true for employees who belong to the older age category. The reason is because while older employees may have more past jobs as compared to younger workers who just entered the workforce, which should suggest that they have a greater likelihood of leaving IBM, interestingly, older workers had a lower likelihood to leave IBM (see Age coefficient in the model) (the lower likelihood of older workers leaving their jobs is also corroborated by the study by Zhang & Feinzig (2016)). Therefore there is room for further research on potential interaction terms between Age and NumCompaniesWorked for better interpretation of the effects.

Lastly, we see that PercentSalaryHike was ranked the least important/significant in the model. This was an interesting finding because reports by The Dice (2020) and Deloitte (2019) suggest that majority of employees are looking for higher salaries. But this was not reflected in the attrition model. One possible reason could be that salary is a crucial entry requirement into a company, but not a crucial exit criteria. This is also an area for possible further research.

Research Qn 2

Who is likely to leave IBM?

The best prediction model was able to achieve a sensitivity of 87.3% (recall that we want to reduce the FN for our business context) and accuracy of 90.1% on the test set as shown below. This is by far the best model, which included the sentiment scores (cons) and the cluster memberships.

##   model               description       auc  accuracy specificity sensitivity
## 1     5 logmod with senti & clust 0.9015611 0.8704545   0.8699187   0.8732394

For future work, it might be worthwhile to attempt averaging both sentiments scores for Pros and Cons to see if it further improves the prediction accuracy. Other methods such as random forest and boosting could be explored as well. Given time and resource constraints in this project and for the ease of interpretation, we went with logistic regression.

Research Qn 3

What is the employee type that has the highest tendency to leave IBM?

Based on the clustering work, the type of employees that have the highest tendency to leave IBM are the employees that are similar to Employee #991 (Cluster #3). His/her persona is shown in the table below.

dat_hr_clusters[pam_fit$medoids,]

##      EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 463             621        No       No        2          5337             10
## 1381           1945        No       No        2          5561              5
## 710             991       Yes      Yes        1          2321              3
## 69               88        No       No        1          2194              3
## 1002           1411        No       No        1          3629              3
## 118             154        No       No        3          9738              9
## 700             976        No       No        4         17099              9
##      StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 463                 0                    7                10        Single  34
## 1381                1                    4                 6       Married  35
## 710                 0                    2                 4        Single  31
## 69                  1                    2                 5       Married  35
## 1002                0                    2                 8        Single  37
## 118                 1                    8                10       Married  36
## 700                 1                    8                26       Married  52
##      YearsInCurrentRole               JobRole EnvironmentSatisfaction
## 463                   7       Sales Executive                       4
## 1381                  3       Sales Executive                       2
## 710                   2    Research Scientist                       3
## 69                    2    Research Scientist                       2
## 1002                  2 Laboratory Technician                       1
## 118                   7       Sales Executive                       2
## 700                   8               Manager                       4
##      JobInvolvement    BusinessTravel cluster
## 463               4     Travel_Rarely       1
## 1381              3     Travel_Rarely       2
## 710               2        Non-Travel       3
## 69                3 Travel_Frequently       4
## 1002              3     Travel_Rarely       5
## 118               3 Travel_Frequently       6
## 700               3     Travel_Rarely       7

Cluster #3 accounts for approximately 52% of all attrition in IBM and about 88% of employees in this cluster leave. It is evident that more can be done to improve the retention rates in this cluster. In the next section, we will propose some recommendations to improve the retention rate of those in this cluster.

attrition_rate_tbl

## # A tibble: 7 x 5
##   cluster Cluster_Turnover_Ra~ Turnover_Count Cluster_Size Population_Turnover_~
##     <int>                <dbl>          <dbl>        <int>                 <dbl>
## 1       1                10.9              27          247                 11.4 
## 2       2                 6.37             20          314                  8.44
## 3       3                87.9             123          140                 51.9 
## 4       4                 7.39             19          257                  8.02
## 5       5                11.3              24          213                 10.1 
## 6       6                12.6              19          151                  8.02
## 7       7                 3.38              5          148                  2.11

Recommendations

We shall use the prediction model (Model #5) to make certain recommendations that are amenable to experimentation. Specifically, we will propose recommendations on variables that are within IBM’s control to make adjustments (e.g., salary, training opportunities for employees, etc.). On the other hand, variables that are out of IBM’s control are like gender, age, total working years, etc.; we will not make recommendations on these variables as they are inherent to the employees. Lastly, for the purpose of keeping it measurable and simple, we shall exclude “indirect” variables such as job satisfaction, environment satisfaction and relationship satisfaction as they require more effort to measure and it involves a summation of many other factors that influences the satisfaction scores.

Based on the model, it suggests that the rescission of overtime culture in IBM can have the potential to reduce the likelihood of attrition by 6 times (while holding all other variables constant). For instance, this means that we can have the potential of reducing the attrition rate in cluster group 3 by 6 times. From a turnover count of 123 out of 140 employees in that cluster, we can expect to see a reduction in attrition to about 21 employees only (reduction of a turnover rate from 87.9% to 14.6%).

However, this hypothesis is only true with the assumption that all other variables are held constant and that the only difference is whether the employee worked overtime. To test this hypothesis (and for purposes of conducting a research experiment), it is imperative to keep the other variables constant as much as possible. In a practical setting, this can be achieved by performing a clustering analysis on existing IBM employees as they would be segmented in accordance to how similar they are to each other (most of the other variables would be similar from employee to employee to some extent), thereby controlling the other variables the best we can. Then, with the other variables being more similar for employees within each cluster, it gives us the opportunity to tweak the attrition variable (i.e., whether the employee would need to work overtime or not) – this allows us to create a control and treatment group accordingly. Finally, we should conduct the experiment on the cluster that is most similar to cluster group 3 in order to yield the greatest impact.

In addition to the reviewing of overtime culture in IBM, another recommendation to reduce attrition is to explore giving employees stock options as it can lower the likelihood of turnover by about 69% (while keeping the other variables constant). To test this, we can easily conduct an experiment with a treatment and control group. Briefly, the experiment would involve giving one group of employees stock options while the other group would not have stock options. Similar to the previous recommendation, this experiment can be conducted on employees who are in the cluster that is most similar to cluster group 3.

Limitations

A limitation in our project is that there is imbalance in the distribution of attrition statuses in our data. For instance, our data has more employees who stayed with IBM as compared to those who left. Such an imbalance would likely result in poorer prediction accuracy in our models. To improve the accuracy in our models, we may need to treat the imbalance in our training data with, say, some upsampling techniques such as the ovun.sample() function from the ROSE package. This can be explored in future works.

References

viGlobal. (2018). Tech Industry Battles Cost of High Attrition Rates. viGlobal. https://www.viglobal.com/2018/06/13/tech-industry-battles-highest-attrition-rate-in-the-world-and-its-costly/.

Kochanski, J., & Ledford, G. (2001). “HOW TO KEEP ME”—RETAINING TECHNICAL PROFESSIONALS. Research Technology Management, 44(3), 31-38. Retrieved June 6, 2021, from http://www.jstor.org/stable/24133992.

Alsop, T. (2021). IBM number of employees worldwide from 2000 to 2020. Statista. https://www.statista.com/statistics/265007/number-of-employees-at-ibm-since-2000/

The Build Network. (2014). Try Fixing the Problem Before Replacing It. Inc. https://www.inc.com/the-build-network/turnover-costs.html

Ghapanchi, A. H., Aurum, A. (2011). Antecedents to IT personnel’s intentions to leave: A systematic literature review. Journal of Systems and Software, Volume 84, 238-249. Retrieved February 21, 2021, from https://www.sciencedirect.com/science/article/abs/pii/S0164121210002645?via%3Dihub.

Zhang, H., & Feinzig, S. (2016). Should I stay or should I go? Global insights into employees’ decisions to leave their jobs [White paper]. IBM Smarter Workforce Institute. https://www.ibm.com/downloads/cas/08GZQKL1

The Deloitte Global Millennial Survey: Societal Discord and Technological Transformation Create a “Generation Disrupted”. (2019). Retrieved February 21, 2021, from https://www2.deloitte.com/global/en/pages/about-deloitte/articles/millennialsurvey.html.

The Dice 2020 Tech Salary Report | eBook | Dice Resources. (2020). Retrieved 20 February 2021, from https://techhub.dice.com/Dice-2020-Tech-Salary-Report.html.

Hecker, D. E. (2005). Occupational employment projections to 2014. Retrieved 21 February 2021, from https://www.bls.gov/opub/mlr/2005/11/art5full.pdf.

Lewis, S. (2020). What Is A High Employee Attrition Rate In Tech Industry. DevSkiller - Powerful tool to test developers skills. Retrieved 21 February 2021, from https://devskiller.com/attrition-rate-in-tech/.

Appendix

Description of dat_hr dataset:

Age: Age of employee
Attrition: Whether employee has left IBM (Yes/No)
BusinessTravel: 3 levels: No Travel, Travel Frequently, Travel Rarely
DailyRate: Daily salary rate
Department: 3 levels: HR, Sales, R&D
DistanceFromHome: Distance from work to home
Education: 1=Below College, 2=College, 3=Bachelor, 4=Master, 5=Doctor
EducationField: 6 levels: HR, Life Sciences, Marketing, Medical Sciences, Technical, Others
EmployeeCount: Number of employees
EmployeeNumber: Employee ID
EnvironmentSatisfaction: 1=Low, 2=Medium, 3=High, 4=Very High
Gender: Male/Female
HourlyRate: Hourly salary
JobInvolvement: 1=Low, 2=Medium, 3=High, 4=Very High
JobLevel: Level of job
JobRole: Healthcare Representative, HR, Lab Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales Representative
JobSatisfaction: 1=Low, 2=Medium, 3=High, 4=Very High
MaritalStatus: Single, Married, Divorced
MonthlyIncome: Monthly salary
MonthlyRate: Monthly salary rate
NumCompaniesWorked: Number of companies employee has worked at
Over18: Whether employee is above 18 (Yes/No)
OverTime: Whether employee has worked overtime (Yes/No)
PercentSalaryHike: Percentage increase in salary
PerformanceRating: Performance rating of employee
RelationshipSatisfaction: 1=Low, 2=Medium, 3=High, 4=Very High
StandardHours: Number of hours worked
StockOptionLevel: Stock option level
TotalWorkingYears: Total years of working
TrainingTimesLastYear: Hours spent on training
WorkLifeBalance: 1=Bad, 2=Good, 3=Better, 4=Best
YearsAtCompany: Total years of working at IBM
YearsInCurrentRole: Total years in current role
YearsSinceLastPromotion: Total years since last promotion
YearsWithCurrManager: Total years with current manager

Description of dat_gd dataset:

Ratings: Review ratings of 1 to 5
Current_Former: Whether current of former employee of IBM
Employment_Length: Length of employment at IBM
Title: Title of Glassdoor reviews
Review_ID: Review ID
Review_Date: Date of review
Role: Job role
Location: State of employment in the US
Pros: Text reviews of the pros of working at IBM
Cons: Text reviews of the cons of working at IBM

HR Analytics using Supervised and Unsupervised Machine Learning in R

Tackling Talent Retention at IBM

Marcus Loke, Jisu Baek, Julia Ju, Yuzhe Sun

July 22, 2021

Problem Statement

Literature Review

Research Questions

Data Description

Data Preparation

IBM Attrition Dataset

Glassdoor Scraped Text Dataset

Exploratory Data Analysis

IBM Attrition Dataset

Glassdoor Scraped Text Dataset

Pros

Cons

Sentiment Analysis on Glassdoor Reviews

Renaming Roles in Glassdoor Dataset

Conduct Sentiment Analysis

Join Both Datasets

Renaming Roles in IBM Dataset

Join Sentiment Scores to IBM Dataset

Predicting Attrition

Split Data

Model 1: Logistic Regression with Backward Selection

Model 2: Logistic Regression with Sentiment Scores (Pros)

Model 3: Logistic Regression with Sentiment Scores (Cons)

Model 4: Logistic Regression with Clustering

Model 5: Logistic Regression with both Sentiment Scores and Clustering

Model 6: Trees with 10-Fold CV

Summary of Results

Discussion

Research Qn 1

Research Qn 2

Research Qn 3

Recommendations

Limitations

References

Appendix