Why do so many tech employees choose to leave their companies? According to a news release in 2018, the tech industry was notorious for a high employee turnover rate of 13.2%, as compared to other industries such as Media and Entertainment (11.4%), Government/Education (11.2%) and Financial Services (10.8%) (viGlobal, 2018). Given the pervasive turnover within the industry, should such culture be accepted in HR talent management teams? Or are there preemptive data-driven strategies that can be employed to tackle staff retention?
High turnover has adverse effects on the business. From a monetary standpoint, the amount that the company needs to spend on a replacement staff can range from one to seven times the employee’s annual salary, and this is due to many factors including, inter alia, the cost and time to find replacements and the revenue contribution attributed to that employee (Kochanski & Ledford, 2001 and The Build Network, 2014). Other impacts include low morale for the team and bottlenecks caused by the laborious efforts needed to onboard incoming staff, exacerbated by the knowledge drain of the outgoing staff.
Since 2012, International Business Machines Corp., otherwise known as IBM, has been facing a steady decline in the number of employees year-on-year, with a total reduction of about 20% of its workforce from then till 2020 (Alsop, 2021). This does not bode well for the company especially when talent retention is key to driving revenue growth. The perennial challenge of retaining talents and skilled employees therefore presents a strong business case for IBM to invest in talent retention strategies. For this reason, this project seeks to explore various machine learning methodologies to predict turnover, characterize and personify the type of employee that would leave, and delve into the reasons that could be driving talents away. Armed with these insights, IBM can curate targeted HR strategies for more effective talent retention.
A myriad of research studies have been conducted on tech employees’ intention to leave (Ghapanchi & Aurum, 2011). While there are many organizational antecedents/factors that affect attrition, the reasons can be categorized into five (5) broad categories: individual factors, organizational factors, job-related factors, psychological factors and environmental factors, among which, job-related antecedents such as role ambiguity and role conflicts are the most often cited drivers that trigger turnovers (Ghapanchi & Aurum, 2011). Further corroborating this phenomena, according to a leadership management whitepaper by IBM Smarter Workforce Institute, who conducted an in-depth research among more than 20,000 employees, job-related reasons are the biggest factors (cited by 40% of respondents) for leaving because employees are not happy with their jobs (Zhang & Feinzig, 2016).
While Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016) suggest that job-related factors are the most cited reason for turnover, recent studies have found psychological factors such as pay satisfaction to be the dominant factor. According to the Deloitte Global Millennial Survey in 2019, close to half of the respondents (49%) reported that they would leave their jobs if they had a choice, of which 43% of them cited dissatisfaction with their pay and benefits as the top reason for leaving (Deloitte, 2019). In the tech industry, the results are worse. According to the Dice Survey in 2019, which is a salary report of United States tech firms, 45% of respondents highlighted they would like to change employers within the next year, of which 71% of them cited they are seeking salary compensation as the top reason as to why they would leave (The Dice, 2020). The high turnover in the entire tech industry, and in top tech firms in particular, are likely a result of the increasing demand for tech talents, driven by the rapid growth in technology (Hecker, 2005), which results in a race to attract talents commensurate with increased compensation (ViGlobal, 2018).
Diving deeper into the insights by the IBM whitepaper, it also suggests that turnover rates can vary across generations. For instance, early and mid-career workers are more likely to leave the job for better career opportunities, as mentioned by 74% of Millennials, 68% of Generation Xs and 54% of Baby Boomers (Zhang & Feinzig, 2016). In a similar trend, the study also highlighted that early and mid-career workers have higher propensity to leave the job for greater employer brands and greater flexibility at work (Zhang & Feinzig, 2016).
Regardless of the reasons for turnover, the high turnover of employees in the tech industry presents many ramifications to the organization. The direct effect is that the turnover of employees will slow down the company’s overall business and bring productivity losses. For instance, if an existing software developer leaves, the company often has to take 43 days on average to hire a new one, which is approximately a month and a half of productivity loss and it does not account for onboarding (Lewis, 2020). This consequence will cost the company as much as US$33,251 for each employee who leaves (Lewis, 2020). Apart from the cost related to productivity loss, some sources also factor in the loss of intellectual capital attributed to the departing employees and the time needed to onboard new staff (Kochanski & Ledford, 2001).
As can be seen from the various studies mentioned above, there are a variety of reasons as to why employees would leave a company – job fit, pay satisfaction, career development, etc. and the list goes on. As the tech industry grows and evolves, one is unable to generalize a particular evergreen reason for employees leaving a company because of cultural shifts, demography shifts and generational differences over time, among others (i.e., the reason for an employee leaving IBM in the early 2000s may be different than an employee leaving in 2020). Likewise, an evergreen solution to tackle talent retention in a company would be futile due to the said reasons. Furthermore, due to the many ramifications turnover has on the organization, there is a strong impetus to improve staff retention. This underscores the need for a current, up-to-date, data-driven approach that leverages on existing live employee data to uncover the latest/current reasons that could be driving talents away so that IBM can predict and personify/characterize the type of employees that might leave the organization. These insights would enable better formulation of HR strategies for talent retention.
Myriad factors could affect employee attrition. However, this project aims to identify the salient factors that influences attrition at IBM so that management can make HR decisions for the greatest impact. Our study focuses on internal attrition data from IBM and employee reviews from Glassdoor. We aim to provide IBM with valuable insights that could lead to the most effective and efficient solutions to improve talent retention by answering the following three (3) questions:
- What are the key driving factors influencing attrition the most at IBM?
The first question explores the key factors that influences attrition at IBM, among the many other factors cited by Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016). Having such insights would allow IBM HR to develop watch-areas on these factors to improve retention.
- Who is likely to leave IBM?
The second question is a prediction problem, where we aim to fit a model with the key factors identified in Question 1. The model accuracy will be optimized by various supervised and unsupervised techniques such as logistic regression, sentiment analysis and clustering. This will help IBM HR to identify talents who are at risk of leaving.
- What is the employee type that has the highest tendency to leave IBM?
Finally, the third question seeks to personify/characterize the type of employee that has the highest risk of leaving IBM, which will allow better formulation of HR strategies to retain such individuals with such a persona. Depending on how many personas we can achieve through our analysis, we can adopt a risk-based approach to curate unique HR strategies for employees with different personas.
We have two datasets for this project, as shown:
dat_hr
dat_gd
The first dataset (dat_hr
) is taken from IBM and it consists of attrition data and demographic variables such as employee satisfaction, income and education, among others, while the second dataset (dat_gd
) is collected by scraping Glassdoor and it contains past and present IBM employee text reviews and ratings of the company. Details of the various data fields can be found in the Appendix.
The first thing that we will do is to get a detailed overview of the data in the IBM attrition dataset, dat_hr
. We use skim()
instead of the conventional glimpse()
and str()
as it offers additional insights such as the number of missing observations, complete rate and the number of unique observations for both factor and numeric column types. It also returns statistical summaries, which are really useful for quick analysis.
Name | dat_hr |
Number of rows | 1470 |
Number of columns | 35 |
_______________________ | |
Column type frequency: | |
factor | 9 |
numeric | 26 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Attrition | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 1043, Tra: 277, Non: 150 |
Department | 0 | 1 | FALSE | 3 | Res: 961, Sal: 446, Hum: 63 |
EducationField | 0 | 1 | FALSE | 6 | Lif: 606, Med: 464, Mar: 159, Tec: 132 |
Gender | 0 | 1 | FALSE | 2 | Mal: 882, Fem: 588 |
JobRole | 0 | 1 | FALSE | 9 | Sal: 326, Res: 292, Lab: 259, Man: 145 |
MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |
Over18 | 0 | 1 | FALSE | 1 | Y: 1470 |
OverTime | 0 | 1 | FALSE | 2 | No: 1054, Yes: 416 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ï..Age | 0 | 1 | 36.92 | 9.14 | 18 | 30.00 | 36.0 | 43.00 | 60 | ▂▇▇▃▂ |
DailyRate | 0 | 1 | 802.49 | 403.51 | 102 | 465.00 | 802.0 | 1157.00 | 1499 | ▇▇▇▇▇ |
DistanceFromHome | 0 | 1 | 9.19 | 8.11 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
Education | 0 | 1 | 2.91 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▃▇▆▁ |
EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
EmployeeNumber | 0 | 1 | 1024.87 | 602.02 | 1 | 491.25 | 1020.5 | 1555.75 | 2068 | ▇▇▇▇▇ |
EnvironmentSatisfaction | 0 | 1 | 2.72 | 1.09 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
HourlyRate | 0 | 1 | 65.89 | 20.33 | 30 | 48.00 | 66.0 | 83.75 | 100 | ▇▇▇▇▇ |
JobInvolvement | 0 | 1 | 2.73 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
JobLevel | 0 | 1 | 2.06 | 1.11 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
JobSatisfaction | 0 | 1 | 2.73 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
MonthlyIncome | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911.00 | 4919.0 | 8379.00 | 19999 | ▇▅▂▁▂ |
MonthlyRate | 0 | 1 | 14313.10 | 7117.79 | 2094 | 8047.00 | 14235.5 | 20461.50 | 26999 | ▇▇▇▇▇ |
NumCompaniesWorked | 0 | 1 | 2.69 | 2.50 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
PercentSalaryHike | 0 | 1 | 15.21 | 3.66 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
RelationshipSatisfaction | 0 | 1 | 2.71 | 1.08 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
StockOptionLevel | 0 | 1 | 0.79 | 0.85 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
TotalWorkingYears | 0 | 1 | 11.28 | 7.78 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
TrainingTimesLastYear | 0 | 1 | 2.80 | 1.29 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
WorkLifeBalance | 0 | 1 | 2.76 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
YearsAtCompany | 0 | 1 | 7.01 | 6.13 | 0 | 3.00 | 5.0 | 9.00 | 40 | ▇▂▁▁▁ |
YearsInCurrentRole | 0 | 1 | 4.23 | 3.62 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
YearsSinceLastPromotion | 0 | 1 | 2.19 | 3.22 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
YearsWithCurrManager | 0 | 1 | 4.12 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Here we perform some data transformation, like renaming column names with typos in them and converting some column types into the appropriate data types.
# Renaming the Age column
dat_hr <- dat_hr %>%
dplyr::rename(Age = ï..Age)
# Converting numeric to factor
dat_hr$Education <- as.factor(dat_hr$Education)
dat_hr$EnvironmentSatisfaction <- as.factor(dat_hr$EnvironmentSatisfaction)
dat_hr$JobInvolvement <- as.factor(dat_hr$JobInvolvement)
dat_hr$JobLevel <- as.factor(dat_hr$JobLevel)
dat_hr$JobSatisfaction <- as.factor(dat_hr$JobSatisfaction)
dat_hr$StockOptionLevel <- as.factor(dat_hr$StockOptionLevel)
dat_hr$PerformanceRating <- as.factor(dat_hr$PerformanceRating)
dat_hr$RelationshipSatisfaction <- as.factor(dat_hr$RelationshipSatisfaction)
dat_hr$WorkLifeBalance <- as.factor(dat_hr$WorkLifeBalance)
Based on the skim()
summary above, there are a few variables that only have one unique value. These variables are: Over18
, EmployeeCount
and StandardHours
. We will remove these columns as they are not useful for the modeling process.
dat_hr <- dat_hr %>%
dplyr::select(-Over18, -EmployeeCount, -StandardHours)
As a final check, we ensure that there are no NAs
in the dataset.
sum(is.na(dat_hr))
## [1] 0
Finally, this is the cleaned dat_hr
dataset that is ready for analysis work.
## Rows: 1,470
## Columns: 32
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction <fct> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement <fct> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole <fct> Sales Executive, Research Scientist, Laborato~
## $ JobSatisfaction <fct> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating <fct> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <fct> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StockOptionLevel <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance <fct> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
The first thing we will do is to get an overview of what the scraped data looks like. We use the glimpse()
function for that. As you can see from the summary below, much cleaning is needed to prepare the data for text mining. For instance, Ratings
need to be in numeric form, Current_Former
needs to be factorized, Review_ID
has web links that need to be truncated, Role
has dates concatenated with the roles, Location
can be shortened to the 2-letter state abbreviation for standardization purposes and to tokenize the Pros
and Cons
columns using tidytext
.
## Rows: 890
## Columns: 8
## $ Ratings <chr> "5.0", "4.0", "4.0", "4.0", "5.0", "4.0", "2.0", "4.0",~
## $ Current_Former <chr> "Current Employee", "Current Employee, more than 8 year~
## $ Title <chr> "EXCELLENT", "Advisory Engineer in STG, IBM", "IBM is n~
## $ Review_ID <chr> "https://www.glassdoor.sg/Reviews/Employee-Review-IBM-R~
## $ Role <chr> "19 Feb 2021 - Executive", "26 Aug 2014 - Advisory Engi~
## $ Location <chr> "Chicago, IL", "Hopewell Junction, NY", "Dallas, TX", "~
## $ Pros <chr> "Great place to work IBM", "Disclaimer: A lot of what I~
## $ Cons <chr> "No cons with IBM employment. Great place to work.", "1~
Here we perform the necessary data preparation.
Clean the Ratings
column.
# Convert Ratings to numeric
dat_gd$Ratings <- as.numeric(dat_gd$Ratings)
Clean the Current_Former
column. Notice that the employment length is concatenated with the string. Therefore we need to separate the employment length from the column and create a new column for it.
# Split the string into 2 columns: Current_Former and Employment_Length
dat_gd <- dat_gd %>%
separate(Current_Former, c("Current_Former", "Employment_Length"), ", ")
# Convert Current_Former to factor
dat_gd$Current_Former <- as.factor(dat_gd$Current_Former)
Clean Review_ID
column.
# Extract review IDs from web links
dat_gd$Review_ID <- sub("https://www.glassdoor.sg/Reviews/Employee-Review-IBM-RVW", "", dat_gd$Review_ID)
dat_gd$Review_ID <- sub(".htm", "", dat_gd$Review_ID)
# Convert Review_ID to numeric
dat_gd$Review_ID <- as.numeric(dat_gd$Review_ID)
Check for duplicates in the Review_ID
.
dim(dat_gd[duplicated(dat_gd$Review_ID),])[1]
## [1] 1
There is 1 duplicate entry. Check for the specific Review_ID
that is duplicated.
dat_gd[duplicated(dat_gd$Review_ID),]
## # A tibble: 1 x 9
## Ratings Current_Former Employment_Length Title Review_ID Role Location Pros
## <dbl> <fct> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 5 Current Employ~ NA EXCE~ 42772946 19 F~ Chicago~ Grea~
## # ... with 1 more variable: Cons <chr>
Review_ID
42772946 is duplicated and we will remove it from the data.
dat_gd <- dat_gd[!duplicated(dat_gd$Review_ID),]
Clean the Role
column.
# Separate the date portion from the role portion
dat_gd <- dat_gd %>%
separate(Role, c("Review_Date", "Role"), " - ")
# Convert Review_Date to date format
dat_gd$Review_Date <- as.Date(dat_gd$Review_Date, "%d %b %Y")
Clean the Location
column.
# Extract the 2-letter state abbreviation
dat_gd$Location <- sub(".*,\\s*", "", dat_gd$Location)
Check whether there are outliers in the Location
column that do not conform to the 2-letter state abbreviation.
dat_gd$Location[nchar(dat_gd$Location) != 2]
## [1] "Tarlac" "Chennai"
There are two places that are not in the United States. Tarlac is in Philippines while Chennai is in India. We will remove these rows from the data and convert the type to factor.
# Remove rows
dat_gd <- dat_gd[nchar(dat_gd$Location) == 2,]
# Convert character to factor
dat_gd$Location <- as.factor(dat_gd$Location)
Finally, this is the cleaned dat_gd
dataset that is ready for analysis work.
## Rows: 887
## Columns: 10
## $ Ratings <dbl> 5, 4, 4, 4, 5, 4, 2, 4, 5, 5, 4, 4, 5, 4, 5, 5, 4, 1~
## $ Current_Former <fct> Current Employee, Current Employee, Current Employee~
## $ Employment_Length <chr> NA, "more than 8 years", "more than 10 years", "more~
## $ Title <chr> "EXCELLENT", "Advisory Engineer in STG, IBM", "IBM i~
## $ Review_ID <dbl> 42772946, 4852131, 33522445, 47288155, 46360847, 470~
## $ Review_Date <date> 2021-02-19, 2014-08-26, 2020-06-04, 2021-05-21, 202~
## $ Role <chr> "Executive", "Advisory Engineer", "Bid Proposal Mana~
## $ Location <fct> IL, NY, TX, LA, TX, FL, MI, NY, NC, DC, NY, NY, NY, ~
## $ Pros <chr> "Great place to work IBM", "Disclaimer: A lot of wha~
## $ Cons <chr> "No cons with IBM employment. Great place to work.",~
The outcome variable is Attrition
and we want to know its distribution.
table(dat_hr$Attrition)
##
## No Yes
## 1233 237
Perhaps visualizing it may be better to understand the distribution.
dat_hr %>%
dplyr::group_by(Attrition) %>%
dplyr::summarise(Total = n()) %>%
ggplot(aes(x=Attrition, y=Total, fill=Attrition)) +
geom_col() +
ggtitle("Distribution of Attrition") +
theme_classic()
Let’s explore the attrition by income level.
dat_hr %>%
ggplot(aes(x=MonthlyIncome, fill=Attrition)) +
geom_density(alpha=0.6) +
labs(x="Monthly Income", y="") +
ggtitle("Attrition by Income Level") +
theme_classic()
Exploring attrition rates with respect to age.
dat_hr %>%
ggplot(aes(x=Attrition, y=Age, fill=Attrition)) +
geom_boxplot() +
ggtitle("Attrition by Age") +
theme_classic()
Here we calculate the correlation between Attrition
and the rest of the variables in the dataset. A correlation funnel is plotted using the correlationfunnel
library and it shows us the variables that have the highest to lowest correlation with Attrition
(like a funnel). To interpret the plot, for instance, we notice that OverTime
is highly correlated with Attrition
as it sits on the top of the funnel. Heuristically speaking, employees who work overtime would likely lead to higher attrition due to increased workloads and lengthier working hours. Our correlation plot reflects this observation, which shows that employees who work overtime have positive correlation (right side of the funnel) while those who do not work overtime have negative correlation (left side of the funnel) to Attrition
.
To select variables for modeling and clustering work, we will select a correlation threshold (e.g., > 0.1) as it does not make sense to include all variables – this will complicate the analysis of the results and make it more difficult for IBM stakeholders to act upon the findings.
The first thing we will do is to analyze the Pros
and Cons
text reviews in the dat_gd
dataset to just have a glimpse of what employees (past and present) are feeling about IBM.
Analyzing the Pros
:
dat_gd %>%
unnest_tokens(input=Pros, output=word) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## # A tibble: 1,787 x 2
## word n
## <chr> <int>
## 1 ibm 163
## 2 people 157
## 3 company 149
## 4 opportunities 136
## 5 benefits 126
## 6 life 101
## 7 balance 94
## 8 learning 78
## 9 culture 72
## 10 learn 70
## # ... with 1,777 more rows
# Visualizing top 10 most common words in Pros
dat_gd %>%
unnest_tokens(input=Pros, output=word) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
slice(1:10) %>%
ggplot() + geom_bar(aes(x=reorder(word, -n), y=n), stat="identity", fill="#40B5AD") +
theme_classic() +
labs(title = "Top 10 Single Words in Pros",
caption = "Data Source: Glassdoor") +
theme(axis.title.x=element_blank())
# Visualizing top 30 bigrams in Pros
dat_gd %>%
unnest_tokens(input=Pros, output=word, token="ngrams", n=2) %>%
separate(word, c("word1", "word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(word, word1, word2, sep=" ") %>%
count(word, sort=TRUE) %>%
slice(1:30) %>%
ggplot() + geom_bar(aes(x=reorder(word, n), y=n), stat = "identity", fill = "#40B5AD") +
theme_classic() +
coord_flip() +
labs(title = "Top 30 Bi-Words in Pros",
caption = "Data Source: Glassdoor") +
theme(axis.title.y=element_blank())
Analyzing the Cons
:
dat_gd %>%
unnest_tokens(input=Cons, output=word) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## # A tibble: 2,399 x 2
## word n
## <chr> <int>
## 1 company 152
## 2 ibm 133
## 3 management 112
## 4 time 60
## 5 pay 59
## 6 salary 51
## 7 job 47
## 8 people 46
## 9 team 44
## 10 hard 43
## # ... with 2,389 more rows
# Visualizing top 10 most common words in Cons
dat_gd %>%
unnest_tokens(input=Cons, output=word) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
slice(1:10) %>%
ggplot() + geom_bar(aes(x=reorder(word, -n), y=n), stat="identity", fill="#de5833") +
theme_classic() +
labs(title = "Top 10 Single Words in Cons",
caption = "Data Source: Glassdoor") +
theme(axis.title.x=element_blank())
# Visualizing top 30 bigrams in Cons
dat_gd %>%
unnest_tokens(input=Cons, output=word, token="ngrams", n=2) %>%
separate(word, c("word1", "word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(word, word1, word2, sep=" ") %>%
count(word, sort=TRUE) %>%
slice(1:30) %>%
ggplot() + geom_bar(aes(x=reorder(word, n), y=n), stat = "identity", fill = "#de5833") +
theme_classic() +
coord_flip() +
labs(title = "Top 30 Bi-Words in Cons",
caption = "Data Source: Glassdoor") +
theme(axis.title.y=element_blank())
In the Glassdoor reviews, there are two columns of text data called Pros
and Cons
– one for positive reviews and one for negative reviews. The idea here is to compute the sentiment scores and to use them as inputs to the prediction model that would predict attrition at IBM. We want to explore whether the sentiments in the Glassdoor reviews would better predict attrition at IBM.
Before any sentiment analysis can be done, we need to categorize the Role
column of the dat_gd
dataset into a smaller group of categories as there are myriad type of roles that were written in the text reviews, which would make it difficult to perform any sort of summary statistic and joining of data with the dat_hr
dataset. The goal is to have an aggregated/averaged sentiment score for each job category based on the text reviews. Eventually, we grouped all roles based on 6 distinct job categories (shown below) because they have distinct career ladder paths from each other, and within each job category, they would have similarities such as pay and education backgrounds, among other things.
The 6 job categories are:
However, while performing the re-categorization exercise, there were roles that were ambiguous and needed manual research in order to determine which category it belonged to. For instance, there were roles such as “sdm” and “rsm”, to which we used Glassdoor’s website to understand what the individual job descriptions were before assigning them into the appropriate categories (“sdm” and “rsm” are the abbreviations for service delivery manager and real storage manager respectively).
# Convert all roles to lowercase
dat_gd$Role <- tolower(dat_gd$Role)
# Rename all roles into standard role types
dat_gd_newroles <- dat_gd %>%
mutate(Role = case_when(
str_detect(Role, "manager") ~ "Manager",
str_detect(Role, "bdm") ~ "Manager",
str_detect(Role, "director") ~ "Director",
str_detect(Role, "engineer") ~ "ESP",
str_detect(Role, "developer") ~ "ESP",
str_detect(Role, "analyst") ~ "ESP",
str_detect(Role, "programmer") ~ "ESP",
str_detect(Role, "architect") ~ "ESP",
str_detect(Role, "scientist") ~ "ESP",
str_detect(Role, "data science") ~ "ESP",
str_detect(Role, "data scientist") ~ "ESP",
str_detect(Role, "data scentist") ~ "ESP",
str_detect(Role, "senior technical staff member") ~ "ESP",
str_detect(Role, "researcher") ~ "ESP",
str_detect(Role, "design") ~ "ESP",
str_detect(Role, "sales") ~ "Sales",
str_detect(Role, "talent") ~ "Corporate",
str_detect(Role, "hr") ~ "Corporate",
str_detect(Role, "human") ~ "Corporate",
str_detect(Role, "executive") ~ "Corporate",
str_detect(Role, "partner") ~ "Corporate",
str_detect(Role, "accountant") ~ "Corporate",
str_detect(Role, "technician") ~ "AESP",
str_detect(Role, "specialist") ~ "AESP",
str_detect(Role, "senior") ~ "Manager",
str_detect(Role, "managing") ~ "Manager",
str_detect(Role, "consultant") ~ "Corporate",
str_detect(Role, "admin") ~ "Corporate",
str_detect(Role, "vice pres") ~ "Director",
str_detect(Role, "svp") ~ "Director",
str_detect(Role, "research") ~ "ESP",
str_detect(Role, "lead") ~ "Manager",
str_detect(Role, "writer") ~ "Corporate",
str_detect(Role, "chief") ~ "Director",
str_detect(Role, "tech") ~ "AESP",
str_detect(Role, "client rep") ~ "Sales",
str_detect(Role, "service rep") ~ "Sales",
str_detect(Role, "account rep") ~ "Sales",
str_detect(Role, "support serv") ~ "Sales",
str_detect(Role, "acquisition") ~ "Corporate",
str_detect(Role, "president") ~ "Director",
str_detect(Role, "legal") ~ "Corporate",
str_detect(Role, "business") ~ "Corporate",
str_detect(Role, "manage") ~ "Manager",
str_detect(Role, "product") ~ "Manager",
str_detect(Role, "procurement") ~ "Corporate",
str_detect(Role, "market") ~ "Corporate",
str_detect(Role, "finance") ~ "Corporate",
str_detect(Role, "people") ~ "Corporate",
str_detect(Role, "rep") ~ "Corporate",
str_detect(Role, "dev") ~ "ESP",
str_detect(Role, "agent") ~ "Corporate",
str_detect(Role, "bdr") ~ "Corporate",
str_detect(Role, "quality") ~ "AESP",
str_detect(Role, "support") ~ "AESP",
str_detect(Role, "attorney") ~ "Corporate",
str_detect(Role, "social") ~ "Corporate",
str_detect(Role, "relations") ~ "Corporate",
str_detect(Role, "content") ~ "Corporate",
str_detect(Role, "hardware") ~ "AESP",
str_detect(Role, "scrum") ~ "Corporate",
str_detect(Role, "sdm") ~ "Manager",
str_detect(Role, "rsm") ~ "Manager",
TRUE ~ Role
))
# Discard all other rows that do not have well defined roles
dat_gd_filtered <- dat_gd_newroles %>%
filter(Role == "ESP" | Role == "Manager" | Role == "Corporate" |
Role == "Sales" | Role == "AESP" | Role == "Director")
# Convert Role to factor
dat_gd_filtered$Role <- as.factor(dat_gd_filtered$Role)
Ensure that the roles have 6 distinct categories and explore the distribution.
levels(dat_gd_filtered$Role)
## [1] "AESP" "Corporate" "Director" "ESP" "Manager" "Sales"
summary(dat_gd_filtered$Role)
## AESP Corporate Director ESP Manager Sales
## 53 139 32 334 232 58
For the sentiment analysis, we use the sentimentr
library as it handles valence shifters (i.e., negators amplifiers, de-amplifiers and adversative conjunctions) while maintaining speed (an alternative is the qdap
library, but it is slower in speed). Another reason is because the library gives us the option to aggregate the sentiment scores by grouping variables, which is something that we want to do (i.e., we want to group it by the job roles).
Here we computed the average sentiment scores for the Pros
in the text reviews and we visualized them in a density plot for easy viewing.
## Pros
# Avg sentiment scores summarized by Role
dat_gd_avgsentiment_pros <- dat_gd_filtered %>%
select(-Cons) %>%
get_sentences() %>%
sentiment_by("Role")
dat_gd_avgsentiment_pros
## Role word_count sd ave_sentiment
## 1: AESP 871 0.3842475 0.5730397
## 2: Corporate 2186 0.3597134 0.4696195
## 3: Director 642 0.5245235 0.5968319
## 4: ESP 6274 0.3822417 0.5135917
## 5: Manager 3470 0.3515144 0.5727328
## 6: Sales 740 0.3391197 0.5839903
# Sentiment scores of Pros reviews faceted by Role
dat_gd_filtered %>%
select(-Cons) %>%
get_sentences() %>%
sentiment() %>%
group_by(element_id, Role) %>%
summarise(avg_sentiment = mean(sentiment)) %>%
ungroup() %>%
left_join(dat_gd_avgsentiment_pros, by="Role") %>%
ggplot(aes(x=avg_sentiment, fill=Role)) +
geom_density(alpha=0.6) +
facet_wrap(~Role) +
geom_vline(aes(xintercept=ave_sentiment), size=1, color="red") +
geom_text(aes(x=ave_sentiment+0.6, label=paste0("Mean\n",round(ave_sentiment,3)), y=1.2)) +
ggtitle("Distribution of Sentiment Scores for Reviews in Pros") +
labs(x="Sentiment Score", y="Density") +
theme_classic()
Now we compute the average sentiment scores for the Cons
in the text reviews and we visualize them in a density plot.
## Cons
# Avg sentiment scores summarized by Role
dat_gd_avgsentiment_cons <- dat_gd_filtered %>%
select(-Pros) %>%
get_sentences() %>%
sentiment_by("Role")
dat_gd_avgsentiment_cons
## Role word_count sd ave_sentiment
## 1: AESP 960 0.3226684 -0.061601871
## 2: Corporate 2392 0.3291854 0.031341807
## 3: Director 664 0.3920271 -0.030371233
## 4: ESP 7116 0.3493464 0.002896978
## 5: Manager 4079 0.3316412 0.044559058
## 6: Sales 1310 0.3581594 -0.015564249
# Sentiment scores of Cons reviews faceted by Role
dat_gd_filtered %>%
select(-Pros) %>%
get_sentences() %>%
sentiment() %>%
group_by(element_id, Role) %>%
summarise(avg_sentiment = mean(sentiment)) %>%
ungroup() %>%
left_join(dat_gd_avgsentiment_cons, by="Role") %>%
ggplot(aes(x=avg_sentiment, fill=Role)) +
geom_density(alpha=0.6) +
facet_wrap(~Role) +
geom_vline(aes(xintercept=ave_sentiment), size=1, color="red") +
geom_text(aes(x=ave_sentiment+0.6, label=paste0("Mean\n",round(ave_sentiment,3)), y=1.2)) +
ggtitle("Distribution of Sentiment Scores for Reviews in Cons") +
labs(x="Sentiment Score", y="Density") +
theme_classic()
As you can see from the distribution, the average sentiment scores from the Pros
resulted in a more positive polarity while the Cons
resulted in a less positive polarity relative to each other, which is anecdotally expected. These sentiment scores will subsequently be used as inputs to the prediction model in the later steps.
With the amalgamation of the roles in the Glassdoor reviews into 6 distinct categories, we shall do the same for the IBM dataset, dat_hr
, and attempt to join the sentiment scores to it for prediction modeling purposes.
First, we performed the re-categorization of the job roles in the IBM dataset. This re-categorization work was definitely easier and more trivial as compared to Glassdoor reviews because the data had distinct job role categories to begin with.
# Convert JobRole from factor to character
dat_hr$JobRole <- as.character(dat_hr$JobRole)
# Rename all roles into standard role types
dat_hr_newroles <- dat_hr %>%
mutate(JobRole = case_when(
str_detect(JobRole, "Sales Executive") ~ "Sales",
str_detect(JobRole, "Research Scientist") ~ "ESP",
str_detect(JobRole, "Laboratory Technician") ~ "AESP",
str_detect(JobRole, "Manufacturing Director") ~ "Director",
str_detect(JobRole, "Healthcare Representative") ~ "Corporate",
str_detect(JobRole, "Sales Representative") ~ "Sales",
str_detect(JobRole, "Research Director") ~ "Director",
str_detect(JobRole, "Human Resources") ~ "Corporate",
TRUE ~ JobRole
))
# Convert JobRole to factor
dat_hr_newroles$JobRole <- as.factor(dat_hr_newroles$JobRole)
# Rename JobRole to Role in order to match dat_gd_avgsentiment_pros and cons
dat_hr_renamed <- dat_hr_newroles %>%
rename(Role = JobRole)
Ensure that the roles have 6 distinct categories and explore the distribution.
levels(dat_hr_renamed$Role)
## [1] "AESP" "Corporate" "Director" "ESP" "Manager" "Sales"
summary(dat_hr_renamed$Role)
## AESP Corporate Director ESP Manager Sales
## 259 183 225 292 102 409
Now that both datasets have a common baseline (i.e., job roles) for joining, we did a left_join
to merge the sentiments scores for Pros
and Cons
into the IBM dataset. The sentiment scores are appended in the last two columns of dat_hr_final
.
# dat_gd_avgsentiment_pros joined to dat_hr_renamed
dat_hr_final <- dat_hr_renamed %>%
left_join(dat_gd_avgsentiment_pros, by="Role") %>%
select(-word_count, -sd) %>%
rename(ave_sentiment_pros = ave_sentiment) %>%
left_join(dat_gd_avgsentiment_cons, by="Role") %>%
select(-word_count, -sd) %>%
rename(ave_sentiment_cons = ave_sentiment)
glimpse(dat_hr_final)
## Rows: 1,470
## Columns: 34
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction <fct> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement <fct> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ Role <fct> Sales, ESP, AESP, ESP, AESP, AESP, AESP, AESP~
## $ JobSatisfaction <fct> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating <fct> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <fct> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StockOptionLevel <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance <fct> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
## $ ave_sentiment_pros <dbl> 0.5839903, 0.5135917, 0.5730397, 0.5135917, 0~
## $ ave_sentiment_cons <dbl> -0.015564249, 0.002896978, -0.061601871, 0.00~
In this section, we perform the modeling process for predicting attrition at IBM. We shall use inputs from the default IBM dataset together with clustering and supplemental data from the Glassdoor reviews. The prediction modeling methodology is as follows:
Since we are attempting to predict attrition, this becomes a classification problem (i.e., “will this employee leave?”) and we will use confusion matrices to compute the accuracy, specificity and sensitivity figures.
The main aim of this project is to reduce the attrition at IBM; in our business context, we want to focus on the false negatives (FN) as opposed to the false positives (FP). Why?
Under a FP case, we predict that an employee would leave (Attrition=Yes) but he/she did not. This would mean that the company would have taken additional measures to prevent this employee from leaving, leading to a “wastage” of resources. But this might not be a bad thing because in the process of doing so, the company would have forged a stronger relationship with the employee, which might further reduce the chance of the employee from leaving.
However, under a FN case, we predict that an employee would not leave (Attrition=No) but he/she did leave. This is more detrimental to the company because it means that we have failed to take preventive action to prevent attrition, thereby losing valuable talents in IBM. For this reason, the priority is in reducing the FN in our prediction models. Hence, the crucial metric in our business context is the sensitivity figure because we want to reduce FN, which is in the denominator of the equation, as shown below.
\[ Sensitivity = TP/(TP+FN) \] \[ Specificity = TN/(TN+FP) \] \[ Accuracy = (TN+TP)/(TN+TP+FN+FP) \]
Lastly, we will use the area under the curve (AUC) as a measure of the model quality. The higher the AUC, the better the model.
# Split data
set.seed(1000)
split <- createDataPartition(dat_hr_final$Attrition, p=0.7, list=F, groups=100)
train <- dat_hr_final[split,]
test <- dat_hr_final[-split,]
We model a logistic regression with the original predictors from the IBM dataset first (without sentiment scores and clustering). We used backward selection for feature selection.
model1 <- glm(Attrition~., family="binomial", data=train[,1:32])
summary(model1)
##
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train[,
## 1:32])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7902 -0.4553 -0.2171 -0.0694 3.4280
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.257e+00 1.622e+00 3.241 0.001190 **
## Age -4.446e-02 1.691e-02 -2.629 0.008573 **
## BusinessTravelTravel_Frequently 1.911e+00 5.026e-01 3.803 0.000143 ***
## BusinessTravelTravel_Rarely 1.044e+00 4.620e-01 2.258 0.023916 *
## DailyRate -4.898e-04 2.764e-04 -1.772 0.076406 .
## DepartmentResearch & Development 1.641e-01 8.497e-01 0.193 0.846837
## DepartmentSales -3.832e-01 1.605e+00 -0.239 0.811333
## DistanceFromHome 4.743e-02 1.394e-02 3.401 0.000670 ***
## Education2 2.398e-01 4.013e-01 0.598 0.550161
## Education3 2.035e-01 3.492e-01 0.583 0.560018
## Education4 2.311e-01 3.826e-01 0.604 0.545927
## Education5 -1.100e+00 9.529e-01 -1.154 0.248549
## EducationFieldLife Sciences -1.257e+00 1.015e+00 -1.238 0.215839
## EducationFieldMarketing -7.822e-01 1.082e+00 -0.723 0.469793
## EducationFieldMedical -1.196e+00 1.008e+00 -1.186 0.235716
## EducationFieldOther -1.530e+00 1.095e+00 -1.397 0.162370
## EducationFieldTechnical Degree -6.066e-01 1.057e+00 -0.574 0.565971
## EmployeeNumber -1.163e-04 1.898e-04 -0.613 0.540157
## EnvironmentSatisfaction2 -9.162e-01 3.515e-01 -2.606 0.009149 **
## EnvironmentSatisfaction3 -1.014e+00 3.215e-01 -3.155 0.001604 **
## EnvironmentSatisfaction4 -1.202e+00 3.259e-01 -3.689 0.000225 ***
## GenderMale 5.812e-01 2.395e-01 2.426 0.015248 *
## HourlyRate 1.814e-03 5.550e-03 0.327 0.743727
## JobInvolvement2 -1.391e+00 4.674e-01 -2.977 0.002912 **
## JobInvolvement3 -1.665e+00 4.380e-01 -3.802 0.000144 ***
## JobInvolvement4 -2.145e+00 5.896e-01 -3.637 0.000275 ***
## JobLevel2 -1.386e+00 4.410e-01 -3.142 0.001678 **
## JobLevel3 2.610e-01 7.951e-01 0.328 0.742689
## JobLevel4 -5.122e-01 1.510e+00 -0.339 0.734491
## JobLevel5 1.259e+00 1.804e+00 0.698 0.485285
## RoleCorporate -1.283e-01 5.894e-01 -0.218 0.827642
## RoleDirector -3.472e-01 5.809e-01 -0.598 0.550009
## RoleESP -9.264e-01 3.369e-01 -2.750 0.005964 **
## RoleManager 4.274e-01 1.059e+00 0.404 0.686551
## RoleSales 1.067e+00 1.493e+00 0.715 0.474751
## JobSatisfaction2 -9.525e-01 3.556e-01 -2.678 0.007396 **
## JobSatisfaction3 -8.181e-01 3.004e-01 -2.723 0.006464 **
## JobSatisfaction4 -1.423e+00 3.268e-01 -4.355 1.33e-05 ***
## MaritalStatusMarried 4.239e-01 3.390e-01 1.251 0.211084
## MaritalStatusSingle 1.163e+00 4.925e-01 2.362 0.018174 *
## MonthlyIncome -1.704e-04 1.052e-04 -1.620 0.105275
## MonthlyRate 5.186e-06 1.595e-05 0.325 0.745119
## NumCompaniesWorked 2.111e-01 4.877e-02 4.328 1.51e-05 ***
## OverTimeYes 2.123e+00 2.545e-01 8.343 < 2e-16 ***
## PercentSalaryHike -2.324e-02 4.724e-02 -0.492 0.622818
## PerformanceRating4 2.530e-01 4.861e-01 0.521 0.602692
## RelationshipSatisfaction2 -1.105e+00 3.590e-01 -3.077 0.002088 **
## RelationshipSatisfaction3 -1.132e+00 3.150e-01 -3.595 0.000324 ***
## RelationshipSatisfaction4 -1.247e+00 3.190e-01 -3.910 9.24e-05 ***
## StockOptionLevel1 -6.250e-01 3.884e-01 -1.609 0.107551
## StockOptionLevel2 -7.834e-01 5.775e-01 -1.357 0.174904
## StockOptionLevel3 -4.366e-01 5.941e-01 -0.735 0.462464
## TotalWorkingYears -5.732e-02 3.645e-02 -1.573 0.115779
## TrainingTimesLastYear -2.158e-01 9.017e-02 -2.393 0.016703 *
## WorkLifeBalance2 -8.450e-01 4.564e-01 -1.851 0.064110 .
## WorkLifeBalance3 -1.334e+00 4.299e-01 -3.103 0.001914 **
## WorkLifeBalance4 -8.533e-01 5.175e-01 -1.649 0.099176 .
## YearsAtCompany 9.188e-02 4.934e-02 1.862 0.062571 .
## YearsInCurrentRole -1.395e-01 5.695e-02 -2.450 0.014290 *
## YearsSinceLastPromotion 1.959e-01 5.370e-02 3.649 0.000264 ***
## YearsWithCurrManager -1.228e-01 5.584e-02 -2.199 0.027857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 909.69 on 1029 degrees of freedom
## Residual deviance: 569.35 on 969 degrees of freedom
## AIC: 691.35
##
## Number of Fisher Scoring iterations: 7
# Backward selection
start_mod <- glm(Attrition~., family="binomial", data=train[,1:32])
empty_mod <- glm(Attrition~1, family="binomial", data=train[,1:32])
full_mod <- glm(Attrition~., family="binomial", data=train[,1:32])
model_bw <- step(start_mod,
scope=list(upper=full_mod, lower=empty_mod),
direction="backward")
We check the VIF of the logistic regression model and realized that JobLevel
has high VIF. We will drop JobLevel
from the model and check the VIF once more to ensure that the variables are not multicollinear.
# Check VIF
vif(model_bw)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.854540 1 1.361815
## BusinessTravel 1.233003 2 1.053758
## DailyRate 1.074445 1 1.036555
## DistanceFromHome 1.153649 1 1.074080
## EnvironmentSatisfaction 1.294510 3 1.043961
## Gender 1.115951 1 1.056386
## JobInvolvement 1.245723 3 1.037298
## JobLevel 32.054649 4 1.542540
## Role 5.852106 5 1.193249
## JobSatisfaction 1.320068 3 1.047368
## MaritalStatus 1.257844 2 1.059026
## MonthlyIncome 14.132514 1 3.759324
## NumCompaniesWorked 1.420018 1 1.191645
## OverTime 1.350808 1 1.162243
## RelationshipSatisfaction 1.265591 3 1.040037
## TotalWorkingYears 5.092692 1 2.256699
## TrainingTimesLastYear 1.082879 1 1.040615
## WorkLifeBalance 1.246539 3 1.037411
## YearsAtCompany 6.194323 1 2.488840
## YearsInCurrentRole 2.866187 1 1.692982
## YearsSinceLastPromotion 2.332125 1 1.527130
## YearsWithCurrManager 2.784844 1 1.668785
# Drop highest VIF (JobLevel)
model_bw <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
DistanceFromHome+EnvironmentSatisfaction+Gender+
JobInvolvement+JobSatisfaction+MonthlyIncome+
NumCompaniesWorked+OverTime+PercentSalaryHike+
RelationshipSatisfaction+StockOptionLevel+
TotalWorkingYears+TrainingTimesLastYear+
WorkLifeBalance+YearsSinceLastPromotion+
YearsWithCurrManager, family="binomial", data=train[,1:32])
# Check VIF once more
vif(model_bw)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.780608 1 1.334394
## BusinessTravel 1.175488 2 1.041249
## DailyRate 1.057085 1 1.028146
## Department 1.155066 2 1.036697
## DistanceFromHome 1.107330 1 1.052297
## EnvironmentSatisfaction 1.186843 3 1.028961
## Gender 1.072582 1 1.035656
## JobInvolvement 1.192317 3 1.029750
## JobSatisfaction 1.213064 3 1.032715
## MonthlyIncome 2.039577 1 1.428138
## NumCompaniesWorked 1.316823 1 1.147529
## OverTime 1.224833 1 1.106722
## PercentSalaryHike 1.070351 1 1.034578
## RelationshipSatisfaction 1.216061 3 1.033140
## StockOptionLevel 1.258684 3 1.039089
## TotalWorkingYears 3.603873 1 1.898387
## TrainingTimesLastYear 1.080614 1 1.039526
## WorkLifeBalance 1.177136 3 1.027554
## YearsSinceLastPromotion 1.706395 1 1.306290
## YearsWithCurrManager 1.770733 1 1.330689
Now that the VIF is good, we performed the predictions with with a 0.5 threshold.
# Predict
pred <- predict(model_bw, newdata=test[,1:32], "response")
ct <- table(attrition = test$Attrition,
predictions = as.integer(pred>0.5))
ct
## predictions
## attrition 0 1
## No 355 14
## Yes 30 41
accuracy_df <- data.frame(accuracy = sum(ct[1,1],ct[2,2])/nrow(test),
specificity = ct[1,1]/sum(ct[1,1],ct[1,2]),
sensitivity = ct[2,2]/sum(ct[2,1],ct[2,2]))
accuracy_df
## accuracy specificity sensitivity
## 1 0.9 0.9620596 0.5774648
Next, we plot the AUC to determine the optimal threshold value. From the chart, it seems that a threshold of 0.2 is good.
# Plot AUC
ROCRpred <- prediction(pred, test$Attrition)
ROCRperf <- performance(ROCRpred, "acc", "fnr")
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred,"auc")@y.values)
## [1] 0.8940799
# Use 0.2 as threshold
ct_0.2 <- table(attrition = test$Attrition,
predictions = as.integer(pred>0.2))
ct_0.2
## predictions
## attrition 0 1
## No 299 70
## Yes 9 62
accuracy_df_0.2 <- data.frame(accuracy = sum(ct_0.2[1,1],ct_0.2[2,2])/nrow(test),
specificity = ct_0.2[1,1]/sum(ct_0.2[1,1],ct_0.2[1,2]),
sensitivity = ct_0.2[2,2]/sum(ct_0.2[2,1],ct_0.2[2,2]))
accuracy_df_0.2
## accuracy specificity sensitivity
## 1 0.8204545 0.8102981 0.8732394
Next, we perform the same logistic regression as Model #1, but we add the sentiment scores for the Pros
as a predictor into the model. The main aim is to observe if there is an improvement to the prediction. We check the VIF again just to ensure that there is no multicollinearity.
# Logistic regression
model_senti <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
DistanceFromHome+EnvironmentSatisfaction+Gender+
JobInvolvement+JobSatisfaction+MonthlyIncome+
NumCompaniesWorked+OverTime+PercentSalaryHike+
RelationshipSatisfaction+StockOptionLevel+
TotalWorkingYears+TrainingTimesLastYear+
WorkLifeBalance+YearsSinceLastPromotion+
YearsWithCurrManager+ave_sentiment_pros,
family="binomial", data=train)
# Check the VIF
vif(model_senti)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.794026 1 1.339412
## BusinessTravel 1.179690 2 1.042178
## DailyRate 1.058433 1 1.028802
## Department 1.839307 2 1.164564
## DistanceFromHome 1.108061 1 1.052645
## EnvironmentSatisfaction 1.193185 3 1.029875
## Gender 1.078229 1 1.038378
## JobInvolvement 1.191892 3 1.029689
## JobSatisfaction 1.229669 3 1.035058
## MonthlyIncome 2.103336 1 1.450288
## NumCompaniesWorked 1.315024 1 1.146745
## OverTime 1.257477 1 1.121373
## PercentSalaryHike 1.073632 1 1.036162
## RelationshipSatisfaction 1.238651 3 1.036314
## StockOptionLevel 1.263413 3 1.039739
## TotalWorkingYears 3.634832 1 1.906524
## TrainingTimesLastYear 1.084505 1 1.041396
## WorkLifeBalance 1.196203 3 1.030309
## YearsSinceLastPromotion 1.745552 1 1.321193
## YearsWithCurrManager 1.794884 1 1.339733
## ave_sentiment_pros 1.803946 1 1.343111
We now predict the results with a threshold of 0.5.
# Predict
pred_senti <- predict(model_senti, newdata=test, "response")
ct_senti <- table(attrition = test$Attrition,
predictions = as.integer(pred_senti>0.5))
ct_senti
## predictions
## attrition 0 1
## No 354 15
## Yes 30 41
accuracy_senti_df <- data.frame(accuracy = sum(ct_senti[1,1],ct_senti[2,2])/nrow(test),
specificity = ct_senti[1,1]/sum(ct_senti[1,1],ct_senti[1,2]),
sensitivity = ct_senti[2,2]/sum(ct_senti[2,1],ct_senti[2,2]))
accuracy_senti_df
## accuracy specificity sensitivity
## 1 0.8977273 0.9593496 0.5774648
We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold value should be 0.24.
# Plot AUC
ROCRpred_senti <- prediction(pred_senti, test$Attrition)
ROCRperf_senti <- performance(ROCRpred_senti, "acc", "fnr")
plot(ROCRperf_senti, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred_senti,"auc")@y.values)
## [1] 0.8982404
# Use 0.24 as threshold
ct_senti_0.24 <- table(attrition = test$Attrition,
predictions = as.integer(pred_senti>0.24))
ct_senti_0.24
## predictions
## attrition 0 1
## No 316 53
## Yes 9 62
accuracy_senti_0.24 <- data.frame(
accuracy = sum(ct_senti_0.24[1,1],ct_senti_0.24[2,2])/nrow(test),
specificity = ct_senti_0.24[1,1]/sum(ct_senti_0.24[1,1],ct_senti_0.24[1,2]),
sensitivity = ct_senti_0.24[2,2]/sum(ct_senti_0.24[2,1],ct_senti_0.24[2,2]))
accuracy_senti_0.24
## accuracy specificity sensitivity
## 1 0.8590909 0.8563686 0.8732394
Next, we perform the same logistic regression as Model #1, but we add the sentiment scores for the Cons
as a predictor into the model. The main aim is to observe if there is an improvement to the prediction. We check the VIF again just to ensure that there is no multicollinearity.
# Logistic regression
model_senti2 <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
DistanceFromHome+EnvironmentSatisfaction+Gender+
JobInvolvement+JobSatisfaction+MonthlyIncome+
NumCompaniesWorked+OverTime+PercentSalaryHike+
RelationshipSatisfaction+StockOptionLevel+
TotalWorkingYears+TrainingTimesLastYear+
WorkLifeBalance+YearsSinceLastPromotion+
YearsWithCurrManager+ave_sentiment_cons,
family="binomial", data=train)
# Check the VIF
vif(model_senti2)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.788089 1 1.337194
## BusinessTravel 1.175845 2 1.041328
## DailyRate 1.060567 1 1.029838
## Department 1.414878 2 1.090636
## DistanceFromHome 1.108395 1 1.052804
## EnvironmentSatisfaction 1.192065 3 1.029714
## Gender 1.079315 1 1.038901
## JobInvolvement 1.195493 3 1.030207
## JobSatisfaction 1.227236 3 1.034716
## MonthlyIncome 2.086244 1 1.444384
## NumCompaniesWorked 1.316093 1 1.147211
## OverTime 1.267715 1 1.125929
## PercentSalaryHike 1.074498 1 1.036580
## RelationshipSatisfaction 1.240934 3 1.036632
## StockOptionLevel 1.261519 3 1.039479
## TotalWorkingYears 3.654708 1 1.911729
## TrainingTimesLastYear 1.086453 1 1.042330
## WorkLifeBalance 1.191752 3 1.029669
## YearsSinceLastPromotion 1.736311 1 1.317691
## YearsWithCurrManager 1.790538 1 1.338110
## ave_sentiment_cons 1.325863 1 1.151461
We now predict the results with a threshold of 0.5.
# Predict
pred_senti2 <- predict(model_senti2, newdata=test, "response")
ct_senti2 <- table(attrition = test$Attrition,
predictions = as.integer(pred_senti2>0.5))
ct_senti2
## predictions
## attrition 0 1
## No 355 14
## Yes 31 40
accuracy_senti2_df <- data.frame(accuracy = sum(ct_senti2[1,1],ct_senti2[2,2])/nrow(test),
specificity = ct_senti2[1,1]/sum(ct_senti2[1,1],ct_senti2[1,2]),
sensitivity = ct_senti2[2,2]/sum(ct_senti2[2,1],ct_senti2[2,2]))
accuracy_senti2_df
## accuracy specificity sensitivity
## 1 0.8977273 0.9620596 0.5633803
We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.245.
# Plot AUC
ROCRpred_senti2 <- prediction(pred_senti2, test$Attrition)
ROCRperf_senti2 <- performance(ROCRpred_senti2, "acc", "fnr")
plot(ROCRperf_senti2, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred_senti2,"auc")@y.values)
## [1] 0.9003015
# Use 0.245 as threshold
ct_senti2_0.245 <- table(attrition = test$Attrition,
predictions = as.integer(pred_senti2>0.245))
ct_senti2_0.245
## predictions
## attrition 0 1
## No 317 52
## Yes 9 62
accuracy_senti2_0.245 <- data.frame(
accuracy = sum(ct_senti2_0.245[1,1],ct_senti2_0.245[2,2])/nrow(test),
specificity = ct_senti2_0.245[1,1]/sum(ct_senti2_0.245[1,1],ct_senti2_0.245[1,2]),
sensitivity = ct_senti2_0.245[2,2]/sum(ct_senti2_0.245[2,1],ct_senti2_0.245[2,2]))
accuracy_senti2_0.245
## accuracy specificity sensitivity
## 1 0.8613636 0.8590786 0.8732394
Next, before we perform any modeling, we will perform clustering to the IBM dataset first. We do so by using the correlation funnel to determine the variables that are highly correlated with Attrition
in order to perform the clustering analysis.
# Binarize continuous variables and one hot encode factors
corr_tbl <- dat_hr %>%
select(-EmployeeNumber) %>%
binarize(n_bins=5, thresh_infreq=0.01, name_infreq="OTHER", one_hot=T) %>%
correlate(Attrition__Yes)
# Plot correlation funnel
corr_tbl %>%
plot_correlation_funnel() %>%
ggplotly()
As can be seen from the correlation funnel, the following variables were highly correlated with the outcome variable: Overtime
, JobLevel
, MonthlyIncome
, YearsAtCompany
, StockOptionLevel
, YearsWithCurrManager
, TotalWorkingYears
, MaritalStatus
, Age
, YearsInCurrentRole
, JobRole
, EnvironmentSatisfaction
, JobInvolvement
and BusinessTravel
. Therefore, we use them for the clustering analysis. We add in EmployeeNumber
to the variable selection as well as a means to index the data.
# Select top correlated variables for analysis
var_selection <- c("EmployeeNumber", "Attrition", "OverTime", "JobLevel",
"MonthlyIncome", "YearsAtCompany", "StockOptionLevel",
"YearsWithCurrManager", "TotalWorkingYears", "MaritalStatus",
"Age", "YearsInCurrentRole", "JobRole", "EnvironmentSatisfaction",
"JobInvolvement", "BusinessTravel")
dat_hr_subset <- dat_hr %>%
select(one_of(var_selection)) %>%
mutate_if(is.character, as.factor)
Finally, we perform the clustering analysis to determine how similar each employees are to each other. If our variables are of numeric (continuous) nature, we could use the Euclidean Distance as a way for computation. However, our dataset consists of both continuous and ordinal types, which would require other techniques for analysis. Based on some research, we discovered that the Gower Distance is able to handle both data types, therefore, we use it for the analysis. Here we compute a distance matrix that compares each employee to every other employee in the data.
# Use Gower distance to handle both numeric and factor types
gower_dist <- daisy(dat_hr_subset[,2:13], metric="gower")
gower_mat <- as.matrix(gower_dist)
Just for a sanity check, we view the most similar employees based on the distance score to ensure that the Gower Distance is working fine. Based on the results, Employee #1272 and Employee #371 are most similar to each other because they match on most of the variables – they had similar monthly incomes, worked the same number of years at IBM, etc.
# View most similar employees to test Gower distance
similar_tbl <- dat_hr_subset[which(gower_mat==min(gower_mat[gower_mat!=min(gower_mat)]),
arr.ind=T)[1,],]
similar_tbl
## EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 1272 1780 Yes No 1 2679 1
## 371 494 Yes No 1 2716 1
## StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 1272 0 0 1 Single 21
## 371 0 0 1 Single 21
## YearsInCurrentRole JobRole EnvironmentSatisfaction
## 1272 0 Sales Representative 2
## 371 0 Sales Representative 3
## JobInvolvement BusinessTravel
## 1272 3 Travel_Rarely
## 371 4 Travel_Rarely
To determine the optimal number of clusters to segment the data, we use the silhouette plot. As shown below, we select the 7-cluster solution due to the highest average silhouette width at 7.
# Choose number of clusters using silhouette plot
silhouette_width <- sapply(2:10, FUN=function(x){
pam(x=gower_dist, k=x)$silinfo$avg.width
})
ggplot(data=data.frame(cluster = 2:10,silhouette_width),aes(x=cluster,y=silhouette_width))+
geom_line(col='steelblue',size=1.2)+
geom_point()+
scale_x_continuous(breaks=seq(2,10,1)) # choose 7 clusters soln
For the analysis, we use the Partitioning Around Mediods (PAM) method because it is less sensitive to outliers in the data (e.g., high salary income) and it provides an exemplary case for each cluster (i.e., medoid), which would be very helpful for us to personify the type of employees that have a high chance of leaving IBM.
Now we run the cluster analysis with 7 clusters and join the cluster membership with the IBM dataset.
# Perform clustering using PAM
k <- 7
pam_fit <- pam(gower_dist, diss=T, k)
# Add cluster assignments to each employee
dat_hr_clusters <- dat_hr_subset %>%
mutate(cluster=pam_fit$clustering)
We now look at each medoid to understand the demography of each cluster. This shows the exemplary employee representing each cluster. It is obvious that Employee #991 is the model type of employee that would best represent those who left IBM (this is the only row that has Attrition=Yes
).
# Look at medoids to understand each cluster
dat_hr_clusters[pam_fit$medoids,]
## EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 463 621 No No 2 5337 10
## 1381 1945 No No 2 5561 5
## 710 991 Yes Yes 1 2321 3
## 69 88 No No 1 2194 3
## 1002 1411 No No 1 3629 3
## 118 154 No No 3 9738 9
## 700 976 No No 4 17099 9
## StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 463 0 7 10 Single 34
## 1381 1 4 6 Married 35
## 710 0 2 4 Single 31
## 69 1 2 5 Married 35
## 1002 0 2 8 Single 37
## 118 1 8 10 Married 36
## 700 1 8 26 Married 52
## YearsInCurrentRole JobRole EnvironmentSatisfaction
## 463 7 Sales Executive 4
## 1381 3 Sales Executive 2
## 710 2 Research Scientist 3
## 69 2 Research Scientist 2
## 1002 2 Laboratory Technician 1
## 118 7 Sales Executive 2
## 700 8 Manager 4
## JobInvolvement BusinessTravel cluster
## 463 4 Travel_Rarely 1
## 1381 3 Travel_Rarely 2
## 710 2 Non-Travel 3
## 69 3 Travel_Frequently 4
## 1002 3 Travel_Rarely 5
## 118 3 Travel_Frequently 6
## 700 3 Travel_Rarely 7
To better understand the attrition in the population, we computed the attrition in each cluster and how much each cluster captures overall attrition in the entire population. As seen below, approximately 88% of employees in Cluster #3 left IBM and that represents about 52% of all attrition in the entire population.
# Group by attrition rate
attrition_rate_tbl <- dat_hr_clusters %>%
select(cluster, Attrition) %>%
mutate(attrition_num = fct_relevel(Attrition, "No", "Yes") %>% as.numeric()-1) %>%
group_by(cluster) %>%
summarise(
Cluster_Turnover_Rate = (sum(attrition_num)/length(attrition_num))*100,
Turnover_Count = sum(attrition_num),
Cluster_Size = length(attrition_num)
) %>%
ungroup() %>%
mutate(Population_Turnover_Rate = (Turnover_Count/sum(Turnover_Count))*100)
attrition_rate_tbl
## # A tibble: 7 x 5
## cluster Cluster_Turnover_Ra~ Turnover_Count Cluster_Size Population_Turnover_~
## <int> <dbl> <dbl> <int> <dbl>
## 1 1 10.9 27 247 11.4
## 2 2 6.37 20 314 8.44
## 3 3 87.9 123 140 51.9
## 4 4 7.39 19 257 8.02
## 5 5 11.3 24 213 10.1
## 6 6 12.6 19 151 8.02
## 7 7 3.38 5 148 2.11
We are now ready to use the cluster membership results as an input to the prediction model. Here we append the results to the IBM data, split the data and perform the logistic regression based on the same inputs as Model #1, but with the added clustering results.
# Merge clusters with IBM HR data
dat_hr_full <- dat_hr_final %>%
mutate(cluster=pam_fit$clustering)
# Split data
set.seed(1000)
split2 <- createDataPartition(dat_hr_full$Attrition, p=0.7, list=F, groups=100)
train2 <- dat_hr_full[split2,]
test2 <- dat_hr_full[-split2,]
# Logistic regression
model_clust <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
DistanceFromHome+EnvironmentSatisfaction+Gender+
JobInvolvement+JobSatisfaction+MonthlyIncome+
NumCompaniesWorked+OverTime+PercentSalaryHike+
RelationshipSatisfaction+StockOptionLevel+
TotalWorkingYears+TrainingTimesLastYear+
WorkLifeBalance+YearsSinceLastPromotion+
YearsWithCurrManager+cluster,
family="binomial", data=train2)
We now predict the results with a threshold of 0.5.
# Predict
pred_clust <- predict(model_clust, newdata=test2, "response")
ct_clust <- table(attrition = test2$Attrition,
predictions = as.integer(pred_clust>0.5))
ct_clust
## predictions
## attrition 0 1
## No 356 13
## Yes 33 38
accuracy_clust_df <- data.frame(accuracy = sum(ct_clust[1,1],ct_clust[2,2])/nrow(test2),
specificity = ct_clust[1,1]/sum(ct_clust[1,1],ct_clust[1,2]),
sensitivity = ct_clust[2,2]/sum(ct_clust[2,1],ct_clust[2,2]))
accuracy_clust_df
## accuracy specificity sensitivity
## 1 0.8954545 0.9647696 0.5352113
We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.22.
# Plot AUC
ROCRpred_clust <- prediction(pred_clust, test2$Attrition)
ROCRperf_clust <- performance(ROCRpred_clust, "acc", "fnr")
plot(ROCRperf_clust, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred_clust,"auc")@y.values)
## [1] 0.8963319
# Use 0.22 as threshold
ct_clust_0.22 <- table(attrition = test2$Attrition,
predictions = as.integer(pred_clust>0.22))
ct_clust_0.22
## predictions
## attrition 0 1
## No 307 62
## Yes 9 62
accuracy_clust_0.22 <- data.frame(
accuracy = sum(ct_clust_0.22[1,1],ct_clust_0.22[2,2])/nrow(test2),
specificity = ct_clust_0.22[1,1]/sum(ct_clust_0.22[1,1],ct_clust_0.22[1,2]),
sensitivity = ct_clust_0.22[2,2]/sum(ct_clust_0.22[2,1],ct_clust_0.22[2,2]))
accuracy_clust_0.22
## accuracy specificity sensitivity
## 1 0.8386364 0.8319783 0.8732394
Since we have already modeled a few different logistic regression models using sentiment scores and cluster memberships as inputs, it would be interesting to have a model with both sentiment scores and cluster memberships as inputs to the same model and see if it improves the prediction accuracy. We shall use the sentiment scores for Cons
(instead of Pros
) in this model since it had a better AUC (compare Model #2 and Model #3).
# Logistic regression
model_clust_senti <- glm(Attrition~Age+BusinessTravel+DailyRate+Department+
DistanceFromHome+EnvironmentSatisfaction+Gender+
JobInvolvement+JobSatisfaction+MonthlyIncome+
NumCompaniesWorked+OverTime+PercentSalaryHike+
RelationshipSatisfaction+StockOptionLevel+
TotalWorkingYears+TrainingTimesLastYear+
WorkLifeBalance+YearsSinceLastPromotion+
YearsWithCurrManager+cluster+ave_sentiment_cons,
family="binomial", data=train2)
We now predict the results with a threshold of 0.5.
# Predict
pred_clust_senti <- predict(model_clust_senti, newdata=test2, "response")
ct_clust_senti <- table(attrition = test2$Attrition,
predictions = as.integer(pred_clust_senti>0.5))
ct_clust_senti
## predictions
## attrition 0 1
## No 355 14
## Yes 31 40
accuracy_clust_senti_df <- data.frame(
accuracy = sum(ct_clust_senti[1,1],ct_clust_senti[2,2])/nrow(test2),
specificity = ct_clust_senti[1,1]/sum(ct_clust_senti[1,1],ct_clust_senti[1,2]),
sensitivity = ct_clust_senti[2,2]/sum(ct_clust_senti[2,1],ct_clust_senti[2,2]))
accuracy_clust_senti_df
## accuracy specificity sensitivity
## 1 0.8977273 0.9620596 0.5633803
We plot the AUC curve to determine the optimal threshold value. Based on the plot, the optimal threshold is 0.26.
# Plot AUC
ROCRpred_clust_senti <- prediction(pred_clust_senti, test2$Attrition)
ROCRperf_clust_senti <- performance(ROCRpred_clust_senti, "acc", "fnr")
plot(ROCRperf_clust_senti, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred_clust_senti,"auc")@y.values)
## [1] 0.9015611
# Use 0.26 as threshold
ct_clust_senti_0.26 <- table(attrition = test2$Attrition,
predictions = as.integer(pred_clust_senti>0.26))
ct_clust_senti_0.26
## predictions
## attrition 0 1
## No 321 48
## Yes 9 62
accuracy_clust_senti_0.26 <- data.frame(
accuracy = sum(ct_clust_senti_0.26[1,1],ct_clust_senti_0.26[2,2])/nrow(test2),
specificity = ct_clust_senti_0.26[1,1]/sum(ct_clust_senti_0.26[1,1],ct_clust_senti_0.26[1,2]),
sensitivity = ct_clust_senti_0.26[2,2]/sum(ct_clust_senti_0.26[2,1],ct_clust_senti_0.26[2,2]))
accuracy_clust_senti_0.26
## accuracy specificity sensitivity
## 1 0.8704545 0.8699187 0.8732394
So far we have only tried logistic regression for the prediction model. It would be worthwhile to try the other machine learning models to see if there can be improvements to the model AUC. In this step, we attempted the decision trees using 10-fold cross validation (CV) and we shall compare the results with the baseline model in Model #1 where we used the default variables from the dat_hr
dataset (we will not include the sentiment scores and cluster memberships into the modeling). This comparison should give us an indication as to which modeling technique is better.
# Model using CV
trControl = trainControl(method="cv",number=10) #10-fold cross validation
tuneGrid = expand.grid(.cp=seq(0,0.1,0.001))
set.seed(777)
trainCV = train(Attrition~.,
data=train,
method="rpart",
trControl=trControl,
tuneGrid=tuneGrid)
trainCV$bestTune
## cp
## 21 0.02
We shall use a cp value of 0.02.
tree_CV <- rpart(Attrition~.,
data=train,
method="class",
control=rpart.control(cp=trainCV$bestTune))
# Predict
pred_CV = predict(tree_CV, newdata=test, type='class')
ct_tree = table(attrition = test$Attrition,
predictions = pred_CV)
ct_tree
## predictions
## attrition No Yes
## No 357 12
## Yes 54 17
accuracy_tree_df <- data.frame(
accuracy = sum(ct_tree[1,1],ct_tree[2,2])/nrow(test),
specificity = ct_tree[1,1]/sum(ct_tree[1,1],ct_tree[1,2]),
sensitivity = ct_tree[2,2]/sum(ct_tree[2,1],ct_tree[2,2]))
accuracy_tree_df
## accuracy specificity sensitivity
## 1 0.85 0.9674797 0.2394366
# Plot AUC
pred_tree_auc <- predict(tree_CV, newdata=test, type="prob")
ROCRpred_tree <- prediction(pred_tree_auc[,2], test$Attrition)
ROCRperf_tree <- performance(ROCRpred_tree, "acc", "fnr")
plot(ROCRperf_tree, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
as.numeric(performance(ROCRpred_tree,"auc")@y.values)
## [1] 0.6325814
This is a summary of all the prediction models that were done. Model 5 had the highest AUC and the best sensitivity and accuracy figures. This was the model that incorporated both sentiment scores (cons) and cluster memberships as predictors, which resulted in improvements to the model AUC, accuracy and specificity figures. Our main aim is to maximize sensitivity because we want a low FN rate due to our business context. Model 5 performs the best in all parameters that we are concerned about. Therefore we shall use that model for interpretation of effects and discussion.
Although the decision tree model performed the best in specificity, it was outperformed in sensitivity and AUC (these factors were far more important to our business context). Hence it is clear that we will not be using the tree model.
## model description auc accuracy specificity sensitivity
## 1 1 logmod with bw select 0.8940799 0.8204545 0.8102981 0.8732394
## 2 2 logmod with senti (pros) 0.8982404 0.8590909 0.8563686 0.8732394
## 3 3 logmod with senti (cons) 0.9003015 0.8613636 0.8590786 0.8732394
## 4 4 logmod with clust 0.8963319 0.8386364 0.8319783 0.8732394
## 5 5 logmod with senti & clust 0.9015611 0.8704545 0.8699187 0.8732394
## 6 6 trees with 10-fold cv 0.6325814 0.8500000 0.9674797 0.2394366
In this section, we shall discuss the analysis findings in relation to the three (3) research questions outlined in this project.
- What are the key driving factors influencing attrition the most at IBM?
We shall use our best model—which happens to be the logistic regression model with sentiment scores and cluster memberships—to understand the key factors that influence attrition at IBM, knowing which would allow us to create specific recommendations and hypotheses to improve the retention rates (e.g., a 5% increase in monthly salary would reduce attrition by 10%). Note: the benefit of using logistic regression is that it is easy to interpret for communication to stakeholders.
As shown in the model summary below, for instance, it can be inferred that working overtime would result in the highest increase in likelihood towards attrition as compared to the other variables (see coefficient of OverTimeYes
).
summary(model_clust_senti)
##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + DailyRate +
## Department + DistanceFromHome + EnvironmentSatisfaction +
## Gender + JobInvolvement + JobSatisfaction + MonthlyIncome +
## NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction +
## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsSinceLastPromotion + YearsWithCurrManager +
## cluster + ave_sentiment_cons, family = "binomial", data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6336 -0.5224 -0.2700 -0.1051 3.3038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.922e+00 1.225e+00 4.019 5.85e-05 ***
## Age -4.541e-02 1.573e-02 -2.887 0.003893 **
## BusinessTravelTravel_Frequently 1.648e+00 4.483e-01 3.675 0.000238 ***
## BusinessTravelTravel_Rarely 7.686e-01 4.118e-01 1.866 0.061985 .
## DailyRate -3.994e-04 2.571e-04 -1.553 0.120321
## DepartmentResearch & Development -1.080e+00 5.490e-01 -1.968 0.049069 *
## DepartmentSales -4.019e-01 5.588e-01 -0.719 0.471974
## DistanceFromHome 3.649e-02 1.285e-02 2.839 0.004530 **
## EnvironmentSatisfaction2 -1.041e+00 3.239e-01 -3.213 0.001315 **
## EnvironmentSatisfaction3 -1.012e+00 2.943e-01 -3.437 0.000589 ***
## EnvironmentSatisfaction4 -1.120e+00 2.972e-01 -3.768 0.000164 ***
## GenderMale 4.865e-01 2.197e-01 2.215 0.026770 *
## JobInvolvement2 -1.113e+00 4.278e-01 -2.601 0.009298 **
## JobInvolvement3 -1.383e+00 4.011e-01 -3.448 0.000564 ***
## JobInvolvement4 -1.938e+00 5.501e-01 -3.523 0.000426 ***
## JobSatisfaction2 -6.460e-01 3.254e-01 -1.985 0.047142 *
## JobSatisfaction3 -6.544e-01 2.795e-01 -2.341 0.019220 *
## JobSatisfaction4 -1.250e+00 3.008e-01 -4.156 3.23e-05 ***
## MonthlyIncome -8.597e-05 4.379e-05 -1.963 0.049621 *
## NumCompaniesWorked 1.677e-01 4.419e-02 3.795 0.000148 ***
## OverTimeYes 1.881e+00 2.334e-01 8.058 7.74e-16 ***
## PercentSalaryHike -1.076e-02 2.998e-02 -0.359 0.719691
## RelationshipSatisfaction2 -9.957e-01 3.282e-01 -3.034 0.002414 **
## RelationshipSatisfaction3 -9.384e-01 2.903e-01 -3.233 0.001226 **
## RelationshipSatisfaction4 -1.186e+00 3.019e-01 -3.927 8.59e-05 ***
## StockOptionLevel1 -1.172e+00 2.368e-01 -4.949 7.45e-07 ***
## StockOptionLevel2 -1.609e+00 4.692e-01 -3.430 0.000603 ***
## StockOptionLevel3 -9.239e-01 4.676e-01 -1.976 0.048156 *
## TotalWorkingYears -5.515e-02 3.212e-02 -1.717 0.085977 .
## TrainingTimesLastYear -2.132e-01 8.381e-02 -2.543 0.010978 *
## WorkLifeBalance2 -7.111e-01 4.124e-01 -1.724 0.084631 .
## WorkLifeBalance3 -1.245e+00 3.903e-01 -3.190 0.001423 **
## WorkLifeBalance4 -7.878e-01 4.734e-01 -1.664 0.096045 .
## YearsSinceLastPromotion 1.974e-01 4.491e-02 4.395 1.11e-05 ***
## YearsWithCurrManager -1.217e-01 4.432e-02 -2.747 0.006020 **
## cluster 5.927e-02 7.038e-02 0.842 0.399688
## ave_sentiment_cons -8.335e+00 4.096e+00 -2.035 0.041891 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 909.69 on 1029 degrees of freedom
## Residual deviance: 627.80 on 993 degrees of freedom
## AIC: 701.8
##
## Number of Fisher Scoring iterations: 6
Specifically, there is a 6.557 times higher likelihood of an attrition when an employee works overtime as compared to not working overtime (while holding other factors constant). Phrased differently, there is a 555.7% increase in likelihood of attrition when an employee works overtime.
# How many times is the likelihood of attrition when an employee works overtime?
exp(summary(model_clust_senti)$coef[21])
## [1] 6.557427
# How much more is the likelihood of attrition when an employee works overtime?
100*(exp(summary(model_clust_senti)$coef[21])-1)
## [1] 555.7427
Working overtime is just one of the factors that affect the likelihood of attrition. We want to explore the other factors that also affect attrition. We use the varImp()
function to see how the model ranks the importance of each variable. Not surprisingly, the model ranks OverTimeYes
as the one with highest importance because it has the greatest statistical significance relative to other variables.
var_imp <- varImp(model_clust_senti) %>%
arrange(desc(Overall))
var_imp %>%
ggplot(aes(x=reorder(rownames(var_imp),Overall), y=Overall)) +
geom_point(color="blue", size=4, alpha=0.6) +
geom_segment(aes(x=rownames(var_imp), xend=rownames(var_imp),
y=0, yend=Overall), color="skyblue") +
xlab("Variable") +
ylab("Overall Importance") +
theme_classic() +
coord_flip()
From the plot, we see that StockOptionLevel1
is the second most important variable in the model, as determined by the low p-value (statistically significant). StockOptionLevel1
has a model coefficient of -1.172, which implies a 0.31 times lower likelihood of leaving IBM (-69%) as compared to not having any stock option at all.
Intuitively speaking, we see that YearsSinceLastPromotion
holds a high spot on the variable importance chart with a model coefficient of 0.197, which implies a 1.218 times higher likelihood of leaving IBM for each year that the employee has not been promoted (while holding other factors constant). This makes sense because the longer the employee remains in the same salary grade, he/she may not feel recognized by the organization, leading to a higher chance of leaving.
JobSatisfaction4
(job satisfaction score of 4) holds a high importance in the model as well with a model coefficient of -1.25, which implies a 0.286 times lower likelihood of leaving IBM as compared to having a satisfaction score of 1 (while holding other factors constant) (Note: satisfaction scores range from 1 to 4, with 4 indicating that the staff is the most satisfied with the job). This finding agrees with Ghapanchi & Aurum (2011) and Zhang & Feinzig (2016), where having high job satisfaction and happiness would lead to better retention in a company. The same can be said of RelationshipSatisfaction4
as well.
It is worthy to note that NumCompaniesWorked
holds a high spot on the variable importance chart too. The model coefficient of 0.168 implies a 1.183 times higher likelihood of leaving IBM for each company that the employee has worked for in the past (while holding other factors constant). This could suggest the “job-hopping” tendency of people who are not able to commit to a particular job, hence the greater tendency to leave IBM if he/she had many jobs in the past. While that is a plausible hypothesis, it might not be entirely true for employees who belong to the older age category. The reason is because while older employees may have more past jobs as compared to younger workers who just entered the workforce, which should suggest that they have a greater likelihood of leaving IBM, interestingly, older workers had a lower likelihood to leave IBM (see Age
coefficient in the model) (the lower likelihood of older workers leaving their jobs is also corroborated by the study by Zhang & Feinzig (2016)). Therefore there is room for further research on potential interaction terms between Age
and NumCompaniesWorked
for better interpretation of the effects.
Lastly, we see that PercentSalaryHike
was ranked the least important/significant in the model. This was an interesting finding because reports by The Dice (2020) and Deloitte (2019) suggest that majority of employees are looking for higher salaries. But this was not reflected in the attrition model. One possible reason could be that salary is a crucial entry requirement into a company, but not a crucial exit criteria. This is also an area for possible further research.
- Who is likely to leave IBM?
The best prediction model was able to achieve a sensitivity of 87.3% (recall that we want to reduce the FN for our business context) and accuracy of 90.1% on the test set as shown below. This is by far the best model, which included the sentiment scores (cons) and the cluster memberships.
## model description auc accuracy specificity sensitivity
## 1 5 logmod with senti & clust 0.9015611 0.8704545 0.8699187 0.8732394
For future work, it might be worthwhile to attempt averaging both sentiments scores for Pros
and Cons
to see if it further improves the prediction accuracy. Other methods such as random forest and boosting could be explored as well. Given time and resource constraints in this project and for the ease of interpretation, we went with logistic regression.
- What is the employee type that has the highest tendency to leave IBM?
Based on the clustering work, the type of employees that have the highest tendency to leave IBM are the employees that are similar to Employee #991 (Cluster #3). His/her persona is shown in the table below.
dat_hr_clusters[pam_fit$medoids,]
## EmployeeNumber Attrition OverTime JobLevel MonthlyIncome YearsAtCompany
## 463 621 No No 2 5337 10
## 1381 1945 No No 2 5561 5
## 710 991 Yes Yes 1 2321 3
## 69 88 No No 1 2194 3
## 1002 1411 No No 1 3629 3
## 118 154 No No 3 9738 9
## 700 976 No No 4 17099 9
## StockOptionLevel YearsWithCurrManager TotalWorkingYears MaritalStatus Age
## 463 0 7 10 Single 34
## 1381 1 4 6 Married 35
## 710 0 2 4 Single 31
## 69 1 2 5 Married 35
## 1002 0 2 8 Single 37
## 118 1 8 10 Married 36
## 700 1 8 26 Married 52
## YearsInCurrentRole JobRole EnvironmentSatisfaction
## 463 7 Sales Executive 4
## 1381 3 Sales Executive 2
## 710 2 Research Scientist 3
## 69 2 Research Scientist 2
## 1002 2 Laboratory Technician 1
## 118 7 Sales Executive 2
## 700 8 Manager 4
## JobInvolvement BusinessTravel cluster
## 463 4 Travel_Rarely 1
## 1381 3 Travel_Rarely 2
## 710 2 Non-Travel 3
## 69 3 Travel_Frequently 4
## 1002 3 Travel_Rarely 5
## 118 3 Travel_Frequently 6
## 700 3 Travel_Rarely 7
Cluster #3 accounts for approximately 52% of all attrition in IBM and about 88% of employees in this cluster leave. It is evident that more can be done to improve the retention rates in this cluster. In the next section, we will propose some recommendations to improve the retention rate of those in this cluster.
attrition_rate_tbl
## # A tibble: 7 x 5
## cluster Cluster_Turnover_Ra~ Turnover_Count Cluster_Size Population_Turnover_~
## <int> <dbl> <dbl> <int> <dbl>
## 1 1 10.9 27 247 11.4
## 2 2 6.37 20 314 8.44
## 3 3 87.9 123 140 51.9
## 4 4 7.39 19 257 8.02
## 5 5 11.3 24 213 10.1
## 6 6 12.6 19 151 8.02
## 7 7 3.38 5 148 2.11
We shall use the prediction model (Model #5) to make certain recommendations that are amenable to experimentation. Specifically, we will propose recommendations on variables that are within IBM’s control to make adjustments (e.g., salary, training opportunities for employees, etc.). On the other hand, variables that are out of IBM’s control are like gender, age, total working years, etc.; we will not make recommendations on these variables as they are inherent to the employees. Lastly, for the purpose of keeping it measurable and simple, we shall exclude “indirect” variables such as job satisfaction, environment satisfaction and relationship satisfaction as they require more effort to measure and it involves a summation of many other factors that influences the satisfaction scores.
Based on the model, it suggests that the rescission of overtime culture in IBM can have the potential to reduce the likelihood of attrition by 6 times (while holding all other variables constant). For instance, this means that we can have the potential of reducing the attrition rate in cluster group 3 by 6 times. From a turnover count of 123 out of 140 employees in that cluster, we can expect to see a reduction in attrition to about 21 employees only (reduction of a turnover rate from 87.9% to 14.6%).
However, this hypothesis is only true with the assumption that all other variables are held constant and that the only difference is whether the employee worked overtime. To test this hypothesis (and for purposes of conducting a research experiment), it is imperative to keep the other variables constant as much as possible. In a practical setting, this can be achieved by performing a clustering analysis on existing IBM employees as they would be segmented in accordance to how similar they are to each other (most of the other variables would be similar from employee to employee to some extent), thereby controlling the other variables the best we can. Then, with the other variables being more similar for employees within each cluster, it gives us the opportunity to tweak the attrition variable (i.e., whether the employee would need to work overtime or not) – this allows us to create a control and treatment group accordingly. Finally, we should conduct the experiment on the cluster that is most similar to cluster group 3 in order to yield the greatest impact.
In addition to the reviewing of overtime culture in IBM, another recommendation to reduce attrition is to explore giving employees stock options as it can lower the likelihood of turnover by about 69% (while keeping the other variables constant). To test this, we can easily conduct an experiment with a treatment and control group. Briefly, the experiment would involve giving one group of employees stock options while the other group would not have stock options. Similar to the previous recommendation, this experiment can be conducted on employees who are in the cluster that is most similar to cluster group 3.
A limitation in our project is that there is imbalance in the distribution of attrition statuses in our data. For instance, our data has more employees who stayed with IBM as compared to those who left. Such an imbalance would likely result in poorer prediction accuracy in our models. To improve the accuracy in our models, we may need to treat the imbalance in our training data with, say, some upsampling techniques such as the ovun.sample()
function from the ROSE
package. This can be explored in future works.
viGlobal. (2018). Tech Industry Battles Cost of High Attrition Rates. viGlobal. https://www.viglobal.com/2018/06/13/tech-industry-battles-highest-attrition-rate-in-the-world-and-its-costly/.
Kochanski, J., & Ledford, G. (2001). “HOW TO KEEP ME”—RETAINING TECHNICAL PROFESSIONALS. Research Technology Management, 44(3), 31-38. Retrieved June 6, 2021, from http://www.jstor.org/stable/24133992.
Alsop, T. (2021). IBM number of employees worldwide from 2000 to 2020. Statista. https://www.statista.com/statistics/265007/number-of-employees-at-ibm-since-2000/
The Build Network. (2014). Try Fixing the Problem Before Replacing It. Inc. https://www.inc.com/the-build-network/turnover-costs.html
Ghapanchi, A. H., Aurum, A. (2011). Antecedents to IT personnel’s intentions to leave: A systematic literature review. Journal of Systems and Software, Volume 84, 238-249. Retrieved February 21, 2021, from https://www.sciencedirect.com/science/article/abs/pii/S0164121210002645?via%3Dihub.
Zhang, H., & Feinzig, S. (2016). Should I stay or should I go? Global insights into employees’ decisions to leave their jobs [White paper]. IBM Smarter Workforce Institute. https://www.ibm.com/downloads/cas/08GZQKL1
The Deloitte Global Millennial Survey: Societal Discord and Technological Transformation Create a “Generation Disrupted”. (2019). Retrieved February 21, 2021, from https://www2.deloitte.com/global/en/pages/about-deloitte/articles/millennialsurvey.html.
The Dice 2020 Tech Salary Report | eBook | Dice Resources. (2020). Retrieved 20 February 2021, from https://techhub.dice.com/Dice-2020-Tech-Salary-Report.html.
Hecker, D. E. (2005). Occupational employment projections to 2014. Retrieved 21 February 2021, from https://www.bls.gov/opub/mlr/2005/11/art5full.pdf.
Lewis, S. (2020). What Is A High Employee Attrition Rate In Tech Industry. DevSkiller - Powerful tool to test developers skills. Retrieved 21 February 2021, from https://devskiller.com/attrition-rate-in-tech/.
Description of dat_hr
dataset:
Age
: Age of employeeAttrition
: Whether employee has left IBM (Yes/No)BusinessTravel
: 3 levels: No Travel, Travel Frequently, Travel RarelyDailyRate
: Daily salary rateDepartment
: 3 levels: HR, Sales, R&DDistanceFromHome
: Distance from work to homeEducation
: 1=Below College, 2=College, 3=Bachelor, 4=Master, 5=DoctorEducationField
: 6 levels: HR, Life Sciences, Marketing, Medical Sciences, Technical, OthersEmployeeCount
: Number of employeesEmployeeNumber
: Employee IDEnvironmentSatisfaction
: 1=Low, 2=Medium, 3=High, 4=Very HighGender
: Male/FemaleHourlyRate
: Hourly salaryJobInvolvement
: 1=Low, 2=Medium, 3=High, 4=Very HighJobLevel
: Level of jobJobRole
: Healthcare Representative, HR, Lab Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales RepresentativeJobSatisfaction
: 1=Low, 2=Medium, 3=High, 4=Very HighMaritalStatus
: Single, Married, DivorcedMonthlyIncome
: Monthly salaryMonthlyRate
: Monthly salary rateNumCompaniesWorked
: Number of companies employee has worked atOver18
: Whether employee is above 18 (Yes/No)OverTime
: Whether employee has worked overtime (Yes/No)PercentSalaryHike
: Percentage increase in salaryPerformanceRating
: Performance rating of employeeRelationshipSatisfaction
: 1=Low, 2=Medium, 3=High, 4=Very HighStandardHours
: Number of hours workedStockOptionLevel
: Stock option levelTotalWorkingYears
: Total years of workingTrainingTimesLastYear
: Hours spent on trainingWorkLifeBalance
: 1=Bad, 2=Good, 3=Better, 4=BestYearsAtCompany
: Total years of working at IBMYearsInCurrentRole
: Total years in current roleYearsSinceLastPromotion
: Total years since last promotionYearsWithCurrManager
: Total years with current managerDescription of dat_gd
dataset:
Ratings
: Review ratings of 1 to 5Current_Former
: Whether current of former employee of IBMEmployment_Length
: Length of employment at IBMTitle
: Title of Glassdoor reviewsReview_ID
: Review IDReview_Date
: Date of reviewRole
: Job roleLocation
: State of employment in the USPros
: Text reviews of the pros of working at IBMCons
: Text reviews of the cons of working at IBM