This document summarizes a final project for a Big Data course at Stockholm University that aimed to provide insights for predicting employee resignations through human resources analytics. It discusses using HDFS for data storage due to its ability to handle large, varied data and using decision trees for analysis due to their simplicity, speed, and ability to avoid overfitting. The results found that employees often resign when they feel underestimated or find better opportunities due to skills or when generally unsatisfied with the company. Scaling to more company data and using dimensionality reduction for sparse data is discussed, as is updating the decision tree model frequently. Potential extensions include using multi-organization structured data and unstructured data sources.
Human Resources Analytics: providing useful insights for employee resignation prediction
1. Human Resources Analytics:
providing useful insights for employee resignation prediction.
Final project for the course of Big Data with NoSQL
Stockholm University, DSV dept.
Academic year 2017/18
Presented by:
• Giacomo Bartoli
• Giorgos Ntymenos
2. Introduction
Why do people leave from this company?
Now, we have to hire and train new employees.
All the gained knowledge will benefit
other companies!
I would like to know it beforehand.
3. Data
• Level of satisfaction
• Grade of last evaluation
• Number of projects
• Average monthly hours at work
• Numbers of years spent in the
company
• Whether the employee had
accidents at work
• Whether the employee was
promoted in the last 5 years
• Department
• Level of salary
• Left
Vs of Big Data:
- Volume: data coming from different
companies
- Variety: data might have different
formats, or even attributes. Even the
same attribute could be computed in
different way from company to
company. For example evaluation.
Thus, preprocessing is required. Also
data of different types are possible.
4. Method
Storage
HDFS is our choice because
Volume: it can handle massive amounts of data
Variety: It can accept data in about any format.
Analysis
Our aim is to classify, so we need to solve a classification task.
Our choice goes to decision trees because:
- they are simple
- not time consuming
- easy to scale.
- overfitting can be avoided using pre and post pruning.
- white box
6. Results
The reasons why employees resign are:
• When they perform really well and although they are generally satisfied they
feel underestimated, or they find better job opportunities due to their skills
• When they are not satisfied from the company at all.
• When they are not satisfied but not very effective either.
7. Discussion
Scaling
Use data from a lot of companies will lead to more accurate results but
preprocessing for integration is required.
Problem: possible sparse data
Solution: dimensionality reduction (PCA)
Replication data over different servers for partition tolerance
8. Discussion
Value for method and analysis result
We have clean data without missing values or many outliers, so with
decision tree we can have both speed and high performance, without
worrying for overfitting.
We might have very frequent writes and updates in our data.
Ex: when inserting data they can be classified, using the tree we already
have from the last training.The training phase can take place as often as
the IT department team believes it is required.