Copyright: What Creators and Users of Art Need to Know
Cis520 group e
1.
2. 1. Background
Over the course of their day-to-day operations, Human Resources teams are finding themselves with increasingly large pools of
data comprised of employee profiles, activity records and performance appraisals. Leveraging this data for insights has become
a challenge, presenting a key demand and opportunity for big data analysis. Our goal is to address these pain points of today’s
employee management by interpreting HR data into descriptive and inferential statistics. The analytics produced can provide
HR professionals the information to make effective description regarding their workforce.
2. Overview
To prove our concept, we will acquire and distribute the raw data across a Hadoop cluster. We will then query the data to
provide basic filtering functions to demonstrate its flexibility and customization. In addition, we will apply analytical tools to
the refined data, representing set business intelligence tools that are in line with today’s human resources management needs.
3. System Requirements
Storage: Standard cloud-based object storage node with a business continuity system in place.
File System: Hadoop Distributed File System
MapReduce Platform: Apache Hive
Analytics and Visualization Tool: Microsoft Excel enabled with ODBC driver and Microsoft Power BI.
3. 4. Dataset
17 text files of comma delimited data that separates columns and each row starting a new record. Each record has 20 data
elements in numeric and text format. The data set is comprised of employment records of various government agencies of the
United States.
4. 5. Storage Deployment
The storage is deployed from the Azure portal, configured to be a classic Azure storage account located in the Central United States. It is Geo-Redundant with local and geo-
distributed copies with a replication factor of three for disaster recovery. It can accommodate block and page blobs, tables and queues with a maximum 500 IOPS per disk.
5. 6. Hadoop Cluster Deployment
The selected Apache Hadoop distribution system is an Azure HDInsight cluster. It is deployed from the Azure portal, configured
to have a Windows operating system and Hadoop version 2.6.0. Its resources are comprised of 4 workers nodes with a total of 16
cores, 14GB RAM and 8 disks. The head node is set-up with the same specifications.
6. 7. Data Upload
The dataset is transferred using the Cloudberry Explorer client application. It is uploaded directly into the default container of the blob linked to the HD Insight cluster
7. 8. Querying the Data
The MapReduce queries are performed from Azure’s Hive Editor.
a. Table Creation: CREATE TABLE query is used to create the table with the appropriate data.
8. b. LOAD DATA INPATH queries are used to create the table with the appropriate data.
9. c. SELECT queries are used to validate the information loaded
into the table.
10. 9. Data Refinement: The refinement of the data is based on the identified business requirements. The filtering is
performed with the COUNT, WHERE and GROUP BY queries using a variety of conditions.
a. COUNT
15. 10. Data Visualization and Interpretation
The data infers the following conclusions:
For demographic analysis, the first query generated displays all the female employees working in the Department of Defense-
Defense Contract Audit Agency
16. 10. Data Visualization and Interpretation
The data infers the following conclusions:
The second query addresses the position-vacancy analysis need of HR professionals, by searching for the employees with a
Length of Service of 30 years and above since they have the higher likelihood to retire.
17. 10. Data Visualization and Interpretation
The data infers the following conclusions:
a. Graph 3.10.1 shows that salary is distributed mostly to salary levels D, E and G which account for 12% of the salary each.
This means that 36% of the total salary expense is allocated to those who make $40,000 - 59,999 and $70,000 - 79,999. Salary
level F is next behind the top three, taking 10% of the total salary expense. Level F has a salary range of $60,000 - 69,999.
18. b. Graph 3.10.2, shows that majority of the personnel with a supervisory level of 2(Manager), 6(Leader) and 7(Team Leader)
have an education level of 13, a bachelor’s degree. More than 80,000 supervisors share this education level, followed by level
4, a high school diploma, which is shared by 64,700 supervisors. Together, the supervisors part of these top two educations
levels, account for 49% of them within this specific group.
19. 11. Conclusion
a. System Review - The system represents an effective and user-friendly platform to manipulate and enhance large amounts of
HR data, created in an efficient and cost-effective manner.
b. Opportunities - Globalization, advancements in technology and the even the growing population in general, will only mean
more jobs and people to manage in the future. And “without analytics, corporations could face an increase in skills gaps
throughout the entire company, less engaged employees, a lack of internal development, along with many other challenges” says
Brittany Hink, Editor in Chief of Human Resources IQ [2]. These factors and trends will result in a continuous explosion of data
that will be very important to tap into and interpret just to perform daily human resources management operations. In summary,
human resources is definitely an aspect of business and human behavior that Big Data analytics should focus more on.
12. Reflection
Our team learned a great deal on big data and the impact of human resources departments have on global business as a whole.
More importantly, we learned about how critical the System Development Lifecycle can be as we had difficulty in
implementation and testing that were correlated. We learned to respect the SDLC process as it is proven framework for
establishing systems that really work.