SlideShare a Scribd company logo
1 of 20
1. Background
Over the course of their day-to-day operations, Human Resources teams are finding themselves with increasingly large pools of
data comprised of employee profiles, activity records and performance appraisals. Leveraging this data for insights has become
a challenge, presenting a key demand and opportunity for big data analysis. Our goal is to address these pain points of today’s
employee management by interpreting HR data into descriptive and inferential statistics. The analytics produced can provide
HR professionals the information to make effective description regarding their workforce.
2. Overview
To prove our concept, we will acquire and distribute the raw data across a Hadoop cluster. We will then query the data to
provide basic filtering functions to demonstrate its flexibility and customization. In addition, we will apply analytical tools to
the refined data, representing set business intelligence tools that are in line with today’s human resources management needs.
3. System Requirements
Storage: Standard cloud-based object storage node with a business continuity system in place.
File System: Hadoop Distributed File System
MapReduce Platform: Apache Hive
Analytics and Visualization Tool: Microsoft Excel enabled with ODBC driver and Microsoft Power BI.
4. Dataset
17 text files of comma delimited data that separates columns and each row starting a new record. Each record has 20 data
elements in numeric and text format. The data set is comprised of employment records of various government agencies of the
United States.
5. Storage Deployment
The storage is deployed from the Azure portal, configured to be a classic Azure storage account located in the Central United States. It is Geo-Redundant with local and geo-
distributed copies with a replication factor of three for disaster recovery. It can accommodate block and page blobs, tables and queues with a maximum 500 IOPS per disk.
6. Hadoop Cluster Deployment
The selected Apache Hadoop distribution system is an Azure HDInsight cluster. It is deployed from the Azure portal, configured
to have a Windows operating system and Hadoop version 2.6.0. Its resources are comprised of 4 workers nodes with a total of 16
cores, 14GB RAM and 8 disks. The head node is set-up with the same specifications.
7. Data Upload
The dataset is transferred using the Cloudberry Explorer client application. It is uploaded directly into the default container of the blob linked to the HD Insight cluster
8. Querying the Data
The MapReduce queries are performed from Azure’s Hive Editor.
a. Table Creation: CREATE TABLE query is used to create the table with the appropriate data.
b. LOAD DATA INPATH queries are used to create the table with the appropriate data.
c. SELECT queries are used to validate the information loaded
into the table.
9. Data Refinement: The refinement of the data is based on the identified business requirements. The filtering is
performed with the COUNT, WHERE and GROUP BY queries using a variety of conditions.
a. COUNT
b. WHERE
c. WHERE
d. GROUP BY
e. GROUP BY and WHERE
10. Data Visualization and Interpretation
The data infers the following conclusions:
For demographic analysis, the first query generated displays all the female employees working in the Department of Defense-
Defense Contract Audit Agency
10. Data Visualization and Interpretation
The data infers the following conclusions:
The second query addresses the position-vacancy analysis need of HR professionals, by searching for the employees with a
Length of Service of 30 years and above since they have the higher likelihood to retire.
10. Data Visualization and Interpretation
The data infers the following conclusions:
a. Graph 3.10.1 shows that salary is distributed mostly to salary levels D, E and G which account for 12% of the salary each.
This means that 36% of the total salary expense is allocated to those who make $40,000 - 59,999 and $70,000 - 79,999. Salary
level F is next behind the top three, taking 10% of the total salary expense. Level F has a salary range of $60,000 - 69,999.
b. Graph 3.10.2, shows that majority of the personnel with a supervisory level of 2(Manager), 6(Leader) and 7(Team Leader)
have an education level of 13, a bachelor’s degree. More than 80,000 supervisors share this education level, followed by level
4, a high school diploma, which is shared by 64,700 supervisors. Together, the supervisors part of these top two educations
levels, account for 49% of them within this specific group.
11. Conclusion
a. System Review - The system represents an effective and user-friendly platform to manipulate and enhance large amounts of
HR data, created in an efficient and cost-effective manner.
b. Opportunities - Globalization, advancements in technology and the even the growing population in general, will only mean
more jobs and people to manage in the future. And “without analytics, corporations could face an increase in skills gaps
throughout the entire company, less engaged employees, a lack of internal development, along with many other challenges” says
Brittany Hink, Editor in Chief of Human Resources IQ [2]. These factors and trends will result in a continuous explosion of data
that will be very important to tap into and interpret just to perform daily human resources management operations. In summary,
human resources is definitely an aspect of business and human behavior that Big Data analytics should focus more on.
12. Reflection
Our team learned a great deal on big data and the impact of human resources departments have on global business as a whole.
More importantly, we learned about how critical the System Development Lifecycle can be as we had difficulty in
implementation and testing that were correlated. We learned to respect the SDLC process as it is proven framework for
establishing systems that really work.
Cis520 group e

More Related Content

What's hot

Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGSTORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
csandit
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data Mining
AM Publications,India
 
An Efficient Virtual Memory using Graceful Code
An Efficient Virtual Memory using Graceful CodeAn Efficient Virtual Memory using Graceful Code
An Efficient Virtual Memory using Graceful Code
ijtsrd
 

What's hot (15)

Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
 
A Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and ChallengesA Comprehensive Study on Big Data Applications and Challenges
A Comprehensive Study on Big Data Applications and Challenges
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Validating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validatorValidating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validator
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Big data
Big dataBig data
Big data
 
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MININGSTORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
STORAGE GROWING FORECAST WITH BACULA BACKUP SOFTWARE CATALOG DATA MINING
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data Mining
 
An Efficient Virtual Memory using Graceful Code
An Efficient Virtual Memory using Graceful CodeAn Efficient Virtual Memory using Graceful Code
An Efficient Virtual Memory using Graceful Code
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
 

Similar to Cis520 group e

Bba205 management information system
Bba205  management information systemBba205  management information system
Bba205 management information system
smumbahelp
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data Environment
IJCSIS Research Publications
 

Similar to Cis520 group e (20)

Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 
Big Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible SolutionsBig Data Summarization : Framework, Challenges and Possible Solutions
Big Data Summarization : Framework, Challenges and Possible Solutions
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey Paper
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
IT6701-Information management question bank
IT6701-Information management question bankIT6701-Information management question bank
IT6701-Information management question bank
 
Complete-SRS.doc
Complete-SRS.docComplete-SRS.doc
Complete-SRS.doc
 
IP Final project 12th
IP Final project 12thIP Final project 12th
IP Final project 12th
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Bba205 management information system
Bba205  management information systemBba205  management information system
Bba205 management information system
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
Bb0020 managing information
Bb0020  managing informationBb0020  managing information
Bb0020 managing information
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with Hadoop
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
Multi-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data EnvironmentMulti-Tier Sentiment Analysis System in Big Data Environment
Multi-Tier Sentiment Analysis System in Big Data Environment
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
SYSTEMATIC LITERATURE REVIEW ON RESOURCE ALLOCATION AND RESOURCE SCHEDULING I...
 

Recently uploaded

Powerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metricsPowerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metrics
CaitlinCummins3
 
What is paper chromatography, principal, procedure,types, diagram, advantages...
What is paper chromatography, principal, procedure,types, diagram, advantages...What is paper chromatography, principal, procedure,types, diagram, advantages...
What is paper chromatography, principal, procedure,types, diagram, advantages...
srcw2322l101
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Dubai Multi Commodity Centre
 

Recently uploaded (20)

Sedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement CriteriaSedex Members Ethical Trade Audit (SMETA) Measurement Criteria
Sedex Members Ethical Trade Audit (SMETA) Measurement Criteria
 
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdfInnomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
Innomantra Viewpoint - Building Moonshots : May-Jun 2024.pdf
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
 
Powerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metricsPowerpoint showing results from tik tok metrics
Powerpoint showing results from tik tok metrics
 
Creative Ideas for Interactive Team Presentations
Creative Ideas for Interactive Team PresentationsCreative Ideas for Interactive Team Presentations
Creative Ideas for Interactive Team Presentations
 
The Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdfThe Truth About Dinesh Bafna's Situation.pdf
The Truth About Dinesh Bafna's Situation.pdf
 
Guide to Networking Essentials 8th Edition by Greg Tomsho solution manual.doc
Guide to Networking Essentials 8th Edition by Greg Tomsho solution manual.docGuide to Networking Essentials 8th Edition by Greg Tomsho solution manual.doc
Guide to Networking Essentials 8th Edition by Greg Tomsho solution manual.doc
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdf
 
Falcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small BusinessesFalcon Invoice Discounting Setup for Small Businesses
Falcon Invoice Discounting Setup for Small Businesses
 
Toyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & TransformationsToyota Kata Coaching for Agile Teams & Transformations
Toyota Kata Coaching for Agile Teams & Transformations
 
TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024
 
HAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future ProspectsHAL Financial Performance Analysis and Future Prospects
HAL Financial Performance Analysis and Future Prospects
 
What is paper chromatography, principal, procedure,types, diagram, advantages...
What is paper chromatography, principal, procedure,types, diagram, advantages...What is paper chromatography, principal, procedure,types, diagram, advantages...
What is paper chromatography, principal, procedure,types, diagram, advantages...
 
Stages of Startup Funding - An Explainer
Stages of Startup Funding - An ExplainerStages of Startup Funding - An Explainer
Stages of Startup Funding - An Explainer
 
A Brief Introduction About Jacob Badgett
A Brief Introduction About Jacob BadgettA Brief Introduction About Jacob Badgett
A Brief Introduction About Jacob Badgett
 
Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)
 
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdfبروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
بروفايل شركة ميار الخليج للاستشارات الهندسية.pdf
 
Your Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptxYour Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptx
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
 
Copyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to KnowCopyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to Know
 

Cis520 group e

  • 1.
  • 2. 1. Background Over the course of their day-to-day operations, Human Resources teams are finding themselves with increasingly large pools of data comprised of employee profiles, activity records and performance appraisals. Leveraging this data for insights has become a challenge, presenting a key demand and opportunity for big data analysis. Our goal is to address these pain points of today’s employee management by interpreting HR data into descriptive and inferential statistics. The analytics produced can provide HR professionals the information to make effective description regarding their workforce. 2. Overview To prove our concept, we will acquire and distribute the raw data across a Hadoop cluster. We will then query the data to provide basic filtering functions to demonstrate its flexibility and customization. In addition, we will apply analytical tools to the refined data, representing set business intelligence tools that are in line with today’s human resources management needs. 3. System Requirements Storage: Standard cloud-based object storage node with a business continuity system in place. File System: Hadoop Distributed File System MapReduce Platform: Apache Hive Analytics and Visualization Tool: Microsoft Excel enabled with ODBC driver and Microsoft Power BI.
  • 3. 4. Dataset 17 text files of comma delimited data that separates columns and each row starting a new record. Each record has 20 data elements in numeric and text format. The data set is comprised of employment records of various government agencies of the United States.
  • 4. 5. Storage Deployment The storage is deployed from the Azure portal, configured to be a classic Azure storage account located in the Central United States. It is Geo-Redundant with local and geo- distributed copies with a replication factor of three for disaster recovery. It can accommodate block and page blobs, tables and queues with a maximum 500 IOPS per disk.
  • 5. 6. Hadoop Cluster Deployment The selected Apache Hadoop distribution system is an Azure HDInsight cluster. It is deployed from the Azure portal, configured to have a Windows operating system and Hadoop version 2.6.0. Its resources are comprised of 4 workers nodes with a total of 16 cores, 14GB RAM and 8 disks. The head node is set-up with the same specifications.
  • 6. 7. Data Upload The dataset is transferred using the Cloudberry Explorer client application. It is uploaded directly into the default container of the blob linked to the HD Insight cluster
  • 7. 8. Querying the Data The MapReduce queries are performed from Azure’s Hive Editor. a. Table Creation: CREATE TABLE query is used to create the table with the appropriate data.
  • 8. b. LOAD DATA INPATH queries are used to create the table with the appropriate data.
  • 9. c. SELECT queries are used to validate the information loaded into the table.
  • 10. 9. Data Refinement: The refinement of the data is based on the identified business requirements. The filtering is performed with the COUNT, WHERE and GROUP BY queries using a variety of conditions. a. COUNT
  • 14. e. GROUP BY and WHERE
  • 15. 10. Data Visualization and Interpretation The data infers the following conclusions: For demographic analysis, the first query generated displays all the female employees working in the Department of Defense- Defense Contract Audit Agency
  • 16. 10. Data Visualization and Interpretation The data infers the following conclusions: The second query addresses the position-vacancy analysis need of HR professionals, by searching for the employees with a Length of Service of 30 years and above since they have the higher likelihood to retire.
  • 17. 10. Data Visualization and Interpretation The data infers the following conclusions: a. Graph 3.10.1 shows that salary is distributed mostly to salary levels D, E and G which account for 12% of the salary each. This means that 36% of the total salary expense is allocated to those who make $40,000 - 59,999 and $70,000 - 79,999. Salary level F is next behind the top three, taking 10% of the total salary expense. Level F has a salary range of $60,000 - 69,999.
  • 18. b. Graph 3.10.2, shows that majority of the personnel with a supervisory level of 2(Manager), 6(Leader) and 7(Team Leader) have an education level of 13, a bachelor’s degree. More than 80,000 supervisors share this education level, followed by level 4, a high school diploma, which is shared by 64,700 supervisors. Together, the supervisors part of these top two educations levels, account for 49% of them within this specific group.
  • 19. 11. Conclusion a. System Review - The system represents an effective and user-friendly platform to manipulate and enhance large amounts of HR data, created in an efficient and cost-effective manner. b. Opportunities - Globalization, advancements in technology and the even the growing population in general, will only mean more jobs and people to manage in the future. And “without analytics, corporations could face an increase in skills gaps throughout the entire company, less engaged employees, a lack of internal development, along with many other challenges” says Brittany Hink, Editor in Chief of Human Resources IQ [2]. These factors and trends will result in a continuous explosion of data that will be very important to tap into and interpret just to perform daily human resources management operations. In summary, human resources is definitely an aspect of business and human behavior that Big Data analytics should focus more on. 12. Reflection Our team learned a great deal on big data and the impact of human resources departments have on global business as a whole. More importantly, we learned about how critical the System Development Lifecycle can be as we had difficulty in implementation and testing that were correlated. We learned to respect the SDLC process as it is proven framework for establishing systems that really work.