SlideShare a Scribd company logo
1
DATA MANAGEMENT FINAL PROJECT
ANALYSIS OF SAN-FRANCISCO EMPLOYEE COMPENSATION FOR
FISCAL YEAR 2014 AND 2015
SUBMITTED BY
SAGAR VINAYKUMAR TUPKAR
MS-BUSINESS ANALYTICS’16
UNIVERSITY OF CINCINNATI, OHIO
2
CHAPTER 01
DATA INFORMATION
1.01 ABOUT DATA
The Data that is worked upon this project is the dataset of the compensation of employees in San
Francisco for the Fiscal Year 2014 and 2015. The San Francisco Controller's Office maintains a
database of the salary and benefits paid to City employees since fiscal year 2013. This data has
also been summarized and presented on the Employee Compensation report hosted at
http://openbook.sfgov.org. New data is added on a bi-annual basis when available for each fiscal
and calendar year.
1.02 DATA SOURCE
The data was obtained from an open-data source website (www.data.sfgov.org) from the
internet. Here is the link of the dataset.
https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd
1.03 MODIFICATIONS DONE TO THE DATA
a. The Original data that was downloaded from the website did not have the datatypes
correct. So, using excel the datatypes for the measures were changed to Numbers and
Dimensions to Text.
b. Using Excel, a filter was applied to the dataset and the data was extracted only for FISCAL
year 2014 and 2015. All CALENDAR year and year 2013 data was excluded from the
dataset to be analyzed.
c. Some of the columns in the table were also excluded, e.g. Year type, Union Code, Union
Name, Employee Identifier etc. as these information were not used in the analysis to be
done.
3
CHAPTER 02
TABLE OVERVIEW
2.01 GENERIC OVERVIEW OF THE DATA
The dataset used for study is the Employee Compensation data for San Francisco city for the
Fiscal year 2014 and 2015. The dataset that was modified for analysis contains 83946 rows and
18 columns. The flow of the columns is hierarchical from Organization to Job and the
Compensation is also granulated into Salaries, Benefits which are further distributed into
different categories. Here are all the Column names with description present in the dataset –
1) Year – the year which for which the data exists (2014 or 2015)
2) Organization Group Code – a unique code given to an Organization Group
3) Organization Group – name of the Organization group
4) Department Code - a unique code given to a department
5) Department – name of the department
6) Job Family Code – a unique code given to a Job Family
7) Job Family – name of the Job Family
8) Job Code – a unique code given to a Job
9) Job – name of the Job
10) Salaries – salary for that job in USD
11) Overtime – overtime extra bonus in USD
12) Other Salaries – other salaries besides the main salary in USD
13) Total Salaries – total salary (aggregate of all 3 columns above) in USD
14) Retirement – benefit due to retirement plan in USD
15) Health/Dental – benefit due to health/ dental privileges in USD
16) Other Benefits – other benefits in USD
17) Total Benefits – total benefits (aggregate of all columns above) in USD
18) Total Compensation – total compensation (total salary + total benefits) in USD
4
2.02 ANALYSIS TO BE DONE ON THE DATASET
The latter part of the report includes probing into the dataset to extract information from it. The
dataset will be analyzed for Average/Minimum/Maximum/Sum of Salary, Benefits, and
Compensation for various Organization Group, Department, Job Family and Jobs, looking for
outliers as they would be insightful to the reader. The analysis will also be done on the trend
followed by the statistics for Fiscal year 2015 as compared to Fiscal Year 2014. Important
information like number of employees in a particular department, organization or doing a
particular type of job will also be showcased and analyzed.
5
CHAPTER 03
NORMALIZATION OF THE DATA
3.01 IS THE DATASET NORMALIZED?
The dataset is usually normalized before analysis to remove the redundancy and repetition of the
information contained. Also, relational database system is much better to analyze and maintain
as compared to non-relational database system. The dataset that is analyzed, although uniform
and well granulated, is not Normalized. The values in rows are redundant with respect to the
columns. Also, there is no linkage between the columns that should be intuitively related to each
other e.g. Job is a part of Job Family which are different for different departments and these
departments are categorized into various organization groups. All these columns can be related.
3.02 HOW TO NORMALIZE THE DATASET?
As mentioned above the dataset needs to be normalized in order to remove the redundancy from
the rows. So,
1. To normalize the dataset, new tables need to be created and linked with each other using
the relation they have. E.g.
Table 1 – Organization Group Code and Organization Group Name because every code
has a unique name associated with it.
Similarly, other can also be created for Department, Job Family and Job.
2. Using the above tables and a fact table, we can form the same dataset, but normalized
using joins in SQL.
3. The Total Salary table can also be created using the columns Salary, Overtime, Other
Salary and Total salary; but in this new table the new column for total salary will work on
the function for aggregate applied using SQL query. Hence, whenever the values of other
3 columns are added, the total salary is automatically updated. This can be done for Total
Benefits and thus Total Compensation as all these values are linked with each other.
6
CHAPTER 04
PROBLEMS IN THE DATASET AND DATA CLEANING
4.01 PROBLEMS IN THE DATASET
Although the dataset is well organized and maintained by the SF government, there are certain
problems regarding the dataset which should be fixed to make it better.
1. The values in the columns of ‘Job Family Code’ and ‘Job Code’ are not consistent as far as
the format is concerned. While most of the codes are numeric, there are some entries
which are alpha numeric. This will cause a problem in the data manipulation.
2. The columns for measures such as ‘Salaries’, ‘Benefits’ etc. have many negative entries.
Such records should be deleted from the dataset and if at all they have any significance,
they should be saved in another table for different analysis. Negative values in these
columns make no sense and it affects the overall analysis (Sum, Average etc.) as well.
3. It was observed that some of the Job codes and Job names were same for different
departments. This can create confusions while concluding about the salaries and
compensations for a particular job name unless filters are applied.
4. There were NULLS in the initial dataset which might have caused serious problems.
5. As mentioned earlier, the datatypes of the columns were not in the standard format
which could have caused problem while importing it into any other tool for analysis.
4.02 IMPROVEMENT AND SUGGESTIONS
As discussed above, there are a lot of issues with the dataset that can possibly interrupt in
further analysis, so the dataset was cleaned using excel and SQL. All the datatypes were
corrected in Excel before any operation is done on the table. After truncating the data as
needed, it was imported in SQL Server Express and all the Nulls (only present in the
dimensions) were replaced by ‘0.00’.
Apart from the problems present in the dataset, there can be a few additional changes that
can potentially increase the utility of the table and much more information can be extracted.
1. New columns with the name, age, work experience and work history of the employee
can be added to the dataset. (for the government officials where extracting names is
legal)
2. The columns where all the codes are mentioned could have been dropped to make
the dataset small and tidy. The identifier could be added later while normalizing the
dataset
7
CHAPTER 05
GENERAL STATISTICS OF THE DATA
A) USING EXCEL –
For our dataset, we will check the number of records for each organization group for both
years 2014 and 2015 combined. Here is the output from pivot table of excel
To probe further into the number of records for each department in an organization
group, pivot table was used again to get the following results whose snapshots are
attached below –
a. General City Responsibilities
8
b. Culture and Recreation
c. General Administration and Finance
d. Community Health
9
e. Human Welfare and Neighborhood Development
f. Public Protection
g. Public Work, Transportation and Commerce
10
B) USING SQL –
The dataset was imported into Microsoft SQL Server Management Studio for initial
analysis. A SQL file is attached along with the submission where all the codes with
description are present. A snapshot of the top 15 records for all the dimensions and
measures was taken separately in SQL. Here are the snapshots of the sample to give
reader an idea about the data.
1. Dimensions
2. Measures
11
Some queries were written and run in SQL to get the outputs accordingly. Here are some
of the observations –
1. Initial overview or summary of the data was obtained – e.g. total number of records,
total number of records in 2014, 2015, number and names of distinct organization
groups, number of distinct departments, number of distinct job families, number of
distinct jobs. It was observed that there are a total of 83946 records out of which
43078 are from the year 2015 and the rest 40686 are from year 2014. It appears that
the number of registered employees in San Francisco increased by 2392 from the
Fiscal Year 2014 to 2015. Also, it was observed that there are 7 different
Organizational Groups, 53 Departments, 55 Job Families and 1068 different job titles
for the year 2015 in San Francisco. Here are the snapshots of the output from SQL –
12
2. A query was written and run in SQL to find out the top 10 departments having largest
number of employees in 2015. It was observed that Public Health Department had the
maximum number of employees, 9148 for the Fiscal Year 2015 followed by Municipal
Transportation Agency with 6427 employee. Here is an output of the query –
3. The top 10 compensations of the entire database for the year 2015 were extracted by
writing a query. It was observed that the Job title of ‘Asst Med Examiner’ from the Job
family ‘Med Therapy and Auxiliary’ from the department of ‘General Services Agency-
City Admin’ under the Organizational Group ‘General Administration & Finance’ has a
record highest compensation of around $497505
13
4. The summation of the compensation in an organizational group is a biased estimate
of the average compensation. To find the average compensation of each
organizational group, a query was written and it was observed that the Public
protection group has the maximum average compensation of $144452 and the rest
follows the pattern as shown in the snapshot from the SQL output –
5. A similar query was written to pull out the top 10 departments having highest
compensation. It was observed that the fire department had the maximum average
compensation of $182231 for the fiscal year 2015. Here is the output.
14
6. The record with maximum total salary was shown for each department along with the
other column information. The output has 53 records which cannot be shown here
but the output table looks somewhat like this.
7. Finally, those records were pulled out for which the difference in salaries was greater
than 250k for the fiscal year 2015. The observation was that the department of
‘General Services Agency-City Admin’ under the Organizational Group ‘General
Administration & Finance’ has the maximum spread in salaries with the difference
between maximum and minimum being $413272. Here is the output –
15
C) USING TABLEAU
We got an overview of the statistics of the table using Excel and SQL. Now we use a tool
called Tableau to get a visual idea about the statistics. Tableau is mainly used for Data
Visualization.
1. As done earlier, we will form a visualization for the employee count in the year 2015
for departments and organizational groups.
a. For Organizational Groups
b. For Departments
16
2. Here is a visualization for Average Total Salary for top 10 department code in the year
2015.
3. To get a better idea, we plot a bar graph of the total compensation, total salary and
total benefits for the top 10 departments for the year 2015.
17
4. The trend of average compensation for the organizational groups was studied and
plotted in tableau. It was observed that for two organizational groups- General City
Responsibilities and Human Welfare and Neighborhood Development, the average
compensation has decreases significantly from the year 2014 to 2015. Here is the plot
5. To probe more into the above fact, we plotted the trend for the Count of employees
and average compensation for just these two organizational groups with distribution
in Departments. It was observed that the number of record/employees significantly
decreased for the ‘Human Services’ Department from 2014 to 2015 while there wasn’t
a significant change in the number of employees in General Fund Unallocated
Department.
18
6. The plot for the Average salaries for Job Family code gives the fact that a single Job
Family Code or Job Name, appears in multiple departments. The visualization stacks
the output for different departments under the same job family code column. Here
is a glimpse –
19
CHAPTER 06
SUMMARY OF THE FINDINGS AND SUGGESTIONS
6.01 SUMMARY OF THE FINDINGS
The dataset of San Francisco Employee Compensation for the Fiscal Year 2014 and 2015 was
analyzed in this project and the following observations were found –
1. The Job title of ‘Asst Med Examiner’ from the Job family ‘Med Therapy and Auxiliary’ from
the department of ‘General Services Agency-City Admin’ under the Organizational Group
‘General Administration & Finance’ has a record highest compensation of around
$497505
2. The Public protection group has the maximum average compensation
3. The fire department had the maximum average compensation for the fiscal year 2015
4. The observation was that the department of ‘General Services Agency-City Admin’ under
the Organizational Group ‘General Administration & Finance’ has the maximum spread in
salaries
5. For two organizational groups- General City Responsibilities and Human Welfare and
Neighborhood Development, the average compensation has decreases significantly from
the year 2014 to 2015
6. the number of record/employees significantly decreased for the ‘Human Services’
Department from 2014 to 2015 while there wasn’t a significant change in the number of
employees in General Fund Unallocated Department
6.02 SUGGESTIONS
Although the dataset had a lot of information pertaining to the Employee Compensation and its
bifurcations, it could have been made better by including more columns to the dataset. Apart
from normalizing the dataset and getting it cleaned, following are few suggestions –
1. A column showing the age of the employee or his work experience could be added so that
more information can be pulled about the distribution of Salaries according to the
experience a person have.
2. A column showing Demographic information about the employee can be added to the
dataset. This will cater the need to get a distribution of salaries of different demographics.
3. Adding a column showing the qualification of the employee e.g. PhD or Masters can be
very useful. For a person with certain qualification who is looking for a job in SF, this data
20
can help him get an idea of the average salary an employee gets for his qualification in
the particular field/department he is planning to apply.
4. A column with a flag giving knowledge about whether the employee has worked in
California before or not can also be utilized wisely. Generally, some departments prefer
people worked in the State before and there is a difference in the CTC for these
employees as compared to the people who haven’t, so this information can also be useful.
REFERENCES –
1. Data –
https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-
5mnd
2. Picture –
http://highincomerealestate.com/wp-content/uploads/2014/09/SanFrancisco2.jpg

More Related Content

Similar to EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

ISAS 600 – Database Project Phase III RubricAs the final ste.docx
ISAS 600 – Database Project Phase III RubricAs the final ste.docxISAS 600 – Database Project Phase III RubricAs the final ste.docx
ISAS 600 – Database Project Phase III RubricAs the final ste.docx
bagotjesusa
 
Health Workforce Plan.docx
Health Workforce Plan.docxHealth Workforce Plan.docx
Health Workforce Plan.docx
write4
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
IRJET Journal
 
Kevin Fahy Bi Portfolio
Kevin Fahy   Bi PortfolioKevin Fahy   Bi Portfolio
Kevin Fahy Bi Portfolio
KevinPFahy
 
William Schaffrans Bus Intelligence Portfolio
William Schaffrans Bus Intelligence PortfolioWilliam Schaffrans Bus Intelligence Portfolio
William Schaffrans Bus Intelligence Portfolio
wschaffr
 
Data visualization via Tableau
Data visualization via TableauData visualization via Tableau
Data visualization via Tableau
kahhuey
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
Ting Yin
 
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
SydneyMorgans
 
Showcase of Reports
Showcase of ReportsShowcase of Reports
Showcase of ReportsGarth Wilson
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
sonalighai
 
FIN 320 Module Four Excel Assignment Rubric This assign.docx
FIN 320 Module Four Excel Assignment Rubric  This assign.docxFIN 320 Module Four Excel Assignment Rubric  This assign.docx
FIN 320 Module Four Excel Assignment Rubric This assign.docx
ssuser454af01
 
A Deep Dive into NetSuite Data Migration.pdf
A Deep Dive into NetSuite Data Migration.pdfA Deep Dive into NetSuite Data Migration.pdf
A Deep Dive into NetSuite Data Migration.pdf
Pratik686562
 
Part OneFirst, use the provided  MS Excel Spreadshe.docx
Part OneFirst, use the provided             MS Excel Spreadshe.docxPart OneFirst, use the provided             MS Excel Spreadshe.docx
Part OneFirst, use the provided  MS Excel Spreadshe.docx
LacieKlineeb
 
17_New_Zealand.pdf
17_New_Zealand.pdf17_New_Zealand.pdf
17_New_Zealand.pdf
Jose Ramon Albert
 
industrial manpower resource manager
industrial manpower resource managerindustrial manpower resource manager
industrial manpower resource manager
Freelancer
 
IT 8100 Database Architecture And Design.docx
IT 8100 Database Architecture And Design.docxIT 8100 Database Architecture And Design.docx
IT 8100 Database Architecture And Design.docx
stirlingvwriters
 
Ais Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The ReaAis Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The Rea
sharing notes123
 
Ais Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The ReaAis Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The Rea
Sharing Slides Training
 

Similar to EDA of San Francisco Employee Compensation for Fiscal Year 2014-15 (20)

ISAS 600 – Database Project Phase III RubricAs the final ste.docx
ISAS 600 – Database Project Phase III RubricAs the final ste.docxISAS 600 – Database Project Phase III RubricAs the final ste.docx
ISAS 600 – Database Project Phase III RubricAs the final ste.docx
 
Health Workforce Plan.docx
Health Workforce Plan.docxHealth Workforce Plan.docx
Health Workforce Plan.docx
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 
Kevin Fahy Bi Portfolio
Kevin Fahy   Bi PortfolioKevin Fahy   Bi Portfolio
Kevin Fahy Bi Portfolio
 
ACCOUNTING & AUDITING WITH EXCEL2011
ACCOUNTING & AUDITING WITH EXCEL2011ACCOUNTING & AUDITING WITH EXCEL2011
ACCOUNTING & AUDITING WITH EXCEL2011
 
Kpi handbook implementation on bizforce one
Kpi handbook implementation on bizforce oneKpi handbook implementation on bizforce one
Kpi handbook implementation on bizforce one
 
William Schaffrans Bus Intelligence Portfolio
William Schaffrans Bus Intelligence PortfolioWilliam Schaffrans Bus Intelligence Portfolio
William Schaffrans Bus Intelligence Portfolio
 
Data visualization via Tableau
Data visualization via TableauData visualization via Tableau
Data visualization via Tableau
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
Integrated Business Processes with ERP Systems 1st Edition Magal Solutions Ma...
 
Showcase of Reports
Showcase of ReportsShowcase of Reports
Showcase of Reports
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
FIN 320 Module Four Excel Assignment Rubric This assign.docx
FIN 320 Module Four Excel Assignment Rubric  This assign.docxFIN 320 Module Four Excel Assignment Rubric  This assign.docx
FIN 320 Module Four Excel Assignment Rubric This assign.docx
 
A Deep Dive into NetSuite Data Migration.pdf
A Deep Dive into NetSuite Data Migration.pdfA Deep Dive into NetSuite Data Migration.pdf
A Deep Dive into NetSuite Data Migration.pdf
 
Part OneFirst, use the provided  MS Excel Spreadshe.docx
Part OneFirst, use the provided             MS Excel Spreadshe.docxPart OneFirst, use the provided             MS Excel Spreadshe.docx
Part OneFirst, use the provided  MS Excel Spreadshe.docx
 
17_New_Zealand.pdf
17_New_Zealand.pdf17_New_Zealand.pdf
17_New_Zealand.pdf
 
industrial manpower resource manager
industrial manpower resource managerindustrial manpower resource manager
industrial manpower resource manager
 
IT 8100 Database Architecture And Design.docx
IT 8100 Database Architecture And Design.docxIT 8100 Database Architecture And Design.docx
IT 8100 Database Architecture And Design.docx
 
Ais Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The ReaAis Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The Rea
 
Ais Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The ReaAis Romney 2006 Slides 15 Database Design Using The Rea
Ais Romney 2006 Slides 15 Database Design Using The Rea
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 

EDA of San Francisco Employee Compensation for Fiscal Year 2014-15

  • 1. 1 DATA MANAGEMENT FINAL PROJECT ANALYSIS OF SAN-FRANCISCO EMPLOYEE COMPENSATION FOR FISCAL YEAR 2014 AND 2015 SUBMITTED BY SAGAR VINAYKUMAR TUPKAR MS-BUSINESS ANALYTICS’16 UNIVERSITY OF CINCINNATI, OHIO
  • 2. 2 CHAPTER 01 DATA INFORMATION 1.01 ABOUT DATA The Data that is worked upon this project is the dataset of the compensation of employees in San Francisco for the Fiscal Year 2014 and 2015. The San Francisco Controller's Office maintains a database of the salary and benefits paid to City employees since fiscal year 2013. This data has also been summarized and presented on the Employee Compensation report hosted at http://openbook.sfgov.org. New data is added on a bi-annual basis when available for each fiscal and calendar year. 1.02 DATA SOURCE The data was obtained from an open-data source website (www.data.sfgov.org) from the internet. Here is the link of the dataset. https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8-5mnd 1.03 MODIFICATIONS DONE TO THE DATA a. The Original data that was downloaded from the website did not have the datatypes correct. So, using excel the datatypes for the measures were changed to Numbers and Dimensions to Text. b. Using Excel, a filter was applied to the dataset and the data was extracted only for FISCAL year 2014 and 2015. All CALENDAR year and year 2013 data was excluded from the dataset to be analyzed. c. Some of the columns in the table were also excluded, e.g. Year type, Union Code, Union Name, Employee Identifier etc. as these information were not used in the analysis to be done.
  • 3. 3 CHAPTER 02 TABLE OVERVIEW 2.01 GENERIC OVERVIEW OF THE DATA The dataset used for study is the Employee Compensation data for San Francisco city for the Fiscal year 2014 and 2015. The dataset that was modified for analysis contains 83946 rows and 18 columns. The flow of the columns is hierarchical from Organization to Job and the Compensation is also granulated into Salaries, Benefits which are further distributed into different categories. Here are all the Column names with description present in the dataset – 1) Year – the year which for which the data exists (2014 or 2015) 2) Organization Group Code – a unique code given to an Organization Group 3) Organization Group – name of the Organization group 4) Department Code - a unique code given to a department 5) Department – name of the department 6) Job Family Code – a unique code given to a Job Family 7) Job Family – name of the Job Family 8) Job Code – a unique code given to a Job 9) Job – name of the Job 10) Salaries – salary for that job in USD 11) Overtime – overtime extra bonus in USD 12) Other Salaries – other salaries besides the main salary in USD 13) Total Salaries – total salary (aggregate of all 3 columns above) in USD 14) Retirement – benefit due to retirement plan in USD 15) Health/Dental – benefit due to health/ dental privileges in USD 16) Other Benefits – other benefits in USD 17) Total Benefits – total benefits (aggregate of all columns above) in USD 18) Total Compensation – total compensation (total salary + total benefits) in USD
  • 4. 4 2.02 ANALYSIS TO BE DONE ON THE DATASET The latter part of the report includes probing into the dataset to extract information from it. The dataset will be analyzed for Average/Minimum/Maximum/Sum of Salary, Benefits, and Compensation for various Organization Group, Department, Job Family and Jobs, looking for outliers as they would be insightful to the reader. The analysis will also be done on the trend followed by the statistics for Fiscal year 2015 as compared to Fiscal Year 2014. Important information like number of employees in a particular department, organization or doing a particular type of job will also be showcased and analyzed.
  • 5. 5 CHAPTER 03 NORMALIZATION OF THE DATA 3.01 IS THE DATASET NORMALIZED? The dataset is usually normalized before analysis to remove the redundancy and repetition of the information contained. Also, relational database system is much better to analyze and maintain as compared to non-relational database system. The dataset that is analyzed, although uniform and well granulated, is not Normalized. The values in rows are redundant with respect to the columns. Also, there is no linkage between the columns that should be intuitively related to each other e.g. Job is a part of Job Family which are different for different departments and these departments are categorized into various organization groups. All these columns can be related. 3.02 HOW TO NORMALIZE THE DATASET? As mentioned above the dataset needs to be normalized in order to remove the redundancy from the rows. So, 1. To normalize the dataset, new tables need to be created and linked with each other using the relation they have. E.g. Table 1 – Organization Group Code and Organization Group Name because every code has a unique name associated with it. Similarly, other can also be created for Department, Job Family and Job. 2. Using the above tables and a fact table, we can form the same dataset, but normalized using joins in SQL. 3. The Total Salary table can also be created using the columns Salary, Overtime, Other Salary and Total salary; but in this new table the new column for total salary will work on the function for aggregate applied using SQL query. Hence, whenever the values of other 3 columns are added, the total salary is automatically updated. This can be done for Total Benefits and thus Total Compensation as all these values are linked with each other.
  • 6. 6 CHAPTER 04 PROBLEMS IN THE DATASET AND DATA CLEANING 4.01 PROBLEMS IN THE DATASET Although the dataset is well organized and maintained by the SF government, there are certain problems regarding the dataset which should be fixed to make it better. 1. The values in the columns of ‘Job Family Code’ and ‘Job Code’ are not consistent as far as the format is concerned. While most of the codes are numeric, there are some entries which are alpha numeric. This will cause a problem in the data manipulation. 2. The columns for measures such as ‘Salaries’, ‘Benefits’ etc. have many negative entries. Such records should be deleted from the dataset and if at all they have any significance, they should be saved in another table for different analysis. Negative values in these columns make no sense and it affects the overall analysis (Sum, Average etc.) as well. 3. It was observed that some of the Job codes and Job names were same for different departments. This can create confusions while concluding about the salaries and compensations for a particular job name unless filters are applied. 4. There were NULLS in the initial dataset which might have caused serious problems. 5. As mentioned earlier, the datatypes of the columns were not in the standard format which could have caused problem while importing it into any other tool for analysis. 4.02 IMPROVEMENT AND SUGGESTIONS As discussed above, there are a lot of issues with the dataset that can possibly interrupt in further analysis, so the dataset was cleaned using excel and SQL. All the datatypes were corrected in Excel before any operation is done on the table. After truncating the data as needed, it was imported in SQL Server Express and all the Nulls (only present in the dimensions) were replaced by ‘0.00’. Apart from the problems present in the dataset, there can be a few additional changes that can potentially increase the utility of the table and much more information can be extracted. 1. New columns with the name, age, work experience and work history of the employee can be added to the dataset. (for the government officials where extracting names is legal) 2. The columns where all the codes are mentioned could have been dropped to make the dataset small and tidy. The identifier could be added later while normalizing the dataset
  • 7. 7 CHAPTER 05 GENERAL STATISTICS OF THE DATA A) USING EXCEL – For our dataset, we will check the number of records for each organization group for both years 2014 and 2015 combined. Here is the output from pivot table of excel To probe further into the number of records for each department in an organization group, pivot table was used again to get the following results whose snapshots are attached below – a. General City Responsibilities
  • 8. 8 b. Culture and Recreation c. General Administration and Finance d. Community Health
  • 9. 9 e. Human Welfare and Neighborhood Development f. Public Protection g. Public Work, Transportation and Commerce
  • 10. 10 B) USING SQL – The dataset was imported into Microsoft SQL Server Management Studio for initial analysis. A SQL file is attached along with the submission where all the codes with description are present. A snapshot of the top 15 records for all the dimensions and measures was taken separately in SQL. Here are the snapshots of the sample to give reader an idea about the data. 1. Dimensions 2. Measures
  • 11. 11 Some queries were written and run in SQL to get the outputs accordingly. Here are some of the observations – 1. Initial overview or summary of the data was obtained – e.g. total number of records, total number of records in 2014, 2015, number and names of distinct organization groups, number of distinct departments, number of distinct job families, number of distinct jobs. It was observed that there are a total of 83946 records out of which 43078 are from the year 2015 and the rest 40686 are from year 2014. It appears that the number of registered employees in San Francisco increased by 2392 from the Fiscal Year 2014 to 2015. Also, it was observed that there are 7 different Organizational Groups, 53 Departments, 55 Job Families and 1068 different job titles for the year 2015 in San Francisco. Here are the snapshots of the output from SQL –
  • 12. 12 2. A query was written and run in SQL to find out the top 10 departments having largest number of employees in 2015. It was observed that Public Health Department had the maximum number of employees, 9148 for the Fiscal Year 2015 followed by Municipal Transportation Agency with 6427 employee. Here is an output of the query – 3. The top 10 compensations of the entire database for the year 2015 were extracted by writing a query. It was observed that the Job title of ‘Asst Med Examiner’ from the Job family ‘Med Therapy and Auxiliary’ from the department of ‘General Services Agency- City Admin’ under the Organizational Group ‘General Administration & Finance’ has a record highest compensation of around $497505
  • 13. 13 4. The summation of the compensation in an organizational group is a biased estimate of the average compensation. To find the average compensation of each organizational group, a query was written and it was observed that the Public protection group has the maximum average compensation of $144452 and the rest follows the pattern as shown in the snapshot from the SQL output – 5. A similar query was written to pull out the top 10 departments having highest compensation. It was observed that the fire department had the maximum average compensation of $182231 for the fiscal year 2015. Here is the output.
  • 14. 14 6. The record with maximum total salary was shown for each department along with the other column information. The output has 53 records which cannot be shown here but the output table looks somewhat like this. 7. Finally, those records were pulled out for which the difference in salaries was greater than 250k for the fiscal year 2015. The observation was that the department of ‘General Services Agency-City Admin’ under the Organizational Group ‘General Administration & Finance’ has the maximum spread in salaries with the difference between maximum and minimum being $413272. Here is the output –
  • 15. 15 C) USING TABLEAU We got an overview of the statistics of the table using Excel and SQL. Now we use a tool called Tableau to get a visual idea about the statistics. Tableau is mainly used for Data Visualization. 1. As done earlier, we will form a visualization for the employee count in the year 2015 for departments and organizational groups. a. For Organizational Groups b. For Departments
  • 16. 16 2. Here is a visualization for Average Total Salary for top 10 department code in the year 2015. 3. To get a better idea, we plot a bar graph of the total compensation, total salary and total benefits for the top 10 departments for the year 2015.
  • 17. 17 4. The trend of average compensation for the organizational groups was studied and plotted in tableau. It was observed that for two organizational groups- General City Responsibilities and Human Welfare and Neighborhood Development, the average compensation has decreases significantly from the year 2014 to 2015. Here is the plot 5. To probe more into the above fact, we plotted the trend for the Count of employees and average compensation for just these two organizational groups with distribution in Departments. It was observed that the number of record/employees significantly decreased for the ‘Human Services’ Department from 2014 to 2015 while there wasn’t a significant change in the number of employees in General Fund Unallocated Department.
  • 18. 18 6. The plot for the Average salaries for Job Family code gives the fact that a single Job Family Code or Job Name, appears in multiple departments. The visualization stacks the output for different departments under the same job family code column. Here is a glimpse –
  • 19. 19 CHAPTER 06 SUMMARY OF THE FINDINGS AND SUGGESTIONS 6.01 SUMMARY OF THE FINDINGS The dataset of San Francisco Employee Compensation for the Fiscal Year 2014 and 2015 was analyzed in this project and the following observations were found – 1. The Job title of ‘Asst Med Examiner’ from the Job family ‘Med Therapy and Auxiliary’ from the department of ‘General Services Agency-City Admin’ under the Organizational Group ‘General Administration & Finance’ has a record highest compensation of around $497505 2. The Public protection group has the maximum average compensation 3. The fire department had the maximum average compensation for the fiscal year 2015 4. The observation was that the department of ‘General Services Agency-City Admin’ under the Organizational Group ‘General Administration & Finance’ has the maximum spread in salaries 5. For two organizational groups- General City Responsibilities and Human Welfare and Neighborhood Development, the average compensation has decreases significantly from the year 2014 to 2015 6. the number of record/employees significantly decreased for the ‘Human Services’ Department from 2014 to 2015 while there wasn’t a significant change in the number of employees in General Fund Unallocated Department 6.02 SUGGESTIONS Although the dataset had a lot of information pertaining to the Employee Compensation and its bifurcations, it could have been made better by including more columns to the dataset. Apart from normalizing the dataset and getting it cleaned, following are few suggestions – 1. A column showing the age of the employee or his work experience could be added so that more information can be pulled about the distribution of Salaries according to the experience a person have. 2. A column showing Demographic information about the employee can be added to the dataset. This will cater the need to get a distribution of salaries of different demographics. 3. Adding a column showing the qualification of the employee e.g. PhD or Masters can be very useful. For a person with certain qualification who is looking for a job in SF, this data
  • 20. 20 can help him get an idea of the average salary an employee gets for his qualification in the particular field/department he is planning to apply. 4. A column with a flag giving knowledge about whether the employee has worked in California before or not can also be utilized wisely. Generally, some departments prefer people worked in the State before and there is a difference in the CTC for these employees as compared to the people who haven’t, so this information can also be useful. REFERENCES – 1. Data – https://data.sfgov.org/City-Management-and-Ethics/Employee-Compensation/88g8- 5mnd 2. Picture – http://highincomerealestate.com/wp-content/uploads/2014/09/SanFrancisco2.jpg