Data Source:
The datasetis downloadedfromKaggle.com. The followingisthe linktothe dataset:
https://www.kaggle.com/wsj/college-salaries.All datawasobtainedfromWall StreetJournal basedon
data fromPayScale Inc.
Dataset Description:
The datasetincludesthree tables:
 Salariesforcollegesbytype:
 Salariesforcollegesbyregion
 Degreesthatpay youback
The datasetgivesinformationaboutyourstartingsalary,mid-levelsalaryandpercentage increase
accordingto the school attended.Italsogivesinformationregardingthe type of the school attended
and the regioninwhichthe school islocated. The lasttable gives informationaboutthe mediansalaries
accordingto undergraduate Major.
The followingare the detail regardingthe columnsinthe tables:
Salariesforthe collegesbytype:
 Thistable has 269 rows and 8 columns
 School name and the school type givesthe name andtype of the school.There are 249 unique
school namesand5 school types
 The remainingcolumnsgivesthe range of salaries of the studentsgraduatedfromthe
respective schoolsovera10-year period.
Degreesthatpay youback:
 Thistable has 50 rowsand 8 columns
 Undergraduate Major has50 distinctvalues.The remainingcolumnsgivesthe range of salaries
accordingto the undergraduate degree.
SalariesforcollegesbyRegion:
 Thistable has 320 rows and 8 columns.
 Regiongivesthe regionwhere the school islocated.Ithas5 distinctvalues.The othercolumns
give the range of salariesbyschool andregion.
Importing data into SQL and CreatingTables:
The tablesare downloadedfromthe source websiteincsvformat.Theyare importedintoSQL by using
the importwizard.While importingthe datafor the ‘salariesforcollegesbytype’table we hadtodefine
school name and school type ascomposite primarykey.The reasonbeingthere are multiple schools
withdifferenttypesdependingonthe course offered. So,acomposite primarykeydefinedtomaintain
the entityintegrity.
Afterimportingthe data,newtablesare created,andthe importeddataisinsertedintothe new tables.
While importingthe dataintonewtables,the datatypesof columnscontainingthe salaryinformation
are convertedfrommoneytodecimal.
General statisticsusing SQL:
Afternormalizingandinsertingdata we getthe followingtables:
 DBO.DEGREE
 DBO.COLLEGE
 DBO.REGION
We performthe followinggeneral statisticsusingSQL:
 Calculatingthe average of mediansalariesby school type:
 Calculatingthe average of mediansalariesby region:
 Higheststarting mediansalary by region withcollege name:
 Higheststarting mediansalary by school type withschool name:
 Selectingtop5 undergraduate majors by starting mediansalary:
 Selectingtop5 undergraduate majors according to percentage change in salaries:
 AddingCategory Variable according to starting mediansalary:
Analysisin R:
The database is importedinRfor furtheranalysis.The RODBClibraryisusedforestablishinga
connectionandimportingthe tablesinR.
The three tablesfromthe database are importedandsavedindataframes.Also,we runan innerjoinon
the regionandcollege tablesandthe resultisimportedintoR,as a data frame namedreg.col,usingthe
sqlQueryfunction.
The followingoutputgives the summary of the importedtables:
The followingoutputgivesthe total number ofmissingvalues and missingvaluesby column for the
data frames:
There are nomissingvaluesinthe degree dataframe.There are 88 missingvaluesinthe reg.col data
frame.
The followingisthe histogram ofthe starting median salary and mid-level mediansalaryfor the two
data frames:
The histogramgivesthe distributionof startingmediansalaryandthe mid-level mediansalariesinthe
twotables.
The followingshowsthe boxplotforthe startingmediansalaryandmid-levelmediansalary. The boxplot
depictsthe inter-quartilerange andshowspossible outliers.
Starting median salary in Degree data frame:
Mid-level mediansalaryin Degree data frame:
Starting median salary in the reg.col data frame:
Mid-level mediansalaryin the reg.col data frame:
The followingisa bar graph ofthe top 7 undergraduate degreesaccording to the starting median
salaries:
We can see thatphysicianassistant, chemical engineeringandcomputerengineeringare the top3
majors.
A linearregressionmodel for predictingthe starting mediansalary basedon the school type and
regionis formed.The belowis the summary of the model:
As the variable school type and regionare categorical variables,dummyvariablesare createdbyR in
regressionmodel.The coefficientsof the categoriesforthe twovariablesare showninthe summary
output.We can see that all the coefficientestimate exceptthe school type Ivyleague are significant.The
R-squaredvalue is57%,whichmeansthat 57% of variance inresponse variable isexplainedbythe
predictorvariables.
Also,a linearregressionmodel,withthe mid-level mediansalaryas response and the starting median
salary and school type as the predictors,is developed.Belowisthe summary ofthe model.
All the coefficientestimatesare significant.The p-valueforthe F-statisticalsosuggeststhatthere isa
linearrelationship betweenthe response variableandthe predictorvariables.The R-squaredis85%,
whichmeansthat 85% variance inthe response variable canbe explainedbythe predictors.
VisualizationinTableau:
The belowis the bar graph of the average salariesaccording to school types:
The belowis the bar graph of the average salariesaccording to region:
The belowis the scatter plotof starting median salariesagainst mid-level mediansalariesaccordingto
school type:
The belowis the text plot of the school names:the size varies according to average of mid-level
mediansalaries and the color variesaccording to average of starting median salaries
Summary:
From the data we can see that the startingsalariesvarysignificantlyaccordingtocollege type.Butthe
increasedearningpowershowslessdisparity.After10 years, graduatesof IvyLeague schoolsearned
99% more than theydidat graduation.Partyschool graduatessaw an 85% increase.Engineeringschool
graduateshad the leastgrowth,earning76% more 10 years afterschool.
Midwestcollege graduates tendtoearn the lowestsalarybothatgraduationandat mid-career,
accordingto the PayScale Inc.survey.Graduatesof schoolsinthe NortheastandCaliforniafaredbest.
The data showsthat graduates of majors like philosophyandInternational Relations earned103.5%and
97.8% more,respectively,about10 yearspost-commencement.Majorsthatdidn'tshow as much salary
growthinclude NursingandInformationTechnology.
Challenges:
While importingcsvfile inSQLIfacederrors relatedtodata type anddelimiter.Iconvertedthe source
file intoxlsx formatandthenimportedthe same withoutanyerrors.Also,regressionmodel involving
the categorical variablescreatesdummyvariablesandassignscoefficientestimates tothem.Itbecomes
little confusingwhenthere are multiplecategorical variablesaspredictors.

Hw5 deone pranil

  • 1.
    Data Source: The datasetisdownloadedfromKaggle.com. The followingisthe linktothe dataset: https://www.kaggle.com/wsj/college-salaries.All datawasobtainedfromWall StreetJournal basedon data fromPayScale Inc. Dataset Description: The datasetincludesthree tables:  Salariesforcollegesbytype:  Salariesforcollegesbyregion  Degreesthatpay youback The datasetgivesinformationaboutyourstartingsalary,mid-levelsalaryandpercentage increase accordingto the school attended.Italsogivesinformationregardingthe type of the school attended and the regioninwhichthe school islocated. The lasttable gives informationaboutthe mediansalaries accordingto undergraduate Major. The followingare the detail regardingthe columnsinthe tables: Salariesforthe collegesbytype:  Thistable has 269 rows and 8 columns  School name and the school type givesthe name andtype of the school.There are 249 unique school namesand5 school types  The remainingcolumnsgivesthe range of salaries of the studentsgraduatedfromthe respective schoolsovera10-year period. Degreesthatpay youback:  Thistable has 50 rowsand 8 columns  Undergraduate Major has50 distinctvalues.The remainingcolumnsgivesthe range of salaries accordingto the undergraduate degree. SalariesforcollegesbyRegion:  Thistable has 320 rows and 8 columns.  Regiongivesthe regionwhere the school islocated.Ithas5 distinctvalues.The othercolumns give the range of salariesbyschool andregion. Importing data into SQL and CreatingTables: The tablesare downloadedfromthe source websiteincsvformat.Theyare importedintoSQL by using the importwizard.While importingthe datafor the ‘salariesforcollegesbytype’table we hadtodefine school name and school type ascomposite primarykey.The reasonbeingthere are multiple schools withdifferenttypesdependingonthe course offered. So,acomposite primarykeydefinedtomaintain the entityintegrity.
  • 2.
    Afterimportingthe data,newtablesare created,andtheimporteddataisinsertedintothe new tables. While importingthe dataintonewtables,the datatypesof columnscontainingthe salaryinformation are convertedfrommoneytodecimal. General statisticsusing SQL: Afternormalizingandinsertingdata we getthe followingtables:  DBO.DEGREE  DBO.COLLEGE  DBO.REGION We performthe followinggeneral statisticsusingSQL:  Calculatingthe average of mediansalariesby school type:  Calculatingthe average of mediansalariesby region:  Higheststarting mediansalary by region withcollege name:
  • 3.
     Higheststarting mediansalaryby school type withschool name:  Selectingtop5 undergraduate majors by starting mediansalary:  Selectingtop5 undergraduate majors according to percentage change in salaries:  AddingCategory Variable according to starting mediansalary:
  • 4.
    Analysisin R: The databaseis importedinRfor furtheranalysis.The RODBClibraryisusedforestablishinga connectionandimportingthe tablesinR. The three tablesfromthe database are importedandsavedindataframes.Also,we runan innerjoinon the regionandcollege tablesandthe resultisimportedintoR,as a data frame namedreg.col,usingthe sqlQueryfunction. The followingoutputgives the summary of the importedtables: The followingoutputgivesthe total number ofmissingvalues and missingvaluesby column for the data frames: There are nomissingvaluesinthe degree dataframe.There are 88 missingvaluesinthe reg.col data frame.
  • 5.
    The followingisthe histogramofthe starting median salary and mid-level mediansalaryfor the two data frames:
  • 6.
    The histogramgivesthe distributionofstartingmediansalaryandthe mid-level mediansalariesinthe twotables.
  • 7.
    The followingshowsthe boxplotforthestartingmediansalaryandmid-levelmediansalary. The boxplot depictsthe inter-quartilerange andshowspossible outliers. Starting median salary in Degree data frame: Mid-level mediansalaryin Degree data frame: Starting median salary in the reg.col data frame: Mid-level mediansalaryin the reg.col data frame:
  • 8.
    The followingisa bargraph ofthe top 7 undergraduate degreesaccording to the starting median salaries: We can see thatphysicianassistant, chemical engineeringandcomputerengineeringare the top3 majors. A linearregressionmodel for predictingthe starting mediansalary basedon the school type and regionis formed.The belowis the summary of the model: As the variable school type and regionare categorical variables,dummyvariablesare createdbyR in regressionmodel.The coefficientsof the categoriesforthe twovariablesare showninthe summary
  • 9.
    output.We can seethat all the coefficientestimate exceptthe school type Ivyleague are significant.The R-squaredvalue is57%,whichmeansthat 57% of variance inresponse variable isexplainedbythe predictorvariables. Also,a linearregressionmodel,withthe mid-level mediansalaryas response and the starting median salary and school type as the predictors,is developed.Belowisthe summary ofthe model. All the coefficientestimatesare significant.The p-valueforthe F-statisticalsosuggeststhatthere isa linearrelationship betweenthe response variableandthe predictorvariables.The R-squaredis85%, whichmeansthat 85% variance inthe response variable canbe explainedbythe predictors.
  • 10.
    VisualizationinTableau: The belowis thebar graph of the average salariesaccording to school types: The belowis the bar graph of the average salariesaccording to region:
  • 11.
    The belowis thescatter plotof starting median salariesagainst mid-level mediansalariesaccordingto school type: The belowis the text plot of the school names:the size varies according to average of mid-level mediansalaries and the color variesaccording to average of starting median salaries
  • 12.
    Summary: From the datawe can see that the startingsalariesvarysignificantlyaccordingtocollege type.Butthe increasedearningpowershowslessdisparity.After10 years, graduatesof IvyLeague schoolsearned 99% more than theydidat graduation.Partyschool graduatessaw an 85% increase.Engineeringschool graduateshad the leastgrowth,earning76% more 10 years afterschool. Midwestcollege graduates tendtoearn the lowestsalarybothatgraduationandat mid-career, accordingto the PayScale Inc.survey.Graduatesof schoolsinthe NortheastandCaliforniafaredbest. The data showsthat graduates of majors like philosophyandInternational Relations earned103.5%and 97.8% more,respectively,about10 yearspost-commencement.Majorsthatdidn'tshow as much salary growthinclude NursingandInformationTechnology. Challenges: While importingcsvfile inSQLIfacederrors relatedtodata type anddelimiter.Iconvertedthe source file intoxlsx formatandthenimportedthe same withoutanyerrors.Also,regressionmodel involving the categorical variablescreatesdummyvariablesandassignscoefficientestimates tothem.Itbecomes little confusingwhenthere are multiplecategorical variablesaspredictors.