Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What is Data Science?

551 views

Published on

Data Science is the art of breaking large amount or chunks of complex data and extracting out the meaningful information from it.Data Science combines different fields of work which includes tools and methods from mathematics, statistics and machine learning.

EmergenTeck – Kausal Vikash Data Science training Pune is an integrated program in Data Science and Machine Learning exclusively designed for professionals pursuing career in Data Scientist, Artificial Intelligence, Data Visualization, Statistics and Analytics.

Our customized Data Science and Machine Learning training Courses in Pune are designed by Senior Data Scientists who has vast experience in the industry. This master program will help you achieve proficiency in data analysis, machine learning algorithms, Decision trees, Regression Model, K-Means Clustering, deploying R Statistical computing, connecting R with Hadoop framework, work on real world projects and case studies. These skills will help you prepare for the role of the Data Scientist.
www.kausalvikash.in

Published in: Technology
  • Be the first to comment

  • Be the first to like this

What is Data Science?

  1. 1. www.kausalvikash.in
  2. 2. www.kausalvikash.in
  3. 3. www.kausalvikash.in
  4. 4. www.kausalvikash.in
  5. 5. www.kausalvikash.in
  6. 6. www.kausalvikash.in
  7. 7. www.kausalvikash.in
  8. 8. www.kausalvikash.in
  9. 9. www.kausalvikash.in
  10. 10. www.kausalvikash.in
  11. 11. www.kausalvikash.in
  12. 12. www.kausalvikash.in
  13. 13. Data www.kausalvikash.in
  14. 14. Dataand its types • What isData? Raw,unorganized facts that need to beprocessed. • What isInformation? Processed,organized, structured data that is useful. Data is plain facts thatis processed, organized, structured or presented into usefulinformation Typesof Data • Structured – information with adegree of organization that is readily searchable and quickly consolidate into facts. Examples: RDMBS,spreadsheet • Unstructured – information with a lack of structure that is time and energy consuming to search and find and consolidate into facts Examples: email, documents, images, reports www.kausalvikash.in
  15. 15. Factsabout Data • Data is growing at an incredible rate • Gartner and IDCstate that data is doubling every 18months • Estimate is that there is over 4 zeta bytes of data in the world • If the trend continues, by 2020 data willbe over 40 zeta bytes • 1 zetabyte =1 billion terabytes www.kausalvikash.in
  16. 16. What is causingData Explosion? • Internet –Connecting everything to everyone – Billions of people to Billions of devices –Online Shopping (Amazon, Wal-Mart, eBay,BestBuy) –File Sharing (Drop box, Google Drive, iCloud, SkyDrive)• • Social Media – Facebook – Google+ – Twitter – YouTube • Store Everything, Delete nothing, multiple copies of it all www.kausalvikash.in
  17. 17. This is where Data Science comesinto the picture www.kausalvikash.in
  18. 18. www.kausalvikash.in
  19. 19. www.kausalvikash.in
  20. 20. www.kausalvikash.in
  21. 21. www.kausalvikash.in
  22. 22. www.kausalvikash.in
  23. 23. www.kausalvikash.in
  24. 24. www.kausalvikash.in
  25. 25. www.kausalvikash.in
  26. 26. www.kausalvikash.in
  27. 27. www.kausalvikash.in
  28. 28. 2. Exploratory DataAnalysis • Remember the quality of your inputs decide the quality of your output. So,once you have got your business hypothesis ready, it makessense to spend lot of time and efforts here. Data exploration, cleaning and preparation cantake up to 70%of your total project time. • Below are the steps involved to understand, clean and prepare your data for building your predictive model:  Variable Identification  UnivariateAnalysis  Bi-variate Analysis  Missing values treatment  Outlier treatment  Variable transformation  Variable creation www.kausalvikash.in
  29. 29. Variable Identification: Typesof DataVariables • Data consists of acombination of "variables" which actually contain thevalues • Variables at a high level are of two types depending on the kind of values they store: • Numerical/ Quantitative • Categorical/ Qualitative Numerical variables/ Quantitative  Discrete  Arisesfrom counting  cantake only aset of particular values including negative and fractionalvalues  Examples: Credit score, number of credit cards owned by aperson, number of states in acountry, charge on electron etc.  Continuous  Arisesfrom measuring  Cantake any value with in aspecifiedrange  Examples: Height,Amount of moneyetc. Categorical variables/ Qualitative  Binary (or Dichotomous)  Hasonly two categories  Examples:yes/no, male/female, pass/failetc.  Nominal  Hasseveralunordered category  Examples:Typeof bank account, type of insurance policy etc.  Ordinal  Hasseveralordered category  Examples:questionnaire responsessuchas "strongly in favour / …/ strongly against". Interval: When the data hasfixed amount of intervals within each category Ratio: Similar to Interval but has a zero, when this variable equals 0, it means none of that variablewww.kausalvikash.in
  30. 30. Variable Identification: Typesof DataVariables- Summary Data(Consists of Variables) Numerical Categorical DiscreteContinuous Nominal Dichotomous or Binary Ordinal Arisesfrom counting Arisesfrom measuring Several unordered category Only two categories Several ordered category www.kausalvikash.in
  31. 31. Variable Identification: CaseStudy • Romanov,anAnalytics consultant works with Credit Onebank. Hismanager gavehim alist having the name of bank's customers. Further he hasbeen askedto pull the information from bank's database pertaining to the customer list. Theinformation will be around the credit cards issued by the bank. He needs to define the variable types and the type of value each one of them will contain. Romanov,who hasjust started his professional career, doesn't hasagood idea about different variable types. • Now, suppose after extracting data he approached you and askedyour help in categorizing the different variables. Help Romanovin variablecategorization. www.kausalvikash.in
  32. 32. Variable Identification: Datasnapshot Sl# Name of Custome r CustomerID Number of CreditCards Age ofCustomer Gender ofthe Customer MaritalStatus of the Customer Annual Salary(in USD) MonthlyCredit CardUsage 1 Josh 111669 5 42 F Never Married 88,00 1 Low 2 Janice 146861 6 25 F Married 592,489 Low 3 Dandre 171690 3 50 M Divorced 272,304 Low 4 Aiden 161721 6 37 M Married 726,593 Low 5 Celine 170359 7 50 F Never Married 612,075 Low 6 Emilio 175646 5 41 M Never Married 490,356 Low 7 Joaquin 180732 2 62 F Divorced 164,732 Low 8 Justus 113136 7 26 F Never Married 510,321 Low 9 Chaya 169254 4 24 M Never Married 358,534 Low 10 Justyn 149771 4 35 M Married 140,400 Low 11 Jadon 166226 7 36 M Never Married 105,259 Lowwww.kausalvikash.in
  33. 33. Variable Identification Variabl e Name Name of Custome r Customer ID Number of CreditCards Age ofCustomer Genderof Customer Marital Status of Customer Annua l Salary Monthly CreditCard Usage ValueStored ? ? ? ? ? ? ? ? VariableType ? ? ? ? ? ? ? ? Remarks Information to be extracted byRomanov. www.kausalvikash.in
  34. 34. Variable identification: Typesof Datavariables Variabl e Name Name of Custome r Customer ID Number of CreditCards Age ofCustomer Genderof Customer Marital Status of Customer Annua l Salary Monthly CreditCard Usage ValueStored Name of the individual customer Unique identifier 1, 2, 3… 18, 19, 20… Male / Female Married / Divorced / NeverMarried Amount Low(<25%) / Medium(<50%) / High(<75%) / Very High(>75%) Variabl e Type ? ? ? ? ? ? ? ? Remarks www.kausalvikash.in
  35. 35. Variable identification: Typesof Datavariables Variabl e Name Name of Custome r Customer ID Number of CreditCards Age ofCustomer Genderof Customer Marital Status of Custome r Annual Salary Monthly Credit Card Usage ValueStored Name of the individual customer Unique identifier 1, 2,3… 18, 19,20… Male / Femal e Married / Divorced/ Never Married Amount Low(<25%)/ Medium(<50%)/ High(<75%)/ VeryHigh(>75%) VariableType -- -- Numeric al (Discrete) Numeric al (Discrete) Categoric al (Binary) Categoric al (Nominal) Numerical (Continuou s) Categorical(Ordinal) Remarks Identifier Identifier Arises from counting. Takes certain discrete values in agiven range Arises from counting. Takes certain discrete values in agiven range Only two categorie s Several ordered categor y Takesmany values in agiven range Severalordered category www.kausalvikash.in
  36. 36. 2. UnivariateAnalysis • Romanov,anAnalytics consultant works with Credit Onebank. Hismanager gavehim some data around credit cards relating to number of credit cards issued to a set of customers and the credit limit of the cards. Further he hasbeen tasked to summarize the data in a presentable form and prepare the report. Romanov, who hasjust started his professional career, hasnever played around with suchkind of data, sohe is clueless about the different summarizing techniques. • Now, suppose he approached you and askedyour help in preparing the report. Help Romanovin summarizing the data and preparing thereport. www.kausalvikash.in
  37. 37. Univariate Analysis: Continuous Variables www.kausalvikash.in
  38. 38. Univariate Analysis:Measure of CentralTendency/Location • There are anumber of different quantities, which canbe used toestimate the central point of asample. • Theseare called measures of central tendency, or measures oflocation. • Justdifferent waysof calculating the "average" value ofdataset • Theseare: • Mean • Median • Mode www.kausalvikash.in
  39. 39. Univariate Analysis:Measure of Central Tendency/Location -Mean • Byfar the most common measure for describing the location of a set of data is the mean. • For aset of observations denoted by x1, x2,….,xnthe mean is defined by • <x>=(x1 +x2 +…+xn)/n (also denoted by x-bar i.e. ). • For afrequency distribution with values x1, x2, …xn and corresponding frequency values f1, f2, …,fnit is defined as • <x>=(f1 * x1 +f2 * x2 +….+fn * xn)/(f1 +f2 +…+fn). • Illustration 2.4: Calculating mean for sample of 3000 individuals having creditcards. 1. Using Excelfunction for granular data 2. Using Excelfunction for frequency distribution table www.kausalvikash.in
  40. 40. Univariate Analysis: Measure of Central Tendency/Location - Median • Another useful measure of location. • The median is avalue, which splits the data set into two equal halves. • Sothathalf the observations are lessthan the median and half are greater than the median. • If n is odd, then the median is the middle observation. • If n is even, then the median is the midpoint of the middle two observations i.e. (n +1) / 2thobservation. • One of the potential advantages of the median for certain data sets is that itis robust or resistant to the effects of extremeobservations. • Illustration 2.5: Calculating median for sample of 3000 individuals having credit cards along with demonstration ofextreme observations. www.kausalvikash.in
  41. 41. Univariate Analysis: Measure of Central Tendency/Location - Median Median #Cards 4 1. UsingExcelfunction for granulardata 2. For summarized data in form of frequencytable www.kausalvikash.in
  42. 42. Univariate Analysis:Measure of Central Tendency/Location -Mode Mode =4 i.e. highest number of individuals have 4cards • Athird measure of location is themode. • Defined asthe value which occurs with the greatest frequency or the most typical value. • Illustration 2.6: Finding the mode for sample of 3000 individuals having credit cards. • Excelhasinbuilt function “Mode” for granulardata • For summarized data it canbe find easily by visualinspection Tabularrepresentation Number of CreditCards #Customers 1 150 2 300 3 450 4 660 5 540 6 300 7 240 8 150 9 120 10 90 www.kausalvikash.in
  43. 43. Univariate Analysis: Measure of Spread • After Romanovpresented the summarized data along with "measures of Centraltendency" to his manager at Credit One, he was further askedto add the various measures of spread to the report. • Now, Romanovbeing unaware of the term "measures of spread" again approached you and askedfor your help. Help Romanovin carrying out histask. www.kausalvikash.in
  44. 44. Univariate Analysis:Measure of Spread • Thecentral tendency of adata set is usually the main featureof interest. • Another feature of interest is the spread (or variability or dispersion or scatter) • Meaning how widely spread the data are about the mean (or other measure of location). • Thedifferent measures of spreadare: • Varianceand StandardDeviation • TheRange • TheInter quartilerange www.kausalvikash.in
  45. 45. Univariate Analysis: Measure of Spread- Variance and Standard Deviation • Themost commonly usedmeasure of spread is the standard deviation. • Essentially itis ameasure of how far on average the observations are from the mean. • For adata set having values x1, x2,…,xn(or xi where i=1,2,…,n) and mean of <x>variance is calculated as • For granular data: Variance (σ2) =∑(xi- <x>)2/n • For summarized frequency table: Variance (σ2) =∑{fi*(xi - <x>)2}/n • Standarddeviation is positive square root of variance denoted byσ • For asample variance is calculated as • Variance (s2) =∑(xi- <x>)2/(n-1) • Dividing by (n −1) makes the sample variance an unbiased estimator of the population variance. We will look into the details of it in later part of the course • Illustration 2.7: Calculating variance and standard deviation for sample of 3000 individuals having credit cards • Exercise:Dothe algebra to make sure that the above mentioned formulae of varianceare equivalent. www.kausalvikash.in
  46. 46. Univariate Analysis: Measure of Spread- Variance and StandardDeviation (Using MSExcel) 1. UsingExcelfunction for granulardata 2. For summarized data in form of frequency table 1 2www.kausalvikash.in
  47. 47. Univariate Analysis: Measure of Spread-Range • Therange is avery simple measure of spread defined, asits name suggests, by the difference between the largest and smallest observations in the dataset. • Range=max(xi) – min(xi) • Apoor measure of the spread of the data asitrelies on the extreme values • Which aren't necessarily representative of the data asawhole. • Illustration 2.8: Calculating Rangefor sample of 3000 individuals having creditcards 1 2 3www.kausalvikash.in
  48. 48. Univariate Analysis:Measure of Spread- Inter quartileRange • Similarto Rangebut is not affected by the data extremes. • Just asthe median divides aset of data into two halves,the quartiles divide a set of data intofour quarters. Theyare denoted by Q1, Q2and Q3. • Q2 is just the median, while Q1is called the lower quartile and Q3 the upper quartile. • Q1canbe defined to be the (n +2) / 4th observation counting from below and Q3asthe same counting from above, with relevant interpolation if needed. • TheInter quartile range is defined asQ3−Q1. • Illustration 2.9: Calculating Inter quartile Rangefor sample of 3000 individuals having credit cards www.kausalvikash.in
  49. 49. Univariate Analysis: Symmetry and skewnessof data • Romanovgot appreciations after he presented the summarized data along with "measures of Central tendency" and "measure of spread" to his manager at Credit One.But, he was further askedtocreate an illustration around symmetry and skewness of data. Following that carry out the analysis of credit card data • Now, Romanovbeing unaware of the term "symmetry and skewness" again approached you and askedfor your help. In return he promised to gift you a bottle of Champagne. Help Romanov in carrying out his task. www.kausalvikash.in
  50. 50. Univariate Analysis: Symmetry andskewness • It deals with the shape of the distribution of a data set, that is, whether it is symmetric or skewed to one side or the other. • Theapproximate shapeof adistribution can be determined by looking at ahistogram. • Illustration 2.9: Calculating mean, median, mode and variance for symmetric and skeweddata. 0 20 40 60 80 100 120 0 5 10 15 20 Symmetrical 200 180 160 140 120 100 80 60 40 20 0 0 5 10 15 20 PositivelySkewed 200 180 160 140 120 100 80 60 40 20 0 0 5 10 15 20 NegativelySkewed www.kausalvikash.in
  51. 51. Univariate Analysis: Symmetry andskewness www.kausalvikash.in
  52. 52. Univariate Analysis: Symmetry andskewness Symmetrical: Mean =Median =Mode Positively Skewed: Mean >Median >Mode Negatively Skewed: Mean <Median <Mode www.kausalvikash.in
  53. 53. Univariate Analysis:Summarizing Data- Frequency distribution • Atechnique to summarize categorical/discrete data • Asimple process which involves counting of distinct categorical/discrete values • Therepresentation canbe either tabular or graphical • Example: Number of credit cardsowned in asample of 3000individuals Tabularrepresentation Graphical representation - BarChart 400 300 200 100 0 700 600 500 1 2 3 4 5 6 7 8 9 10 #Customers #Cards FreqDistribution- #Cardsvs.#Customers #Customers Number ofCredit Cards #Customers 1 150 2 300 3 450 4 660 5 540 6 300 7 240 8 150 9 120 10 90 www.kausalvikash.in
  54. 54. Univariate Analysis: Summarizing Data- Frequency distribution (Using MSExcel) Number of CreditCards 3 2 4 5 1 7 9 10 6 8 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 #Customers #Customers 1 2 3 4 567 4. Press“ctrl+alt+enter” www.kausalvikash.in
  55. 55. Univariate Analysis: Summarizing Data - Grouped Frequency distribution • Atechnique to summarize continuous data or discrete data having large number of observations and an extended range • Asimple processwhich involves counting of values falling under the different intervals(grouped) • Exampleand illustration 2.2: Number of customers falling under differentSalarygroups Graphical representation - BarChart #Customers FreqDistribution- Salary Band vs. #Customers 120 100 80 60 40 20 0 Salary Band www.kausalvikash.in
  56. 56. Univariate Analysis: Summarizing Data – Grouped Frequency distribution 1 2 1. Press“ctrl+alt+enter” 0- 75000 100001- 125000 150001- 175000 200001- 225000 250001- 275000 300001- 325000 350001- 375000 400001- 425000 450001- 475000 500001- 525000 550001- 575000 600001- 625000 650001- 675000 700001- 725000 750001- 775000 800001- 825000 850001- 875000 900001- 925000 950001- 975000 #Customers 120 100 80 60 40 20 0 3 4 5 4.From“Edit” select the salary bandsashorizontal axis 5.Observethe difference between horizontalaxesof two charts www.kausalvikash.in
  57. 57. Univariate Analysis: Summarizing Data- Cumulative Frequencydistribution • Cumulative frequencies are obtained by accumulating the frequencies to give the total numberof observations up toand including the value or group in question. • Example and illustration 2.3: Cumulative number of cardsin the sample of3000 individuals Tabularrepresentation Graphicalrepresentation Number ofCredit Cards Upto Cumulative #Customers 1 150 2 450 3 900 4 1560 5 2100 6 2400 7 2640 8 2790 9 2910 10 3000 3000 2500 2000 1500 1000 500 0 0 1 2 3 4 6 7 8 9 10 Cumulative#Customers 5 #Cards Cumulative #Customers www.kausalvikash.in
  58. 58. 3. Observethe last entry. Itis equal to the total numbers ofobservations 0 2 4 6 8 10 12 Cumulative #Customers 3500 3000 2500 2000 1500 1000 500 0 1 2 345 Univariate Analysis:SummarizingData- Cumulative Frequency distribution (UsingMSExcel) www.kausalvikash.in
  59. 59. Univariate Analysis:Summarizing Data– Line Plots • Line plot diagram • Not suitable for large data. Hence, not extensively usedin industry. • Illustration: Given test scoresof 20 students. Represent them using line plot diagram Sl# Score Score(Sorted) 1 50 20 2 20 20 3 50 20 4 50 30 5 50 30 6 30 30 7 30 30 8 40 30 9 30 40 10 40 40 11 30 40 12 20 40 13 50 40 14 40 50 15 20 50 16 30 50 17 40 50 18 40 50 19 50 50 20 50 50 www.kausalvikash.in
  60. 60. Visualization Techniques: Histograms www.kausalvikash.in
  61. 61. • Definition : Abar graph is achart that uses either horizontal or vertical bars to show comparison amongst categories • Typesof bar charts: BarCharts www.kausalvikash.in
  62. 62. • Single bar graphs are used to convey the discrete value of the item for each category shown on the opposingaxis. Singlebar charts www.kausalvikash.in
  63. 63. Horizontal Barchart • It is also possible to draw bar charts so that the bars are horizontal which means that the longer the bar, the larger the category www.kausalvikash.in
  64. 64. Grouped BarChart • Agrouped or clustered bar graph is used to represent discrete valuesfor more than one item that share the samecategory. www.kausalvikash.in
  65. 65. PieChart • Apie chart is atype of graph in which acircle is divided intosectors that each represent aproportion of the whole www.kausalvikash.in
  66. 66. Visualization Techniques www.kausalvikash.in
  67. 67. BoxPlots www.kausalvikash.in
  68. 68. BoxPlots www.kausalvikash.in
  69. 69. BoxPlots www.kausalvikash.in
  70. 70. BoxPlots www.kausalvikash.in
  71. 71. BoxPlots www.kausalvikash.in
  72. 72. BoxPlots www.kausalvikash.in
  73. 73. BoxPlots www.kausalvikash.in
  74. 74. BoxPlots www.kausalvikash.in
  75. 75. 3. Bivariate Analysis • Bi-variateAnalysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at apre-defined significance level. • Thecombination can be: Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysisprocess. • Categorical& Categorical: Tofind the relationship between two categorical variables, we canusefollowing methods: • Chi-Square Test:Thistest is used toderive the statistical significance of relationship between the variables. It returns probability for the computed chi-square distribution with the degree of freedom. Probability of 0: It indicates that both categorical variable are dependent Probability of 1: It showsthat both variables are independent. Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. www.kausalvikash.in
  76. 76. Bivariate Analysis • Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables. Z-Test/ T-Test: Either test assesswhether mean of twogroups are statistically different from each other ornot. ANOVA:- It assesseswhether the average of more than two groups is statistically different • Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatterplot. Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. Tofind the strength of the relationship, we useCorrelation. Correlation varies between -1 and+1. www.kausalvikash.in
  77. 77. Yis continuous Xiscategorical www.kausalvikash.in
  78. 78. 4.Missing ValueImputation There are avariety of techniques for missingvalue imputation; but these should be considered more asscenario-specific than just being aset of pure alternativechoices. Missing ValueImputation Techniques A.Impute Missing Valueswith ZERO B.Impute Missing Values with MEDIAN C.Impute Missing Valueswith MEAN D.Impute Missing Values with MODE E.Impute using Regressionon other Non-Missing Predictors F.KNNImputation www.kausalvikash.in
  79. 79. 5.Outlier Treatment An outlier is a single observation"far away" from rest of the data. Reasons foroutliers: • Errors • Dataerrors • Samplingerror • Standardization failure • Faulty distributionalassumptions • HumanError • Genuine Outliers Why do we careaboutoutliers? • Outliersare BAD • Thepresenceof outliers can lead to inflated error rates and substantial distortions of results that can lead to wrong conclusions and inferences. • Outliersare GOOD • Theoutliers can provide useful information in the data, for example, a spike in spendbehavior of some customers mayprove to be the deciding factor in marketing responsecampaigns.Socare should be taken while dealing withoutliers. In short, outliersare importantand hence should not be ignored. Techniques for outlier detection / treatment: • Capping and FlooringTechnique • ExponentialSmoothingTechnique • SigmaApproach • Robust Regression Technique • Mahalanobis DistanceTechnique Outlier Outlier www.kausalvikash.in
  80. 80. www.kausalvikash.in

×