DevSpace would like to thank our
sponsors!
The Buzzwords
• Big Data
• Fast Data
• Dark Data
• Unstructured Data
• Data Mining
• Data Vizualization
• Predictive Analytics
• Machine Learning
• [Deep] Neural Network
The Growth
The Demand
• National gap for analytical expertise at 140k+ by 2017. –McKinsey 2011
• Shortage of 100k Data Scientists by 2020. –Gartner 2012
• 90% of clients need expertise, 40% cite lack of talent. –Accenture 2014
• Survey finds 83% of data scientists see shortage. –Crowdflower 2016
The Salary
https://www.paysa.com/salaries/data-scientist--t
The Definition
A data scientist is a job title for an employee or
business intelligence (BI) consultant who excels at
analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.
–WhatIs.com
The Definition
The Breakdown
Data
•Define
•Collect
•Store
•Explore
Scientist
•Hypothesis
•Plan Approach
•Analysis
•Report Results
The Job
• Educate the business
• Look for problems to solve
• Research new techniques
• Collate data for analysis (ETL)*
• Implement algorithms
• Design big data-capable architecture
• Present insights
The Wrangling
Sample the Data
•Random
•Stratified
Reconcile Missing Data
•Discard
•Infer
Normalize Numeric Values
•Standard Unit of Measure
•Subtract Average (Mean = 0)
•Divide by Standard Deviation
Reduce Dimensionality
•Irrelevant Input Variables
•Redundant Input Variables
Add Derivative Values
•Generalize Attributes
•Discretize Attributes to Categories
•Binarize Categorical Attributes
Design Training Data
•Select
•Combine
•Aggregate
Power and Log transformation
•Approximate Normal Distribution
The Analysis Tools
The Tool Trends
Python
KNIME
RapidMiner
R
SPSS
SAS
Hadoop
The Top Tools
• SQL
• Excel
• Python
• R
• MySQL
The Languages
• R
• Python
• Java/Scala
• Stata
• SAS
• SPSS
• Matlab
• Julia
• Kafka/Storm
http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
The Math
• basic statistics (ie. p-value)
• statistical modeling
• statistical tests
• experiment design
• distributions
• maximum likelihood estimators
• probability theory
• linear algebra
• multivariable calculus
The Visualization Tools
• Tableau (enterprise visualization products) - www.tableau.com
• ggvis (R visualization package) - ggvis.rstudio.com
• ggplot (plotting system) - ggplot.yhathq.com
• D3.js (declarative DOM manipulation) - d3js.org
• Vega (visualization grammar)- trifacta.github.com/vega
• Rickshaw (charting library - code.shutterstock.com/rickshaw
• modest maps (map library) - modestmaps.com
• Chart.js (plotting library) - www.chartjs.org
The Machine Learning
Concepts
•k-nearest neighbors
•random forests
•ensemble methods
•…use Python libraries!
Tools
•Weka - www.cs.waikato.ac.nz/ml/weka/
The Results
• Report
• Presentation
• Demo
• Prototype
• Component
The Skills
• Data Analyst (A)
• Data Engineer (B)
• Academic (Ab)
• Generalist (AB)
http://blog.udacity.com/2014/11/data-science-job-skills.html
The Path
1. Fundamentals
2. Statistics
3. Programming
4. ML
5. Text Mining
6. Visualization
7. Big Data
8. Data Munging
9. Toolbox
Fundamentals
1. Matrices & Linear Algebra
2. Hash Functions, Binary Tree, O(n)
3. Relational Algebra, DB Basics
4. Inner, Outer, Cross, Theta Join
5. Cap Theorem
6. Tabular Data
7. Data Frames & Series
8. Sharding
9. OLAP
Fundamentals
10. Multidimensional Data Model
11. ETL
12. Reporting vs BI vs Analytics
13. JSON & XML
14. NoSQL
15. Regex
16. Vendor Landscape
17. Env Setup
Statistics
1. Pick a Dataset
2. Descriptive Statistics
3. Exploratory Data Analysis
4. Histograms
5. Percentiles and Outliers
6. Probability Theorem
7. Bayes Theorem
8. Random Variables
9. Cumul Dist Fn (CDF)
The Training
• Coursera - www.coursera.org
• EdX- www.edx.org
• Udacity - www.udacity.com
• Kaggle - www.kaggle.com
• Youtube - projects.iq.harvard.edu/stat110/youtube
• Boot Camps

From Developer to Data Scientist

  • 2.
    DevSpace would liketo thank our sponsors!
  • 4.
    The Buzzwords • BigData • Fast Data • Dark Data • Unstructured Data • Data Mining • Data Vizualization • Predictive Analytics • Machine Learning • [Deep] Neural Network
  • 5.
  • 6.
    The Demand • Nationalgap for analytical expertise at 140k+ by 2017. –McKinsey 2011 • Shortage of 100k Data Scientists by 2020. –Gartner 2012 • 90% of clients need expertise, 40% cite lack of talent. –Accenture 2014 • Survey finds 83% of data scientists see shortage. –Crowdflower 2016
  • 7.
  • 8.
    The Definition A datascientist is a job title for an employee or business intelligence (BI) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge. –WhatIs.com
  • 9.
  • 11.
  • 12.
    The Job • Educatethe business • Look for problems to solve • Research new techniques • Collate data for analysis (ETL)* • Implement algorithms • Design big data-capable architecture • Present insights
  • 13.
    The Wrangling Sample theData •Random •Stratified Reconcile Missing Data •Discard •Infer Normalize Numeric Values •Standard Unit of Measure •Subtract Average (Mean = 0) •Divide by Standard Deviation Reduce Dimensionality •Irrelevant Input Variables •Redundant Input Variables Add Derivative Values •Generalize Attributes •Discretize Attributes to Categories •Binarize Categorical Attributes Design Training Data •Select •Combine •Aggregate Power and Log transformation •Approximate Normal Distribution
  • 14.
  • 15.
  • 16.
    The Top Tools •SQL • Excel • Python • R • MySQL
  • 17.
    The Languages • R •Python • Java/Scala • Stata • SAS • SPSS • Matlab • Julia • Kafka/Storm http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
  • 18.
    The Math • basicstatistics (ie. p-value) • statistical modeling • statistical tests • experiment design • distributions • maximum likelihood estimators • probability theory • linear algebra • multivariable calculus
  • 19.
    The Visualization Tools •Tableau (enterprise visualization products) - www.tableau.com • ggvis (R visualization package) - ggvis.rstudio.com • ggplot (plotting system) - ggplot.yhathq.com • D3.js (declarative DOM manipulation) - d3js.org • Vega (visualization grammar)- trifacta.github.com/vega • Rickshaw (charting library - code.shutterstock.com/rickshaw • modest maps (map library) - modestmaps.com • Chart.js (plotting library) - www.chartjs.org
  • 20.
    The Machine Learning Concepts •k-nearestneighbors •random forests •ensemble methods •…use Python libraries! Tools •Weka - www.cs.waikato.ac.nz/ml/weka/
  • 21.
    The Results • Report •Presentation • Demo • Prototype • Component
  • 22.
    The Skills • DataAnalyst (A) • Data Engineer (B) • Academic (Ab) • Generalist (AB) http://blog.udacity.com/2014/11/data-science-job-skills.html
  • 23.
    The Path 1. Fundamentals 2.Statistics 3. Programming 4. ML 5. Text Mining 6. Visualization 7. Big Data 8. Data Munging 9. Toolbox
  • 24.
    Fundamentals 1. Matrices &Linear Algebra 2. Hash Functions, Binary Tree, O(n) 3. Relational Algebra, DB Basics 4. Inner, Outer, Cross, Theta Join 5. Cap Theorem 6. Tabular Data 7. Data Frames & Series 8. Sharding 9. OLAP
  • 25.
    Fundamentals 10. Multidimensional DataModel 11. ETL 12. Reporting vs BI vs Analytics 13. JSON & XML 14. NoSQL 15. Regex 16. Vendor Landscape 17. Env Setup
  • 26.
    Statistics 1. Pick aDataset 2. Descriptive Statistics 3. Exploratory Data Analysis 4. Histograms 5. Percentiles and Outliers 6. Probability Theorem 7. Bayes Theorem 8. Random Variables 9. Cumul Dist Fn (CDF)
  • 27.
    The Training • Coursera- www.coursera.org • EdX- www.edx.org • Udacity - www.udacity.com • Kaggle - www.kaggle.com • Youtube - projects.iq.harvard.edu/stat110/youtube • Boot Camps

Editor's Notes

  • #3Ā Chris Gardner
  • #4Ā Trying to understand the world of Data Science and ā€œBig Dataā€ can be overwhelming. Not only is it huge, but it is constantly changing! Innovation is shifting from Infrastructure and Analytics toward Applications. http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
  • #5Ā Before we discuss data scientists, lets look at some common buzz words. Big Data – Data sets so large they require techniques to analyze. Fast Data – Data whose utility is going to decline over time (fast ingest, streaming, preparation, analytics, user response). Dark Data – People don’t know it’s there, don’t know how to access it, aren’t allowed access, or the systems haven’t been set up to leverage it yet. Data Mining – Examining large data sets top generate new insights. Predictive Analytics – Extracting information from existing data to determine patterns. Machine Learning – The use of algorithms to learn from and make predictions on data. Deep Neural Network - Graphical models in which data is computed upon by successive layers of nodes. Any other major buzzwords? Lets talk about Data Science. http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro-to-data-science
  • #6Ā Within the last few years the amount of data being generated has drastically increased. The majority of this new data is unstructured.
  • #7Ā As storage prices dropped and the world became increasingly computerized, the amount of data being gathered grew exponentially, as did the opportunities to benefit from it’s analysis. McKinsey predicts need for expertise Gartner predicts talent shortage Accenture sees widespread need (unmet) Survey of industry confirms shortage Studies and surveys are great, but money talks!
  • #8Ā Software Engineer market salary of $118k Developer market salary of $94k So we know there’s a need, but what is a Data Scientist?
  • #14Ā Domain Expertise! https://digitalmarketing.temple.edu/romannicholas/2016/01/28/when-i-grow-up-i-want-to-be-a-data-scientist/
  • #15Ā Data munging is the process of converting ā€œrawā€ data into a format that can be consumed. Input Variable = ā€œFeatureā€ Also Power and Log transformation A large portion of machine learning models are based on assumption of linearity relationship (ie: the output is linearly dependent on the input), as well as normal distribution of error with constant standard deviation.Ā  However the reality is not exactly align with this assumption in many cases. http://horicky.blogspot.com.au/2012/05/predictive-analytics-data-preparation.html
  • #18Ā http://r4stats.com/articles/popularity/
  • #19Ā O’Reilly survey better represents open-source community Conclusion: Among the software that tends to be used as a collection of pre-written methods, R, SAS, SPSS and Stata tend to always be in the top, with R and SAS occasionally swapping places depending on the criteria used. I don’t include Python in this group as I rarely see someone using it exclusively to call pre-written routines. Using commodity hardware and open source software, Hadoop’s distributed file system (HDFS) facilitates the storage, management and rapid analysis of vast datasets across distributed clusters of servicers. Hadoop MapReduce Persists to disk (slow) Hive/Presto (new) SQL-like abstraction layers Pig Workflow-driven abstraction Spark * Runs in-memory Standalone or with Hadoop http://r4stats.com/articles/popularity/
  • #23Ā Tool
  • #25Ā A=Analyst B=Builder Don’t forget Domain Expertise!
  • #26Ā Author: Swami Chandrasekaran http://nirvacana.com/thoughts/becoming-a-data-scientist/
  • #30Ā 1.Ā PythonLearn Python Programming From Scratch by UdemyLearn to program in Python by CodeCademyLearnPython.org interactive Python tutorial2.Ā Machine LearningMachine learning onlineOperational Intelligence and Machine Data with Splunk3.Ā R LanguageR Basics – R Programming Language Introduction by UdemyIntroduction to R at DataCampLearn R at Code school4.Ā Big DataBig Data UniversityBig Data and Hadoop Essentials by UdemyBasic overview of Big Data Hadoopby- Udemy5.Ā StatisticsStatistics One by CourseraStatistics and ProbabilityProbability & Statistics6.Ā Data MiningData Mining and Web Scraping: How to Convert Sites into Data by UdemyData Mining by Coursera7.Ā SQLInteractive Online SQL Training for BeginnersSachin Quickly Learns (SQL) – Structured Query Language by UdemySQL Tutorial by w3schools8.Ā JavaLearn Java: The Java Programming Tutorial For Beginners by UdemyLearn Java – Free Interactive Java TutorialĀ Learn Java Programming From Scratch – Udemy