Due to recent advances in technology, humanity is collecting vast amounts of data at an unprecedented rate, making the skills necessary to mine insights from this data increasingly valuable. So what does it take for a Developer to enter the world of data science?
Join me on a journey into the world of big data and machine learning where we will explore what the work actually looks like, identify which skills are most important, and design a road map for how you too can join this exciting and profitable industry.
4. The Buzzwords
• Big Data
• Fast Data
• Dark Data
• Unstructured Data
• Data Mining
• Data Vizualization
• Predictive Analytics
• Machine Learning
• [Deep] Neural Network
6. The Demand
• National gap for analytical expertise at 140k+ by 2017. –McKinsey 2011
• Shortage of 100k Data Scientists by 2020. –Gartner 2012
• 90% of clients need expertise, 40% cite lack of talent. –Accenture 2014
• Survey finds 83% of data scientists see shortage. –Crowdflower 2016
8. The Definition
A data scientist is a job title for an employee or
business intelligence (BI) consultant who excels at
analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.
–WhatIs.com
12. The Job
• Educate the business
• Look for problems to solve
• Research new techniques
• Collate data for analysis (ETL)*
• Implement algorithms
• Design big data-capable architecture
• Present insights
13. The Wrangling
Sample the Data
•Random
•Stratified
Reconcile Missing Data
•Discard
•Infer
Normalize Numeric Values
•Standard Unit of Measure
•Subtract Average (Mean = 0)
•Divide by Standard Deviation
Reduce Dimensionality
•Irrelevant Input Variables
•Redundant Input Variables
Add Derivative Values
•Generalize Attributes
•Discretize Attributes to Categories
•Binarize Categorical Attributes
Design Training Data
•Select
•Combine
•Aggregate
Power and Log transformation
•Approximate Normal Distribution
17. The Languages
• R
• Python
• Java/Scala
• Stata
• SAS
• SPSS
• Matlab
• Julia
• Kafka/Storm
http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
18. The Math
• basic statistics (ie. p-value)
• statistical modeling
• statistical tests
• experiment design
• distributions
• maximum likelihood estimators
• probability theory
• linear algebra
• multivariable calculus
22. The Skills
• Data Analyst (A)
• Data Engineer (B)
• Academic (Ab)
• Generalist (AB)
http://blog.udacity.com/2014/11/data-science-job-skills.html
23. The Path
1. Fundamentals
2. Statistics
3. Programming
4. ML
5. Text Mining
6. Visualization
7. Big Data
8. Data Munging
9. Toolbox
24. Fundamentals
1. Matrices & Linear Algebra
2. Hash Functions, Binary Tree, O(n)
3. Relational Algebra, DB Basics
4. Inner, Outer, Cross, Theta Join
5. Cap Theorem
6. Tabular Data
7. Data Frames & Series
8. Sharding
9. OLAP
25. Fundamentals
10. Multidimensional Data Model
11. ETL
12. Reporting vs BI vs Analytics
13. JSON & XML
14. NoSQL
15. Regex
16. Vendor Landscape
17. Env Setup
26. Statistics
1. Pick a Dataset
2. Descriptive Statistics
3. Exploratory Data Analysis
4. Histograms
5. Percentiles and Outliers
6. Probability Theorem
7. Bayes Theorem
8. Random Variables
9. Cumul Dist Fn (CDF)
Trying to understand the world of Data Science and “Big Data” can be overwhelming.
Not only is it huge, but it is constantly changing!
Innovation is shifting from Infrastructure and Analytics toward Applications.
http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
Before we discuss data scientists, lets look at some common buzz words.
Big Data – Data sets so large they require techniques to analyze.
Fast Data – Data whose utility is going to decline over time (fast ingest, streaming, preparation, analytics, user response).
Dark Data – People don’t know it’s there, don’t know how to access it, aren’t allowed access, or the systems haven’t been set up to leverage it yet.
Data Mining – Examining large data sets top generate new insights.
Predictive Analytics – Extracting information from existing data to determine patterns.
Machine Learning – The use of algorithms to learn from and make predictions on data.
Deep Neural Network - Graphical models in which data is computed upon by successive layers of nodes.
Any other major buzzwords? Lets talk about Data Science.
http://www.zipfianacademy.com/blog/post/46864003608/a-practical-intro-to-data-science
Within the last few years the amount of data being generated has drastically increased.
The majority of this new data is unstructured.
As storage prices dropped and the world became increasingly computerized, the amount of data being gathered grew exponentially, as did the opportunities to benefit from it’s analysis.
McKinsey predicts need for expertise
Gartner predicts talent shortage
Accenture sees widespread need (unmet)
Survey of industry confirms shortage
Studies and surveys are great, but money talks!
Software Engineer market salary of $118k
Developer market salary of $94k
So we know there’s a need, but what is a Data Scientist?
Data munging is the process of converting “raw” data into a format that can be consumed.
Input Variable = “Feature”
Also Power and Log transformation
A large portion of machine learning models are based on assumption of linearity relationship (ie: the output is linearly dependent on the input), as well as normal distribution of error with constant standard deviation. However the reality is not exactly align with this assumption in many cases.
http://horicky.blogspot.com.au/2012/05/predictive-analytics-data-preparation.html
http://r4stats.com/articles/popularity/
O’Reilly survey better represents open-source community
Conclusion: Among the software that tends to be used as a collection of pre-written methods, R, SAS, SPSS and Stata tend to always be in the top, with R and SAS occasionally swapping places depending on the criteria used. I don’t include Python in this group as I rarely see someone using it exclusively to call pre-written routines.
Using commodity hardware and open source software, Hadoop’s distributed file system (HDFS) facilitates the storage, management and rapid analysis of vast datasets across distributed clusters of servicers.
Hadoop MapReduce
Persists to disk (slow)
Hive/Presto (new)
SQL-like abstraction layers
Pig
Workflow-driven abstraction
Spark *
Runs in-memory
Standalone or with Hadoop
http://r4stats.com/articles/popularity/
1. PythonLearn Python Programming From Scratch by UdemyLearn to program in Python by CodeCademyLearnPython.org interactive Python tutorial2. Machine LearningMachine learning onlineOperational Intelligence and Machine Data with Splunk3. R LanguageR Basics – R Programming Language Introduction by UdemyIntroduction to R at DataCampLearn R at Code school4. Big DataBig Data UniversityBig Data and Hadoop Essentials by UdemyBasic overview of Big Data Hadoopby- Udemy5. StatisticsStatistics One by CourseraStatistics and ProbabilityProbability & Statistics6. Data MiningData Mining and Web Scraping: How to Convert Sites into Data by UdemyData Mining by Coursera7. SQLInteractive Online SQL Training for BeginnersSachin Quickly Learns (SQL) – Structured Query Language by UdemySQL Tutorial by w3schools8. JavaLearn Java: The Java Programming Tutorial For Beginners by UdemyLearn Java – Free Interactive Java Tutorial Learn Java Programming From Scratch – Udemy