SlideShare a Scribd company logo
1 of 48
So your boss wants you to learn
data science
Susan Ibach
Data Science has become a buzzword
When your boss walks up to you and says
we need to do data science, where do
you start?
What is a data scientist?
math skills
Follow the 7 Steps to data science success
Step 1: Identify your problem and
data to define the problem
What insights might help solve/define the problem?
An airline wants to prevent flight delays
Different insights require different tools
1997 and 2017
Data science tools include
Data Mining
gain insights from data
• Those who bought this also bought
• Keyword extraction
Machine Learning
make predictions
• Who will need hospitalization from the flu?
• How many copies of this book will I sell?
Deep Learning
For complex data processed in layers
• Is there a bird in this photo?
• Will this person get cancer?
Do we need Artificial Intelligence?
•AI is when a computer completes a task that
normally requires human intelligence
• Answering questions from a customer
• Recognizing the content of a photo
• Understanding human speech
•We use data science to analyze and recognize
patterns and responses so we can do AI
Step 2: Collect data
Which flights are most likely to be delayed next
What data would help you determine:
Relational databases
BLOB storage
NoSQL databases
Data warehouses
Flat Files
Open source data
Where do I get all that data?
When does data become “big data”?
High Volume High Velocity
High Variety
Step 3: Prepare data
Your data will need clean-up/prep
Flight # Dep Date Sched Dep
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
15/12/2016 Pearson NNE 5 MPH 150 mm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Missing Values
Duplicate rows
Different data formats
Start with what you already know
• Excel, SQL
Write your own Code
• Python Pandas library, R
Third party products
Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab,
Knowledge Works, Datameer
What tools might you use for data prep?
If you have Big Data
•Preparing and pulling together your data will require
a LOT of storage and processing power
Step 4: Identify the data that
influences outcomes
Which fields “features” might helps us
predict if a flight will be late “label”?
Flight # Dep Date Sched Dep
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
15/12/2016 Pearson NNE 5 MPH 15 cm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Are there any fields we can
decompose to get more
Which fields “features” help us predict if a
picture contains a dog or cat “label”?
• Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
Break out the deep learning
Pixel Edge Shape Cat
Step 5: Pick the right algorithm
What are you trying to predict?
Prediction Algorithm Example
Predict continuous
Regression Predict what time a
flight will land
Predict what
category something
falls into
Classification Predict if a flight
will be late or on
Detect unusual data
Anomaly detection Predict if a credit
card transaction is
Predict if a runner
cheated on a
Supervised vs Unsupervised
Type Definiton Example
Supervised You have existing data with known
inputs and known outputs to help
make predictions
When I try to predict if a flight
next week will be late, I know what
flights have been late in the past
Unsupervised You have input data but no known
outcomes in your data
When I try to predict if a runner
cheated on a marathon, I don’t
have a history of runners who
cheated in the past.
Step 6: Train your model
Once you have data and your algorithm
you can train and create your predictive
scikit-learn (based on NumPy, SciPy, and matplotlib)
Azure Machine Learning Service
Cognitive Toolkit/Tensorflow (deep learning)
There are lots of tools to choose from
Step 7: Test your model
You need to know the accuracy of your
Flt #406
Air Canada
April 1, 2016
3:15 PM
Late: No
Flt #351
West Jet
April 12, 2016
8:01 AM
Late: No
Flt #141
Sep 25, 2016
1:45 PM
Late: Yes
Flt #406, Air Canada, April 1, 2016, 3:15
Flt #351, West Jet, April 12, 2016, 8:01
Flt #141, Delta, Sep 25, 2016, 1:45 PM
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR, Late: Yes
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ, Late: No
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA, Late: Yes
66.6% accuracy
What do I do if my accuracy is lousy?
Go back to step 1
For additional information
•Appendix A
What is Hadoop anyway?
•Appendix B
What cloud tools exist to help with data science?
•Appendix C Lexicon
Susan Ibach
Appendix A –
What is Hadoop anyway?
It’s a tool for analyzing Big Data
Hadoop is an OS framework
•Based on java
•Distributed processing of large datasets across
clusters of computers
•Distributed storage and computation across clusters
of computers
•Scales from single server to thousands of machines
Hadoop components
• Hadoop Common – java libraries used by Hadoop to abstract the filesystem and
• Hadoop YARN – framework for job scheduling and managing cluster resources
• HDFS – distributed File system for access to application data (distributed storage)
• Based on Google File System (GFS)
• Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS
• File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to
• MapReduce – the query language for parallel processing of large data sets
(distributed computation)
• Map data into key/value pairs (tuples)
• Reduce data tuples into smaller pairs of tuples
• Input/output stored in file system
• Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
Hadoop components
• Hive – similar to SQL, hides complexity of Map Reduce
programming, generates a MapReduce job
• Pig - (Pig latin) – High level data flow language for
parallel computation & ETL
• Hbase - Scalable distributed non-relational database that
supports structured data storage for large tables (billions
of rows X millions of columns)
• Spark - compute engine for Hadoop data used for ETL,
machine learning, stream /real-time processing and
graph computation (gradually replacing MapReduce
because it is faster for iterative algorithms)
How does Hadoop work
• User submits a job to Hadoop
• Location of input and output files
• Java classes containing map and reduce functions
• Job configuration parameters
•Hadoop submits the job to JobTracker which
distributes the job to the slaves, schedules tasks and
monitors them
•Task trackers execute the task and output is stored in
output files on the file system
Why is it popular?
• Allows user to quickly write and test distributed systems
• It automatically distributes data and work across the
machines and utilizes the parallelism of CPU cores
• Does not rely on hardware for fault tolerance and high
• Servers can be added or removed dynamically
• It’s Open source and compatible on many platforms since
it is Java based
Appendix B –
What cloud tools exist to help with
data science?
Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers
Power BI See it Visualize data with heat maps, graphs and charts
Stream Analytics Stream it Monitor data as it arrives and act on it in real time
Azure Machine Learning, Microsoft R
Learn it Analyze past data to learn by finding patterns you can use to predict
outcomes for new data
SQL Data Warehouse, SQL DB, Document
DB, Blob storage
Relate it Store related data together using the best data store for the job
Data Lake Store it A data store that can handle data of any size, shape or speed
Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of
data over small amounts of time
Data Factory Move it Move data from one place to another, transform it as you move it
Data Catalog Document it Document all your data sources
Cognitive Services Use it Pre-trained models available for use
HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark)
Microsoft Azure
Cloud Dataprep Prepare it Prepare your data for analysis
BiqQuery ML, BigQuery GIS Train it Train machine learning models
Big Query, GCP Data Lake Store it Data warehouse
Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark
Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time
Cloud DataFlow Store it A data store that can handle data of any size, shape or speed
Prepackaged AI solutions Use it Pre-trained models available for use
Google Cloud Platform
Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark
InfoSphere Information Server on
Access it Extract, transform & load data + data standardization
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features
such as speech to text, natural language processing, and image
Watson IoT Platform Collect it Connect devices and analyze the associated data
Deep Learning Analyze it Design and deploy deep learning modules using neural networks
IBM Data Refinery Prepare it Data preparation tool
Data Lakes, Redshift Store it Store your data
Lake formation Move it Get data into your data lake
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT
Glue Document it Create a catalog of your data that is searchable and queryable by
Athena Analyze it Analyze your data
EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark
QuickSight See it Visualizations and dashboards
Application Services Use it Pre-trained models ready for use
Deep Learning AMIs, SageMaker Train it Tools to help you build and train models
Appendix C Lexicon
Buzzwords and Tools
Amazon Redshift – Data warehouse infrastructure
Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters
Avro – a data serialization system (like XML or JSON)
Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the
packaged code into nodes to process the data in parallel, for faster processing
Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications
Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing
Azure DataBricks – platform for managing and deploying Spark at scale
Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores
Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R
Azure SQL Data Warehouse – Data warehouse infrastructure
Caffe – Deep learning framework
Cassandra –NoSQL Database
Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with
CouchDB – NoSQL Database
Chukwa – Data collection system for managing large distributed systems
H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit)
Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete)
Hadoop Map Reduce – programming model used to process data, provides horizontal scalability
Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters
HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
Hive – Data warehouse infrastructure
Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of
Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data
visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning
and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark
Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow
MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key
value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce
functions can be written in Java, Python, C# or Pig
MATLAB – tools for machine learning – build models
MongoDB – NoSQL Database
MySQL – NoSQL Database
Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib)
Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace
MapReduce because Spark is faster)
Sqoop – Used for transferring data between structured databases and Hadoop
Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy
computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python
Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning)
TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them,
Google Compute Engine – second generation of Google TPUs
Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution
engine on Hadoop because it can process data in a single job instead of multiple jobs
ZooKeeper – high performance coordination service for distributed applications
Scala – libraries and tools for performing data analysis
Python –
Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation)
NumPy – fundamental package for scientific computing with Python
SciPy – numerical routines for numerical integration and optimization
matplotlib (for graphing, charting and visualizing data sets or query results)
Keras – deep learning for building your own neural networks
R - language for statistical (linear and nonlinear modelling, classification, clustering)
and graphics
Julia – numerical computing language supports parallel execution based on C
Mahout – Scalable Machine Learning and data mining library
Pig (Pig latin) – High level data flow language for parallel computation & ETL
HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a
MapReduce job
USQL – data language used by Azure Data Lake to query across data sources
Programming languages and libraries

More Related Content

What's hot

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonJen Stirrup
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for SparkMark Kerzner
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks

What's hot (20)

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
Big data with java
Big data with javaBig data with java
Big data with java
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator

Similar to So your boss says you need to learn data science

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Pavlo Baron
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
NYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewNYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewTravis Wright
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens
Warehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your DataWarehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your DataScott Arbeitman
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Pavlo Baron
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk

Similar to So your boss says you need to learn data science (20)

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
NYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewNYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services Overview
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17
Warehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your DataWarehousing Your Hits - The Why and How of Owning Your Data
Warehousing Your Hits - The Why and How of Owning Your Data
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview

Recently uploaded

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

So your boss says you need to learn data science

  • 1. So your boss wants you to learn data science Susan Ibach @HockeyGeekGirl
  • 2. Data Science has become a buzzword I THINK WE NEED TO DO DATA SCIENCE YOUR DATA SCIENCE
  • 3. When your boss walks up to you and says we need to do data science, where do you start? PLATFORM TO USE DATA SCIENCE BIG DATA AI ML
  • 4. What is a data scientist? Advanced math skills Subject Matter Expertise Data Engineering skills
  • 5. Follow the 7 Steps to data science success 1 2 3 4 5 6 7
  • 6. Step 1: Identify your problem and data to define the problem 1
  • 7. What insights might help solve/define the problem? An airline wants to prevent flight delays
  • 9. Data science tools include Data Mining gain insights from data • Those who bought this also bought • Keyword extraction Machine Learning make predictions • Who will need hospitalization from the flu? • How many copies of this book will I sell? Deep Learning For complex data processed in layers • Is there a bird in this photo? • Will this person get cancer?
  • 10. Do we need Artificial Intelligence? •AI is when a computer completes a task that normally requires human intelligence • Answering questions from a customer • Recognizing the content of a photo • Understanding human speech •We use data science to analyze and recognize patterns and responses so we can do AI
  • 11. Step 2: Collect data 1 2
  • 12. Which flights are most likely to be delayed next week? What data would help you determine:
  • 13. Relational databases BLOB storage NoSQL databases Data warehouses Flat Files Open source data Sensors Where do I get all that data?
  • 14. BIG DATA When does data become “big data”? High Volume High Velocity High Variety
  • 15. Step 3: Prepare data 1 2 3
  • 16. Your data will need clean-up/prep Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 150 mm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Missing Values Duplicate rows Different data formats Decomposition Outliers Scaling
  • 17. Start with what you already know • Excel, SQL Write your own Code • Python Pandas library, R Third party products Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab, Knowledge Works, Datameer What tools might you use for data prep?
  • 18. If you have Big Data •Preparing and pulling together your data will require a LOT of storage and processing power
  • 19. Step 4: Identify the data that influences outcomes 1 2 3 4
  • 20. Which fields “features” might helps us predict if a flight will be late “label”? Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 15 cm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Are there any fields we can decompose to get more information?
  • 21. Which fields “features” help us predict if a picture contains a dog or cat “label”? • Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
  • 22. Break out the deep learning GPUs Storage Pixel Edge Shape Cat
  • 23. Step 5: Pick the right algorithm 1 2 3 4 5
  • 24. What are you trying to predict? Prediction Algorithm Example Predict continuous values Regression Predict what time a flight will land Predict what category something falls into Classification Predict if a flight will be late or on time Detect unusual data points Anomaly detection Predict if a credit card transaction is fraudulent Predict if a runner cheated on a marathon
  • 25. Supervised vs Unsupervised Type Definiton Example Supervised You have existing data with known inputs and known outputs to help make predictions When I try to predict if a flight next week will be late, I know what flights have been late in the past Unsupervised You have input data but no known outcomes in your data When I try to predict if a runner cheated on a marathon, I don’t have a history of runners who cheated in the past.
  • 26. Step 6: Train your model 1 2 3 4 5 6
  • 27. Once you have data and your algorithm you can train and create your predictive model
  • 28. Python R scikit-learn (based on NumPy, SciPy, and matplotlib) Azure Machine Learning Service Cognitive Toolkit/Tensorflow (deep learning) There are lots of tools to choose from
  • 29. Step 7: Test your model 1 2 3 4 5 6 7
  • 30. You need to know the accuracy of your model! Predictive/Trained Model Flt #406 Air Canada April 1, 2016 3:15 PM YYZ-YVR Late: No Flt #351 West Jet April 12, 2016 8:01 AM YOW-YYZ Late: No Flt #141 Delta Sep 25, 2016 1:45 PM HND-SEA Late: Yes Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR, Late: Yes Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ, Late: No Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA, Late: Yes 66.6% accuracy
  • 31. What do I do if my accuracy is lousy? Go back to step 1
  • 32. For additional information •Appendix A What is Hadoop anyway? •Appendix B What cloud tools exist to help with data science? •Appendix C Lexicon
  • 34. Appendix A – What is Hadoop anyway? It’s a tool for analyzing Big Data
  • 35. Hadoop is an OS framework •Based on java •Distributed processing of large datasets across clusters of computers •Distributed storage and computation across clusters of computers •Scales from single server to thousands of machines
  • 36. Hadoop components • Hadoop Common – java libraries used by Hadoop to abstract the filesystem and OS • Hadoop YARN – framework for job scheduling and managing cluster resources • HDFS – distributed File system for access to application data (distributed storage) • Based on Google File System (GFS) • Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS • File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to datanodes • MapReduce – the query language for parallel processing of large data sets (distributed computation) • Map data into key/value pairs (tuples) • Reduce data tuples into smaller pairs of tuples • Input/output stored in file system • Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
  • 37. Hadoop components • Hive – similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job • Pig - (Pig latin) – High level data flow language for parallel computation & ETL • Hbase - Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) • Spark - compute engine for Hadoop data used for ETL, machine learning, stream /real-time processing and graph computation (gradually replacing MapReduce because it is faster for iterative algorithms)
  • 38. How does Hadoop work • User submits a job to Hadoop • Location of input and output files • Java classes containing map and reduce functions • Job configuration parameters •Hadoop submits the job to JobTracker which distributes the job to the slaves, schedules tasks and monitors them •Task trackers execute the task and output is stored in output files on the file system
  • 39. Why is it popular? • Allows user to quickly write and test distributed systems • It automatically distributes data and work across the machines and utilizes the parallelism of CPU cores • Does not rely on hardware for fault tolerance and high availability • Servers can be added or removed dynamically • It’s Open source and compatible on many platforms since it is Java based
  • 40. Appendix B – What cloud tools exist to help with data science?
  • 41. Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers Power BI See it Visualize data with heat maps, graphs and charts Stream Analytics Stream it Monitor data as it arrives and act on it in real time Azure Machine Learning, Microsoft R Server Learn it Analyze past data to learn by finding patterns you can use to predict outcomes for new data SQL Data Warehouse, SQL DB, Document DB, Blob storage Relate it Store related data together using the best data store for the job Data Lake Store it A data store that can handle data of any size, shape or speed Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of data over small amounts of time Data Factory Move it Move data from one place to another, transform it as you move it Data Catalog Document it Document all your data sources Cognitive Services Use it Pre-trained models available for use HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark) Microsoft Azure
  • 42. Cloud Dataprep Prepare it Prepare your data for analysis BiqQuery ML, BigQuery GIS Train it Train machine learning models Big Query, GCP Data Lake Store it Data warehouse Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time Cloud DataFlow Store it A data store that can handle data of any size, shape or speed Prepackaged AI solutions Use it Pre-trained models available for use Google Cloud Platform
  • 43. Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark InfoSphere Information Server on cloud Access it Extract, transform & load data + data standardization Streaming Analytics Stream it Monitor data as it arrives and act on it in real time IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features such as speech to text, natural language processing, and image analysis Watson IoT Platform Collect it Connect devices and analyze the associated data Deep Learning Analyze it Design and deploy deep learning modules using neural networks IBM Data Refinery Prepare it Data preparation tool IBM
  • 44. Data Lakes, Redshift Store it Store your data Lake formation Move it Get data into your data lake Streaming Analytics Stream it Monitor data as it arrives and act on it in real time Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT devices Glue Document it Create a catalog of your data that is searchable and queryable by users Athena Analyze it Analyze your data EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark QuickSight See it Visualizations and dashboards Application Services Use it Pre-trained models ready for use Deep Learning AMIs, SageMaker Train it Tools to help you build and train models AWS
  • 46. Amazon Redshift – Data warehouse infrastructure Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters Avro – a data serialization system (like XML or JSON) Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the packaged code into nodes to process the data in parallel, for faster processing Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing Azure DataBricks – platform for managing and deploying Spark at scale Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R Azure SQL Data Warehouse – Data warehouse infrastructure Caffe – Deep learning framework Cassandra –NoSQL Database Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with Python CouchDB – NoSQL Database Chukwa – Data collection system for managing large distributed systems H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit) Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete) Hadoop Map Reduce – programming model used to process data, provides horizontal scalability Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
  • 47. Hive – Data warehouse infrastructure Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark Cluster. Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce functions can be written in Java, Python, C# or Pig MATLAB – tools for machine learning – build models MongoDB – NoSQL Database MySQL – NoSQL Database Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib) Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace MapReduce because Spark is faster) Sqoop – Used for transferring data between structured databases and Hadoop Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning) TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them, Google Compute Engine – second generation of Google TPUs Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution engine on Hadoop because it can process data in a single job instead of multiple jobs ZooKeeper – high performance coordination service for distributed applications
  • 48. Scala – libraries and tools for performing data analysis Python – Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation) NumPy – fundamental package for scientific computing with Python SciPy – numerical routines for numerical integration and optimization matplotlib (for graphing, charting and visualizing data sets or query results) Keras – deep learning for building your own neural networks R - language for statistical (linear and nonlinear modelling, classification, clustering) and graphics Julia – numerical computing language supports parallel execution based on C Mahout – Scalable Machine Learning and data mining library Pig (Pig latin) – High level data flow language for parallel computation & ETL HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job USQL – data language used by Azure Data Lake to query across data sources Programming languages and libraries

Editor's Notes

  1. 1
  2. What insights might help solve/define the problem? How many flights were late last year? How much money did we spend on flight delays? What are the most common causes of late flights? Is the number of late flights increasing or decreasing year over year Which flights are most likely to be delayed next week?
  3. SQL Queries are used to extract data from relational databases Data Warehouses are used to aggregate historical data to see trends Dashboards are used to provide visualizations of important data
  4. Data mining to gain insights from data Those who bought this also bought Keyword extraction Machine Learning to make predictions by using algorithms to parse and learn from historical data Predict if this credit card was stolen based on the most recent transactions Deep learning to analyze data with a lot of different features Is there a bird in this photo? Will this person get cancer?
  5. Weather forecast Crew schedules Maintenance history Passenger information Airport information
  6. Big data is what we call data that is so big and complex that traditional data processing is inadequate (e.g. internet search, financial, genomics) High volume (amount of data) High variety (range of data types and sources) High velocity (speed of data in or out)
  7. Missing values Duplicate rows Different data formats Outliers Decomposition Aggregation Scaling
  8. You will need parallel processing and distributed storage Hadoop gives you distributed storage and processing across one or more servers You set up a cluster and run Hadoop on your cluster to abstract the hardware Numerous tools run on top of Hadoop to access the data and perform the processing (Hive, Spark, MapReduce, Pig)
  9. Flight Number Scheduled Departure time Flight distance Day of week Month Year
  10. Sometimes even a subject matter expert cannot identify the features You can use deep learning with neural networks to identify the significant features The more processing power the better! Cheap storage and GPUs enabled breakthroughs in deep learning
  11. 33