Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology
7. What insights might help solve/define the problem?
An airline wants to prevent flight delays
8. Different insights require different tools
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME
SELECT COUNT(*) FROM
FLIGHTS WHERE
ACTUAL_ARR_TIME >
SCHED_ARR_TIME BETWEEN
1997 and 2017
9. Data science tools include
Data Mining
gain insights from data
• Those who bought this also bought
• Keyword extraction
Machine Learning
make predictions
• Who will need hospitalization from the flu?
• How many copies of this book will I sell?
Deep Learning
For complex data processed in layers
• Is there a bird in this photo?
• Will this person get cancer?
10. Do we need Artificial Intelligence?
•AI is when a computer completes a task that
normally requires human intelligence
• Answering questions from a customer
• Recognizing the content of a photo
• Understanding human speech
•We use data science to analyze and recognize
patterns and responses so we can do AI
16. Your data will need clean-up/prep
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 150 mm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Missing Values
Duplicate rows
Different data formats
Decomposition
Outliers
Scaling
17. Start with what you already know
• Excel, SQL
Write your own Code
• Python Pandas library, R
Third party products
Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab,
Knowledge Works, Datameer
What tools might you use for data prep?
18. If you have Big Data
•Preparing and pulling together your data will require
a LOT of storage and processing power
20. Which fields “features” might helps us
predict if a flight will be late “label”?
Flight # Dep Date Sched Dep
Time
Dep Airport Dep Delay
041 15-dec-2016 09:20 YYZ 253:26
386 15-dec-2016 15:20 YYZ
415 15-dec-2016 19:15 YYZ 0:02
415 15-dec-2016 19:15 YYZ 0:02
Date Airport Wind Precipitation Precipitation
Type
15/12/2016 Pearson NNE 5 MPH 15 cm Snow
15/12/2016 Dulles SW 18 MPH 7 mm Rain
15/12/2016 Reagan SW 18 MPH 7 mm Rain
Are there any fields we can
decompose to get more
information?
21. Which fields “features” help us predict if a
picture contains a dog or cat “label”?
• Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
22. Break out the deep learning
GPUs
Storage
Pixel Edge Shape Cat
24. What are you trying to predict?
Prediction Algorithm Example
Predict continuous
values
Regression Predict what time a
flight will land
Predict what
category something
falls into
Classification Predict if a flight
will be late or on
time
Detect unusual data
points
Anomaly detection Predict if a credit
card transaction is
fraudulent
Predict if a runner
cheated on a
marathon
25. Supervised vs Unsupervised
Type Definiton Example
Supervised You have existing data with known
inputs and known outputs to help
make predictions
When I try to predict if a flight
next week will be late, I know what
flights have been late in the past
Unsupervised You have input data but no known
outcomes in your data
When I try to predict if a runner
cheated on a marathon, I don’t
have a history of runners who
cheated in the past.
27. Once you have data and your algorithm
you can train and create your predictive
model
28. Python
R
scikit-learn (based on NumPy, SciPy, and matplotlib)
Azure Machine Learning Service
Cognitive Toolkit/Tensorflow (deep learning)
There are lots of tools to choose from
30. You need to know the accuracy of your
model!
Predictive/Trained
Model
Flt #406
Air Canada
April 1, 2016
3:15 PM
YYZ-YVR
Late: No
Flt #351
West Jet
April 12, 2016
8:01 AM
YOW-YYZ
Late: No
Flt #141
Delta
Sep 25, 2016
1:45 PM
HND-SEA
Late: Yes
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA
Flt #406, Air Canada, April 1, 2016, 3:15
PM, YYZ-YVR, Late: Yes
Flt #351, West Jet, April 12, 2016, 8:01
AM, YOW-YYZ, Late: No
Flt #141, Delta, Sep 25, 2016, 1:45 PM
HND-SEA, Late: Yes
66.6% accuracy
31. What do I do if my accuracy is lousy?
Go back to step 1
34. Appendix A –
What is Hadoop anyway?
It’s a tool for analyzing Big Data
35. Hadoop is an OS framework
•Based on java
•Distributed processing of large datasets across
clusters of computers
•Distributed storage and computation across clusters
of computers
•Scales from single server to thousands of machines
36. Hadoop components
• Hadoop Common – java libraries used by Hadoop to abstract the filesystem and
OS
• Hadoop YARN – framework for job scheduling and managing cluster resources
• HDFS – distributed File system for access to application data (distributed storage)
• Based on Google File System (GFS)
• Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS
• File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to
datanodes
• MapReduce – the query language for parallel processing of large data sets
(distributed computation)
• Map data into key/value pairs (tuples)
• Reduce data tuples into smaller pairs of tuples
• Input/output stored in file system
• Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
37. Hadoop components
• Hive – similar to SQL, hides complexity of Map Reduce
programming, generates a MapReduce job
• Pig - (Pig latin) – High level data flow language for
parallel computation & ETL
• Hbase - Scalable distributed non-relational database that
supports structured data storage for large tables (billions
of rows X millions of columns)
• Spark - compute engine for Hadoop data used for ETL,
machine learning, stream /real-time processing and
graph computation (gradually replacing MapReduce
because it is faster for iterative algorithms)
38. How does Hadoop work
• User submits a job to Hadoop
• Location of input and output files
• Java classes containing map and reduce functions
• Job configuration parameters
•Hadoop submits the job to JobTracker which
distributes the job to the slaves, schedules tasks and
monitors them
•Task trackers execute the task and output is stored in
output files on the file system
39. Why is it popular?
• Allows user to quickly write and test distributed systems
• It automatically distributes data and work across the
machines and utilizes the parallelism of CPU cores
• Does not rely on hardware for fault tolerance and high
availability
• Servers can be added or removed dynamically
• It’s Open source and compatible on many platforms since
it is Java based
41. Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers
Power BI See it Visualize data with heat maps, graphs and charts
Stream Analytics Stream it Monitor data as it arrives and act on it in real time
Azure Machine Learning, Microsoft R
Server
Learn it Analyze past data to learn by finding patterns you can use to predict
outcomes for new data
SQL Data Warehouse, SQL DB, Document
DB, Blob storage
Relate it Store related data together using the best data store for the job
Data Lake Store it A data store that can handle data of any size, shape or speed
Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of
data over small amounts of time
Data Factory Move it Move data from one place to another, transform it as you move it
Data Catalog Document it Document all your data sources
Cognitive Services Use it Pre-trained models available for use
HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark)
Microsoft Azure
42. Cloud Dataprep Prepare it Prepare your data for analysis
BiqQuery ML, BigQuery GIS Train it Train machine learning models
Big Query, GCP Data Lake Store it Data warehouse
Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark
Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time
Cloud DataFlow Store it A data store that can handle data of any size, shape or speed
Prepackaged AI solutions Use it Pre-trained models available for use
Google Cloud Platform
43. Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark
InfoSphere Information Server on
cloud
Access it Extract, transform & load data + data standardization
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features
such as speech to text, natural language processing, and image
analysis
Watson IoT Platform Collect it Connect devices and analyze the associated data
Deep Learning Analyze it Design and deploy deep learning modules using neural networks
IBM Data Refinery Prepare it Data preparation tool
IBM
44. Data Lakes, Redshift Store it Store your data
Lake formation Move it Get data into your data lake
Streaming Analytics Stream it Monitor data as it arrives and act on it in real time
Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT
devices
Glue Document it Create a catalog of your data that is searchable and queryable by
users
Athena Analyze it Analyze your data
EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark
QuickSight See it Visualizations and dashboards
Application Services Use it Pre-trained models ready for use
Deep Learning AMIs, SageMaker Train it Tools to help you build and train models
AWS
46. Amazon Redshift – Data warehouse infrastructure
Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters
Avro – a data serialization system (like XML or JSON)
Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the
packaged code into nodes to process the data in parallel, for faster processing
Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications
Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing
Azure DataBricks – platform for managing and deploying Spark at scale
Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores
Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R
Azure SQL Data Warehouse – Data warehouse infrastructure
Caffe – Deep learning framework
Cassandra –NoSQL Database
Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with
Python
CouchDB – NoSQL Database
Chukwa – Data collection system for managing large distributed systems
H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit)
Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete)
Hadoop Map Reduce – programming model used to process data, provides horizontal scalability
Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters
HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
47. Hive – Data warehouse infrastructure
Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of
columns)
Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data
visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning
and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark
Cluster.
Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow
MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key
value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce
functions can be written in Java, Python, C# or Pig
MATLAB – tools for machine learning – build models
MongoDB – NoSQL Database
MySQL – NoSQL Database
Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib)
Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace
MapReduce because Spark is faster)
Sqoop – Used for transferring data between structured databases and Hadoop
Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy
computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python
Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning)
TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them,
Google Compute Engine – second generation of Google TPUs
Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution
engine on Hadoop because it can process data in a single job instead of multiple jobs
ZooKeeper – high performance coordination service for distributed applications
48. Scala – libraries and tools for performing data analysis
Python –
Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation)
NumPy – fundamental package for scientific computing with Python
SciPy – numerical routines for numerical integration and optimization
matplotlib (for graphing, charting and visualizing data sets or query results)
Keras – deep learning for building your own neural networks
R - language for statistical (linear and nonlinear modelling, classification, clustering)
and graphics
Julia – numerical computing language supports parallel execution based on C
Mahout – Scalable Machine Learning and data mining library
Pig (Pig latin) – High level data flow language for parallel computation & ETL
HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a
MapReduce job
USQL – data language used by Azure Data Lake to query across data sources
Programming languages and libraries
Editor's Notes
1
What insights might help solve/define the problem?
How many flights were late last year?
How much money did we spend on flight delays?
What are the most common causes of late flights?
Is the number of late flights increasing or decreasing year over year
Which flights are most likely to be delayed next week?
SQL Queries are used to extract data from relational databases
Data Warehouses are used to aggregate historical data to see trends
Dashboards are used to provide visualizations of important data
Data mining to gain insights from data
Those who bought this also bought
Keyword extraction
Machine Learning to make predictions by using algorithms to parse and learn from historical data
Predict if this credit card was stolen based on the most recent transactions
Deep learning to analyze data with a lot of different features
Is there a bird in this photo? Will this person get cancer?
Weather forecast
Crew schedules
Maintenance history
Passenger information
Airport information
Big data is what we call data that is so big and complex that traditional data processing is inadequate (e.g. internet search, financial, genomics)
High volume (amount of data)
High variety (range of data types and sources)
High velocity (speed of data in or out)
Missing values
Duplicate rows
Different data formats
Outliers
Decomposition
Aggregation
Scaling
You will need parallel processing and distributed storage
Hadoop gives you distributed storage and processing across one or more servers
You set up a cluster and run Hadoop on your cluster to abstract the hardware
Numerous tools run on top of Hadoop to access the data and perform the processing (Hive, Spark, MapReduce, Pig)
Flight Number
Scheduled Departure time
Flight distance
Day of week
Month
Year
Sometimes even a subject matter expert cannot identify the features
You can use deep learning with neural networks to identify the significant features
The more processing power the better!
Cheap storage and GPUs enabled breakthroughs in deep learning