SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
DWS Washington, D.C. 2019
Robert Hryniewicz
@robhryniewicz
Data Science Crash Course
2 © Hortonworks Inc. 2011–2018. All rights reserved
INTRO TO DATA SCIENCE
3 © Hortonworks Inc. 2011–2018. All rights reserved
The scientific exploration of data to extract meaning or
insight, using statistics and mathematical models with the
end goal of making smarter, quicker decisions.
What is Data Science?
4 © Hortonworks Inc. 2011–2018. All rights reserved
5 © Hortonworks Inc. 2011–2018. All rights reserved
What is Machine Learning? Favorite cocktail party definitions
Machine Learning is programming with data.
Machine Learning is a way to use data to draw meaningful conclusions including
identifying patterns, anomalies and trends that may not be obvious to humans.
Machine learning is math, at scale.
2nd
3rd
Using statistical analysis of data to build
predictive systems without needing to design
or maintain explicit rules.
1st
6 © Hortonworks Inc. 2011–2018. All rights reserved
Examples where Machine Learning can be applied
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
Insurance
• Risk assessment
• Customer insights/experience
• Finance real time analysis
Life sciences
• Genome sequencing
• Drug development
• Sensor data
7 © Hortonworks Inc. 2011–2018. All rights reserved
What is a ML Model?
• Mathematical formula with a number of parameters that need to be learned from the
data. Fitting a model to the data is a process known as model training.
• E.g. linear regression
• Goal: fit a line y = mx + c to data points
• After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
y = 2x + 5
8 © Hortonworks Inc. 2011–2018. All rights reserved
Types of Learning
Supervised Learning Unsupervised Learning
Reinforcement
Learning
9 © Hortonworks Inc. 2011–2018. All rights reserved
Supervised Learning
Input
Input
Input
Input
Input
Input
Input
Output 1
Output n
Use labeled (training)
datasets on to learn the
relationship of given
inputs to outputs.
Once model is trained use
it to predict outputs on
new input data.
Output 2
.
.
.
…
…
10 © Hortonworks Inc. 2011–2018. All rights reserved
Unsupervised Learning
Explore, classify & find
patterns in the input data
without being explicit
about the output.
11 © Hortonworks Inc. 2011–2018. All rights reserved
Reinforcement Learning
Algorithm
Environment
ActionRewardState
Algorithm learns to
maximize rewards it
receives for its actions
(e.g. maximizes points for
investment returns).
Use when you don’t have
lots of training data, you
can’t clearly define ideal
end-state, or the only way
to learn is by interacting
with the environment.
12 © Hortonworks Inc. 2011–2018. All rights reserved
ALGORITHMS
13 © Hortonworks Inc. 2011–2018. All rights reserved
Regression
Classification
Recommender Systems / Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression • Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
Deep Learning
• Fully Connected Neural Nets
 Tabular or Recommender Systems
• Convolutional Neural Nets (CNNs)
 Images
• Recurrent Neural Nets (RNNs)
 Natural Language Processing (NLP) / Text
14 © Hortonworks Inc. 2011–2018. All rights reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
15 © Hortonworks Inc. 2011–2018. All rights reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
• Logistic Regression
• Fast training (linear model)
• Classes expressed in probabilities
• Less overfitting [+]
• Less fitting (accuracy) [-]
• Support Vector Machines (SVM)
• “Best” supervised learning algorithm, effective
• State of the art prior to Deep Learning
• More robust to outliers than Log Regression
• Handles non-linearity
• Checkout: blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
• Random Forest
(ensemble of Decision Trees)
• Fast training
• Handles categorical features
• Does not require feature scaling
• Captures non-linearity and
feature interaction
• i.e. performs feature selection / PCA implicitly
• Naïve Bayes
• Good for text classification
• Assumes independent variables / words
16 © Hortonworks Inc. 2011–2018. All rights reserved
Visual Intro to Decision Trees
• http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
17 © Hortonworks Inc. 2011–2018. All rights reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
18 © Hortonworks Inc. 2011–2018. All rights reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
19 © Hortonworks Inc. 2011–2018. All rights reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
• Removing noise in images by selecting only
“important” features
• Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
20 © Hortonworks Inc. 2011–2018. All rights reserved
Deep Learning
20
21 © Hortonworks Inc. 2011–2018. All rights reserved
22 © Hortonworks Inc. 2011–2018. All rights reserved
Simple/shallow vs Deep Neural Net
23 © Hortonworks Inc. 2011–2018. All rights reserved
• Convolutional Neural Nets (CNNs)
• Recurrent Neural Nets (RNNs)
• Long Short-Term Memory (LSTM)
Popular Neural Net Architectures
 Images
 Text / Language (NLP) & Time Series
24 © Hortonworks Inc. 2011–2018. All rights reserved
Number Probability
0 0.03
1 0.01
2 0.04
3 0.08
4 0.05
5 0.08
6 0.07
7 0.02
8 0.54
9 0.08
25 © Hortonworks Inc. 2011–2018. All rights reserved
scs.ryerson.ca/~aharley/vis/conv/flat.html
26 © Hortonworks Inc. 2011–2018. All rights reserved
Quickly Training Deep Learning Models
with Transfer Learning
26
27 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant
Step 1: Download examples to train the model with
28 © Hortonworks Inc. 2011–2018. All rights reserved
How to Build a Deep Learning Image Recognition System?
Step 2: Augment dataset to enrich training data
 Adds 5-10x more training examples
29 © Hortonworks Inc. 2011–2018. All rights reserved
dawn.cs.stanford.edu/benchmark
Step 3: Checkout DAWNBench then select and download a pre-trained model.
How to Build a Deep Learning Image Recognition System?
30 © Hortonworks Inc. 2011–2018. All rights reserved
Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html
Sample Architecture of a CNN
31 © Hortonworks Inc. 2011–2018. All rights reserved
Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html
Sample Architecture of a CNN
Pretrained
Parameters
Random
Parameters
32 © Hortonworks Inc. 2011–2018. All rights reserved
Step 4: Apply transfer learning to a downloaded model
How to Build a Deep Learning Image Recognition System?
Pretrained Network
(millions of parameters)
Random
ParametersINPUT OUTPUT
Borneo Pygmy
Elephant
Train
Parameters
Step A
Adjust
Parameters
Step B
image
label
33 © Hortonworks Inc. 2011–2018. All rights reserved
Step 5: Save the trained model
How to Build a Deep Learning Image Recognition System?
Pretrained Network
(millions of parameters)
Random
ParametersINPUT OUTPUT
Train
Parameters
Adjust
Parameters
Trained Model (Neural Net)
34 © Hortonworks Inc. 2011–2018. All rights reserved
Step 6: Host a trained model on a server and make it accessible via a web app
How to Build a Deep Learning Image Recognition System?
User uploads
Borneo Pygmy Elephant
Web app returns
35 © Hortonworks Inc. 2011–2018. All rights reserved
DATA SCIENCE JOURNEY
36 © Hortonworks Inc. 2011–2018. All rights reserved
37 © Hortonworks Inc. 2011–2018. All rights reserved
Start by Asking Relevant Questions
• Specific (can you think of a clear answer?)
• Measurable (quantifiable? data driven?)
• Actionable (if you had an answer, could you do something with it?)
• Realistic (can you get an answer with data you have?)
• Timely (answer in reasonable timeframe?)
38 © Hortonworks Inc. 2011–2018. All rights reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States  California, CA, Cal., Cal
39 © Hortonworks Inc. 2011–2018. All rights reserved
Visualizing
Data
https://www.autodeskresearch.com/publications/samestats
40 © Hortonworks Inc. 2011–2018. All rights reserved
Feature Selection
• Also known as variable or attribute selection
• Why important?
• simplification of models  easier to interpret by researchers/users
• shorter training times
• enhanced generalization by reducing overfitting
• Dimensionality reduction vs feature selection
• Dimensionality reduction: create new combinations of attributes
• Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
41 © Hortonworks Inc. 2011–2018. All rights reserved
Hyperparameters
• Define higher-level model properties, e.g. complexity or learning rate
• Cannot be learned during training  need to be predefined
• Can be decided by
• setting different values
• training different models
• choosing the values that test better
• Hyperparameter examples
• Number of leaves or depth of a tree
• Number of latent factors in a matrix factorization
• Learning rate (in many models)
• Number of hidden layers in a deep neural network
• Number of clusters in a k-means clustering
42 © Hortonworks Inc. 2011–2018. All rights reserved
• Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
• R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
• RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
43 © Hortonworks Inc. 2011–2018. All rights reserved
With that in mind…
• No simple formula for “good questions” only general guidelines
• The right data is better than lots of data
• Understanding relationships matters
44 © Hortonworks Inc. 2011–2018. All rights reserved
Enterprise Data Science @ Scale
Enterprise- Grade
Leverage
enterprise-grade
security,
governance and
operations
Tools
Enhance productivity
by enabling data
scientists to use their
favorite tools,
technologies and
libraries
Deployment
Compress the
time to insight
by deploying
models into
production
faster
Data
Build more
robust models
by using all
the data in the
data lake
45 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
Robert Hryniewicz
@robhryniewicz
46 © Hortonworks Inc. 2011–2018. All rights reserved
Easier intro books (less math)
• The Hundred-Page Machine Learning Book by Andriy Burkov
• Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems by Aurélien Géron
• Deep Learning with Python by Francois Chollet
• Fundamentals of Machine Learning for Predictive Data Analytics:
Algorithms, Worked Examples, and Case Studies by John D. Kelleher, Brian
Mac Namee, Aoife D’Arcy
More thorough books (more math)
• Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
• Information Theory, Inference and Learning Algorithms 1st Edition by David
J. C. MacKay
Machine Learning Books
47 © Hortonworks Inc. 2011–2018. All rights reserved

More Related Content

What's hot

Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Edureka!
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Databricks
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
Mykhailo Koval
 
Anomaly Detection in DataMining
Anomaly Detection in DataMiningAnomaly Detection in DataMining
Anomaly Detection in DataMining
BilalAbbasAwan
 
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Simplilearn
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
Suman Debnath
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
Abash shah
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
Sangwoo Mo
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
Artivatic.ai
 
Video Multi-Object Tracking using Deep Learning
Video Multi-Object Tracking using Deep LearningVideo Multi-Object Tracking using Deep Learning
Video Multi-Object Tracking using Deep Learning
ShreyusPuthiyapurail
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
Jeff Squyres
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Impetus Technologies
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
Balayogi G
 
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
Edge AI and Vision Alliance
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
Prakash Pimpale
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
Ujjawal
 

What's hot (20)

Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...Learn Python Programming | Python Programming - Step by Step | Python for Beg...
Learn Python Programming | Python Programming - Step by Step | Python for Beg...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Machine Learning Overview
Machine Learning OverviewMachine Learning Overview
Machine Learning Overview
 
Anomaly Detection in DataMining
Anomaly Detection in DataMiningAnomaly Detection in DataMining
Anomaly Detection in DataMining
 
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
Deep Learning Tutorial | Deep Learning Tutorial For Beginners | What Is Deep ...
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
 
Video Multi-Object Tracking using Deep Learning
Video Multi-Object Tracking using Deep LearningVideo Multi-Object Tracking using Deep Learning
Video Multi-Object Tracking using Deep Learning
 
What is [Open] MPI?
What is [Open] MPI?What is [Open] MPI?
What is [Open] MPI?
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Recent trends in natural language processing
Recent trends in natural language processingRecent trends in natural language processing
Recent trends in natural language processing
 
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
“Efficient Neuromorphic Computing with Dynamic Vision Sensor, Spiking Neural ...
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 

Similar to Data Science Crash Course

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
DataWorks Summit
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
DataWorks Summit
 
Data science workshop
Data science workshopData science workshop
Data science workshop
Hortonworks
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
Ganesan Narayanasamy
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
Carolyn Duby
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
Data Science London
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
 
Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
Artem Ervits
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Denodo
 
Online talent sourcing - a future essentia
Online talent sourcing - a future essentiaOnline talent sourcing - a future essentia
Online talent sourcing - a future essentia
HSE Guru
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Hortonworks
 
An introduction to AI in Test Engineering
An introduction to AI in Test EngineeringAn introduction to AI in Test Engineering
An introduction to AI in Test Engineering
Heemeng Foo
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud ComputingWeek 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon Web Services
 

Similar to Data Science Crash Course (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
 
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
Square Pegs In Round Holes: Rethinking Data Availability in the Age of Automa...
 
Online talent sourcing - a future essentia
Online talent sourcing - a future essentiaOnline talent sourcing - a future essentia
Online talent sourcing - a future essentia
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
 
An introduction to AI in Test Engineering
An introduction to AI in Test EngineeringAn introduction to AI in Test Engineering
An introduction to AI in Test Engineering
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud ComputingWeek 12: Cloud AI- DSA 441 Cloud Computing
Week 12: Cloud AI- DSA 441 Cloud Computing
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 

More from DataWorks Summit (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 

Recently uploaded (20)

How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 

Data Science Crash Course

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved DWS Washington, D.C. 2019 Robert Hryniewicz @robhryniewicz Data Science Crash Course
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved INTRO TO DATA SCIENCE
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved The scientific exploration of data to extract meaning or insight, using statistics and mathematical models with the end goal of making smarter, quicker decisions. What is Data Science?
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved What is Machine Learning? Favorite cocktail party definitions Machine Learning is programming with data. Machine Learning is a way to use data to draw meaningful conclusions including identifying patterns, anomalies and trends that may not be obvious to humans. Machine learning is math, at scale. 2nd 3rd Using statistical analysis of data to build predictive systems without needing to design or maintain explicit rules. 1st
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Examples where Machine Learning can be applied Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels Insurance • Risk assessment • Customer insights/experience • Finance real time analysis Life sciences • Genome sequencing • Drug development • Sensor data
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved What is a ML Model? • Mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training. • E.g. linear regression • Goal: fit a line y = mx + c to data points • After model training: y = 2x + 5 Input OutputModel 1, 0, 7, 2, … 7, 5, 19, 9, … y = 2x + 5
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Types of Learning Supervised Learning Unsupervised Learning Reinforcement Learning
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Supervised Learning Input Input Input Input Input Input Input Output 1 Output n Use labeled (training) datasets on to learn the relationship of given inputs to outputs. Once model is trained use it to predict outputs on new input data. Output 2 . . . … …
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Unsupervised Learning Explore, classify & find patterns in the input data without being explicit about the output.
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Reinforcement Learning Algorithm Environment ActionRewardState Algorithm learns to maximize rewards it receives for its actions (e.g. maximizes points for investment returns). Use when you don’t have lots of training data, you can’t clearly define ideal end-state, or the only way to learn is by interacting with the environment.
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved ALGORITHMS
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Regression Classification Recommender Systems / Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA) Deep Learning • Fully Connected Neural Nets  Tabular or Recommender Systems • Convolutional Neural Nets (CNNs)  Images • Recurrent Neural Nets (RNNs)  Natural Language Processing (NLP) / Text
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms: • Logistic Regression • Fast training (linear model) • Classes expressed in probabilities • Less overfitting [+] • Less fitting (accuracy) [-] • Support Vector Machines (SVM) • “Best” supervised learning algorithm, effective • State of the art prior to Deep Learning • More robust to outliers than Log Regression • Handles non-linearity • Checkout: blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93 • Random Forest (ensemble of Decision Trees) • Fast training • Handles categorical features • Does not require feature scaling • Captures non-linearity and feature interaction • i.e. performs feature selection / PCA implicitly • Naïve Bayes • Good for text classification • Assumes independent variables / words
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Visual Intro to Decision Trees • http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications: • Removing noise in images by selecting only “important” features • Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Deep Learning 20
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Simple/shallow vs Deep Neural Net
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved • Convolutional Neural Nets (CNNs) • Recurrent Neural Nets (RNNs) • Long Short-Term Memory (LSTM) Popular Neural Net Architectures  Images  Text / Language (NLP) & Time Series
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Number Probability 0 0.03 1 0.01 2 0.04 3 0.08 4 0.05 5 0.08 6 0.07 7 0.02 8 0.54 9 0.08
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved scs.ryerson.ca/~aharley/vis/conv/flat.html
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Quickly Training Deep Learning Models with Transfer Learning 26
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? African Bush Elephant Indian Elephant Sri Lankan Elephant Borneo Pygmy Elephant Step 1: Download examples to train the model with
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved How to Build a Deep Learning Image Recognition System? Step 2: Augment dataset to enrich training data  Adds 5-10x more training examples
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved dawn.cs.stanford.edu/benchmark Step 3: Checkout DAWNBench then select and download a pre-trained model. How to Build a Deep Learning Image Recognition System?
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html Sample Architecture of a CNN
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html Sample Architecture of a CNN Pretrained Parameters Random Parameters
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Step 4: Apply transfer learning to a downloaded model How to Build a Deep Learning Image Recognition System? Pretrained Network (millions of parameters) Random ParametersINPUT OUTPUT Borneo Pygmy Elephant Train Parameters Step A Adjust Parameters Step B image label
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Step 5: Save the trained model How to Build a Deep Learning Image Recognition System? Pretrained Network (millions of parameters) Random ParametersINPUT OUTPUT Train Parameters Adjust Parameters Trained Model (Neural Net)
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Step 6: Host a trained model on a server and make it accessible via a web app How to Build a Deep Learning Image Recognition System? User uploads Borneo Pygmy Elephant Web app returns
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved DATA SCIENCE JOURNEY
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Start by Asking Relevant Questions • Specific (can you think of a clear answer?) • Measurable (quantifiable? data driven?) • Actionable (if you had an answer, could you do something with it?) • Realistic (can you get an answer with data you have?) • Timely (answer in reasonable timeframe?)
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States  California, CA, Cal., Cal
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Visualizing Data https://www.autodeskresearch.com/publications/samestats
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Feature Selection • Also known as variable or attribute selection • Why important? • simplification of models  easier to interpret by researchers/users • shorter training times • enhanced generalization by reducing overfitting • Dimensionality reduction vs feature selection • Dimensionality reduction: create new combinations of attributes • Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Hyperparameters • Define higher-level model properties, e.g. complexity or learning rate • Cannot be learned during training  need to be predefined • Can be decided by • setting different values • training different models • choosing the values that test better • Hyperparameter examples • Number of leaves or depth of a tree • Number of latent factors in a matrix factorization • Learning rate (in many models) • Number of hidden layers in a deep neural network • Number of clusters in a k-means clustering
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved • Residuals • residual of an observed value is the difference between the observed value and the estimated value • R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data • RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved With that in mind… • No simple formula for “good questions” only general guidelines • The right data is better than lots of data • Understanding relationships matters
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Enterprise Data Science @ Scale Enterprise- Grade Leverage enterprise-grade security, governance and operations Tools Enhance productivity by enabling data scientists to use their favorite tools, technologies and libraries Deployment Compress the time to insight by deploying models into production faster Data Build more robust models by using all the data in the data lake
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! Robert Hryniewicz @robhryniewicz
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved Easier intro books (less math) • The Hundred-Page Machine Learning Book by Andriy Burkov • Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron • Deep Learning with Python by Francois Chollet • Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies by John D. Kelleher, Brian Mac Namee, Aoife D’Arcy More thorough books (more math) • Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville • Information Theory, Inference and Learning Algorithms 1st Edition by David J. C. MacKay Machine Learning Books
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved

Editor's Notes

  1. Specific: Can you think of what an answer to your question would look like? The more clearly you can see it, the more specific the question is. Measurable: Is the answer something you can quantify? It’s hard to make decisions based off things that aren’t in a really data-driven way. Actionable: If you had the answer to your question, could you do something useful with it? If not, you don’t necessarily have a bad question but you may not want to expend a lot of resources answering it. Realistic: Can you get an answer to your question with the data you have? If not, can you get the data that would get you an answer? Timely: Can you get an answer in a reasonable time frame, or at least as before you need it? This is usually not a big issue, but if you operate according to a tight schedule, you may need to think about it.