SlideShare a Scribd company logo
1 of 40
Download to read offline
10-605/805 – ML for
Large Datasets
Lecture 1: Course
Overview
Henry Chai
8/30/22
Machine
Learning
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data
– Use data to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 8/30/22 2
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 3
Dataset
ML
Method
Insight
Machine
Learning:
Example
Henry Chai - 8/30/22 4
Dataset
ML
Method
Insight
Regression
Source: https://muppet.fandom.com/wiki/Oscar's_trash_can?file=IMG_0815.jpg
𝑥
𝑦
=
$1,000,000
Machine
Learning:
Example
Henry Chai - 8/30/22 5
Dataset
ML
Method
Insight
Classification
Figure courtesy of Matt Gormley
Dear sir or madam,
your balance
currently is 0$.
please send money.
immediately.
Machine
Learning:
Example
Henry Chai - 8/30/22 6
Dataset
ML
Method
Insight
Clustering
Figure courtesy of Pat Virtue
CustomerID Purchases
1
2
⋮
Machine
Learning:
Example
Henry Chai - 8/30/22 7
Dataset
ML
Method
Insight
Dimensionality Reduction
Figure courtesy of Matt Gormley
Machine
Learning:
Terminology
– Datasets will (usually) consist of
– Observations – individual entries used in learning or
evaluating a learned model
– Features – attributes used to represent an
observation during learning
– Labels – values or categories associated with an
observation
Henry Chai - 8/30/22 8
– Running Example: Sentiment analysis of course evaluations
– Raw training dataset
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 13
Machine
Learning:
Pipeline
– Running Example: Sentiment analysis of course evaluations
– Raw training dataset
– Data preprocessing
Henry Chai - 8/30/22 14
Features Labels
🙂
🙁
🙂
🙁
🙃
Machine
Learning:
Pipeline
– Running Example: Sentiment analysis of course evaluations
– Raw training dataset
– Data preprocessing
Henry Chai - 8/30/22 15
Features Labels
easy course taught well +1
homework takes way too long −1
great +1
too much work −1
this course had lot problems but none them were henry fault 0
– Running Example: Sentiment analysis of course evaluations
– Training dataset
– Feature engineering - transform observations into a
form appropriate for the machine learning method
– Example: bag of words model
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 16
easy course taught well
homework takes way too long
great
too much work
this course had lot problems but
none them were henry fault
Vocabulary
but
course
easy
fault
great
henry
homework
long
⋮
– Running Example: Sentiment analysis of course evaluations
– Training dataset
– Feature engineering - transform observations into a
form appropriate for the machine learning method
– Example: bag of words model
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 17
Vocabulary
but
course
easy
fault
great
henry
homework
long
⋮
this course had lot problems but
none them were henry fault
1
1
0
1
0
1
0
0
⋮
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 18
– Running Example: Sentiment analysis of course evaluations
– Model training
– Just throw a narwhal neural network at it?
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 19
– Running Example: Sentiment analysis of course evaluations
– Model training
– Hyperparameter tuning – most machine
learning/optimization methods will have values/design
choices that need to be specified/made in order to run
– Example: neural networks trained using mini-batch
gradient descent
– architecture
– batch size
– learning rate/step size
– termination criteria
– etc...
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 20
– Suppose we want to compare multiple
hyperparameter settings 𝜃&, … , 𝜃'
– For 𝑘 = 1, 2, … , 𝐾
– Train a model on 𝐷()*+, using 𝜃-
– Evaluate each model on 𝐷.*/ and find
the best hyperparameter setting, 𝜃-∗
𝐷()*+,
𝐷.*/
– Running Example: Sentiment analysis of course evaluations
– Model training
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 21
– Running Example: Sentiment analysis of course evaluations
– Model evaluation
– How do you know if you’ve learned a good model?
– If a model is trained by minimizing the training error,
then the training error at termination is (typically)
overly optimistic about the model’s performance
– The model has been overfit to training data
– Likewise, the validation error is also (typically)
optimistic about the model’s performance
– Usually less so than the training error
– Idea: use a held-out test dataset to assess our model’s
ability to generalize to unseen observations
Machine
Learning:
Pipeline
Henry Chai - 8/30/22 22
– Suppose we want to compare multiple
hyperparameter settings 𝜃&, … , 𝜃'
– For 𝑘 = 1, 2, … , 𝐾
– Train a model on 𝐷()*+, using 𝜃-
– Evaluate each model on 𝐷.*/ and find
the best hyperparameter setting, 𝜃-∗
– Compute the error of a model trained
with 𝜃-∗ on 𝐷(01(
𝐷()*+,
𝐷.*/
𝐷(01(
– Running Example: Sentiment analysis of course evaluations
– Model evaluation
Machine
Learning:
Pipeline
Revisited
Henry Chai - 8/30/22 23
Dataset
ML
Method
Insight
?
Raw
Data-
set
Preprocessing
Model
Postprocessing
Feature
Engineering
Hyperparameter
Tuning
Model
Evaluation
Machine
Learning:
Challenges
– Contemporary issues in modern machine learning:
– Privacy
– Fairness
– Interpretability
– Big data
Henry Chai - 8/30/22 24
Machine
Learning:
Challenges
– Contemporary issues in modern machine learning:
– Privacy
– Fairness
– Interpretability
– Big data
Henry Chai - 8/30/22 25
Machine
Learning
with Large
Datasets
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data (sometimes a lot of it!)
– More data usually helps
– Use data efficiently/intelligently to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 8/30/22 26
– Datasets can be big in two ways
Henry Chai - 8/30/22 27
Large
Datasets
Dataset
Large 𝑘 (# of features)
Large
𝑛
(#
of
observations)
Henry Chai - 8/30/22 28
Large
Datasets:
Example
– Image processing
– Large 𝑛: potentially massive number of observations
(e.g., pictures on the internet)
– Use-cases: object recognition, annotation generation
– Medical data
– Large 𝑘: potentially massive feature set (e.g., genome
sequence, electronic medical records, etc…)
– Use-cases: personalized medicine, diagnosis prediction
– Business analytics
– Large 𝑛 (e.g., all customers & all products) and 𝑘 (e.g.,
customer data, product specifications, transaction
records, etc…)
– Use-cases: product recommendations, customer
segmentation
Henry Chai - 8/30/22 29
Large
Datasets:
Example
Tons of
Features
Henry Chai - 8/30/22 30
– High-dimensional datasets present numerous issues:
– Curse of dimensionality
– Overfitting
– Computational issues
– Strategies:
– Learn low-dimensional representations
– Perform feature selection to eliminate “low-yield”
features
Tons of
Observations
– Typically, we consider exponential time complexity
(e.g., 𝑂 2, ) bad and polynomial complexity (e.g.,
𝑂 𝑛2 ) good
– However, if 𝑛 is massive, then even 𝑂 𝑛 can be
problematic!
– Strategies:
– Speed up processing e.g., stochastic gradient
descent vs. gradient descent
– Make approximations/subsample the dataset
– Exploit parallelism
Henry Chai - 8/30/22 31
Tons of
Observations
– Typically, we consider exponential time complexity
(e.g., 𝑂 2, ) bad and polynomial complexity (e.g.,
𝑂 𝑛2 ) good
– However, if 𝑛 is massive, then even 𝑂 𝑛 can be
problematic!
– Strategies:
– Speed up processing e.g., stochastic gradient
descent vs. gradient descent
– Make approximations/subsample the dataset
– Exploit parallelism
Henry Chai - 8/30/22 32
Parallel
Computing
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Henry Chai - 8/30/22 33
Parallel
Computing
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Henry Chai - 8/30/22 34
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Parallel
Computing
Henry Chai - 8/30/22 35
Apache Spark
– Open-source engine for parallel computing/large-scale data
processing
– Lots of convenient features for machine learning specifically
– Fast iterative procedures
– Efficient communication primitives
– Interactive IPython-style notebooks (Databricks)
Henry Chai - 8/30/22 36
Course
Overview
– Data preprocessing
– Cleaning
– Summarizing/visualizing
– Dimensionality reduction
– Model training
– Distributed machine learning
– Large-scale optimization
– Scalable deep learning
– Efficient data structures
– Hyperparameter tuning
– Inference
– Hardware for ML
– Low-latency inference
(Compression, Pruning, Distillation)
Henry Chai - 8/30/22 37
– Infrastructure/Frameworks
– Apache Spark
– TensorFlow
– AWS/Google Cloud/Azure
– Advanced Topics
– Federated Learning
– Neural architecture search
– Machine learning in practice
“Front”
Matter
– HW1 released 8/30 (today!), due 9/13 at 11:59 PM
– All HWs consist of two parts: written and programming
– For HW1 only, the programming part is optional (but
strongly encouraged)
– The written part is nominally about PCA but can be
solved using pre-requisite knowledge (linear algebra)
– Recitations on Friday, 11:50 – 1:10 (different from lecture)
in GHC 4401 (same as lecture)
– Recitation 1 on 9/2: Introduction to PySpark/Databricks
– Recitation 2 on 9/9: Review of linear algebra
Henry Chai - 8/30/22 38
– Course website: https://10605.github.io/
Course
Logistics
Henry Chai - 8/30/22 39
Course
Logistics
Henry Chai - 8/30/22 40
– Course website: https://10605.github.io/
– Exam 1: 10/11
– Exam 2: 12/8
– Course website: https://10605.github.io/
Course
Logistics
Henry Chai - 8/30/22 41
Course
Logistics
Henry Chai - 8/30/22 42
– Course website: https://10605.github.io/
– Mini-project:
– Complete in groups of 2-3
– Two pre-specified options:
– Groups with only 10-605 students will pick one to
complete
– Groups with any 10-805 students must complete both
– No late days may be used on project deliverables
– More details about project options and deliverables will
be announced later in the semester
Course
Technologies
– Piazza for Q&A / announcements
– Gradescope for assignment submissions
– Canvas for hosting recordings and gradebook
– Google calendar for lecture, recitation and OH schedule
Henry Chai - 8/30/22 43
Course
Staff
Henry Chai - 8/30/22 44
TAs
Instructors EA

More Related Content

Similar to Lecture_1_-_Course_Overview_(Inked).pdf

Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIPramit Choudhary
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryJohannes Hoppe
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
DSCI 552 machine learning for data science
DSCI 552 machine learning for data scienceDSCI 552 machine learning for data science
DSCI 552 machine learning for data sciencepavithrak2205
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Simulacion luis garciaguzman-21012011
Simulacion luis garciaguzman-21012011Simulacion luis garciaguzman-21012011
Simulacion luis garciaguzman-21012011lideresacademicos
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science Frank Kienle
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
Machine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonMachine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonAlain Chabrier
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
lecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxlecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxMarc Teunis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...Databricks
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...Xin-She Yang
 

Similar to Lecture_1_-_Course_Overview_(Inked).pdf (20)

Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining Theory
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
DSCI 552 machine learning for data science
DSCI 552 machine learning for data scienceDSCI 552 machine learning for data science
DSCI 552 machine learning for data science
 
Mahout
MahoutMahout
Mahout
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Slides barcelona risk data
Slides barcelona risk dataSlides barcelona risk data
Slides barcelona risk data
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Simulacion luis garciaguzman-21012011
Simulacion luis garciaguzman-21012011Simulacion luis garciaguzman-21012011
Simulacion luis garciaguzman-21012011
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Machine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparisonMachine Learning vs Decision Optimization comparison
Machine Learning vs Decision Optimization comparison
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
lecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptxlecture-intro-pet-nams-ai-in-toxicology.pptx
lecture-intro-pet-nams-ai-in-toxicology.pptx
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...
 
Ai in finance
Ai in financeAi in finance
Ai in finance
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Lecture_1_-_Course_Overview_(Inked).pdf

  • 1. 10-605/805 – ML for Large Datasets Lecture 1: Course Overview Henry Chai 8/30/22
  • 2. Machine Learning – Premise: – There exists some pattern/behavior of interest – The pattern/behavior is difficult to describe – There is data – Use data to “learn” the pattern – Definition: – A computer program learns if its performance, P, at some task, T, improves with experience, E. Henry Chai - 8/30/22 2
  • 3. Machine Learning: Pipeline Henry Chai - 8/30/22 3 Dataset ML Method Insight
  • 4. Machine Learning: Example Henry Chai - 8/30/22 4 Dataset ML Method Insight Regression Source: https://muppet.fandom.com/wiki/Oscar's_trash_can?file=IMG_0815.jpg 𝑥 𝑦 = $1,000,000
  • 5. Machine Learning: Example Henry Chai - 8/30/22 5 Dataset ML Method Insight Classification Figure courtesy of Matt Gormley Dear sir or madam, your balance currently is 0$. please send money. immediately.
  • 6. Machine Learning: Example Henry Chai - 8/30/22 6 Dataset ML Method Insight Clustering Figure courtesy of Pat Virtue CustomerID Purchases 1 2 ⋮
  • 7. Machine Learning: Example Henry Chai - 8/30/22 7 Dataset ML Method Insight Dimensionality Reduction Figure courtesy of Matt Gormley
  • 8. Machine Learning: Terminology – Datasets will (usually) consist of – Observations – individual entries used in learning or evaluating a learned model – Features – attributes used to represent an observation during learning – Labels – values or categories associated with an observation Henry Chai - 8/30/22 8
  • 9. – Running Example: Sentiment analysis of course evaluations – Raw training dataset Machine Learning: Pipeline Henry Chai - 8/30/22 13
  • 10. Machine Learning: Pipeline – Running Example: Sentiment analysis of course evaluations – Raw training dataset – Data preprocessing Henry Chai - 8/30/22 14 Features Labels 🙂 🙁 🙂 🙁 🙃
  • 11. Machine Learning: Pipeline – Running Example: Sentiment analysis of course evaluations – Raw training dataset – Data preprocessing Henry Chai - 8/30/22 15 Features Labels easy course taught well +1 homework takes way too long −1 great +1 too much work −1 this course had lot problems but none them were henry fault 0
  • 12. – Running Example: Sentiment analysis of course evaluations – Training dataset – Feature engineering - transform observations into a form appropriate for the machine learning method – Example: bag of words model Machine Learning: Pipeline Henry Chai - 8/30/22 16 easy course taught well homework takes way too long great too much work this course had lot problems but none them were henry fault Vocabulary but course easy fault great henry homework long ⋮
  • 13. – Running Example: Sentiment analysis of course evaluations – Training dataset – Feature engineering - transform observations into a form appropriate for the machine learning method – Example: bag of words model Machine Learning: Pipeline Henry Chai - 8/30/22 17 Vocabulary but course easy fault great henry homework long ⋮ this course had lot problems but none them were henry fault 1 1 0 1 0 1 0 0 ⋮
  • 14. Machine Learning: Pipeline Henry Chai - 8/30/22 18 – Running Example: Sentiment analysis of course evaluations – Model training – Just throw a narwhal neural network at it?
  • 15. Machine Learning: Pipeline Henry Chai - 8/30/22 19 – Running Example: Sentiment analysis of course evaluations – Model training – Hyperparameter tuning – most machine learning/optimization methods will have values/design choices that need to be specified/made in order to run – Example: neural networks trained using mini-batch gradient descent – architecture – batch size – learning rate/step size – termination criteria – etc...
  • 16. Machine Learning: Pipeline Henry Chai - 8/30/22 20 – Suppose we want to compare multiple hyperparameter settings 𝜃&, … , 𝜃' – For 𝑘 = 1, 2, … , 𝐾 – Train a model on 𝐷()*+, using 𝜃- – Evaluate each model on 𝐷.*/ and find the best hyperparameter setting, 𝜃-∗ 𝐷()*+, 𝐷.*/ – Running Example: Sentiment analysis of course evaluations – Model training
  • 17. Machine Learning: Pipeline Henry Chai - 8/30/22 21 – Running Example: Sentiment analysis of course evaluations – Model evaluation – How do you know if you’ve learned a good model? – If a model is trained by minimizing the training error, then the training error at termination is (typically) overly optimistic about the model’s performance – The model has been overfit to training data – Likewise, the validation error is also (typically) optimistic about the model’s performance – Usually less so than the training error – Idea: use a held-out test dataset to assess our model’s ability to generalize to unseen observations
  • 18. Machine Learning: Pipeline Henry Chai - 8/30/22 22 – Suppose we want to compare multiple hyperparameter settings 𝜃&, … , 𝜃' – For 𝑘 = 1, 2, … , 𝐾 – Train a model on 𝐷()*+, using 𝜃- – Evaluate each model on 𝐷.*/ and find the best hyperparameter setting, 𝜃-∗ – Compute the error of a model trained with 𝜃-∗ on 𝐷(01( 𝐷()*+, 𝐷.*/ 𝐷(01( – Running Example: Sentiment analysis of course evaluations – Model evaluation
  • 19. Machine Learning: Pipeline Revisited Henry Chai - 8/30/22 23 Dataset ML Method Insight ? Raw Data- set Preprocessing Model Postprocessing Feature Engineering Hyperparameter Tuning Model Evaluation
  • 20. Machine Learning: Challenges – Contemporary issues in modern machine learning: – Privacy – Fairness – Interpretability – Big data Henry Chai - 8/30/22 24
  • 21. Machine Learning: Challenges – Contemporary issues in modern machine learning: – Privacy – Fairness – Interpretability – Big data Henry Chai - 8/30/22 25
  • 22. Machine Learning with Large Datasets – Premise: – There exists some pattern/behavior of interest – The pattern/behavior is difficult to describe – There is data (sometimes a lot of it!) – More data usually helps – Use data efficiently/intelligently to “learn” the pattern – Definition: – A computer program learns if its performance, P, at some task, T, improves with experience, E. Henry Chai - 8/30/22 26
  • 23. – Datasets can be big in two ways Henry Chai - 8/30/22 27 Large Datasets Dataset Large 𝑘 (# of features) Large 𝑛 (# of observations)
  • 24. Henry Chai - 8/30/22 28 Large Datasets: Example
  • 25. – Image processing – Large 𝑛: potentially massive number of observations (e.g., pictures on the internet) – Use-cases: object recognition, annotation generation – Medical data – Large 𝑘: potentially massive feature set (e.g., genome sequence, electronic medical records, etc…) – Use-cases: personalized medicine, diagnosis prediction – Business analytics – Large 𝑛 (e.g., all customers & all products) and 𝑘 (e.g., customer data, product specifications, transaction records, etc…) – Use-cases: product recommendations, customer segmentation Henry Chai - 8/30/22 29 Large Datasets: Example
  • 26. Tons of Features Henry Chai - 8/30/22 30 – High-dimensional datasets present numerous issues: – Curse of dimensionality – Overfitting – Computational issues – Strategies: – Learn low-dimensional representations – Perform feature selection to eliminate “low-yield” features
  • 27. Tons of Observations – Typically, we consider exponential time complexity (e.g., 𝑂 2, ) bad and polynomial complexity (e.g., 𝑂 𝑛2 ) good – However, if 𝑛 is massive, then even 𝑂 𝑛 can be problematic! – Strategies: – Speed up processing e.g., stochastic gradient descent vs. gradient descent – Make approximations/subsample the dataset – Exploit parallelism Henry Chai - 8/30/22 31
  • 28. Tons of Observations – Typically, we consider exponential time complexity (e.g., 𝑂 2, ) bad and polynomial complexity (e.g., 𝑂 𝑛2 ) good – However, if 𝑛 is massive, then even 𝑂 𝑛 can be problematic! – Strategies: – Speed up processing e.g., stochastic gradient descent vs. gradient descent – Make approximations/subsample the dataset – Exploit parallelism Henry Chai - 8/30/22 32
  • 29. Parallel Computing – Multi-core processing – scale up one big machine – Data can fit on one machine – Usually requires high-end, specialized hardware – Simpler algorithms that don’t necessarily scale well – Distributed processing – scale out many machines – Data stored across multiple machines – Scales to massive problems on standard hardware – Added complexity of network communication Henry Chai - 8/30/22 33
  • 30. Parallel Computing – Multi-core processing – scale up one big machine – Data can fit on one machine – Usually requires high-end, specialized hardware – Simpler algorithms that don’t necessarily scale well – Distributed processing – scale out many machines – Data stored across multiple machines – Scales to massive problems on standard hardware – Added complexity of network communication Henry Chai - 8/30/22 34
  • 31. – Multi-core processing – scale up one big machine – Data can fit on one machine – Usually requires high-end, specialized hardware – Simpler algorithms that don’t necessarily scale well – Distributed processing – scale out many machines – Data stored across multiple machines – Scales to massive problems on standard hardware – Added complexity of network communication Parallel Computing Henry Chai - 8/30/22 35
  • 32. Apache Spark – Open-source engine for parallel computing/large-scale data processing – Lots of convenient features for machine learning specifically – Fast iterative procedures – Efficient communication primitives – Interactive IPython-style notebooks (Databricks) Henry Chai - 8/30/22 36
  • 33. Course Overview – Data preprocessing – Cleaning – Summarizing/visualizing – Dimensionality reduction – Model training – Distributed machine learning – Large-scale optimization – Scalable deep learning – Efficient data structures – Hyperparameter tuning – Inference – Hardware for ML – Low-latency inference (Compression, Pruning, Distillation) Henry Chai - 8/30/22 37 – Infrastructure/Frameworks – Apache Spark – TensorFlow – AWS/Google Cloud/Azure – Advanced Topics – Federated Learning – Neural architecture search – Machine learning in practice
  • 34. “Front” Matter – HW1 released 8/30 (today!), due 9/13 at 11:59 PM – All HWs consist of two parts: written and programming – For HW1 only, the programming part is optional (but strongly encouraged) – The written part is nominally about PCA but can be solved using pre-requisite knowledge (linear algebra) – Recitations on Friday, 11:50 – 1:10 (different from lecture) in GHC 4401 (same as lecture) – Recitation 1 on 9/2: Introduction to PySpark/Databricks – Recitation 2 on 9/9: Review of linear algebra Henry Chai - 8/30/22 38
  • 35. – Course website: https://10605.github.io/ Course Logistics Henry Chai - 8/30/22 39
  • 36. Course Logistics Henry Chai - 8/30/22 40 – Course website: https://10605.github.io/ – Exam 1: 10/11 – Exam 2: 12/8
  • 37. – Course website: https://10605.github.io/ Course Logistics Henry Chai - 8/30/22 41
  • 38. Course Logistics Henry Chai - 8/30/22 42 – Course website: https://10605.github.io/ – Mini-project: – Complete in groups of 2-3 – Two pre-specified options: – Groups with only 10-605 students will pick one to complete – Groups with any 10-805 students must complete both – No late days may be used on project deliverables – More details about project options and deliverables will be announced later in the semester
  • 39. Course Technologies – Piazza for Q&A / announcements – Gradescope for assignment submissions – Canvas for hosting recordings and gradebook – Google calendar for lecture, recitation and OH schedule Henry Chai - 8/30/22 43
  • 40. Course Staff Henry Chai - 8/30/22 44 TAs Instructors EA