SlideShare a Scribd company logo
1 of 48
Download to read offline
10-605/805 – ML for
Large Datasets
Lecture 2: Distributed
Computing
Henry Chai
9/1/22
Front Matter
– HW1 released 8/30, due 9/13 at 11:59 PM
– For HW1 only, the programming part is optional (but
strongly encouraged)
– The written part is nominally about PCA but can be
solved using pre-requisite knowledge (linear algebra)
– Recitations on Friday, 11:50 – 1:10 (different from lecture)
in GHC 4401 (same as lecture)
– Recitation 1 on 9/2: Introduction to PySpark/Databricks
Henry Chai - 9/1/22 2
Recall:
Machine
Learning
with Large
Datasets
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data (sometimes a lot of it!)
– More data usually helps
– Use data efficiently/intelligently to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 9/1/22 3
Okay, but how
much data are
we talking
about here?
– Premise:
– There exists some pattern/behavior of interest
– The pattern/behavior is difficult to describe
– There is data (sometimes a lot of it!)
– More data usually helps
– Use data efficiently/intelligently to “learn” the pattern
– Definition:
– A computer program learns if its performance, P, at
some task, T, improves with experience, E.
Henry Chai - 9/1/22 4
Units of Data
Unit Value Scale
Kilobyte (KB) 1000 bytes A paragraph of text
Megabyte (MB) 1000 KB A short novel
Gigabyte (GB) 1000 GB Beethoven’s 5th symphony
Terabyte (TB) 1000 TB
All the x-rays in a large
hospital
Petabyte (PB) 1000 PB
≈ ⁄
!
" of the content of all
US research libraries
Exabyte (EB) 1000 EB
≈ ⁄
!
# of the words humans
have ever spoken
Henry Chai - 9/1/22 5
Henry Chai - 9/1/22 6
Source: https://res.cloudinary.com/yumyoshojin/image/upload/v1/pdf/future-data-2019.pdf
The Big Data
Problem
– The sources and amount of data keep growing
– Data storage and processing can’t keep up
– A 1 TB hard disk costs ≈ $25 USD
– Reading 1 TB from disk ≈ 5 hours
– So it would cost $100,000 USD to store Facebook’s
4PB of data/day and take ≈ 2.5 years to read!
Henry Chai - 9/1/22 7
Recall:
Parallel
Computing
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Henry Chai - 9/1/22 8
– Multi-core processing – scale up one big machine
– Data can fit on one machine
– Usually requires high-end, specialized hardware
– Simpler algorithms that don’t necessarily scale well
– Distributed processing – scale out many machines
– Data stored across multiple machines
– Scales to massive problems on standard hardware
– Added complexity of network communication
Recall:
Parallel
Computing
Henry Chai - 9/1/22 9
Cloud
Computing
– Enables distributed processing by democratizing access
to storage and computational resources
– Costs continue to decrease each year
Henry Chai - 9/1/22 10
Cloud
Computing
Henry Chai - 9/1/22 11
Source: https://www.google.com/about/datacenters/gallery/
How can we
program in this
setting?
Henry Chai - 9/1/22 12
Source: https://www.google.com/about/datacenters/gallery/
Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 13
Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 14
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 15
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 16
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 17
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
Sam 1
Example:
Word Counting
– Given a text, count the number of times each word appears
Henry Chai - 9/1/22 18
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Word Count
I 1
am 1
Sam 2
– Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
Example:
Word Counting
Henry Chai - 9/1/22 19
Word Count
I 6
am 5
Sam 5
that 3
do 1
not 1
like 1
– Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 20
– Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 21
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
Word Count
I 8
am 6
Sam 6
that 3
⋮ ⋮
– Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
Example:
Word Counting
Henry Chai - 9/1/22 22
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
– Given a text, count the number of times each word appears
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
MapReduce
(Apache
Hadoop)
Henry Chai - 9/1/22 23
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
Map Reduce
Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 24
Machine
Failures
Henry Chai - 9/1/22 25
– Machines are plentiful, so just launch another job!
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 2
⋮ ⋮
Machine
Failures and
Stragglers
Henry Chai - 9/1/22 26
– Machines are plentiful, so just launch another job!
I am Sam
Sam I am
that Sam I am
that Sam I am
I do not like
that Sam I am
do you like
green eggs and ham
I do not like them
Sam I am
⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
I 2
⋮ ⋮
Challenges in
Cloud
Computing
1. How can we divide work across multiple machines?
– Key consideration: moving data across
machines/over a network is costly!
2. What can we do if a machine fails?
– If a node fails every 3 years …
– … then a center with 10,000 nodes will have 10
faults/day!
– Even worse are stragglers, or nodes that run slower
than others
Henry Chai - 9/1/22 27
Communication
Hierarchy
Henry Chai - 9/1/22 28
main memory (RAM)
disk
≈ 1 GB/s
≈ 50 GB/s
≈ 1 GB/s
top-of-rack
switch
≈
1
GB/s
CPU
≈
0
.
5
G
B
/
s
in-rack nodes
other racks
What are the
implications for
MapReduce?
Henry Chai - 9/1/22 29
main memory (RAM)
disk
≈ 1 GB/s
≈ 50 GB/s
≈ 1 GB/s
top-of-rack
switch
≈
1
GB/s
CPU
≈
0
.
5
G
B
/
s
in-rack nodes
other racks
MapReduce
Henry Chai - 9/1/22 30
I 8
am 6
Sam 6
that 3
do 2
not ⋮
like
you 1
green ⋮
eggs
and 2
ham ⋮
them
Reduce
I 2
⋮ ⋮
that 2
⋮ ⋮
I 2
⋮ ⋮
do 1
⋮ ⋮
I 2
⋮ ⋮
Map
MapReduce &
Iterative
Procedures
Henry Chai - 9/1/22 31
– Iterative jobs involve lots of disk reading/writing per
iteration, which is slow!
– Iterative jobs involve lots of disk reading/writing per
iteration, which is slow!
– Issue: many machine learning/optimization algorithms are
inherently iterative!
MapReduce &
Iterative
Procedures
Machine
Learning
Henry Chai - 9/1/22 32
– Insight: the cost of RAM has been steadily decreasing
– Idea: shove more RAM into each machine and hold more
data in main memory → Apache Spark (circa 2010)
MapReduce &
Iterative
Procedures
Machine
Learning
Henry Chai - 9/1/22 33
RAM
Disk
Source: https://jcmit.net/mem2015.htm
1¢/MB
Apache Spark
vs. Hadoop
Henry Chai - 9/1/22 34
Figure courtesy of Virginia Smith
Iteration and Big Data Processing
Iteration in Hadoop:
Input
(e.g., from HDFS)
iteration 1 iteration 2 iteration 3 …
file system
read file system
write file system
read file system
write file system
read file system
write
Read/write
intermediate data
Read/write
intermediate data
Read/write
intermediate data
MapReduce program MapReduce program MapReduce program
Iteration in Spark:
Input
(e.g., from HDFS)
iteration 1 iteration 2 iteration 3 …
file system
read
In-memory computations, no need to read/write to disk.
Apache Spark
vs. Hadoop
Henry Chai - 9/1/22 35
Source: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
Apache Spark:
3 Main APIs
1. Resilient distributed datasets (RDDs):
– A collection of elements partitioned across machines in
a fault-tolerant way that can be operated on in parallel
2. Dataframes:
– An abstraction on top of the RDD API
– Similar to tables in a relational database with named
and typed columns that support SQL-like queries
3. Datasets
– A combination of RDDs and Dataframes
– Not available in PySpark
Henry Chai - 9/1/22 36
Apache Spark:
3 Main APIs
1. Resilient distributed datasets (RDDs):
– A collection of elements partitioned across machines in
a fault-tolerant way that can be operated on in parallel
2. Dataframes:
– An abstraction on top of the RDD API
– Similar to tables in a relational database with named
and typed columns that support SQL-like queries
3. Datasets
– A combination of RDDs and Dataframes
– Not available in PySpark
Henry Chai - 9/1/22 37
RDDs
Element 1
Element 2
Element 3
Element 4
Element 5
Element 6
Element 7
Element 8
Element 9
Element 10
Henry Chai - 9/1/22 38
Machine 2
Machine 1
Machine 3
An RDD split into 5 partitions
Rule of thumb: 2-4 partitions per CPU
Operations on
RDDs
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Actions: compute a result from an RDD
and return/write that result to disk
– Eager – the result is computed
immediately
– Lazy vs. eager operations allows Spark to
efficiently manage data transfer
Henry Chai - 9/1/22 39
semantic
difference
functional
difference
Transformations
Henry Chai - 9/1/22 40
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Sequence of transformations defines a
“recipe” for computing some result
– Allows Spark to efficiently recover
from failures/stragglers by recalling
the steps required to bring a new
machine up to speed
Transformations:
Examples
Henry Chai - 9/1/22 41
Transformation Description
map(func) returns a new RDD by applying the function func
to each element of the source RDD
flatMap(func) similar to map except each element of the source
RDD can be mapped to 0 or more outputs
filter(func) returns a new RDD consisting of the elements
where func evaluates to TRUE
distinct() returns a new RDD consisting of the unique
elements of the source RDD
union(otherRDD) returns a new RDD consisting of the union of
elements in the source RDD and otherRDD
intersection
(otherRDD)
returns a new RDD consisting of the intersection
of elements in the source RDD and otherRDD
Actions
Henry Chai - 9/1/22 42
– Transformations: return a new RDD
– Lazy – the new RDD is not
computed immediately
– Actions: compute a result from an RDD
and return/write that result to disk
– Eager – the result is computed
immediately
– Required to get Spark to “do” anything,
i.e., generate the output we want
Actions:
Examples
Henry Chai - 9/1/22 43
Action Description
reduce(func) aggregates the elements of the source RDD using
the function func, a commutative and associative
function that takes two arguments and returns
one (allowing for parallelization)
collect() returns all elements of the source RDD as an array
to the driver program
take(n) returns the first n elements of the source RDD as
an array to the driver program
first() returns the first element of the source RDD as an
array to the driver program
count() returns the number of elements in the source RDD
For A Detailed
Walkthrough…
– Recitation 1 and HW1 will walk you step-by-step through…
– The architecture of a typical Spark job
– Lambda functions
– RDDs and caching
– Dataframes and schema
– Getting setup in Databricks
Henry Chai - 9/1/22 44
Key Takeaways
– MapReduce as a framework for distributed processing
– Heavy disk I/O → poorly suited for iterative procedures
– Apache Spark
– Big idea: keep data in memory as much as possible
– RDDs are the foundational API
– Transformations (lazy) vs actions (eager)
– Laziness/eagerness allows Spark to optimize the
execution of operations
Henry Chai - 9/1/22 45
Homework 1
Programming
Preview
Henry Chai - 9/1/22 46
– Part 1: Entity Resolution with RDDs
– Identifying and linking different occurrences of the
same object across multiple data sources
– Different titles/names for the same person
– Different pictures of the same physical object
– Motivating example: product listings on different
e-commerce websites
– Google shopping: clickart 950000 - premier
image pack (dvd-rom) massive collection of
images & fonts for all your design needs …
– Amazon: clickart 950 000 - premier image pack
(dvd-rom)
Homework 1
Programming
Preview
– Part 1: Entity Resolution with RDDs
– Determine if two product listings correspond to the
same product using text similarity
– Distance between texts defined as the cosine
similarity of their TF-IDF representations:
– Similar to bag-of-words model except instead of
occurrences, each word is defined by its term
frequency times its inverse document frequency
– TF(word, doc) =
# &' ()*+, -./0 122+13, )4 0.5
(&(16 # &' 7&38, )4 0.5
– IDF(word) =
(&(16 # &' 8&9:*+4(,
# &' 8&9:*+4(, (;1( 9&4(1)4 -./0
– Upweights words that occur a lot in a few specific
documents and rarely in other documents
Henry Chai - 9/1/22 47
Homework 1
Programming
Preview
– Part 2: Machine Learning Pipeline with DataFrames
– Task: predicting power generation of power plants
under different environmental conditions
1. Data ETL (extract-transform-load): converting raw
data into a workable form
2. Exploration and visualization
3. Modelling
4. Hyperparameter optimization and model selection
Henry Chai - 9/1/22 48

More Related Content

Similar to Lecture_2_-_Distributed_Computing.pdf

DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesMax De Marzi
 
Dmdh winter 2015 session #1
Dmdh winter 2015 session #1Dmdh winter 2015 session #1
Dmdh winter 2015 session #1sarahkh12
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
MBA632 Lecture, Morehead State University
MBA632 Lecture, Morehead State UniversityMBA632 Lecture, Morehead State University
MBA632 Lecture, Morehead State UniversityBrad Ward
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]David Przybilla
 
Chapter-1-Computer-Concept.pdf
Chapter-1-Computer-Concept.pdfChapter-1-Computer-Concept.pdf
Chapter-1-Computer-Concept.pdfSaSShakil
 
Super Computers
Super ComputersSuper Computers
Super ComputersTHEFPS
 
How to get started with Site Reliability Engineering
How to get started with Site Reliability EngineeringHow to get started with Site Reliability Engineering
How to get started with Site Reliability EngineeringAndrew Kirkpatrick
 
Mobile? Who are we kidding?
Mobile? Who are we kidding?Mobile? Who are we kidding?
Mobile? Who are we kidding?William Anderson
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesJohn Coggeshall
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability MistakesJohn Coggeshall
 
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docx
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docxNameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docx
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docxrosemarybdodson23141
 
Narrative Essay If I Were Invisible
Narrative Essay If I Were InvisibleNarrative Essay If I Were Invisible
Narrative Essay If I Were InvisiblePatty Loen
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
 
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance SpecialistMunich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance SpecialistMartin Packer
 

Similar to Lecture_2_-_Distributed_Computing.pdf (20)

Pc Or Mainframe
Pc Or MainframePc Or Mainframe
Pc Or Mainframe
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
 
Dmdh winter 2015 session #1
Dmdh winter 2015 session #1Dmdh winter 2015 session #1
Dmdh winter 2015 session #1
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
MBA632 Lecture, Morehead State University
MBA632 Lecture, Morehead State UniversityMBA632 Lecture, Morehead State University
MBA632 Lecture, Morehead State University
 
Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]Reproducible datascience [with Terraform]
Reproducible datascience [with Terraform]
 
Chapter-1-Computer-Concept.pdf
Chapter-1-Computer-Concept.pdfChapter-1-Computer-Concept.pdf
Chapter-1-Computer-Concept.pdf
 
Super Computers
Super ComputersSuper Computers
Super Computers
 
Computer Crazy
Computer CrazyComputer Crazy
Computer Crazy
 
How to get started with Site Reliability Engineering
How to get started with Site Reliability EngineeringHow to get started with Site Reliability Engineering
How to get started with Site Reliability Engineering
 
Mobile? Who are we kidding?
Mobile? Who are we kidding?Mobile? Who are we kidding?
Mobile? Who are we kidding?
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Apache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 MistakesApache Con 2008 Top 10 Mistakes
Apache Con 2008 Top 10 Mistakes
 
Top 10 Scalability Mistakes
Top 10 Scalability MistakesTop 10 Scalability Mistakes
Top 10 Scalability Mistakes
 
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docx
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docxNameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docx
NameBISM 1200 Introduction to Computing Sound Byte Lab Chap.docx
 
Narrative Essay If I Were Invisible
Narrative Essay If I Were InvisibleNarrative Essay If I Were Invisible
Narrative Essay If I Were Invisible
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
TPCiP 2019 Wrap Up
TPCiP 2019 Wrap UpTPCiP 2019 Wrap Up
TPCiP 2019 Wrap Up
 
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance SpecialistMunich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
Munich 2016 - Z011597 Martin Packer - How To Be A Better Performance Specialist
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Lecture_2_-_Distributed_Computing.pdf

  • 1. 10-605/805 – ML for Large Datasets Lecture 2: Distributed Computing Henry Chai 9/1/22
  • 2. Front Matter – HW1 released 8/30, due 9/13 at 11:59 PM – For HW1 only, the programming part is optional (but strongly encouraged) – The written part is nominally about PCA but can be solved using pre-requisite knowledge (linear algebra) – Recitations on Friday, 11:50 – 1:10 (different from lecture) in GHC 4401 (same as lecture) – Recitation 1 on 9/2: Introduction to PySpark/Databricks Henry Chai - 9/1/22 2
  • 3. Recall: Machine Learning with Large Datasets – Premise: – There exists some pattern/behavior of interest – The pattern/behavior is difficult to describe – There is data (sometimes a lot of it!) – More data usually helps – Use data efficiently/intelligently to “learn” the pattern – Definition: – A computer program learns if its performance, P, at some task, T, improves with experience, E. Henry Chai - 9/1/22 3
  • 4. Okay, but how much data are we talking about here? – Premise: – There exists some pattern/behavior of interest – The pattern/behavior is difficult to describe – There is data (sometimes a lot of it!) – More data usually helps – Use data efficiently/intelligently to “learn” the pattern – Definition: – A computer program learns if its performance, P, at some task, T, improves with experience, E. Henry Chai - 9/1/22 4
  • 5. Units of Data Unit Value Scale Kilobyte (KB) 1000 bytes A paragraph of text Megabyte (MB) 1000 KB A short novel Gigabyte (GB) 1000 GB Beethoven’s 5th symphony Terabyte (TB) 1000 TB All the x-rays in a large hospital Petabyte (PB) 1000 PB ≈ ⁄ ! " of the content of all US research libraries Exabyte (EB) 1000 EB ≈ ⁄ ! # of the words humans have ever spoken Henry Chai - 9/1/22 5
  • 6. Henry Chai - 9/1/22 6 Source: https://res.cloudinary.com/yumyoshojin/image/upload/v1/pdf/future-data-2019.pdf
  • 7. The Big Data Problem – The sources and amount of data keep growing – Data storage and processing can’t keep up – A 1 TB hard disk costs ≈ $25 USD – Reading 1 TB from disk ≈ 5 hours – So it would cost $100,000 USD to store Facebook’s 4PB of data/day and take ≈ 2.5 years to read! Henry Chai - 9/1/22 7
  • 8. Recall: Parallel Computing – Multi-core processing – scale up one big machine – Data can fit on one machine – Usually requires high-end, specialized hardware – Simpler algorithms that don’t necessarily scale well – Distributed processing – scale out many machines – Data stored across multiple machines – Scales to massive problems on standard hardware – Added complexity of network communication Henry Chai - 9/1/22 8
  • 9. – Multi-core processing – scale up one big machine – Data can fit on one machine – Usually requires high-end, specialized hardware – Simpler algorithms that don’t necessarily scale well – Distributed processing – scale out many machines – Data stored across multiple machines – Scales to massive problems on standard hardware – Added complexity of network communication Recall: Parallel Computing Henry Chai - 9/1/22 9
  • 10. Cloud Computing – Enables distributed processing by democratizing access to storage and computational resources – Costs continue to decrease each year Henry Chai - 9/1/22 10
  • 11. Cloud Computing Henry Chai - 9/1/22 11 Source: https://www.google.com/about/datacenters/gallery/
  • 12. How can we program in this setting? Henry Chai - 9/1/22 12 Source: https://www.google.com/about/datacenters/gallery/
  • 13. Challenges in Cloud Computing 1. How can we divide work across multiple machines? – Key consideration: moving data across machines/over a network is costly! 2. What can we do if a machine fails? – If a node fails every 3 years … – … then a center with 10,000 nodes will have 10 faults/day! – Even worse are stragglers, or nodes that run slower than others Henry Chai - 9/1/22 13
  • 14. Example: Word Counting – Given a text, count the number of times each word appears Henry Chai - 9/1/22 14 I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am
  • 15. Example: Word Counting – Given a text, count the number of times each word appears Henry Chai - 9/1/22 15 I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am Word Count I 1
  • 16. Example: Word Counting – Given a text, count the number of times each word appears Henry Chai - 9/1/22 16 I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am Word Count I 1 am 1
  • 17. Example: Word Counting – Given a text, count the number of times each word appears Henry Chai - 9/1/22 17 I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am Word Count I 1 am 1 Sam 1
  • 18. Example: Word Counting – Given a text, count the number of times each word appears Henry Chai - 9/1/22 18 I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am Word Count I 1 am 1 Sam 2
  • 19. – Given a text, count the number of times each word appears I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am Example: Word Counting Henry Chai - 9/1/22 19 Word Count I 6 am 5 Sam 5 that 3 do 1 not 1 like 1
  • 20. – Given a text, count the number of times each word appears I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ Example: Word Counting Henry Chai - 9/1/22 20
  • 21. – Given a text, count the number of times each word appears I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ Example: Word Counting Henry Chai - 9/1/22 21 I 2 ⋮ ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ Word Count I 8 am 6 Sam 6 that 3 ⋮ ⋮
  • 22. – Given a text, count the number of times each word appears I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ Example: Word Counting Henry Chai - 9/1/22 22 I 2 ⋮ ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ I 8 am 6 Sam 6 that 3 do 2 not ⋮ like you 1 green ⋮ eggs and 2 ham ⋮ them
  • 23. – Given a text, count the number of times each word appears I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ MapReduce (Apache Hadoop) Henry Chai - 9/1/22 23 I 2 ⋮ ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ I 8 am 6 Sam 6 that 3 do 2 not ⋮ like you 1 green ⋮ eggs and 2 ham ⋮ them Map Reduce
  • 24. Challenges in Cloud Computing 1. How can we divide work across multiple machines? – Key consideration: moving data across machines/over a network is costly! 2. What can we do if a machine fails? – If a node fails every 3 years … – … then a center with 10,000 nodes will have 10 faults/day! – Even worse are stragglers, or nodes that run slower than others Henry Chai - 9/1/22 24
  • 25. Machine Failures Henry Chai - 9/1/22 25 – Machines are plentiful, so just launch another job! I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ I 2 ⋮ ⋮
  • 26. Machine Failures and Stragglers Henry Chai - 9/1/22 26 – Machines are plentiful, so just launch another job! I am Sam Sam I am that Sam I am that Sam I am I do not like that Sam I am do you like green eggs and ham I do not like them Sam I am ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ I 2 ⋮ ⋮
  • 27. Challenges in Cloud Computing 1. How can we divide work across multiple machines? – Key consideration: moving data across machines/over a network is costly! 2. What can we do if a machine fails? – If a node fails every 3 years … – … then a center with 10,000 nodes will have 10 faults/day! – Even worse are stragglers, or nodes that run slower than others Henry Chai - 9/1/22 27
  • 28. Communication Hierarchy Henry Chai - 9/1/22 28 main memory (RAM) disk ≈ 1 GB/s ≈ 50 GB/s ≈ 1 GB/s top-of-rack switch ≈ 1 GB/s CPU ≈ 0 . 5 G B / s in-rack nodes other racks
  • 29. What are the implications for MapReduce? Henry Chai - 9/1/22 29 main memory (RAM) disk ≈ 1 GB/s ≈ 50 GB/s ≈ 1 GB/s top-of-rack switch ≈ 1 GB/s CPU ≈ 0 . 5 G B / s in-rack nodes other racks
  • 30. MapReduce Henry Chai - 9/1/22 30 I 8 am 6 Sam 6 that 3 do 2 not ⋮ like you 1 green ⋮ eggs and 2 ham ⋮ them Reduce I 2 ⋮ ⋮ that 2 ⋮ ⋮ I 2 ⋮ ⋮ do 1 ⋮ ⋮ I 2 ⋮ ⋮ Map
  • 31. MapReduce & Iterative Procedures Henry Chai - 9/1/22 31 – Iterative jobs involve lots of disk reading/writing per iteration, which is slow!
  • 32. – Iterative jobs involve lots of disk reading/writing per iteration, which is slow! – Issue: many machine learning/optimization algorithms are inherently iterative! MapReduce & Iterative Procedures Machine Learning Henry Chai - 9/1/22 32
  • 33. – Insight: the cost of RAM has been steadily decreasing – Idea: shove more RAM into each machine and hold more data in main memory → Apache Spark (circa 2010) MapReduce & Iterative Procedures Machine Learning Henry Chai - 9/1/22 33 RAM Disk Source: https://jcmit.net/mem2015.htm 1¢/MB
  • 34. Apache Spark vs. Hadoop Henry Chai - 9/1/22 34 Figure courtesy of Virginia Smith Iteration and Big Data Processing Iteration in Hadoop: Input (e.g., from HDFS) iteration 1 iteration 2 iteration 3 … file system read file system write file system read file system write file system read file system write Read/write intermediate data Read/write intermediate data Read/write intermediate data MapReduce program MapReduce program MapReduce program Iteration in Spark: Input (e.g., from HDFS) iteration 1 iteration 2 iteration 3 … file system read In-memory computations, no need to read/write to disk.
  • 35. Apache Spark vs. Hadoop Henry Chai - 9/1/22 35 Source: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
  • 36. Apache Spark: 3 Main APIs 1. Resilient distributed datasets (RDDs): – A collection of elements partitioned across machines in a fault-tolerant way that can be operated on in parallel 2. Dataframes: – An abstraction on top of the RDD API – Similar to tables in a relational database with named and typed columns that support SQL-like queries 3. Datasets – A combination of RDDs and Dataframes – Not available in PySpark Henry Chai - 9/1/22 36
  • 37. Apache Spark: 3 Main APIs 1. Resilient distributed datasets (RDDs): – A collection of elements partitioned across machines in a fault-tolerant way that can be operated on in parallel 2. Dataframes: – An abstraction on top of the RDD API – Similar to tables in a relational database with named and typed columns that support SQL-like queries 3. Datasets – A combination of RDDs and Dataframes – Not available in PySpark Henry Chai - 9/1/22 37
  • 38. RDDs Element 1 Element 2 Element 3 Element 4 Element 5 Element 6 Element 7 Element 8 Element 9 Element 10 Henry Chai - 9/1/22 38 Machine 2 Machine 1 Machine 3 An RDD split into 5 partitions Rule of thumb: 2-4 partitions per CPU
  • 39. Operations on RDDs – Transformations: return a new RDD – Lazy – the new RDD is not computed immediately – Actions: compute a result from an RDD and return/write that result to disk – Eager – the result is computed immediately – Lazy vs. eager operations allows Spark to efficiently manage data transfer Henry Chai - 9/1/22 39 semantic difference functional difference
  • 40. Transformations Henry Chai - 9/1/22 40 – Transformations: return a new RDD – Lazy – the new RDD is not computed immediately – Sequence of transformations defines a “recipe” for computing some result – Allows Spark to efficiently recover from failures/stragglers by recalling the steps required to bring a new machine up to speed
  • 41. Transformations: Examples Henry Chai - 9/1/22 41 Transformation Description map(func) returns a new RDD by applying the function func to each element of the source RDD flatMap(func) similar to map except each element of the source RDD can be mapped to 0 or more outputs filter(func) returns a new RDD consisting of the elements where func evaluates to TRUE distinct() returns a new RDD consisting of the unique elements of the source RDD union(otherRDD) returns a new RDD consisting of the union of elements in the source RDD and otherRDD intersection (otherRDD) returns a new RDD consisting of the intersection of elements in the source RDD and otherRDD
  • 42. Actions Henry Chai - 9/1/22 42 – Transformations: return a new RDD – Lazy – the new RDD is not computed immediately – Actions: compute a result from an RDD and return/write that result to disk – Eager – the result is computed immediately – Required to get Spark to “do” anything, i.e., generate the output we want
  • 43. Actions: Examples Henry Chai - 9/1/22 43 Action Description reduce(func) aggregates the elements of the source RDD using the function func, a commutative and associative function that takes two arguments and returns one (allowing for parallelization) collect() returns all elements of the source RDD as an array to the driver program take(n) returns the first n elements of the source RDD as an array to the driver program first() returns the first element of the source RDD as an array to the driver program count() returns the number of elements in the source RDD
  • 44. For A Detailed Walkthrough… – Recitation 1 and HW1 will walk you step-by-step through… – The architecture of a typical Spark job – Lambda functions – RDDs and caching – Dataframes and schema – Getting setup in Databricks Henry Chai - 9/1/22 44
  • 45. Key Takeaways – MapReduce as a framework for distributed processing – Heavy disk I/O → poorly suited for iterative procedures – Apache Spark – Big idea: keep data in memory as much as possible – RDDs are the foundational API – Transformations (lazy) vs actions (eager) – Laziness/eagerness allows Spark to optimize the execution of operations Henry Chai - 9/1/22 45
  • 46. Homework 1 Programming Preview Henry Chai - 9/1/22 46 – Part 1: Entity Resolution with RDDs – Identifying and linking different occurrences of the same object across multiple data sources – Different titles/names for the same person – Different pictures of the same physical object – Motivating example: product listings on different e-commerce websites – Google shopping: clickart 950000 - premier image pack (dvd-rom) massive collection of images & fonts for all your design needs … – Amazon: clickart 950 000 - premier image pack (dvd-rom)
  • 47. Homework 1 Programming Preview – Part 1: Entity Resolution with RDDs – Determine if two product listings correspond to the same product using text similarity – Distance between texts defined as the cosine similarity of their TF-IDF representations: – Similar to bag-of-words model except instead of occurrences, each word is defined by its term frequency times its inverse document frequency – TF(word, doc) = # &' ()*+, -./0 122+13, )4 0.5 (&(16 # &' 7&38, )4 0.5 – IDF(word) = (&(16 # &' 8&9:*+4(, # &' 8&9:*+4(, (;1( 9&4(1)4 -./0 – Upweights words that occur a lot in a few specific documents and rarely in other documents Henry Chai - 9/1/22 47
  • 48. Homework 1 Programming Preview – Part 2: Machine Learning Pipeline with DataFrames – Task: predicting power generation of power plants under different environmental conditions 1. Data ETL (extract-transform-load): converting raw data into a workable form 2. Exploration and visualization 3. Modelling 4. Hyperparameter optimization and model selection Henry Chai - 9/1/22 48