SlideShare a Scribd company logo
Doing Data Science
Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• Big Data, how big?
• Data Science, who is doing it?
• Academia have been doing this for years
• Statisticians have been doing this work.
Getting Past the Hype / Why Now
• The Hype: Understanding the cultural phenomenon of data
science and how others were experiencing it. Study how
companies, and universities are “doing data science”.
• Why Now: Technology makes this possible: infrastructure for
large-scale data processing, increased memory, and
bandwidth, as well as a cultural acceptance of technology in
the fabric of our lives. This wasn't true a decade ago.
• Consideration should be to the ethical and technical
responsibilities for the people responsible for the process.
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''
• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
Current Landscape of Data Science
• Drew Conway's Venn diagram of data science
from 20l0,
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise
Observation: Nobody is an expert in everything, which is
why it makes more sense to create teams of people who
have different profiles and different expertise-together, as
a team, they can specialize in all those things.
Data Science Profile
Data Science Team
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data
scientist?
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data scientist?
• statisticians, applied mathematicians, and computer
scientists, sociologists, journalists, political scientists,
biomedical informatics students, students from
government agencies and social welfare, someone
from the architecture school, environmental
engineering, pure mathematicians, business marketing
students, and students who already worked as data
scientists.
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data scientist?
• statisticians, applied mathematicians, and computer
scientists, sociologists, journalists, political scientists,
biomedical informatics students, students from
government agencies and social welfare, someone
from the architecture school, environmental
engineering, pure mathematicians, business marketing
students, and students who already worked as data
scientists.
• They were all interested in figuring out ways to solve
important problems, often of social value, with data.
• In Academia: an academic data scientist is a
scientist, trained in anything from social
science to biology, who works with large
amounts of data, and must wrestle with
computational problems posed by the
structure, size, messiness, and the complexity
and nature of the data, while simultaneously
solving a real-world problem.
In Industry?
• What do data scientists look like in industry?
In Industry?
• What do data scientists look like in industry?
• It depends on the level of seniority
• A chief data scientist : setting everything up
from the engineering and infrastructure for
collecting data and logging, to privacy
concerns, to deciding what data will be user-
facing, how data is going to be used to make
decisions, and how it’s going to be built back
into the product.
• manage a team of engineers, scientists, and
analysts and should communicate with
leadership across the company, including the
CEO, CTO, and product leadership.
• In Industry: Someone who knows how to extract
meaning from and interpret data, which requires
both tools and methods from statistics and
machine learning, as well as being human.
He/She spends a lot of time in the process of
collecting, cleaning, and “munging” data,
because data is never clean. This process requires
persistence, statistics, and software engineering
skills that are also necessary for understanding
biases in the data, and for debugging logging
output from code.
Doing Data Science
Chapter 2, Pages 15 - 34
Big Data Statistics (pages 17 -33)
• Statistical thinking in the Age of Big Data
• Statistical Inference
• Populations and Samples
• Big Data Examples
• Big Assumptions due to Big Data
• Modeling
Statistical Thinking in the Age of Big
Data
Big Data?.
• First, it is a bundle of technologies.
• Second, it is a potential revolution in
measurement.
• And third, it is a point of view, or philosophy,
about how decisions will be—and perhaps
should be—made in the future.
Statistical Thinking – Age of Big Data
• Prerequisites – massive skills!! (Pages 14 -16)
– Math/Comp Science: stats, linear algebra, coding.
– Analytical: Data preparation, modeling,
visualization, communication.
Statistical Inference
• The World – complex, random, uncertain. (Page 18)
• As we commute to work on subways and in cars,
• shopping, emailing, browsing the Internet and watching the stock
market,
• as we’re building things, eating things,
• talking to our friends and family about things,
• this all processes potentially produces data.
– Data are small traces of real-world processes.
– which traces we gather are decided by our data
collection or sampling method
…..
• Note: two forms of randomness exist: (Page 19)
– Underlying the process (system property)
– Collection methods (human errors)
• Need a solid method to extract meaning and
information from random, dubious data. ( Page
19)
– This is Statistical Inference!
• This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.
• More precisely, statistical inference is the discipline that
concerns itself with the development of procedures,
methods, and theorems that allow us to extract meaning
and information from data that has been generated by
stochastic (random) processes.
Populations and Samples
• Population : population of India or population
of world ?
• It could be any set of objects or units, such as
tweets or photographs or stars etc.
• If we could measure the characteristics of all
those objects : set of observations (N)
Big Data Domain - Sampling
• Scientific Validity Issues with “Big Data”
populations and samples. (Page 21 –
Engineering problems + Bias)
– Incompleteness Assumptions (Page 22)
• All statistics and analyses must assume that samples do
not represent the population and therefore scientifically-
tenable conclusions cannot be drawn.
• i.e. It’s a guess at best. These types of assertions will
stand-up better against academic/scientific scrutiny.
Big Data Domain - Assumptions
• Other Bad or Wrong Assumptions
– N = 1 vs. N = ALL (multiple layers) (Page 25 -26)
• Big Data introduces a 2nd degree to the data context.
• There are infinite levels of depth and breadth in the data.
• Individuals become populations. Populations become
populations of populations – to the nth degree. (meta-data)
– My Example:
• 1 billion Facebook posts (one from each user) vs. 1 billion
Facebook posts from one unique user.
• 1 billion tweets vs. 1 billion images from one unique user.
• Danger: Drawing conclusions from incomplete
populations. Understand the boundaries/context.
Modeling
• What’s a model? (bottom page 27 – middle 28)
– An attempt to understand the population of interest
and represent that in a compact form which can be
used to experiment/analyze/study and determine
cause-and-effect and similar relationships amongst
the variables under study IN THE POPULATION.
• Data model
• Statistical model – fitting?
• Mathematical model
Probability Distributions (Page 31)
Fitting a model
• estimate the parameters of the model using
the observed data.
Overfitting:
• model isn’t that good at capturing reality
beyond your sampled data.
Doing Data Science
Chapter 2, Pages 34 - 50
Exploratory Data Analysis (EDT)
• “It is an attitude, a state of flexibility, a willingness
to look for those things that we believe are not
there, as well as those we believe to be there.”
-John Tukey
• Traditionally presented as a bunch of histograms
and stem-and-leaf plots.
Features
• EDT is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.
• “Exploratory” aspect means that your
understanding of the problem you are
solving, or might solve, is changing as you go.
Basic Tools of EDA
• Plots, graphs and summary statistics.
• Method of systematically going through the
data, plotting distributions of all variables.
• EDA is a set of tools, it’s also a mindset.
• Mindset is about relationship with the data.
Philosophy of EDA
• Many reasons any one working with data
should do EDA.
• EDA helps with de-bugging the logging
process.
• EDA helps assuring the product is performing
as intended.
• EDA is done toward the beginning of the
analysis.
Data Science Process
A Data Scientist’s Role in This process
Doing Data Science
Chapter 3
What is an algorithm?
• Series of steps or rules to accomplish a tasks
such as:
– Sorting
– Searching
– Graph-based computational problems
• Because one problem could be solved by
several algorithms, the “best” is the one that
can do it with most efficiency and least
computational time.
Three Categories of Algorithms
• Data munging, preparation, and processing
– Sorting, MapReduce, Pregel
– Considered data engineering
• Optimization
– Parameter estimation
– Gradient Descent, Newton’s Method, least
squares
• Machine learning
– Predict, classify, cluster
Data Scientists
• Good data scientists use both statistical
modeling and machine learning algorithms.
• Statisticians:
– Want to apply parameters
to real world scenarios.
– Provide confidence
intervals and have
uncertainty in these.
– Make explicit assumptions
about data generation.
• Software engineers:
– Want to create production
code into a model without
interpret parameters.
– Machine learning
algorithms don’t have
notions of uncertainty.
– Don’t make assumptions of
probability distribution –
implicit.
Linear Regression (supervised)
• Determine if there is causation and build a
model if we think so.
• Does X (explanatory var) cause Y (response
var)?
• Assumptions:
– Quantitative variables
– Linear form
Linear Regression (supervised)
• Steps:
– Create a scatterplot of data
– Ensure that data looks linear (maybe apply
transformation?)
– Find “line of least squares” or fit line.
• This is the line that has the lowest sum of all of the
residuals (actual values – expected values)
– Check your model for “goodness” with R-squared,
p-values, etc.
– Apply your model within reason.
Suppose you run a social networking site that
charges a monthly subscription fee of $25, and that
this is your only source of revenue.
Each month you collect data and count your number
of users and total revenue.
You’ve done this daily over the course of two years,
recording it all in a spreadsheet.
You could express this data as a series of points. Here
are the first four:
S= {(x, y) = (1,25) , (10,250) , (100,2500) ,(200,5000)}
The names of the columns are
total_num_friends,
total_new_friends_this_week, num_visits,
time_spent, number_apps_downloaded,
number_ads_shown, gender, age, and so on.
Linear Line Equation
y = β0 +β1x
β0 and β1 ??
Fitting the model
Fitting the model
• To find this line, you’ll define the “residual
sum of squares” (RSS), denoted RSS (β) , to
be:
Fitting Linear model
• model <- lm(y ~ x)
Extending beyond least squares
• We have a simple linear regression model
using least squares estimation to estimate your
βs.
• This model can also be build in three primary
ways
1. Adding in modeling assumptions about the errors
2. Adding in more predictors
3. Transforming the predictors
Adding in modeling assumptions about
the errors
• If you use your model to predict y for a given
value of x, your prediction is deterministic
(y = β0 +β1x)
• doesn’t capture the variability in the observed
data.
• to capture this variability in your model, so
you extend your model to:
y = β0 +β1x+ϵ
• the error term—ϵ represents the actual error.
• the difference between the observations and
the true regression line,
• which you’ll never know and can only
estimate with your .
• the noise is normally distributed, which is
denoted:
• the conditional distribution of y given x is
• Need to estimate your parameters β0, β1, σ (variance)
from the data
• Then you estimate the variance (σ2) of ϵ, as:
(mean squared error)
Evaluation metrics
• R-squared and p-values
• R-squared
• p-values
To see the p-values, look at Pr (> |t|) .
Cross-validation
Other models for error terms
• Adding other predictors
y = β0 +β1x1 +β2x2 +β3x3 +ϵ.
model <- lm(y ~ x_1 + x_2 + x_3)
Transformations.
• polynomial relationship
y = β0 +β1x+β2x2 +β3x3
KNN
The intution behind k-NN
• is to consider the most similar other items
defined in terms of their attributes, look at
their labels, and give the unassigned item the
majority vote.
• If there’s a tie, you randomly select among the
labels that have tied for first.
• To automate it, two decisions must be made:
first, how do you define similarity or closeness?
• Once you define it, for a given unrated item, you
can say how similar all the labeled items are to it,
• and you can take the most similar items and call
them neighbors, who each have a “vote.”
• how many neighbors should you look at or “let
vote”? This value is k
overview of the process:
1. Decide on your similarity or distance metric.
2. Split the original labeled dataset into training and test
data.
3. Pick an evaluation metric. (Misclassification rate is a
good one. We’ll explain this more in a bit.)
4. Run k-NN a few times, changing k and checking the
evaluation measure.
5. Optimize k by picking the one with the best
evaluation measure.
6. Once you’ve chosen k, use the same training set and
now create a new test set with the people’s ages and
incomes that you have no
Similarity or distance metrics
• Euclidean distance
• Cosine Similarity
• Jaccard Distance or Similarity
• Mahalanobis Distance
• Hamming Distance
• Manhattan
Training and test sets
• Train Test split
Pick an evaluation metric
• Sensitivity(true positive/recall) is defined as
the probability of correctly diagnosing an ill
patient as ill
• Specificity(true negative) is defined as the
probability of correctly diagnosing a well
patient as well.
Choosing k
• Run k-NN a few times, changing k, and
checking the evaluation metric each time.
k-Nearest Neighbor/k-NN (supervised)
• Used when you have many objects that are
classified into categories but have some
unclassified objects (e.g. movie ratings).
k-Nearest Neighbor/k-NN (supervised)
• Pick a k value (usually a low odd number, but
up to you to pick).
• Find the closest number of k points to the
unclassified point (using various distance
measurement techniques).
• Assign the new point to the class where the
majority of closest points lie.
• Run algorithm again and again using different
k’s.
k-means (unsupervised)
• Goal is to segment data into clusters or strata
– Important for marketing research where you need
to determine your sample space.
• Assumptions:
– Labels are not known.
– You pick k (more of an art than a science).
k-means (unsupervised)
• Randomly pick k centroids (centers of data)
and place them near “clusters” of data.
• Assign each data point to a centroid.
• Move the centroids to the average location of
the data points assigned to it.
• Repeat the previous two steps until the data
point assignments don’t change.

More Related Content

What's hot

Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
BalaBit
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Edureka!
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
Srinath Perera
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Data Science
Data ScienceData Science
Data Science
Amit Singh
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
neelamoberoi1030
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Web mining
Web mining Web mining
Web mining
TeklayBirhane
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
Datamining Tools
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Big data storage
Big data storageBig data storage
Big data storage
Vikram Nandini
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
Peter Reimann
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
VijayMohan Vasu
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
SadhanaParameswaran
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
Data Science Club
 

What's hot (20)

Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data Science
Data ScienceData Science
Data Science
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Web mining
Web mining Web mining
Web mining
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Big data storage
Big data storageBig data storage
Big data storage
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 

Similar to Data Science-1 (1).ppt

Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
sasi
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
AkhirulAminulloh2
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
jybufgofasfbkpoovh
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptx
QasimGull
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
wahiba ben abdessalem
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
shalini s
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
SugumarSarDurai
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
ssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
vishal choudhary
 
Data science training presentation for high-quality education and training in...
Data science training presentation for high-quality education and training in...Data science training presentation for high-quality education and training in...
Data science training presentation for high-quality education and training in...
testingggg0101
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
Nozha Boujemaa
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
Inbavalli Valli
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
SouravBiswas747273
 
Roles of Datascience.pptx
Roles of Datascience.pptxRoles of Datascience.pptx
Roles of Datascience.pptx
KarthicaMarasamy
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
PerumalPitchandi
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
PerumalPitchandi
 

Similar to Data Science-1 (1).ppt (20)

Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...intro to data science Clustering and visualization of data science subfields ...
intro to data science Clustering and visualization of data science subfields ...
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Statistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptxStatistical Inference for development statistical model.pptx
Statistical Inference for development statistical model.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdfData_Science_Applications_&_Use_Cases.pdf
Data_Science_Applications_&_Use_Cases.pdf
 
Data science training presentation for high-quality education and training in...
Data science training presentation for high-quality education and training in...Data science training presentation for high-quality education and training in...
Data science training presentation for high-quality education and training in...
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
 
Roles of Datascience.pptx
Roles of Datascience.pptxRoles of Datascience.pptx
Roles of Datascience.pptx
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 

Recently uploaded

Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
binhminhvu04
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
azzyixes
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 

Recently uploaded (20)

Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 

Data Science-1 (1).ppt

  • 1.
  • 3. What is Data Science? • Big Data and Data Science Hype • Getting Past the Hype / Why Now? • Datafication • The Current Landscape (with a Little History) • Data Science Jobs • A Data Science Profile • Thought Experiment: Meta-Definition • OK, So What Is a Data Scientist, Really? – In Academia – In Industry
  • 4. Big Data and Data Science Hype • Big Data, how big? • Data Science, who is doing it? • Academia have been doing this for years • Statisticians have been doing this work.
  • 5. Getting Past the Hype / Why Now • The Hype: Understanding the cultural phenomenon of data science and how others were experiencing it. Study how companies, and universities are “doing data science”. • Why Now: Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn't true a decade ago. • Consideration should be to the ethical and technical responsibilities for the people responsible for the process.
  • 6. Datafication • Definition: A process of "taking all aspects of life and turning them into data:'' • For Example: – "Google's augmented-reality glasses “datafy” the gaze. – Twitter “datafies” stray thoughts. – Linkedin “datafies” professional networks:
  • 7. Current Landscape of Data Science • Drew Conway's Venn diagram of data science from 20l0,
  • 8. Data Science Jobs Job descriptions: • experts in computer science, • statistics, • communication, • data visualization, and to have • extensive domain expertise Observation: Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise-together, as a team, they can specialize in all those things.
  • 11.
  • 12. What is Data Science, Really? • Data scientist in academia ? • who in academia plans to become a data scientist?
  • 13. What is Data Science, Really? • Data scientist in academia ? • who in academia plans to become a data scientist? • statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists.
  • 14. What is Data Science, Really? • Data scientist in academia ? • who in academia plans to become a data scientist? • statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists. • They were all interested in figuring out ways to solve important problems, often of social value, with data.
  • 15. • In Academia: an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must wrestle with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
  • 16. In Industry? • What do data scientists look like in industry?
  • 17. In Industry? • What do data scientists look like in industry? • It depends on the level of seniority • A chief data scientist : setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns, to deciding what data will be user- facing, how data is going to be used to make decisions, and how it’s going to be built back into the product.
  • 18. • manage a team of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership.
  • 19. • In Industry: Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. He/She spends a lot of time in the process of collecting, cleaning, and “munging” data, because data is never clean. This process requires persistence, statistics, and software engineering skills that are also necessary for understanding biases in the data, and for debugging logging output from code.
  • 20. Doing Data Science Chapter 2, Pages 15 - 34
  • 21. Big Data Statistics (pages 17 -33) • Statistical thinking in the Age of Big Data • Statistical Inference • Populations and Samples • Big Data Examples • Big Assumptions due to Big Data • Modeling
  • 22. Statistical Thinking in the Age of Big Data Big Data?. • First, it is a bundle of technologies. • Second, it is a potential revolution in measurement. • And third, it is a point of view, or philosophy, about how decisions will be—and perhaps should be—made in the future.
  • 23. Statistical Thinking – Age of Big Data • Prerequisites – massive skills!! (Pages 14 -16) – Math/Comp Science: stats, linear algebra, coding. – Analytical: Data preparation, modeling, visualization, communication.
  • 24. Statistical Inference • The World – complex, random, uncertain. (Page 18) • As we commute to work on subways and in cars, • shopping, emailing, browsing the Internet and watching the stock market, • as we’re building things, eating things, • talking to our friends and family about things, • this all processes potentially produces data. – Data are small traces of real-world processes. – which traces we gather are decided by our data collection or sampling method
  • 25. ….. • Note: two forms of randomness exist: (Page 19) – Underlying the process (system property) – Collection methods (human errors) • Need a solid method to extract meaning and information from random, dubious data. ( Page 19) – This is Statistical Inference!
  • 26. • This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference. • More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes.
  • 27. Populations and Samples • Population : population of India or population of world ? • It could be any set of objects or units, such as tweets or photographs or stars etc. • If we could measure the characteristics of all those objects : set of observations (N)
  • 28. Big Data Domain - Sampling • Scientific Validity Issues with “Big Data” populations and samples. (Page 21 – Engineering problems + Bias) – Incompleteness Assumptions (Page 22) • All statistics and analyses must assume that samples do not represent the population and therefore scientifically- tenable conclusions cannot be drawn. • i.e. It’s a guess at best. These types of assertions will stand-up better against academic/scientific scrutiny.
  • 29. Big Data Domain - Assumptions • Other Bad or Wrong Assumptions – N = 1 vs. N = ALL (multiple layers) (Page 25 -26) • Big Data introduces a 2nd degree to the data context. • There are infinite levels of depth and breadth in the data. • Individuals become populations. Populations become populations of populations – to the nth degree. (meta-data) – My Example: • 1 billion Facebook posts (one from each user) vs. 1 billion Facebook posts from one unique user. • 1 billion tweets vs. 1 billion images from one unique user. • Danger: Drawing conclusions from incomplete populations. Understand the boundaries/context.
  • 30. Modeling • What’s a model? (bottom page 27 – middle 28) – An attempt to understand the population of interest and represent that in a compact form which can be used to experiment/analyze/study and determine cause-and-effect and similar relationships amongst the variables under study IN THE POPULATION. • Data model • Statistical model – fitting? • Mathematical model
  • 32. Fitting a model • estimate the parameters of the model using the observed data. Overfitting: • model isn’t that good at capturing reality beyond your sampled data.
  • 33. Doing Data Science Chapter 2, Pages 34 - 50
  • 34. Exploratory Data Analysis (EDT) • “It is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.” -John Tukey • Traditionally presented as a bunch of histograms and stem-and-leaf plots.
  • 35. Features • EDT is a critical part of data science process. • Represents a philosophy or way of doing statistics. • No hypotheses and there is no model. • “Exploratory” aspect means that your understanding of the problem you are solving, or might solve, is changing as you go.
  • 36. Basic Tools of EDA • Plots, graphs and summary statistics. • Method of systematically going through the data, plotting distributions of all variables. • EDA is a set of tools, it’s also a mindset. • Mindset is about relationship with the data.
  • 37. Philosophy of EDA • Many reasons any one working with data should do EDA. • EDA helps with de-bugging the logging process. • EDA helps assuring the product is performing as intended. • EDA is done toward the beginning of the analysis.
  • 39. A Data Scientist’s Role in This process
  • 41. What is an algorithm? • Series of steps or rules to accomplish a tasks such as: – Sorting – Searching – Graph-based computational problems • Because one problem could be solved by several algorithms, the “best” is the one that can do it with most efficiency and least computational time.
  • 42. Three Categories of Algorithms • Data munging, preparation, and processing – Sorting, MapReduce, Pregel – Considered data engineering • Optimization – Parameter estimation – Gradient Descent, Newton’s Method, least squares • Machine learning – Predict, classify, cluster
  • 43. Data Scientists • Good data scientists use both statistical modeling and machine learning algorithms. • Statisticians: – Want to apply parameters to real world scenarios. – Provide confidence intervals and have uncertainty in these. – Make explicit assumptions about data generation. • Software engineers: – Want to create production code into a model without interpret parameters. – Machine learning algorithms don’t have notions of uncertainty. – Don’t make assumptions of probability distribution – implicit.
  • 44. Linear Regression (supervised) • Determine if there is causation and build a model if we think so. • Does X (explanatory var) cause Y (response var)? • Assumptions: – Quantitative variables – Linear form
  • 45. Linear Regression (supervised) • Steps: – Create a scatterplot of data – Ensure that data looks linear (maybe apply transformation?) – Find “line of least squares” or fit line. • This is the line that has the lowest sum of all of the residuals (actual values – expected values) – Check your model for “goodness” with R-squared, p-values, etc. – Apply your model within reason.
  • 46. Suppose you run a social networking site that charges a monthly subscription fee of $25, and that this is your only source of revenue. Each month you collect data and count your number of users and total revenue. You’ve done this daily over the course of two years, recording it all in a spreadsheet. You could express this data as a series of points. Here are the first four: S= {(x, y) = (1,25) , (10,250) , (100,2500) ,(200,5000)}
  • 47. The names of the columns are total_num_friends, total_new_friends_this_week, num_visits, time_spent, number_apps_downloaded, number_ads_shown, gender, age, and so on.
  • 48.
  • 49.
  • 50. Linear Line Equation y = β0 +β1x β0 and β1 ?? Fitting the model
  • 52. • To find this line, you’ll define the “residual sum of squares” (RSS), denoted RSS (β) , to be:
  • 53. Fitting Linear model • model <- lm(y ~ x)
  • 54. Extending beyond least squares • We have a simple linear regression model using least squares estimation to estimate your βs. • This model can also be build in three primary ways 1. Adding in modeling assumptions about the errors 2. Adding in more predictors 3. Transforming the predictors
  • 55. Adding in modeling assumptions about the errors • If you use your model to predict y for a given value of x, your prediction is deterministic (y = β0 +β1x) • doesn’t capture the variability in the observed data. • to capture this variability in your model, so you extend your model to: y = β0 +β1x+ϵ
  • 56. • the error term—ϵ represents the actual error. • the difference between the observations and the true regression line, • which you’ll never know and can only estimate with your . • the noise is normally distributed, which is denoted:
  • 57. • the conditional distribution of y given x is • Need to estimate your parameters β0, β1, σ (variance) from the data • Then you estimate the variance (σ2) of ϵ, as: (mean squared error)
  • 58. Evaluation metrics • R-squared and p-values • R-squared • p-values To see the p-values, look at Pr (> |t|) .
  • 60. Other models for error terms • Adding other predictors y = β0 +β1x1 +β2x2 +β3x3 +ϵ. model <- lm(y ~ x_1 + x_2 + x_3)
  • 62. KNN
  • 63.
  • 64.
  • 65.
  • 66. The intution behind k-NN • is to consider the most similar other items defined in terms of their attributes, look at their labels, and give the unassigned item the majority vote. • If there’s a tie, you randomly select among the labels that have tied for first.
  • 67. • To automate it, two decisions must be made: first, how do you define similarity or closeness? • Once you define it, for a given unrated item, you can say how similar all the labeled items are to it, • and you can take the most similar items and call them neighbors, who each have a “vote.” • how many neighbors should you look at or “let vote”? This value is k
  • 68.
  • 69.
  • 70. overview of the process: 1. Decide on your similarity or distance metric. 2. Split the original labeled dataset into training and test data. 3. Pick an evaluation metric. (Misclassification rate is a good one. We’ll explain this more in a bit.) 4. Run k-NN a few times, changing k and checking the evaluation measure. 5. Optimize k by picking the one with the best evaluation measure. 6. Once you’ve chosen k, use the same training set and now create a new test set with the people’s ages and incomes that you have no
  • 71. Similarity or distance metrics • Euclidean distance • Cosine Similarity • Jaccard Distance or Similarity • Mahalanobis Distance • Hamming Distance • Manhattan
  • 72. Training and test sets • Train Test split
  • 73. Pick an evaluation metric • Sensitivity(true positive/recall) is defined as the probability of correctly diagnosing an ill patient as ill • Specificity(true negative) is defined as the probability of correctly diagnosing a well patient as well.
  • 74. Choosing k • Run k-NN a few times, changing k, and checking the evaluation metric each time.
  • 75. k-Nearest Neighbor/k-NN (supervised) • Used when you have many objects that are classified into categories but have some unclassified objects (e.g. movie ratings).
  • 76. k-Nearest Neighbor/k-NN (supervised) • Pick a k value (usually a low odd number, but up to you to pick). • Find the closest number of k points to the unclassified point (using various distance measurement techniques). • Assign the new point to the class where the majority of closest points lie. • Run algorithm again and again using different k’s.
  • 77.
  • 78. k-means (unsupervised) • Goal is to segment data into clusters or strata – Important for marketing research where you need to determine your sample space. • Assumptions: – Labels are not known. – You pick k (more of an art than a science).
  • 79. k-means (unsupervised) • Randomly pick k centroids (centers of data) and place them near “clusters” of data. • Assign each data point to a centroid. • Move the centroids to the average location of the data points assigned to it. • Repeat the previous two steps until the data point assignments don’t change.