Doing Data Science
Chapter 1
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype / Why Now?
• Datafication
• The Current Landscape (with a Little History)
• Data Science Jobs
• A Data Science Profile
• Thought Experiment: Meta-Definition
• OK, So What Is a Data Scientist, Really?
– In Academia
– In Industry
Big Data and Data Science Hype
• Big Data, how big?
• Data Science, who is doing it?
• Academia have been doing this for years
• Statisticians have been doing this work.
Getting Past the Hype / Why Now
• The Hype: Understanding the cultural phenomenon of data
science and how others were experiencing it. Study how
companies, and universities are “doing data science”.
• Why Now: Technology makes this possible: infrastructure for
large-scale data processing, increased memory, and bandwidth,
as well as a cultural acceptance of technology in the fabric of
our lives. This wasn't true a decade ago.
• Consideration should be to the ethical and technical
responsibilities for the people responsible for the process.
Datafication
• Definition: A process of "taking all aspects of
life and turning them into data:''
• For Example:
– "Google's augmented-reality glasses “datafy” the
gaze.
– Twitter “datafies” stray thoughts.
– Linkedin “datafies” professional networks:
Current Landscape of Data Science
• Drew Conway's Venn diagram of data science
from 20l0,
H
a
c
k
i
n
g
S
k
i
l
l
s
M
a
t
h
a
n
d
S
t
a
ti
s
ti
c
s
Substantive
Expertise
Machine
Learning
Data
Science
T
r
a
d
i
ti
o
n
a
l
R
e
s
e
a
r
c
h
Danger
Zone
Data Science Jobs
Job descriptions:
• experts in computer science,
• statistics,
• communication,
• data visualization, and to have
• extensive domain expertise
Observation: Nobody is an expert in everything, which is why
it makes more sense to create teams of people who have
different profiles and different expertise-together, as a team,
they can specialize in all those things.
Data Science Profile
Data Science Team
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data
scientist?
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data scientist?
• statisticians, applied mathematicians, and computer
scientists, sociologists, journalists, political scientists,
biomedical informatics students, students from
government agencies and social welfare, someone
from the architecture school, environmental
engineering, pure mathematicians, business
marketing students, and students who already worked
as data scientists.
What is Data Science, Really?
• Data scientist in academia ?
• who in academia plans to become a data scientist?
• statisticians, applied mathematicians, and computer
scientists, sociologists, journalists, political scientists,
biomedical informatics students, students from government
agencies and social welfare, someone from the architecture
school, environmental engineering, pure mathematicians,
business marketing students, and students who already
worked as data scientists.
• They were all interested in figuring out ways to solve
important problems, often of social value, with data.
• In Academia: an academic data scientist is a
scientist, trained in anything from social
science to biology, who works with large
amounts of data, and must wrestle with
computational problems posed by the
structure, size, messiness, and the complexity
and nature of the data, while simultaneously
solving a real-world problem.
In Industry?
• What do data scientists look like in industry?
In Industry?
• What do data scientists look like in industry?
• It depends on the level of seniority
• A chief data scientist : setting everything up
from the engineering and infrastructure for
collecting data and logging, to privacy concerns,
to deciding what data will be user-facing, how
data is going to be used to make decisions, and
how it’s going to be built back into the product.
• manage a team of engineers, scientists, and
analysts and should communicate with
leadership across the company, including the
CEO, CTO, and product leadership.
• In Industry: Someone who knows how to extract
meaning from and interpret data, which requires
both tools and methods from statistics and machine
learning, as well as being human. He/She spends a
lot of time in the process of collecting, cleaning, and
“munging” data, because data is never clean. This
process requires persistence, statistics, and software
engineering skills that are also necessary for
understanding biases in the data, and for debugging
logging output from code.
Doing Data Science
Chapter 2, Pages 15 - 34
Big Data Statistics (pages 17 -33)
• Statistical thinking in the Age of Big Data
• Statistical Inference
• Populations and Samples
• Big Data Examples
• Big Assumptions due to Big Data
• Modeling
Statistical Thinking in the Age of Big Data
Big Data?.
• First, it is a bundle of technologies.
• Second, it is a potential revolution in
measurement.
• And third, it is a point of view, or philosophy,
about how decisions will be—and perhaps
should be—made in the future.
Statistical Thinking – Age of Big Data
• Prerequisites – massive skills!! (Pages 14 -16)
– Math/Comp Science: stats, linear algebra, coding.
– Analytical: Data preparation, modeling,
visualization, communication.
Statistical Inference
• The World – complex, random, uncertain. (Page 18)
• As we commute to work on subways and in cars,
• shopping, emailing, browsing the Internet and watching the stock
market,
• as we’re building things, eating things,
• talking to our friends and family about things,
• this all processes potentially produces data.
– Data are small traces of real-world processes.
– which traces we gather are decided by our data
collection or sampling method
…..
• Note: two forms of randomness exist: (Page 19)
– Underlying the process (system property)
– Collection methods (human errors)
• Need a solid method to extract meaning and
information from random, dubious data. ( Page 19)
– This is Statistical Inference!
• This overall process of going from the world to the data,
and then from the data back to the world, is the field of
statistical inference.
• More precisely, statistical inference is the discipline that
concerns itself with the development of procedures,
methods, and theorems that allow us to extract meaning
and information from data that has been generated by
stochastic (random) processes.
Populations and Samples
• Population : population of India or population
of world ?
• It could be any set of objects or units, such as
tweets or photographs or stars etc.
• If we could measure the characteristics of all
those objects : set of observations (N)
Big Data Domain - Sampling
• Scientific Validity Issues with “Big Data”
populations and samples. (Page 21 –
Engineering problems + Bias)
– Incompleteness Assumptions (Page 22)
• All statistics and analyses must assume that samples do
not represent the population and therefore scientifically-
tenable conclusions cannot be drawn.
• i.e. It’s a guess at best. These types of assertions will
stand-up better against academic/scientific scrutiny.
Big Data Domain - Assumptions
• Other Bad or Wrong Assumptions
– N = 1 vs. N = ALL (multiple layers) (Page 25 -26)
• Big Data introduces a 2nd
degree to the data context.
• There are infinite levels of depth and breadth in the data.
• Individuals become populations. Populations become populations of
populations – to the nth
degree. (meta-data)
– My Example:
• 1 billion Facebook posts (one from each user) vs. 1 billion Facebook
posts from one unique user.
• 1 billion tweets vs. 1 billion images from one unique user.
• Danger: Drawing conclusions from incomplete
populations. Understand the boundaries/context.
Modeling
• What’s a model? (bottom page 27 – middle 28)
– An attempt to understand the population of interest
and represent that in a compact form which can be
used to experiment/analyze/study and determine
cause-and-effect and similar relationships amongst
the variables under study IN THE POPULATION.
• Data model
• Statistical model – fitting?
• Mathematical model
Probability Distributions (Page 31)
Fitting a model
• estimate the parameters of the model using
the observed data.
Overfitting:
• model isn’t that good at capturing reality
beyond your sampled data.
Doing Data Science
Chapter 2, Pages 34 - 50
Exploratory Data Analysis (EDT)
• “It is an attitude, a state of flexibility, a willingness to look
for those things that we believe are not there, as well as
those we believe to be there.”
-John Tukey
• Traditionally presented as a bunch of histograms and
stem-and-leaf plots.
Features
• EDT is a critical part of data science process.
• Represents a philosophy or way of doing
statistics.
• No hypotheses and there is no model.
• “Exploratory” aspect means that your
understanding of the problem you are
solving, or might solve, is changing as you go.
Basic Tools of EDA
• Plots, graphs and summary statistics.
• Method of systematically going through the
data, plotting distributions of all variables.
• EDA is a set of tools, it’s also a mindset.
• Mindset is about relationship with the data.
Philosophy of EDA
• Many reasons any one working with data
should do EDA.
• EDA helps with de-bugging the logging process.
• EDA helps assuring the product is performing as
intended.
• EDA is done toward the beginning of the
analysis.
Data Science Process
A Data Scientist’s Role in This process

Data science and visualization power point

  • 3.
  • 4.
    What is DataScience? • Big Data and Data Science Hype • Getting Past the Hype / Why Now? • Datafication • The Current Landscape (with a Little History) • Data Science Jobs • A Data Science Profile • Thought Experiment: Meta-Definition • OK, So What Is a Data Scientist, Really? – In Academia – In Industry
  • 5.
    Big Data andData Science Hype • Big Data, how big? • Data Science, who is doing it? • Academia have been doing this for years • Statisticians have been doing this work.
  • 6.
    Getting Past theHype / Why Now • The Hype: Understanding the cultural phenomenon of data science and how others were experiencing it. Study how companies, and universities are “doing data science”. • Why Now: Technology makes this possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as well as a cultural acceptance of technology in the fabric of our lives. This wasn't true a decade ago. • Consideration should be to the ethical and technical responsibilities for the people responsible for the process.
  • 7.
    Datafication • Definition: Aprocess of "taking all aspects of life and turning them into data:'' • For Example: – "Google's augmented-reality glasses “datafy” the gaze. – Twitter “datafies” stray thoughts. – Linkedin “datafies” professional networks:
  • 8.
    Current Landscape ofData Science • Drew Conway's Venn diagram of data science from 20l0, H a c k i n g S k i l l s M a t h a n d S t a ti s ti c s Substantive Expertise Machine Learning Data Science T r a d i ti o n a l R e s e a r c h Danger Zone
  • 9.
    Data Science Jobs Jobdescriptions: • experts in computer science, • statistics, • communication, • data visualization, and to have • extensive domain expertise Observation: Nobody is an expert in everything, which is why it makes more sense to create teams of people who have different profiles and different expertise-together, as a team, they can specialize in all those things.
  • 10.
  • 11.
  • 13.
    What is DataScience, Really? • Data scientist in academia ? • who in academia plans to become a data scientist?
  • 14.
    What is DataScience, Really? • Data scientist in academia ? • who in academia plans to become a data scientist? • statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists.
  • 15.
    What is DataScience, Really? • Data scientist in academia ? • who in academia plans to become a data scientist? • statisticians, applied mathematicians, and computer scientists, sociologists, journalists, political scientists, biomedical informatics students, students from government agencies and social welfare, someone from the architecture school, environmental engineering, pure mathematicians, business marketing students, and students who already worked as data scientists. • They were all interested in figuring out ways to solve important problems, often of social value, with data.
  • 16.
    • In Academia:an academic data scientist is a scientist, trained in anything from social science to biology, who works with large amounts of data, and must wrestle with computational problems posed by the structure, size, messiness, and the complexity and nature of the data, while simultaneously solving a real-world problem.
  • 17.
    In Industry? • Whatdo data scientists look like in industry?
  • 18.
    In Industry? • Whatdo data scientists look like in industry? • It depends on the level of seniority • A chief data scientist : setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns, to deciding what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product.
  • 19.
    • manage ateam of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership.
  • 20.
    • In Industry:Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. He/She spends a lot of time in the process of collecting, cleaning, and “munging” data, because data is never clean. This process requires persistence, statistics, and software engineering skills that are also necessary for understanding biases in the data, and for debugging logging output from code.
  • 21.
  • 22.
    Big Data Statistics(pages 17 -33) • Statistical thinking in the Age of Big Data • Statistical Inference • Populations and Samples • Big Data Examples • Big Assumptions due to Big Data • Modeling
  • 23.
    Statistical Thinking inthe Age of Big Data Big Data?. • First, it is a bundle of technologies. • Second, it is a potential revolution in measurement. • And third, it is a point of view, or philosophy, about how decisions will be—and perhaps should be—made in the future.
  • 24.
    Statistical Thinking –Age of Big Data • Prerequisites – massive skills!! (Pages 14 -16) – Math/Comp Science: stats, linear algebra, coding. – Analytical: Data preparation, modeling, visualization, communication.
  • 25.
    Statistical Inference • TheWorld – complex, random, uncertain. (Page 18) • As we commute to work on subways and in cars, • shopping, emailing, browsing the Internet and watching the stock market, • as we’re building things, eating things, • talking to our friends and family about things, • this all processes potentially produces data. – Data are small traces of real-world processes. – which traces we gather are decided by our data collection or sampling method
  • 26.
    ….. • Note: twoforms of randomness exist: (Page 19) – Underlying the process (system property) – Collection methods (human errors) • Need a solid method to extract meaning and information from random, dubious data. ( Page 19) – This is Statistical Inference!
  • 27.
    • This overallprocess of going from the world to the data, and then from the data back to the world, is the field of statistical inference. • More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes.
  • 28.
    Populations and Samples •Population : population of India or population of world ? • It could be any set of objects or units, such as tweets or photographs or stars etc. • If we could measure the characteristics of all those objects : set of observations (N)
  • 29.
    Big Data Domain- Sampling • Scientific Validity Issues with “Big Data” populations and samples. (Page 21 – Engineering problems + Bias) – Incompleteness Assumptions (Page 22) • All statistics and analyses must assume that samples do not represent the population and therefore scientifically- tenable conclusions cannot be drawn. • i.e. It’s a guess at best. These types of assertions will stand-up better against academic/scientific scrutiny.
  • 30.
    Big Data Domain- Assumptions • Other Bad or Wrong Assumptions – N = 1 vs. N = ALL (multiple layers) (Page 25 -26) • Big Data introduces a 2nd degree to the data context. • There are infinite levels of depth and breadth in the data. • Individuals become populations. Populations become populations of populations – to the nth degree. (meta-data) – My Example: • 1 billion Facebook posts (one from each user) vs. 1 billion Facebook posts from one unique user. • 1 billion tweets vs. 1 billion images from one unique user. • Danger: Drawing conclusions from incomplete populations. Understand the boundaries/context.
  • 31.
    Modeling • What’s amodel? (bottom page 27 – middle 28) – An attempt to understand the population of interest and represent that in a compact form which can be used to experiment/analyze/study and determine cause-and-effect and similar relationships amongst the variables under study IN THE POPULATION. • Data model • Statistical model – fitting? • Mathematical model
  • 32.
  • 33.
    Fitting a model •estimate the parameters of the model using the observed data. Overfitting: • model isn’t that good at capturing reality beyond your sampled data.
  • 34.
  • 35.
    Exploratory Data Analysis(EDT) • “It is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.” -John Tukey • Traditionally presented as a bunch of histograms and stem-and-leaf plots.
  • 36.
    Features • EDT isa critical part of data science process. • Represents a philosophy or way of doing statistics. • No hypotheses and there is no model. • “Exploratory” aspect means that your understanding of the problem you are solving, or might solve, is changing as you go.
  • 37.
    Basic Tools ofEDA • Plots, graphs and summary statistics. • Method of systematically going through the data, plotting distributions of all variables. • EDA is a set of tools, it’s also a mindset. • Mindset is about relationship with the data.
  • 38.
    Philosophy of EDA •Many reasons any one working with data should do EDA. • EDA helps with de-bugging the logging process. • EDA helps assuring the product is performing as intended. • EDA is done toward the beginning of the analysis.
  • 39.
  • 40.
    A Data Scientist’sRole in This process