Data Science
Unit 1
By : Professor Lili Saghafi
https://professorlilisaghafiquantumcomputing.wor
dpress.com
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS
Introduction
• The course is designed as an introduction
to programming and statistics for students
from many different majors.
• It teaches practical techniques that apply
across many disciplines and also serves
as the technical foundation for more
advanced courses in data
science, statistics, and computer science.
Programming Prerequisite
• No prior programming experience is necessary,
but many of the programming techniques
covered in this course do not appear in a typical
introduction to programming.
• The programming content of this course focuses
on manipulating data tables, rather than
building software applications.
• Students who take the course after taking other
programming courses often learn a new
approach to programming that they haven't
encountered before.
Statistic Prerequisite
• No prior statistics experience is necessary, but
many of the statistical inference
techniques covered in this course do not appear
in an undergraduate statistics curriculum.
• The techniques in this course rely heavily on
sampling and simulation, and they require
computers to carry out.
• Students who have taken statistics courses
before often learn new methods to complement
what they already know.
Understanding problem domains
Prerequisite
• Data science is more than just a combination
of programming and statistics.
• Effective data science requires understanding
problem domains and correctly interpreting
domain-specific approaches.
• The examples in this course are largely drawn
from real-world data sets, and one of the main
goals of this course is to develop the ability to
apply analysis and prediction techniques to real-
world scenarios.
NO Prerequisite
• This course is designed specifically
for those who have not previously
taken statistics or computer science
courses.
Equipment and Supplies
• A computer
• R Studio (https://cran.r-project.org/ )
• Math Player
• NVDA reader
• SAS or Python
• MS Azure
• A browser that supports Jupyter (Project Jupyter exists to develop
open-source software, open-standards, and services for interactive
computing across dozens of programming languages).
https://jupyter.org/
• Jupyter notebooks to complete lab assignments.
• We highly recommend using Google Chrome to complete Jupyter
notebook lab assignments. https://jupyter.org/
Using Jupyter Notebooks on
Microsoft Azure
• https://notebooks.azure.com
• an overview of using Jupyter Notebooks
with Python 3.
• For further information on Jupyter
Notebooks see the documentation
at http://Jupyter.org.
R
Jupyter notebook
• Jupyter notebook, and it's not running on your computer. Instead,
Google has generously donated compute cloud credits so that we
can run your code on Google's machines in order to execute
whatever examples you want, including all of the labs for the
course.So, thanks, Google!
• You'll learn about how to use this Jupyter environment in the labs.
https://jupyter.org/try OR https://jupyter.org/install.html
• For now, all you need to know is that you can run whatever
examples you want by clicking on a cell, holding down shift, and
pressing return or enter. So in this case, we told the computer to add
two and two together and that made four.
• Now the examples are going to get a lot more interesting soon, and
you'll learn how to use this environment, which is one of the most
popular environments for data science work out there in the world
today.
Jupyter notebook
• Thanks to Google's support, all of the
software relevant to the course is already
pre-installed on their systems that you
have access to, so you can start working
on examples right without having to install
anything. https://jupyter.org/try
Azure Machine Learning Studio
Azure AI Gallery.
Recommended Text
• Analytics, Data Science, &
Artificial Intelligence: Systems
for Decision Support, 11/E
– Ramesh Sharda
– Dursun Delen, Oklahoma
State University
– Efraim Turban, Oklahoma
State University
– ISBN-10: 0135192013 • ISBN-
13: 9780135192016
– ©2020 • Pearson • Cloth, 832
pp
– Published 02/11/2019 •
– Reading this text is not
required but it is helpful for
mastering the course material.
Why Data science?
• It's about taking large data sets and trying
to make them useful or
informative,especially for understanding
the world or making informed decisions.
• We need to use ideas from computing,
ideas from statistics and also domain
knowledge that informs what the data
really represents.
Domain knowledge
• You can't do an analysis in the legal
domain without understanding something
about the law, so that's what we mean by
domain knowledge.
• It's that you really have to understand
when you have a data set, some big table
of numbers and descriptions, what's really
going on behind those numbers and what
they represent about the world.
So what is data science?
• What do you get when you combine
computing and statistics and domain
knowledge together?
• You get a science that's about drawing
useful conclusions from data using
computation as our primary tool.
Data science,
as a practice,
has three core
activities.
1-Exploration
• Exploration is figuring out what patterns exist in
the data.
• When you have many observations about some
phenomenon, what can you conclude about the
phenomenon itself?
• Instead of just looking at large tables of
numbers, we'll draw data visualizations because
it's much easier to interpret lot of information at
once if it's portrayed in some kind of visualway.
2-Statistical Inference
• Once we've found a pattern, we need to perform
statistical inference, and that's because some
patterns are there just by chance and some are
there because they're a reflection of some
underlying process that's really interesting about
the world.
• The goal of statistical inference is to quantify
whether the patterns that we observe during
the exploration phase are reliable.
• If we collected more data, would we see this
pattern again or not?
Randomization
• The primary tool we have is
randomization because by simulating
random processes, we can see what kinds
of patterns appear just by chance.
3-Prediction
• And if the pattern we observe is not the kind of
thing that could just appear by chance, then we
can conclude that it's because of some robust or
reliable pattern in the underlying phenomenon
we want to study.
• We'll perform prediction.
• This is where we have partial information about
something we want to know, and we want to
guess about the things we don't know yet.
Machine Learning
• We are making informed guesses, quantitative
guesses using a discipline called machine
learning.
• Normally when we write programs, we just focus
on the particular logic of what the computer
should do, but machine learning is about not
programming every detail, but instead using
the data to make decisions or choice within
that program.
A form of prediction
• So when we write a program, for instance, to
recognize speech or automatically translate
languages or control a car or a robot, we don't
actually write down all the details of what to do,
but instead use examples from the world to help
computers automatically learn how to behave.
• And that's a form of prediction, one that we'll
talk about in this course.
Three stages in this course
• And these three stages correspond to
how we'll approach the material in this
course.
1. First talk about how to identify patterns,
2. then we'll talk about quantifying whether
those patterns are reliable.
3. And finally, based on the patterns we've
discovered, the reliable ones can help us
make informed guesses about the
information that we wish we knew.
On the way to become a Data
scientist
• Once you can do all that, you're well on
your way to being a data scientist.
• Now in the process of doing all these
things, it's important that you learn how to
program a computer, because computing
underlies each step of the way and
learning to program is just an essential
part of participating in this discipline.
Data Science
Thanks
Professor Lili Saghafi
proflilisaghafi@gmail.com
https://sites.google.com/site/professorlilisaghafi/home
@Lili_PLS

Data science unit 1 By: Professor Lili Saghafi

  • 1.
    Data Science Unit 1 By: Professor Lili Saghafi https://professorlilisaghafiquantumcomputing.wor dpress.com proflilisaghafi@gmail.com https://sites.google.com/site/professorlilisaghafi/home @Lili_PLS
  • 2.
    Introduction • The courseis designed as an introduction to programming and statistics for students from many different majors. • It teaches practical techniques that apply across many disciplines and also serves as the technical foundation for more advanced courses in data science, statistics, and computer science.
  • 3.
    Programming Prerequisite • Noprior programming experience is necessary, but many of the programming techniques covered in this course do not appear in a typical introduction to programming. • The programming content of this course focuses on manipulating data tables, rather than building software applications. • Students who take the course after taking other programming courses often learn a new approach to programming that they haven't encountered before.
  • 5.
    Statistic Prerequisite • Noprior statistics experience is necessary, but many of the statistical inference techniques covered in this course do not appear in an undergraduate statistics curriculum. • The techniques in this course rely heavily on sampling and simulation, and they require computers to carry out. • Students who have taken statistics courses before often learn new methods to complement what they already know.
  • 6.
    Understanding problem domains Prerequisite •Data science is more than just a combination of programming and statistics. • Effective data science requires understanding problem domains and correctly interpreting domain-specific approaches. • The examples in this course are largely drawn from real-world data sets, and one of the main goals of this course is to develop the ability to apply analysis and prediction techniques to real- world scenarios.
  • 7.
    NO Prerequisite • Thiscourse is designed specifically for those who have not previously taken statistics or computer science courses.
  • 8.
    Equipment and Supplies •A computer • R Studio (https://cran.r-project.org/ ) • Math Player • NVDA reader • SAS or Python • MS Azure • A browser that supports Jupyter (Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages). https://jupyter.org/ • Jupyter notebooks to complete lab assignments. • We highly recommend using Google Chrome to complete Jupyter notebook lab assignments. https://jupyter.org/
  • 9.
    Using Jupyter Notebookson Microsoft Azure • https://notebooks.azure.com • an overview of using Jupyter Notebooks with Python 3. • For further information on Jupyter Notebooks see the documentation at http://Jupyter.org.
  • 10.
  • 21.
    Jupyter notebook • Jupyternotebook, and it's not running on your computer. Instead, Google has generously donated compute cloud credits so that we can run your code on Google's machines in order to execute whatever examples you want, including all of the labs for the course.So, thanks, Google! • You'll learn about how to use this Jupyter environment in the labs. https://jupyter.org/try OR https://jupyter.org/install.html • For now, all you need to know is that you can run whatever examples you want by clicking on a cell, holding down shift, and pressing return or enter. So in this case, we told the computer to add two and two together and that made four. • Now the examples are going to get a lot more interesting soon, and you'll learn how to use this environment, which is one of the most popular environments for data science work out there in the world today.
  • 22.
    Jupyter notebook • Thanksto Google's support, all of the software relevant to the course is already pre-installed on their systems that you have access to, so you can start working on examples right without having to install anything. https://jupyter.org/try
  • 23.
  • 24.
  • 25.
    Recommended Text • Analytics,Data Science, & Artificial Intelligence: Systems for Decision Support, 11/E – Ramesh Sharda – Dursun Delen, Oklahoma State University – Efraim Turban, Oklahoma State University – ISBN-10: 0135192013 • ISBN- 13: 9780135192016 – ©2020 • Pearson • Cloth, 832 pp – Published 02/11/2019 • – Reading this text is not required but it is helpful for mastering the course material.
  • 27.
    Why Data science? •It's about taking large data sets and trying to make them useful or informative,especially for understanding the world or making informed decisions. • We need to use ideas from computing, ideas from statistics and also domain knowledge that informs what the data really represents.
  • 28.
    Domain knowledge • Youcan't do an analysis in the legal domain without understanding something about the law, so that's what we mean by domain knowledge. • It's that you really have to understand when you have a data set, some big table of numbers and descriptions, what's really going on behind those numbers and what they represent about the world.
  • 29.
    So what isdata science? • What do you get when you combine computing and statistics and domain knowledge together? • You get a science that's about drawing useful conclusions from data using computation as our primary tool.
  • 30.
    Data science, as apractice, has three core activities.
  • 31.
    1-Exploration • Exploration isfiguring out what patterns exist in the data. • When you have many observations about some phenomenon, what can you conclude about the phenomenon itself? • Instead of just looking at large tables of numbers, we'll draw data visualizations because it's much easier to interpret lot of information at once if it's portrayed in some kind of visualway.
  • 32.
    2-Statistical Inference • Oncewe've found a pattern, we need to perform statistical inference, and that's because some patterns are there just by chance and some are there because they're a reflection of some underlying process that's really interesting about the world. • The goal of statistical inference is to quantify whether the patterns that we observe during the exploration phase are reliable. • If we collected more data, would we see this pattern again or not?
  • 33.
    Randomization • The primarytool we have is randomization because by simulating random processes, we can see what kinds of patterns appear just by chance.
  • 34.
    3-Prediction • And ifthe pattern we observe is not the kind of thing that could just appear by chance, then we can conclude that it's because of some robust or reliable pattern in the underlying phenomenon we want to study. • We'll perform prediction. • This is where we have partial information about something we want to know, and we want to guess about the things we don't know yet.
  • 35.
    Machine Learning • Weare making informed guesses, quantitative guesses using a discipline called machine learning. • Normally when we write programs, we just focus on the particular logic of what the computer should do, but machine learning is about not programming every detail, but instead using the data to make decisions or choice within that program.
  • 36.
    A form ofprediction • So when we write a program, for instance, to recognize speech or automatically translate languages or control a car or a robot, we don't actually write down all the details of what to do, but instead use examples from the world to help computers automatically learn how to behave. • And that's a form of prediction, one that we'll talk about in this course.
  • 37.
    Three stages inthis course • And these three stages correspond to how we'll approach the material in this course. 1. First talk about how to identify patterns, 2. then we'll talk about quantifying whether those patterns are reliable. 3. And finally, based on the patterns we've discovered, the reliable ones can help us make informed guesses about the information that we wish we knew.
  • 38.
    On the wayto become a Data scientist • Once you can do all that, you're well on your way to being a data scientist. • Now in the process of doing all these things, it's important that you learn how to program a computer, because computing underlies each step of the way and learning to program is just an essential part of participating in this discipline.
  • 39.
    Data Science Thanks Professor LiliSaghafi proflilisaghafi@gmail.com https://sites.google.com/site/professorlilisaghafi/home @Lili_PLS