Course Overview
An overview of data science, CS 577, and the data science lifecycle.
Josh Hug and Lisa Yan
1
2
• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Lots of important details
• Data Science Lifecycle
• Demo
What is Data
Science?
Why I Care About Data Science
3
Why Data Science?
● The world is complicated, and data is a tool for finding truth in this complicated
world!
● We have a lot of questions in different domains that need to know the answer
Data science: Uses a combination of methods and principles from statistics and
computer science to work with and draw insights from data.
4
Data-Centric Problems
● Assess whether a vaccine works
● Filter out fake news automatically
● Calibrate air quality sensors
● Advise analysts on policy changes
5
Primary Goal of This Course
6
Be able to take data and produce useful insights on the world’s most challenging and
ambiguous problems.
What is Data Science?
PRINCIPLES AND TECHNIQUES OF DATA SCIENCE
7
Data is changing the world
From Joey Gonzalez. 8
Data science is a fundamentally interdisciplinary field
Joey Gonzalez
Data Science is the application of data
centric, computational, and inferential
thinking to:
● Understand the world (science).
● Solve problems (engineering).
9
Data Science Venn Diagram
by Drew Conway in 2010 (
link)
10
Insight
Good data analysis is not:
● Simple application of a statistics recipe.
● Simple application of statistical software.
There are many tools out there for data science, but they are merely tools.
● They don’t do any of the important thinking!
“The purpose of computing is insight, not numbers.” - R. Hamming. Numerical Methods
for Scientists and Engineers (1962).
11
Example Questions in Data Science
Some (broad) questions we might try to answer with data science:
● What show should we recommend to our user to watch?
● In which markets should we focus our advertising campaign?
● What areas of the world are at higher risks for climate change impact in 10 years?
20?
● What should we eat to avoid dying early of heart disease?
● Do immigrants from poor countries have a positive or negative impact on the
economy?
● Is the world getting better or worse?
12
13
• Intros
• What is data science?
• What will you learn in this class?
• Course overview
• Lots of important details
• Data Science Lifecycle
• Demo
What will you
learn in this
class?
Tentative List of Topics to be Covered in CS-577
● Pandas and NumPy
● Relational Databases & SQL
● Exploratory Data Analysis
● Regular Expressions
● Visualization
○ matplotlib
○ Seaborn
○ plotly
● Sampling
● Probability and random variables
● Model design and loss formulation
● Linear Regression
● Feature Engineering
● Regularization, Bias-Variance
Tradeoff, Cross-Validation
● Gradient Descent
● Logistic Regression
● Decision Trees and Random Forests
14
Course Websites / Platforms
15
Online platforms
Course website on Canvas
● Where all lectures, assignments, and discussions are posted.
Textbook (www.textbook.ds100.org)
● Supplemental reading.
16
Programming Environment
Jupyter Notebook
“Jupyter notebooks are documents that combine live runnable code
with narrative text (Markdown), equations (LaTeX), images,
interactive visualizations and other rich output”
Installing Jupyter
https://jupyterlab.readthedocs.io/en/
stable/getting_started/installation.html
JupyterLab
19
JupyterLab offers notebooks and more tools for data science.
● Use JupyterLab locally on your own machine.
● Use Google Colab
Learning Advanced JupyterLab
Resources for learning fancier JupyterLab functionality:
● A quickest intro is this great 2-minute overview by Serena Bonaretti.
○ Note: Unlike Serena’s example, in our course we’re using JupyterLab notebooks hosted
on the internet, not on your own local computer.
● The interface overview from the official docs has more details and short, embedded videos.
● A more detailed discussion from a bio/data angle: ~45 minute video.
● Full ~3h in-depth tutorial is available from the core team.
20
Google Colab
What is Colab?
Colab, or "Colaboratory", allows you to write and execute Python in your browser, with
● Zero configuration required
● Access to GPUs free of charge
● Easy sharing
21
Course Logistics
Content and workflow
22
23
Weekly Flow
Class Days: TTR
Class Times Section 1: 12:30 pm -- 1:45 pm
Class Times Section 2: 2:00 pm -- 3:15 pm
Class Location: LH 347
Discussion Section
There is a discussion board in Canvas. Two types of topics:
● Topics covered in lecture
● Topics covered in assignments
24
Homework
● 4 assignments in Jupyter Notebook that must be individually submitted
● Midterm exam: Oct. 19
● Final exam: Dec. 12
● A group term project: by Dec. 10
Format:
● Current plan: Primarily in-person exams with the option for virtual exams. Details
TBD.
● Alternate exam times will be provided for all exams for pre-approved reasons,
such as a concurrent final exam.
● If you miss an exam due to a personal emergency or illness, please contact me.
25
Grading Logistics
Grades will be posted on Canvas
Deadlines are firm at 11:59PM. Extensions are provided only to students with DSP
accommodations, or in the case of exceptional circumstances, only if you email me
before the deadline.
● You can submit assignments up to 2 days late, at 10% off per day.
○ Rounded up to the next day: 2 minutes late = 1 day late.
26
Mid-term exam 25%
Final exam 25%
Assignments 30%
Discussions 5%
Semester project 15%

1_Course Overview, Data Science Lifecycle.pptx

  • 1.
    Course Overview An overviewof data science, CS 577, and the data science lifecycle. Josh Hug and Lisa Yan 1
  • 2.
    2 • Intros • Whatis data science? • What will you learn in this class? • Course overview • Lots of important details • Data Science Lifecycle • Demo What is Data Science?
  • 3.
    Why I CareAbout Data Science 3
  • 4.
    Why Data Science? ●The world is complicated, and data is a tool for finding truth in this complicated world! ● We have a lot of questions in different domains that need to know the answer Data science: Uses a combination of methods and principles from statistics and computer science to work with and draw insights from data. 4
  • 5.
    Data-Centric Problems ● Assesswhether a vaccine works ● Filter out fake news automatically ● Calibrate air quality sensors ● Advise analysts on policy changes 5
  • 6.
    Primary Goal ofThis Course 6 Be able to take data and produce useful insights on the world’s most challenging and ambiguous problems.
  • 7.
    What is DataScience? PRINCIPLES AND TECHNIQUES OF DATA SCIENCE 7
  • 8.
    Data is changingthe world From Joey Gonzalez. 8
  • 9.
    Data science isa fundamentally interdisciplinary field Joey Gonzalez Data Science is the application of data centric, computational, and inferential thinking to: ● Understand the world (science). ● Solve problems (engineering). 9
  • 10.
    Data Science VennDiagram by Drew Conway in 2010 ( link) 10
  • 11.
    Insight Good data analysisis not: ● Simple application of a statistics recipe. ● Simple application of statistical software. There are many tools out there for data science, but they are merely tools. ● They don’t do any of the important thinking! “The purpose of computing is insight, not numbers.” - R. Hamming. Numerical Methods for Scientists and Engineers (1962). 11
  • 12.
    Example Questions inData Science Some (broad) questions we might try to answer with data science: ● What show should we recommend to our user to watch? ● In which markets should we focus our advertising campaign? ● What areas of the world are at higher risks for climate change impact in 10 years? 20? ● What should we eat to avoid dying early of heart disease? ● Do immigrants from poor countries have a positive or negative impact on the economy? ● Is the world getting better or worse? 12
  • 13.
    13 • Intros • Whatis data science? • What will you learn in this class? • Course overview • Lots of important details • Data Science Lifecycle • Demo What will you learn in this class?
  • 14.
    Tentative List ofTopics to be Covered in CS-577 ● Pandas and NumPy ● Relational Databases & SQL ● Exploratory Data Analysis ● Regular Expressions ● Visualization ○ matplotlib ○ Seaborn ○ plotly ● Sampling ● Probability and random variables ● Model design and loss formulation ● Linear Regression ● Feature Engineering ● Regularization, Bias-Variance Tradeoff, Cross-Validation ● Gradient Descent ● Logistic Regression ● Decision Trees and Random Forests 14
  • 15.
    Course Websites /Platforms 15
  • 16.
    Online platforms Course websiteon Canvas ● Where all lectures, assignments, and discussions are posted. Textbook (www.textbook.ds100.org) ● Supplemental reading. 16
  • 17.
  • 18.
    Jupyter Notebook “Jupyter notebooksare documents that combine live runnable code with narrative text (Markdown), equations (LaTeX), images, interactive visualizations and other rich output” Installing Jupyter https://jupyterlab.readthedocs.io/en/ stable/getting_started/installation.html
  • 19.
    JupyterLab 19 JupyterLab offers notebooksand more tools for data science. ● Use JupyterLab locally on your own machine. ● Use Google Colab
  • 20.
    Learning Advanced JupyterLab Resourcesfor learning fancier JupyterLab functionality: ● A quickest intro is this great 2-minute overview by Serena Bonaretti. ○ Note: Unlike Serena’s example, in our course we’re using JupyterLab notebooks hosted on the internet, not on your own local computer. ● The interface overview from the official docs has more details and short, embedded videos. ● A more detailed discussion from a bio/data angle: ~45 minute video. ● Full ~3h in-depth tutorial is available from the core team. 20
  • 21.
    Google Colab What isColab? Colab, or "Colaboratory", allows you to write and execute Python in your browser, with ● Zero configuration required ● Access to GPUs free of charge ● Easy sharing 21
  • 22.
  • 23.
    23 Weekly Flow Class Days:TTR Class Times Section 1: 12:30 pm -- 1:45 pm Class Times Section 2: 2:00 pm -- 3:15 pm Class Location: LH 347
  • 24.
    Discussion Section There isa discussion board in Canvas. Two types of topics: ● Topics covered in lecture ● Topics covered in assignments 24
  • 25.
    Homework ● 4 assignmentsin Jupyter Notebook that must be individually submitted ● Midterm exam: Oct. 19 ● Final exam: Dec. 12 ● A group term project: by Dec. 10 Format: ● Current plan: Primarily in-person exams with the option for virtual exams. Details TBD. ● Alternate exam times will be provided for all exams for pre-approved reasons, such as a concurrent final exam. ● If you miss an exam due to a personal emergency or illness, please contact me. 25
  • 26.
    Grading Logistics Grades willbe posted on Canvas Deadlines are firm at 11:59PM. Extensions are provided only to students with DSP accommodations, or in the case of exceptional circumstances, only if you email me before the deadline. ● You can submit assignments up to 2 days late, at 10% off per day. ○ Rounded up to the next day: 2 minutes late = 1 day late. 26 Mid-term exam 25% Final exam 25% Assignments 30% Discussions 5% Semester project 15%

Editor's Notes

  • #10 Hacking skills: programming skills
  • #11 Blackbox usage of tools is shallow: here is the tool, here is data, and here are the results
  • #12 1. streaming service and production company