Roadmap to Membership of RICS - Pathways and Routes
Data science programming .ppt
1. Welcome to IST 380 !
When the course was over, I knew it was a good thing.
We don't have strong enough words to describe this class.
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot - New York Times Review of Courses
- US News and Course Report
We give this course two thumbs!
- Ebert and Roeper
2. Welcome to IST 380 !
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot
3. About myself
Who Zach Dodds
Harvey Mudd College
Where
What Research includes robotics and computer vision
Contact
Information
dodds@cs.hmc.edu
909-607-0867
Office Hours:
Friday mornings, 9-11 am
or set up a time...
When Mondays 7-10pm here in ACB 119
HMC Beckman B111
6. IST 380 ~ the big picture
Data Science
Venn Diagram
Hmmm… where am I
on this diagram?
What is it?
7. Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Where?
9. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
10. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
17. Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
18. IST 380 ~ the big picture
What? Why?
Data Science
Programming Data Rules
All of our insights – large and small, permanent and
ephemeral, natural and artificial – come about
through the integration of lots of data.
Data Science simply recognizes that the rules and
skills behind those insights are widely applicable…
19. A few examples…
Make3d
How is this being done?
Andrew Ng ~
Computers and
Thought award,
2009
… Data Science is at the heart of computer science
and how do we succeed?
20. A few examples…
… Data Science is at the heart of computer science
Stanford's
Autonomous
Vehicles project
(Thrun et al.)
Learning to
Powerslide
21. A few examples…
… Data Science is at the heart of computer science
"my summer was
finding that red line"
Learning ground
from obstacles
28. Bob Bell, winner of the "Netflix prize"
Napoleon Dynamite =
Batman Begins =
Netflix Prize
Finding Nemo =
Lord of the Rings =
(I don't know this guy)
1.22
.75
??
??
Some films are difficult to predict…
29. Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
Napoleon Dynamite =
Batman Begins =
Finding Nemo =
Lord of the Rings =
1.22
.75
.67
.42
Some films are difficult to predict… and others are easier!
Netflix Prize
30. Why IST 380 ?
Specific skills:
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
31. Why IST 380 ?
Specific skills:
Broad background:
You'll be confident and capable with whatever datasets you
encounter in the future – on your own or as part of a team.
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
Final project ~ open-ended with datasets of your choice
33. Details
Web Page:
http://www.cs.hmc.edu/~dodds/IST380
Assignments, online text, necessary files, lecture slides are linked
First week's assignment: Getting started with R
Programming: R
Textbook An introduction to Data Science
jsresearch.net/groups/teachdatascience/
www.r-project.org/
Grab both of
these now…
freely available online
and many online resources…
35. Homework
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
1 week + 1 day…
36. Homework
Working on programs:
On your own or in groups of 2.
Divide the work at the keyboard evenly!
Submitting programs: at the submission website
Today's Lab:
install software ensure accounts are working
try out R - the first HW is officially due on 2/5
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
37. Outline
Weeks 1-5
using R
descriptive statistics
predictive statistics
probability distributions
Weeks 6-10
"Data Science"
"Machine Learning"
statistical modeling
support vector machines (SVMs)
random forests
k-means algorithm
nearest neighbors (NN)
Weeks 11-15
approximate!
Final Project
No breaks?!
38. Grading
Grades
Final project
if score >= 0.95: grade = "A"
if score >= 0.90: grade = "A-"
if score >= 0.86: grade = "B+"
• the last ~4 weeks will work towards a larger, final project
• there will be a short design phase and a short final presentation
• I'd encourage you to connect R and our Data Science techniques
to other datasets or projects that you use/need/like, etc.
Based on points percentage
~ 800 points for assignments
see the course syllabus for the full list...
~ 400 points for the final project
• choose your own problem to study (I'll have some suggestions, too.)
39. Academic Honesty
This course operates under CGU's (and all of Claremont Schools')
Academic Honesty policies…
•Your work must be your own. This must be true for the whole
team, if you're working in a pair.
•Consulting with others (except team members or myself) is
encouraged, but has to be limited to discussion and debugging
of problems. Sharing of written, electronic, or verbal
solutions/files/code is a violation of CGU’s academic honesty
policy.
•A reasonable guideline: Work is your own if you could delete
all of it and recreate it yourself.
42. Getting to know… R
http://lang-index.sourceforge.net/#categ
R is the programmer's toolkit for statistics; SAS, Stata,
SPSS are preferred by those in business intelligence
44. Getting to know… R
R is responsive, up-to-date, and flexible: Data Science vs. Statistics
45. Getting to know… R
1) Find the IST 380 course webpage
www.cs.hmc.edu/~dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the prompt:
6 * 7
rnorm(10)
x <- 380
46. Getting started!
1) Open Matloff's Why R? notes
2) Skip ahead to page 7, the "5 minute example session"
3) Try out the commands in section 2.2 to get started…
4) When you finish, save your session and submit it!
This is problem 1 this week
47. Saving your session
2) Use the Save to file… (Windows) or Save as…
(Mac) in order to save your current console session into
hw1
This is problem 1 this week
1) Create a folder named hw1, perhaps on your desktop
3) Name that file pr1.txt
4) From your operating system, open up that file in
order to confirm it contains your whole session!
48. Submitting your work
2) From the course webpage, click on the submission
site link.
You've completed Problem 1!
1) Zip up hw1 into hw1.zip
3) Choose a submission site login name & let me know!
4) Once your account is made, login, change your password
to something you know, and submit hw1.zip
This webserver can be
spacey -- I should know!
troubles? email me!
5) You can submit again – all copies are saved…
55. NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
56. NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
This uses subsetting to remove NA values!
60. Lab…
The 2nd part of each class meeting dedicated to lab work.
I welcome you to stay for the lab, but it is not required.
Today's lab:
Work through Santorico and Shin's Tutorial for the R
Statistical Package and submit the console sessions as
pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors, introduction to
data frames, and a look at the graphics that R supports.
61. Homework
Problem 3: Challenge exercises in R
These will reinforce the "subsetting" and data-
analysis introduction from pr2's tutorial.
Problem 4: Introduction to Data Science, early chapters
This is a fuller background on R and the field
of data science
(submit your console session for both of these…)
63. CS vs. IS and IT ?
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
greater integration
system-wide issues
smaller details
machine specifics
69. The bigger picture
Weeks 10-12
Objects
Week 10
Week 11
Week 12
Weeks 13-15
Final Projects
classes vs. objects
methods and data
inheritance
Week 13
Week 14
Week 15
final projects
final projects
final exam
70. Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Where?
72. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
73. Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
This class is truly
seminar-style:
we're devloping
expertise in this
field together.