SlideShare a Scribd company logo
Welcome to IST 380 !
When the course was over, I knew it was a good thing.
We don't have strong enough words to describe this class.
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot - New York Times Review of Courses
- US News and Course Report
We give this course two thumbs!
- Ebert and Roeper
Welcome to IST 380 !
Data Science
Programming
an advocate of
concrete computing –
and HMC's mascot
About myself
Who Zach Dodds
Harvey Mudd College
Where
What Research includes robotics and computer vision
Contact
Information
dodds@cs.hmc.edu
909-607-0867
Office Hours:
Friday mornings, 9-11 am
or set up a time...
When Mondays 7-10pm here in ACB 119
HMC Beckman B111
TMI?
fan of low-tech games
fan of low-level AI
IST 380 ~ the big picture
What is it? Why me?
IST 380 ~ the big picture
Data Science
Venn Diagram
Hmmm… where am I
on this diagram?
What is it?
Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Where?
state reminders…
Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science" background?
(statistics, machine learning, CS)
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
Data Science concerns
Is "Data Science"
important or just trendy?
Hmmm…
Data Science concerns
the companies are expanding as fast as the data!
There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
Data, data everywhere…
References
1 TB = 1000 GB
120 PB
logarithmic
scale
data
information
knowledge
wisdom
I'd call it data,
not information
Big Data?
I agree with this…
Make data easier to use ~ by using it!
It may be true that
Data Science isn't a
science – but that
doesn't mean it's
not useful!
IST 380 ~ the big picture
What? Why?
Data Science
Programming Data Rules
All of our insights – large and small, permanent and
ephemeral, natural and artificial – come about
through the integration of lots of data.
Data Science simply recognizes that the rules and
skills behind those insights are widely applicable…
A few examples…
Make3d
How is this being done?
Andrew Ng ~
Computers and
Thought award,
2009
… Data Science is at the heart of computer science
and how do we succeed?
A few examples…
… Data Science is at the heart of computer science
Stanford's
Autonomous
Vehicles project
(Thrun et al.)
Learning to
Powerslide
A few examples…
… Data Science is at the heart of computer science
"my summer was
finding that red line"
Learning ground
from obstacles
A few examples…
Learning ground from obstacles
classification segmentation
Insights beyond science
Marketing
Visualization
Motivation
Recommender Systems
predicting
movie ratings
Bob Bell, winner of the "Netflix prize"
Napoleon Dynamite =
Batman Begins =
Netflix Prize
Finding Nemo =
Lord of the Rings =
(I don't know this guy)
1.22
.75
??
??
Some films are difficult to predict…
Bob Bell, winner of the "Netflix prize"
(I don't know this guy)
Napoleon Dynamite =
Batman Begins =
Finding Nemo =
Lord of the Rings =
1.22
.75
.67
.42
Some films are difficult to predict… and others are easier!
Netflix Prize
Why IST 380 ?
Specific skills:
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
Why IST 380 ?
Specific skills:
Broad background:
You'll be confident and capable with whatever datasets you
encounter in the future – on your own or as part of a team.
R statistical environment (and the S programming language)
Experience with several statistical analyses (descriptive statistics)
Experience with predictive statistics (modeling) and
machine learning algorithms
Final project ~ open-ended with datasets of your choice
About IST 380 …
Details
Web Page:
http://www.cs.hmc.edu/~dodds/IST380
Assignments, online text, necessary files, lecture slides are linked
First week's assignment: Getting started with R
Programming: R
Textbook An introduction to Data Science
jsresearch.net/groups/teachdatascience/
www.r-project.org/
Grab both of
these now…
freely available online
and many online resources…
Homepage
http://www.cs.hmc.edu/~dodds/IST380/
Go to the course page
Grab R and the text from
these two links…
Homework
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
1 week + 1 day…
Homework
Working on programs:
On your own or in groups of 2.
Divide the work at the keyboard evenly!
Submitting programs: at the submission website
Today's Lab:
install software ensure accounts are working
try out R - the first HW is officially due on 2/5
Assignments
~ 2-5 problems/week ~ 100 points extra credit, often
Due Tuesday of the following week by 11:59 pm.
Assignment 1 due Tuesday, February 5.
Outline
Weeks 1-5
using R
descriptive statistics
predictive statistics
probability distributions
Weeks 6-10
"Data Science"
"Machine Learning"
statistical modeling
support vector machines (SVMs)
random forests
k-means algorithm
nearest neighbors (NN)
Weeks 11-15
approximate!
Final Project
No breaks?!
Grading
Grades
Final project
if score >= 0.95: grade = "A"
if score >= 0.90: grade = "A-"
if score >= 0.86: grade = "B+"
• the last ~4 weeks will work towards a larger, final project
• there will be a short design phase and a short final presentation
• I'd encourage you to connect R and our Data Science techniques
to other datasets or projects that you use/need/like, etc.
Based on points percentage
~ 800 points for assignments
see the course syllabus for the full list...
~ 400 points for the final project
• choose your own problem to study (I'll have some suggestions, too.)
Academic Honesty
This course operates under CGU's (and all of Claremont Schools')
Academic Honesty policies…
•Your work must be your own. This must be true for the whole
team, if you're working in a pair.
•Consulting with others (except team members or myself) is
encouraged, but has to be limited to discussion and debugging
of problems. Sharing of written, electronic, or verbal
solutions/files/code is a violation of CGU’s academic honesty
policy.
•A reasonable guideline: Work is your own if you could delete
all of it and recreate it yourself.
Thoughts?
Getting to know… R
Getting to know… R
http://lang-index.sourceforge.net/#categ
R is the programmer's toolkit for statistics; SAS, Stata,
SPSS are preferred by those in business intelligence
Getting to know… R
Free… and very well supported online…
Getting to know… R
R is responsive, up-to-date, and flexible: Data Science vs. Statistics
Getting to know… R
1) Find the IST 380 course webpage
www.cs.hmc.edu/~dodds/IST380/
2) Download and install R
3) Run R and try some basic commands at the prompt:
6 * 7
rnorm(10)
x <- 380
Getting started!
1) Open Matloff's Why R? notes
2) Skip ahead to page 7, the "5 minute example session"
3) Try out the commands in section 2.2 to get started…
4) When you finish, save your session and submit it!
This is problem 1 this week
Saving your session
2) Use the Save to file… (Windows) or Save as…
(Mac) in order to save your current console session into
hw1
This is problem 1 this week
1) Create a folder named hw1, perhaps on your desktop
3) Name that file pr1.txt
4) From your operating system, open up that file in
order to confirm it contains your whole session!
Submitting your work
2) From the course webpage, click on the submission
site link.
You've completed Problem 1!
1) Zip up hw1 into hw1.zip
3) Choose a submission site login name & let me know!
4) Once your account is made, login, change your password
to something you know, and submit hw1.zip
This webserver can be
spacey -- I should know!
troubles? email me!
5) You can submit again – all copies are saved…
Reflection
Average and standard deviation?
Assignment?
Comments?
Printing?
Comments?
Creating a vector?
R types
You can use mode() to view the type of a variable.
Where's the big data?
Vectors are R lists of a single type of element
c ~ concatenate
Where's the big data?
Vectors are R lists of a single type of element
c ~ concatenate
the colon : also
creates vectors
Analyzing vectors – try these…
Square brackets [] can "subset" (or "slice") vectors
Analyzing vectors
Square brackets [] can "subset" (or "slice") vectors
you can use a
boolean vector
to subset
another vector
NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
NA
R uses NA to represent data that is "not available"
What is going on here?
The function is.na( ) tests for NA
This uses subsetting to remove NA values!
Data frames
R's fundamental data structures are data frames
The next tutorial will introduce them…
Irises…
setosa
virginica
data() yields many built-in data files. This is iris
Subsetting iris data
As with vectors, you can "subset" data frames.
df[rows,cols]
Lab…
The 2nd part of each class meeting dedicated to lab work.
I welcome you to stay for the lab, but it is not required.
Today's lab:
Work through Santorico and Shin's Tutorial for the R
Statistical Package and submit the console sessions as
pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.
This is a nice reinforcement of vectors, introduction to
data frames, and a look at the graphics that R supports.
Homework
Problem 3: Challenge exercises in R
These will reinforce the "subsetting" and data-
analysis introduction from pr2's tutorial.
Problem 4: Introduction to Data Science, early chapters
This is a fuller background on R and the field
of data science
(submit your console session for both of these…)
Lab !
CS vs. IS and IT ?
www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf
greater integration
system-wide issues
smaller details
machine specifics
CS vs. IS and IT ?
Where will IS go?
CS vs. IS and IT ?
IT ?
Where will IT go?
IT ?
The bigger picture
Weeks 10-12
Objects
Week 10
Week 11
Week 12
Weeks 13-15
Final Projects
classes vs. objects
methods and data
inheritance
Week 13
Week 14
Week 15
final projects
final projects
final exam
Data?!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Where?
state reminders…
Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
Data!
• Neighbor's name
• A place they consider home
• Are they working at a company now?
• How many U.S. states have they visited?
• Their favorite unhealthy food… ?
• Do they have any "Data Science"
(statistics, machine learning, CS)
background?
Zachary Dodds
Pittsburgh, PA
Harvey Mudd
Where?
44
mostly CS for me…
M&Ms
be sure to set up your login + profile for the submission site…
This class is truly
seminar-style:
we're devloping
expertise in this
field together.

More Related Content

Similar to Lec1cgu13updated.ppt

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Jason Anderson
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Databricks
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind-slides
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
butest
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
butest
 
Acm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbvAcm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbv
Nagasuri Bala Venkateswarlu
 
Being Professional
Being ProfessionalBeing Professional
Being Professional
Abdalla Mahmoud
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
Discover Pinterest
 
DataMind interactive learning: Dublin R User Group: September 2013
DataMind interactive learning: Dublin R User Group: September 2013DataMind interactive learning: Dublin R User Group: September 2013
DataMind interactive learning: Dublin R User Group: September 2013
DataMind-slides
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
Paris Open Source Summit
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
My lectures
My lecturesMy lectures
My lectures
Jahanzeb khan
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDays Riga
 
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red TeamWhat is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
MITRE ATT&CK
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 

Similar to Lec1cgu13updated.ppt (20)

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
DataMind: An e-learning platform for Data Analysis based on R. RBelgium meetu...
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
Acm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbvAcm icpc-briefing-prof-nbv
Acm icpc-briefing-prof-nbv
 
Being Professional
Being ProfessionalBeing Professional
Being Professional
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
DataMind interactive learning: Dublin R User Group: September 2013
DataMind interactive learning: Dublin R User Group: September 2013DataMind interactive learning: Dublin R User Group: September 2013
DataMind interactive learning: Dublin R User Group: September 2013
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
My lectures
My lecturesMy lectures
My lectures
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red TeamWhat is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
What is ATT&CK coverage, anyway? Breadth and depth analysis with Atomic Red Team
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 

Recently uploaded

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

Lec1cgu13updated.ppt

  • 1. Welcome to IST 380 ! When the course was over, I knew it was a good thing. We don't have strong enough words to describe this class. Data Science Programming an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses - US News and Course Report We give this course two thumbs! - Ebert and Roeper
  • 2. Welcome to IST 380 ! Data Science Programming an advocate of concrete computing – and HMC's mascot
  • 3. About myself Who Zach Dodds Harvey Mudd College Where What Research includes robotics and computer vision Contact Information dodds@cs.hmc.edu 909-607-0867 Office Hours: Friday mornings, 9-11 am or set up a time... When Mondays 7-10pm here in ACB 119 HMC Beckman B111
  • 4. TMI? fan of low-tech games fan of low-level AI
  • 5. IST 380 ~ the big picture What is it? Why me?
  • 6. IST 380 ~ the big picture Data Science Venn Diagram Hmmm… where am I on this diagram? What is it?
  • 7. Data?! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Where?
  • 9. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Zachary Dodds Pittsburgh, PA Harvey Mudd Where? 44 mostly CS for me… M&Ms
  • 10. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" background? (statistics, machine learning, CS) Zachary Dodds Pittsburgh, PA Harvey Mudd Where? 44 mostly CS for me… M&Ms be sure to set up your login + profile for the submission site…
  • 11. Data Science concerns Is "Data Science" important or just trendy?
  • 13. the companies are expanding as fast as the data!
  • 14. There's certainly a lot of it! 2015 1 Zettabyte 1 Exabyte 1 Petabyte (brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store (2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm 1 Petabyte == 1000 TB 2002 2009 (2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf (2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf 2006 2011 (2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly! 5 EB 161 EB 800 EB 1.8 ZB 8.0 ZB 14 PB 60 PB Data produced each year 100-years of HD video + audio Human brain's capacity Data, data everywhere… References 1 TB = 1000 GB 120 PB logarithmic scale
  • 16. Big Data? I agree with this…
  • 17. Make data easier to use ~ by using it! It may be true that Data Science isn't a science – but that doesn't mean it's not useful!
  • 18. IST 380 ~ the big picture What? Why? Data Science Programming Data Rules All of our insights – large and small, permanent and ephemeral, natural and artificial – come about through the integration of lots of data. Data Science simply recognizes that the rules and skills behind those insights are widely applicable…
  • 19. A few examples… Make3d How is this being done? Andrew Ng ~ Computers and Thought award, 2009 … Data Science is at the heart of computer science and how do we succeed?
  • 20. A few examples… … Data Science is at the heart of computer science Stanford's Autonomous Vehicles project (Thrun et al.) Learning to Powerslide
  • 21. A few examples… … Data Science is at the heart of computer science "my summer was finding that red line" Learning ground from obstacles
  • 22. A few examples… Learning ground from obstacles classification segmentation
  • 26.
  • 28. Bob Bell, winner of the "Netflix prize" Napoleon Dynamite = Batman Begins = Netflix Prize Finding Nemo = Lord of the Rings = (I don't know this guy) 1.22 .75 ?? ?? Some films are difficult to predict…
  • 29. Bob Bell, winner of the "Netflix prize" (I don't know this guy) Napoleon Dynamite = Batman Begins = Finding Nemo = Lord of the Rings = 1.22 .75 .67 .42 Some films are difficult to predict… and others are easier! Netflix Prize
  • 30. Why IST 380 ? Specific skills: R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms
  • 31. Why IST 380 ? Specific skills: Broad background: You'll be confident and capable with whatever datasets you encounter in the future – on your own or as part of a team. R statistical environment (and the S programming language) Experience with several statistical analyses (descriptive statistics) Experience with predictive statistics (modeling) and machine learning algorithms Final project ~ open-ended with datasets of your choice
  • 33. Details Web Page: http://www.cs.hmc.edu/~dodds/IST380 Assignments, online text, necessary files, lecture slides are linked First week's assignment: Getting started with R Programming: R Textbook An introduction to Data Science jsresearch.net/groups/teachdatascience/ www.r-project.org/ Grab both of these now… freely available online and many online resources…
  • 34. Homepage http://www.cs.hmc.edu/~dodds/IST380/ Go to the course page Grab R and the text from these two links…
  • 35. Homework Assignments ~ 2-5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11:59 pm. Assignment 1 due Tuesday, February 5. 1 week + 1 day…
  • 36. Homework Working on programs: On your own or in groups of 2. Divide the work at the keyboard evenly! Submitting programs: at the submission website Today's Lab: install software ensure accounts are working try out R - the first HW is officially due on 2/5 Assignments ~ 2-5 problems/week ~ 100 points extra credit, often Due Tuesday of the following week by 11:59 pm. Assignment 1 due Tuesday, February 5.
  • 37. Outline Weeks 1-5 using R descriptive statistics predictive statistics probability distributions Weeks 6-10 "Data Science" "Machine Learning" statistical modeling support vector machines (SVMs) random forests k-means algorithm nearest neighbors (NN) Weeks 11-15 approximate! Final Project No breaks?!
  • 38. Grading Grades Final project if score >= 0.95: grade = "A" if score >= 0.90: grade = "A-" if score >= 0.86: grade = "B+" • the last ~4 weeks will work towards a larger, final project • there will be a short design phase and a short final presentation • I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc. Based on points percentage ~ 800 points for assignments see the course syllabus for the full list... ~ 400 points for the final project • choose your own problem to study (I'll have some suggestions, too.)
  • 39. Academic Honesty This course operates under CGU's (and all of Claremont Schools') Academic Honesty policies… •Your work must be your own. This must be true for the whole team, if you're working in a pair. •Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy. •A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.
  • 42. Getting to know… R http://lang-index.sourceforge.net/#categ R is the programmer's toolkit for statistics; SAS, Stata, SPSS are preferred by those in business intelligence
  • 43. Getting to know… R Free… and very well supported online…
  • 44. Getting to know… R R is responsive, up-to-date, and flexible: Data Science vs. Statistics
  • 45. Getting to know… R 1) Find the IST 380 course webpage www.cs.hmc.edu/~dodds/IST380/ 2) Download and install R 3) Run R and try some basic commands at the prompt: 6 * 7 rnorm(10) x <- 380
  • 46. Getting started! 1) Open Matloff's Why R? notes 2) Skip ahead to page 7, the "5 minute example session" 3) Try out the commands in section 2.2 to get started… 4) When you finish, save your session and submit it! This is problem 1 this week
  • 47. Saving your session 2) Use the Save to file… (Windows) or Save as… (Mac) in order to save your current console session into hw1 This is problem 1 this week 1) Create a folder named hw1, perhaps on your desktop 3) Name that file pr1.txt 4) From your operating system, open up that file in order to confirm it contains your whole session!
  • 48. Submitting your work 2) From the course webpage, click on the submission site link. You've completed Problem 1! 1) Zip up hw1 into hw1.zip 3) Choose a submission site login name & let me know! 4) Once your account is made, login, change your password to something you know, and submit hw1.zip This webserver can be spacey -- I should know! troubles? email me! 5) You can submit again – all copies are saved…
  • 49. Reflection Average and standard deviation? Assignment? Comments? Printing? Comments? Creating a vector?
  • 50. R types You can use mode() to view the type of a variable.
  • 51. Where's the big data? Vectors are R lists of a single type of element c ~ concatenate
  • 52. Where's the big data? Vectors are R lists of a single type of element c ~ concatenate the colon : also creates vectors
  • 53. Analyzing vectors – try these… Square brackets [] can "subset" (or "slice") vectors
  • 54. Analyzing vectors Square brackets [] can "subset" (or "slice") vectors you can use a boolean vector to subset another vector
  • 55. NA R uses NA to represent data that is "not available" What is going on here? The function is.na( ) tests for NA
  • 56. NA R uses NA to represent data that is "not available" What is going on here? The function is.na( ) tests for NA This uses subsetting to remove NA values!
  • 57. Data frames R's fundamental data structures are data frames The next tutorial will introduce them…
  • 58. Irises… setosa virginica data() yields many built-in data files. This is iris
  • 59. Subsetting iris data As with vectors, you can "subset" data frames. df[rows,cols]
  • 60. Lab… The 2nd part of each class meeting dedicated to lab work. I welcome you to stay for the lab, but it is not required. Today's lab: Work through Santorico and Shin's Tutorial for the R Statistical Package and submit the console sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt. This is a nice reinforcement of vectors, introduction to data frames, and a look at the graphics that R supports.
  • 61. Homework Problem 3: Challenge exercises in R These will reinforce the "subsetting" and data- analysis introduction from pr2's tutorial. Problem 4: Introduction to Data Science, early chapters This is a fuller background on R and the field of data science (submit your console session for both of these…)
  • 62. Lab !
  • 63. CS vs. IS and IT ? www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf greater integration system-wide issues smaller details machine specifics
  • 64. CS vs. IS and IT ? Where will IS go?
  • 65. CS vs. IS and IT ?
  • 66. IT ? Where will IT go?
  • 67. IT ?
  • 68.
  • 69. The bigger picture Weeks 10-12 Objects Week 10 Week 11 Week 12 Weeks 13-15 Final Projects classes vs. objects methods and data inheritance Week 13 Week 14 Week 15 final projects final projects final exam
  • 70. Data?! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" (statistics, machine learning, CS) background? Where?
  • 72. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" (statistics, machine learning, CS) background? Zachary Dodds Pittsburgh, PA Harvey Mudd Where? 44 mostly CS for me… M&Ms
  • 73. Data! • Neighbor's name • A place they consider home • Are they working at a company now? • How many U.S. states have they visited? • Their favorite unhealthy food… ? • Do they have any "Data Science" (statistics, machine learning, CS) background? Zachary Dodds Pittsburgh, PA Harvey Mudd Where? 44 mostly CS for me… M&Ms be sure to set up your login + profile for the submission site… This class is truly seminar-style: we're devloping expertise in this field together.