Bill Howe, PhD
Director of
Research, Scalable Data
Analytics
University of Washington
eScience Institute
Big Data Curricul...
2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are tryin...
1. Theory (last 2000 yrs)
2. Experiment (last 200
yrs)
3. Simulation (last 50 yrs)
4. Data-Driven Discovery
(last 5 yrs)
The University of Washington
eScience Institute
• Rationale
– The exponential increase in sensors is transitioning all fie...
π-shaped researchers
Broad in many areas; deep in at least two
UW Data Science Education Efforts
8/7/2013 Bill Howe, UW 6
Students Non-Students
CS/Informatics Non-Major
professionals re...
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Ot...
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Ot...
8/7/2013 Bill Howe, UW 9
• 8600 completed all programming assignments
• 7000 earned a certificate
Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 ...
8/7/2013 Bill Howe, UW 13
tools abstr.
desk cloud
structs stats
hackers analysts
This Course
8/7/2013 Bill Howe, UW 14
What are the abstractions of
data science?
tools abstr.
“Data Jujitsu”
“Data Wrangling”
“Data Mu...
8/7/2013 Bill Howe, UW 15
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and sc...
16
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorro...
US faces shortage of 140,000 to 190,000
people “with deep analytical skills, as well
as 1.5 million managers and analysts ...
Three types of tasks:
8/7/2013 Bill Howe, UW 18
1) Preparing to run a model
2) Running the model
3) Interpreting the resul...
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Ot...
New Phd Track: “Big Data U”
• Open to all departments
• New courses to “level the playing field”
– “Molecular Biology for ...
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actit...
Data Science Incubator: Motivation
• We need the right people
– We produce “builders,” but 99% of them go to industry to
“...
Science Domains
Stats, Computer
Science, Applied Math
• “Where’s the funding?”
• “How does this help me write a paper in m...
Domain Labs
Research Programmers
• Expensive; doesn’t scale
• “Code Monkey” – No viable career path
• Can’t attract top pe...
Data Science Incubator: Structure
• Recruit top-flight data science talent
• Give them autonomy to select collaborations a...
Data Science Incubator: Seed Grants
• Domain researchers submit Seed Grant applications
for short, intensive 1-6 month pro...
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awa...
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awa...
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actit...
MOOC “Introduction to Data Science:”
https://www.coursera.org/course/datasci
Certificate program:
http://www.pce.uw.edu/co...
Big Data Curricula at the UW eScience Institute, JSM 2013
Upcoming SlideShare
Loading in …5
×

Big Data Curricula at the UW eScience Institute, JSM 2013

1,575 views

Published on

A 25 minute talk from a panel on big data curricula at JSM 2013
http://www.amstat.org/meetings/jsm/2013/onlineprogram/ActivityDetails.cfm?SessionID=208664

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,575
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Observe the world vs. Observe the dataInstruments vs. Algorithms
  • So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data ScienceWe taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  • “Data Jujitsu”“Data Wrangling”“Data Munging”
  • Our collaborators tell us that loading data into memory with R is the major bottleneck.It actually changes the science they can do:I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).
  • Big Data Curricula at the UW eScience Institute, JSM 2013

    1. 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 8/7/2013 Bill Howe, UW 1
    2. 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
    3. 3. 1. Theory (last 2000 yrs) 2. Experiment (last 200 yrs) 3. Simulation (last 50 yrs) 4. Data-Driven Discovery (last 5 yrs)
    4. 4. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – As a result, the techniques and technologies of data science must be widely practiced and widely adopted • Mission – Advance the forefront of research both in modern data science techniques and technologies, and in the fields that depend upon them • Strategy – Provide an umbrella organization for Big Data activities at UW and beyond (new curricula, collaborations, funding sources, hiring practices) – Bootstrap a national network of partners and peer institutes – Attract, develop, and retain “Pi-shaped people” 8/7/2013 Bill Howe, UW 4
    5. 5. π-shaped researchers Broad in many areas; deep in at least two
    6. 6. UW Data Science Education Efforts 8/7/2013 Bill Howe, UW 6 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
    7. 7. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 7
    8. 8. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 8
    9. 9. 8/7/2013 Bill Howe, UW 9
    10. 10. • 8600 completed all programming assignments • 7000 earned a certificate
    11. 11. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Pearls (~1 week) – Machine Learning Pearls (~1 week) • Visualization (~1 week) 8/7/2013 Bill Howe, UW 12
    12. 12. 8/7/2013 Bill Howe, UW 13 tools abstr. desk cloud structs stats hackers analysts This Course
    13. 13. 8/7/2013 Bill Howe, UW 14 What are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about”
    14. 14. 8/7/2013 Bill Howe, UW 15 matrices and linear algebra? relations and relational algebra? objects and methods? files and scripts? data frames and functions? What are the abstractions of data science? tools abstr.
    15. 15. 16 Data Access Hitting a Wall Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • You can GREP 1 MB in a second • You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help • You can FTP 1 MB in 1 sec • You can FTP 1 GB / min (~1$) • … 2 days and 1K$ • … 3 years and 1M$ desk cloud [slide src: Jim Gray]
    16. 16. US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” 8/7/2013 Bill Howe, UW 17 --Mckinsey Global Institute hackers analysts
    17. 17. Three types of tasks: 8/7/2013 Bill Howe, UW 18 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work” -- Aaron Kimball structs stats
    18. 18. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 19
    19. 19. New Phd Track: “Big Data U” • Open to all departments • New courses to “level the playing field” – “Molecular Biology for Computer Scientists” offered this Fall • Dual advising in two disciplines • Joint projects leading to multiple theses – Each methods thesis will include domain impact component – Each domain thesis will include methods impact component • Contribution to a shared cyberinfrastructure – Software engineering experience as a side effect • “Application Assistantships” – Like RAs and TAs; focused on solving a concrete problem 8/7/2013 Bill Howe, UW 20 Magda Balazinska Carlos Guestrin
    20. 20. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 21
    21. 21. Data Science Incubator: Motivation • We need the right people – We produce “builders,” but 99% of them go to industry to “make people click on ads” – They aren’t motivated by writing papers – No viable career path in the academy • We need the right processes – Hands-on, extended, intensive experience is required to produce π-shaped people – Data-driven discovery requires intensive collaboration 8/7/2013 Bill Howe, UW 22
    22. 22. Science Domains Stats, Computer Science, Applied Math • “Where’s the funding?” • “How does this help me write a paper in my field”? • Thin collaborations; nobody to work on the short- term, high-risk, high-impact “triage” projects • “Does method X work on dataset Y?”
    23. 23. Domain Labs Research Programmers • Expensive; doesn’t scale • “Code Monkey” – No viable career path • Can’t attract top people • No sharing, no community, no cross-pollination
    24. 24. Data Science Incubator: Structure • Recruit top-flight data science talent • Give them autonomy to select collaborations and projects • Promote them according to “altmetrics” and project impact – “Data Scientist”  “Senior Data Scientist”  “Technical Fellow” – “Data Science Fellows” • Perhaps non-tenure, but 3-5 year commitments • Funded with contributions from Academic units, IT, Libraries, and soft money 8/7/2013 Bill Howe, UW 25
    25. 25. Data Science Incubator: Seed Grants • Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects – Reviewed by the Data Scientists themselves • Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration – Application may or may not include funding for the student 8/7/2013 Bill Howe, UW 26
    26. 26. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
    27. 27. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
    28. 28. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 29
    29. 29. MOOC “Introduction to Data Science:” https://www.coursera.org/course/datasci Certificate program: http://www.pce.uw.edu/courses/data-science-intro 8/7/2013 Bill Howe, UW 30 http://escience.washington.edu billhowe@cs.washington.edu

    ×