SlideShare a Scribd company logo
1 of 30
Bill Howe, PhD
Director of
Research, Scalable Data
Analytics
University of Washington
eScience Institute
Big Data Curricula at the
University of Washington
eScience Institute
8/7/2013 Bill Howe, UW 1
2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
1. Theory (last 2000 yrs)
2. Experiment (last 200
yrs)
3. Simulation (last 50 yrs)
4. Data-Driven Discovery
(last 5 yrs)
The University of Washington
eScience Institute
• Rationale
– The exponential increase in sensors is transitioning all fields of science
and engineering from data-poor to data-rich
– As a result, the techniques and technologies of data science must be
widely practiced and widely adopted
• Mission
– Advance the forefront of research both in modern data science
techniques and technologies, and in the fields that depend upon them
• Strategy
– Provide an umbrella organization for Big Data activities at UW and
beyond (new curricula, collaborations, funding sources, hiring practices)
– Bootstrap a national network of partners and peer institutes
– Attract, develop, and retain “Pi-shaped people”
8/7/2013 Bill Howe, UW 4
π-shaped researchers
Broad in many areas; deep in at least two
UW Data Science Education Efforts
8/7/2013 Bill Howe, UW 6
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
Graduate Certificate in Big Data
CS Data Management Courses
eScience workshops
Intro to data programming
eScience Masters (planned)
MOOC: Intro to Data Science
Incubator: On-the-job-training
Previous courses:
Scientific Data Management, Graduate CS, Summer 2006, Portland State University
Scientific Data Management, Graduate CS, Spring 2010, University of Washington
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 7
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 8
8/7/2013 Bill Howe, UW 9
• 8600 completed all programming assignments
• 7000 earned a certificate
Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Analytics
– Statistics Pearls (~1 week)
– Machine Learning Pearls (~1 week)
• Visualization (~1 week)
8/7/2013 Bill Howe, UW 12
8/7/2013 Bill Howe, UW 13
tools abstr.
desk cloud
structs stats
hackers analysts
This Course
8/7/2013 Bill Howe, UW 14
What are the abstractions of
data science?
tools abstr.
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
8/7/2013 Bill Howe, UW 15
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
What are the abstractions of
data science?
tools abstr.
16
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorrow
• You can GREP 1 MB in a second
• You can GREP 1 GB in a minute
• You can GREP 1 TB in 2 days
• You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
• You can FTP 1 MB in 1 sec
• You can FTP 1 GB / min (~1$)
• … 2 days and 1K$
• … 3 years and 1M$
desk cloud
[slide src: Jim Gray]
US faces shortage of 140,000 to 190,000
people “with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.”
8/7/2013 Bill Howe, UW 17
--Mckinsey Global Institute
hackers analysts
Three types of tasks:
8/7/2013 Bill Howe, UW 18
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
-- Aaron Kimball
structs stats
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 19
New Phd Track: “Big Data U”
• Open to all departments
• New courses to “level the playing field”
– “Molecular Biology for Computer Scientists” offered this Fall
• Dual advising in two disciplines
• Joint projects leading to multiple theses
– Each methods thesis will include domain impact component
– Each domain thesis will include methods impact component
• Contribution to a shared cyberinfrastructure
– Software engineering experience as a side effect
• “Application Assistantships”
– Like RAs and TAs; focused on solving a concrete problem
8/7/2013 Bill Howe, UW 20
Magda
Balazinska
Carlos
Guestrin
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 21
Data Science Incubator: Motivation
• We need the right people
– We produce “builders,” but 99% of them go to industry to
“make people click on ads”
– They aren’t motivated by writing papers
– No viable career path in the academy
• We need the right processes
– Hands-on, extended, intensive experience is required to
produce π-shaped people
– Data-driven discovery requires intensive collaboration
8/7/2013 Bill Howe, UW 22
Science Domains
Stats, Computer
Science, Applied Math
• “Where’s the funding?”
• “How does this help me write a paper in my field”?
• Thin collaborations; nobody to work on the short-
term, high-risk, high-impact “triage” projects
• “Does method X work on dataset Y?”
Domain Labs
Research Programmers
• Expensive; doesn’t scale
• “Code Monkey” – No viable career path
• Can’t attract top people
• No sharing, no community, no cross-pollination
Data Science Incubator: Structure
• Recruit top-flight data science talent
• Give them autonomy to select collaborations and projects
• Promote them according to “altmetrics” and project impact
– “Data Scientist”  “Senior Data Scientist”  “Technical Fellow”
– “Data Science Fellows”
• Perhaps non-tenure, but 3-5 year commitments
• Funded with contributions from Academic units, IT,
Libraries, and soft money
8/7/2013 Bill Howe, UW 25
Data Science Incubator: Seed Grants
• Domain researchers submit Seed Grant applications
for short, intensive 1-6 month projects
– Reviewed by the Data Scientists themselves
• Awardees send 1+ students, postdocs, staff, or faculty
to come and physically sit in the incubator space X
days per week for the project duration
– Application may or may not include funding for the student
8/7/2013 Bill Howe, UW 26
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 29
MOOC “Introduction to Data Science:”
https://www.coursera.org/course/datasci
Certificate program:
http://www.pce.uw.edu/courses/data-science-intro
8/7/2013 Bill Howe, UW 30
http://escience.washington.edu
billhowe@cs.washington.edu

More Related Content

What's hot

Making Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbMaking Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbPhilip Bourne
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Projectmwe400
 
Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Micah Altman
 
Information is beautiful
Information is beautifulInformation is beautiful
Information is beautifulMargaret Lawson
 
Towards a Platform for Global Health
Towards a Platform for Global HealthTowards a Platform for Global Health
Towards a Platform for Global HealthPhilip Bourne
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...African Open Science Platform
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentPhilip Bourne
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisPhilip Bourne
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam Universitymwe400
 
Health Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataHealth Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataPhilip Bourne
 
BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020Philip Bourne
 
Bw dave pattern lidp
Bw dave pattern lidpBw dave pattern lidp
Bw dave pattern lidpgregynog
 
Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Carly Strasser
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 

What's hot (19)

25
2525
25
 
Making Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbMaking Biomedical Research More Like Airbnb
Making Biomedical Research More Like Airbnb
 
20
2020
20
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Project
 
Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
Information is beautiful
Information is beautifulInformation is beautiful
Information is beautiful
 
Towards a Platform for Global Health
Towards a Platform for Global HealthTowards a Platform for Global Health
Towards a Platform for Global Health
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training Environment
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT Analysis
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam University
 
Health Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataHealth Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big Data
 
BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020
 
Bw dave pattern lidp
Bw dave pattern lidpBw dave pattern lidp
Bw dave pattern lidp
 
Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
The African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan VeldsmanThe African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan Veldsman
 
Today's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's CitizensToday's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's Citizens
 

Similar to Big Data Curricula at the UW eScience Institute, JSM 2013

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsNicole Vasilevsky
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesDaniel S. Katz
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Keith Webster
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data JournalMarieke Guy
 
Yafei (debbie) Liang resume
Yafei (debbie) Liang resume  Yafei (debbie) Liang resume
Yafei (debbie) Liang resume YafeiDebbieLiang
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?James Howison
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data ScienceFeyzi R. Bagirov
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?Daniel S. Katz
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data ThingsKatina Toufexis
 
2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distrddm314
 

Similar to Big Data Curricula at the UW eScience Institute, JSM 2013 (20)

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate Students
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data Journal
 
Yafei liang resume
Yafei liang resumeYafei liang resume
Yafei liang resume
 
Yafei (debbie) Liang resume
Yafei (debbie) Liang resume  Yafei (debbie) Liang resume
Yafei (debbie) Liang resume
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Big Data
Big Data Big Data
Big Data
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 

More from University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 

Recently uploaded

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 

Recently uploaded (20)

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Big Data Curricula at the UW eScience Institute, JSM 2013

  • 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 8/7/2013 Bill Howe, UW 1
  • 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  • 3. 1. Theory (last 2000 yrs) 2. Experiment (last 200 yrs) 3. Simulation (last 50 yrs) 4. Data-Driven Discovery (last 5 yrs)
  • 4. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – As a result, the techniques and technologies of data science must be widely practiced and widely adopted • Mission – Advance the forefront of research both in modern data science techniques and technologies, and in the fields that depend upon them • Strategy – Provide an umbrella organization for Big Data activities at UW and beyond (new curricula, collaborations, funding sources, hiring practices) – Bootstrap a national network of partners and peer institutes – Attract, develop, and retain “Pi-shaped people” 8/7/2013 Bill Howe, UW 4
  • 5. π-shaped researchers Broad in many areas; deep in at least two
  • 6. UW Data Science Education Efforts 8/7/2013 Bill Howe, UW 6 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
  • 7. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 7
  • 8. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 8
  • 10. • 8600 completed all programming assignments • 7000 earned a certificate
  • 11.
  • 12. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Pearls (~1 week) – Machine Learning Pearls (~1 week) • Visualization (~1 week) 8/7/2013 Bill Howe, UW 12
  • 13. 8/7/2013 Bill Howe, UW 13 tools abstr. desk cloud structs stats hackers analysts This Course
  • 14. 8/7/2013 Bill Howe, UW 14 What are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about”
  • 15. 8/7/2013 Bill Howe, UW 15 matrices and linear algebra? relations and relational algebra? objects and methods? files and scripts? data frames and functions? What are the abstractions of data science? tools abstr.
  • 16. 16 Data Access Hitting a Wall Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • You can GREP 1 MB in a second • You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help • You can FTP 1 MB in 1 sec • You can FTP 1 GB / min (~1$) • … 2 days and 1K$ • … 3 years and 1M$ desk cloud [slide src: Jim Gray]
  • 17. US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” 8/7/2013 Bill Howe, UW 17 --Mckinsey Global Institute hackers analysts
  • 18. Three types of tasks: 8/7/2013 Bill Howe, UW 18 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work” -- Aaron Kimball structs stats
  • 19. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 19
  • 20. New Phd Track: “Big Data U” • Open to all departments • New courses to “level the playing field” – “Molecular Biology for Computer Scientists” offered this Fall • Dual advising in two disciplines • Joint projects leading to multiple theses – Each methods thesis will include domain impact component – Each domain thesis will include methods impact component • Contribution to a shared cyberinfrastructure – Software engineering experience as a side effect • “Application Assistantships” – Like RAs and TAs; focused on solving a concrete problem 8/7/2013 Bill Howe, UW 20 Magda Balazinska Carlos Guestrin
  • 21. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 21
  • 22. Data Science Incubator: Motivation • We need the right people – We produce “builders,” but 99% of them go to industry to “make people click on ads” – They aren’t motivated by writing papers – No viable career path in the academy • We need the right processes – Hands-on, extended, intensive experience is required to produce π-shaped people – Data-driven discovery requires intensive collaboration 8/7/2013 Bill Howe, UW 22
  • 23. Science Domains Stats, Computer Science, Applied Math • “Where’s the funding?” • “How does this help me write a paper in my field”? • Thin collaborations; nobody to work on the short- term, high-risk, high-impact “triage” projects • “Does method X work on dataset Y?”
  • 24. Domain Labs Research Programmers • Expensive; doesn’t scale • “Code Monkey” – No viable career path • Can’t attract top people • No sharing, no community, no cross-pollination
  • 25. Data Science Incubator: Structure • Recruit top-flight data science talent • Give them autonomy to select collaborations and projects • Promote them according to “altmetrics” and project impact – “Data Scientist”  “Senior Data Scientist”  “Technical Fellow” – “Data Science Fellows” • Perhaps non-tenure, but 3-5 year commitments • Funded with contributions from Academic units, IT, Libraries, and soft money 8/7/2013 Bill Howe, UW 25
  • 26. Data Science Incubator: Seed Grants • Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects – Reviewed by the Data Scientists themselves • Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration – Application may or may not include funding for the student 8/7/2013 Bill Howe, UW 26
  • 27. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 28. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 29. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 29
  • 30. MOOC “Introduction to Data Science:” https://www.coursera.org/course/datasci Certificate program: http://www.pce.uw.edu/courses/data-science-intro 8/7/2013 Bill Howe, UW 30 http://escience.washington.edu billhowe@cs.washington.edu

Editor's Notes

  1. Observe the world vs. Observe the dataInstruments vs. Algorithms
  2. So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data ScienceWe taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  3. “Data Jujitsu”“Data Wrangling”“Data Munging”
  4. Our collaborators tell us that loading data into memory with R is the major bottleneck.It actually changes the science they can do:I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).