SlideShare a Scribd company logo
1 of 45
Data Science Curricula at the
             University of Washington
                         Bill Howe, PhD
                             Director of
                      Research, Scalable Data
                              Analytics
                      University of Washington
                         eScience Institute



10/9/2012              Bill Howe, UW             1
http://escience.washington.edu
The University of Washington
    eScience Institute
• Rationale
   – The exponential increase in sensors is transitioning all fields of science and
     engineering from data-poor to data-rich
   – Techniques and technologies include
         • Sensors and sensor networks, databases, data mining, machine learning,
           visualization, cluster/cloud computing
   – If these techniques and technologies are not widely available and widely
     practiced, UW will cease to be competitive
• Mission
   – Help position the University of Washington at the forefront of research both in
     modern eScience techniques and technologies, and in the fields that depend
     upon them
• Strategy
   – Bootstrap a cadre of Research Scientists
   – Add faculty in key fields
   – Build out a “consultancy” of students and non-research staff


    10/9/2012                      Bill Howe, eScience Institute                    4
eScience Big Data Group
Bill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)
Director of Research, Scalable Data Analytics


Staff
     –   Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms)
     –   Dan Halperin, Phd (postdoc; scalable systems)
     –   Sagar Chitnis, Research Engineer (Azure, databases, web services)
     –   (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases)
     –   (alumna) Alicia Key, Research Engineer (visualization, web applications)
Students
     – Scott Moe (2nd yr Phd, Applied Math)
     – Daniel Perry (2nd yr Phd, HCDE)
Partners
     –   CSE DB Faculty: Magda Balazinska, Dan Suciu
     –   CSE students: Paris Koutris, Prasang Upadhyaya,
     –   UW-IT (web applications, QA/support)
     –   Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications)
                10/9/2012                Bill Howe, UW                                   5
eScience ~= Data Science

             …modulo application area




10/9/2012             Bill Howe, UW     6
10/9/2012   Bill Howe, UW   7
UW Data Science Education Efforts
                                                  Students                  Non-Students
                                     CS/Informatics        Non-Major
                                                                       professionals researchers
                                   undergrads grads undergrads grads
UWEO Data Science Certificate
Graduate Certificate in Big Data
CS Data Management Courses
eScience workshops
Intro to data programming
eScience Masters (planned)
Coursera Course: Intro to Data
Science


  Previous courses:
  Scientific Data Management, Graduate CS, Summer 2006, Portland State University
  Scientific Data Management, Graduate CS, Spring 2010, University of Washington

  10/9/2012                               Bill Howe, UW                                  8
Huge number of
relevant
courses, new and
existing.




10/9/2012          Bill Howe, UW   9
“I worry that the Data Scientist role is like
            the mythical “webmaster” of the 90s:
            master of all trades.”

                               -- Aaron Kimball, CTO Wibidata




10/9/2012                      Bill Howe, UW                    10
Breadth


       tools                          abstractions

        Hadoop                           MapReduce

        PostgreSQL                       Relational Algebra

        glm(…) in R                      Logistic Regression

        Tableau                          InfoVis


10/9/2012             Bill Howe, UW                            11
Depth


            structures                       statistics

        Management                            Analysis

        Relational Algebra                    Linear Algebra

        Standards                             ad hoc files




10/9/2012                    Bill Howe, UW                     12
Scale


            desktop                       cloud

            main memory                   distributed

            R                             Hadoop

            local files                   S3, Azure Storage




10/9/2012                 Bill Howe, UW                   13
Target


            hackers                           analysts

            Assume                            Assume little or no
            proficiency in                    programming
            Python, Java, R




10/9/2012                     Bill Howe, UW                         14
Breadth
                       tools       abstr.



            Depth
                      structs       stats



            Scale
                       desk         cloud



            Target
                      hackers        analysts

10/9/2012                   Bill Howe, UW       15
tools    abstr.




                            structs   stats




                             desk     cloud




                            hackers    analysts
10/9/2012   Bill Howe, UW                      16
William W. Cohen




Machine                      tools    abstr.
                                                tools    abstr.

Learning
                                               structs   stats
                            structs   stats




                                                desk     cloud
                             desk     cloud




                                          hackers         analysts
                            hackers    analysts

10/9/2012   Bill Howe, UW                                   17
Dan                      Magda
                             Suciu                    Balazinska




                                                       tools       abstr.
            CSE 344 Introduction to Data Management


                                                      structs      stats




                                                        desk       cloud




                                                      hackers       analysts




10/9/2012                      Bill Howe, UW                          18
Jeff Hammerbacher Mike Franklin




                                                tools     abstr.




                                               structs    stats




                                                 desk     cloud




                                              hackers      analysts
10/9/2012   Bill Howe, UW                                    19
Introduction to Data Science
                                        Rachel Schutt



                                                         tools    abstr.




                                                        structs   stats




                                                         desk     cloud




                                                        hackers    analysts

  10/9/2012             Bill Howe, UW                                 20
tools    abstr.




                            structs   stats




                             desk     cloud




                            hackers    analysts


10/9/2012   Bill Howe, UW                21
Bill Howe Richard Sharp Roger Barga




                                        tools    abstr.




                                       structs    stats




                                        desk      cloud




                                      hackers      analysts
10/9/2012   Bill Howe, UW                                 22
Bill Howe



                                         tools    abstr.




                                        structs   stats




                                         desk     cloud




                                        hackers    analysts

10/9/2012   Bill Howe, UW                            23
tools        abstr.




            structs        stats




             desk          cloud




            hackers         analysts


10/9/2012             Bill Howe, UW    24
tools    abstr.
What goes around comes around

   •   2004 Dean et al. MapReduce
   •   2008 Hadoop 0.17 release
   •   2008 Olston et al. Pig: Relational Algebra on Hadoop
   •   2008 DryadLINQ: Relational Algebra in a Hadoop-like system
   •   2009 Thusoo et al. HIVE: SQL on Hadoop
   •   2009 Hbase: Indexing for Hadoop
   •   2010 Dietrich et al. Schemas and Indexing for Hadoop
   •   2012 Transactions in HBase (plus VoltDB, other NewSQL systems)

   •   But also some permanent contributions:
        – Fault tolerance
        – Schema-on-Read
        – User-defined functions that don’t suck

10/9/2012                          Bill Howe, UW                         25
tools      abstr.



        What are the abstractions of
        data science?

“Data Jujitsu”
“Data Wrangling”   Translation: “We have no idea what
“Data Munging”                   this is all about”




10/9/2012              Bill Howe, UW                     26
tools   abstr.



        What are the abstractions of
        data science?

      matrices and linear algebra?
      relations and relational algebra?
      objects and methods?
      files and scripts?
      data frames and functions?



10/9/2012                    Bill Howe, UW            27
structs    stats




  “80% of analytics is sums and averages”
                                -- Aaron Kimball, wibidata



     God created the integers; all else is the work of man

    Codd created relations; all else is the work of man




10/9/2012                      Bill Howe, UW                           28
structs    stats
Three types of tasks:

            1) Preparing to run a model                        “80% of the work”
                                                                       -- Aaron Kimball
              Gathering, cleaning, integrating, restructuring, transformi
              ng, loading, filtering, deleting, combining, merging, verifyi
              ng, extracting, shaping, massaging


            2) Running the model

            3) Interpreting the results                   The other 80% of the work



10/9/2012                            Bill Howe, UW                               29
Problem            structs   stats




How much time do you spend “handling
data” as opposed to “doing science”?



  Mode answer: “90%”
structs   stats




            src: Christian Grant, MADSkills



10/9/2012                Bill Howe, UW                  31
(Sparse) Matrix Multiply in SQL
                                               structs   stats




            src: Christian Grant, MADSkills



10/9/2012                      Bill Howe, UW             32
Aside: Schema-on-Write vs. Schema-on-Read
• A schema* is a shared consensus about some
  universe of discourse
• At the frontier of research, this shared consensus
  does not exist, by definition
• Any schema that does emerge will change
  frequently, by definition
• Data found “in the wild” will typically not conform to
  any schema, by definition
• But this doesn’t mean we have to live with ad hoc
  scripts and files
• My answer: Schema-later, “lazy schemification”
     * ontology/metadata standard/controlled vocabulary/etc.


10/9/2012                        Bill Howe, UW                 33
Data Access Hitting a Wall                                 desk   cloud


Current practice based on data download (FTP/GREP)
  Will not scale to the datasets of tomorrow
•   You can GREP 1 MB in a second      •   You can FTP 1 MB in 1 sec
•   You can GREP 1 GB in a minute
                                       •   You can FTP 1 GB / min (~1$)
•   You can GREP 1 TB in 2 days
•   You can GREP 1 PB in 3 years.
                                       •   … 2 days and 1K$
                                       •   … 3 years and 1M$
•   Oh!, and 1PB ~5,000 disks

• At some point you need
   indices to limit search
   parallel data search and analysis
• This is where databases can help



      [slide src: Jim Gray]                                        34
hackers   analysts




US faces shortage of 140,000 to 190,000
people “with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.”


                 --Mckinsey Global Institute

10/9/2012          Bill Howe, UW               35
hackers        analysts
Biologists are beginning to write very complex
queries (rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results

 SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
   , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
   , w.category as nc_category
   , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
  THEN x.end_bp - x.start_bp + 1
  WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
  THEN x.end_bp - w.start_bp + 1                                            We see thousands     of
  WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)                    queries written by
  THEN w.end_bp - x.start_bp + 1
 END AS len_overlap
                                                                            non-programmers

  FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
  INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
  ON x.chr = w.chr
  WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
  OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
  OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
  ORDER BY x.strain, x.chr ASC, x.start_bp ASC
UW Curricular Activities
                                                  Students                  Non-Students
                                     CS/Informatics        Non-Major
                                                                       professionals researchers
                                   undergrads grads undergrads grads
UWEO Data Science Certificate
Graduate Certificate in Big Data
Database Courses
eScience workshops
Intro to data programming
eScience Masters (planned)
Coursera Course Intro to Data
Science




  10/9/2012                               Bill Howe, UW                                 37
How do you deliver hands-on big data
experience to 10k students?

Cloud vendors’ free tiers?
      – 1 micro instance is not “big data”
Cloud vendors’ academic discounts? (e.g., Amazon’s education grants)
      – $100 / head = $1M
      – invites abuse: free credits with no obligation to complete course
Out of pocket?
      – Don’t want to be the only non-free Coursera course
      – Unclear that we can require it (perhaps analogous to a textbook?)



10/9/2012                          Bill Howe, UW                            38
10k Students on 10k GB for $10k
• Requirements
     – Inexpensive: Need a fixed, small budget; O(10k) maximum
     – Fair: All students need to be able to complete the assignment
• Non-solutions
     – Fixed cluster
             • Fairness problems; no quality of service guarantees
     – Autoscaling cluster
             • No upper bound to cost
     – Budget cap (via, e.g., Amazon’s IAM)
             • Fairness problems: Different students consume different levels of
               resources depending on background, etc.


 10/9/2012                            Bill Howe, UW                            39
10k Students on 10k GB for $10k
• Key idea: 10k students all working on the same
  assignment = lots of redundancy
• Students debug locally on scaled down datasets
• Then submit 10k jobs
• Prune the queue aggressively
      – Remove duplicates
      – Detect typical mistakes syntactically; return cached results
      – Global common subexpression elimination (feasible thanks to
        abstractions)




10/9/2012                     Bill Howe, UW                       40
10k Students on 10k GB for $10k
  • Another approach we are considering:                                 Kaggle-
    Kaggle-style Prize assignments
 student date      runtime          output quality notes                                  votes
sarah123 4/23/12 5 min 44 sec      result.txt 45% I removed the dirty data for this run    456
 jane456 4/22/12 3 min 23 sec      result.txt 23% I used gradient descent this time        97
    :
    :
• Pay-to-play: Students submit successful jobs to access leaderboards
• Cast votes for their preferred solution.
• Grade determined by
        f(votes(your_solution), grade(your_solution), grade(solutions_you_voted_for))
• We run the top-k highest ranked solutions on the full size dataset

• Problem: Easy to game the system and just coast
  10/9/2012                               Bill Howe, UW                                   41
http://escience.washington.edu

                                        billhowe@cs.washington.edu




Coursera course:
https://www.coursera.org/course/datasci

Certificate program:
http://www.pce.uw.edu/courses/data-science-intro


10/9/2012                    Bill Howe, UW                           42
Science is becoming a database query problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquisition supports many hypotheses)
  – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
  – Biology: lab automation, high-throughput sequencing,
  – Oceanography: high-resolution models, cheap sensors, satellites




40TB / 2 nights
1 device




                          ~1TB / day
                          100s of devices


    10/9/2012                  Bill Howe, eScience Institute               43
[src: Carol Goble]


                     • Power distribution
                     • 80:20 rule
Popularity / Sales




                        Head




                                             Tail


                                         Products / Results
          First published May 2007, Wired Magazine article 2004
A “Needs Hierarchy” of Science Data Management

“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
                         -- Maslow 43




            full semantic integration

             analytics

            query
            sharing

            storage
10/9/2012                         Bill Howe, UW   45

More Related Content

Similar to Data science curricula at UW

Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebStefan Dietze
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
Service Integration - A Web of Things Perspective
Service Integration - A Web of Things PerspectiveService Integration - A Web of Things Perspective
Service Integration - A Web of Things PerspectiveSimon Mayer
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Jian Qin
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Open Access Statistics: An Examination how to Generate Interoperable Usage In...
Open Access Statistics: An Examination how to Generate Interoperable Usage In...Open Access Statistics: An Examination how to Generate Interoperable Usage In...
Open Access Statistics: An Examination how to Generate Interoperable Usage In...Daniel Beucke
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big dataAndrew Clegg
 
On Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceOn Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceHong-Linh Truong
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...Hong-Linh Truong
 
Augmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistAugmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistWhereScape
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 

Similar to Data science curricula at UW (20)

Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
Service Integration - A Web of Things Perspective
Service Integration - A Web of Things PerspectiveService Integration - A Web of Things Perspective
Service Integration - A Web of Things Perspective
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Open Access Statistics: An Examination how to Generate Interoperable Usage In...
Open Access Statistics: An Examination how to Generate Interoperable Usage In...Open Access Statistics: An Examination how to Generate Interoperable Usage In...
Open Access Statistics: An Examination how to Generate Interoperable Usage In...
 
Resume
ResumeResume
Resume
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
 
On Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceOn Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a Service
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
 
Augmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data ScientistAugmented Analytics and Automation in the Age of the Data Scientist
Augmented Analytics and Automation in the Age of the Data Scientist
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Mini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.jsMini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.js
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 

More from University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Data science curricula at UW

  • 1. Data Science Curricula at the University of Washington Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute 10/9/2012 Bill Howe, UW 1
  • 2.
  • 4. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – Techniques and technologies include • Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing – If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive • Mission – Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them • Strategy – Bootstrap a cadre of Research Scientists – Add faculty in key fields – Build out a “consultancy” of students and non-research staff 10/9/2012 Bill Howe, eScience Institute 4
  • 5. eScience Big Data Group Bill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization) Director of Research, Scalable Data Analytics Staff – Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms) – Dan Halperin, Phd (postdoc; scalable systems) – Sagar Chitnis, Research Engineer (Azure, databases, web services) – (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases) – (alumna) Alicia Key, Research Engineer (visualization, web applications) Students – Scott Moe (2nd yr Phd, Applied Math) – Daniel Perry (2nd yr Phd, HCDE) Partners – CSE DB Faculty: Magda Balazinska, Dan Suciu – CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications) 10/9/2012 Bill Howe, UW 5
  • 6. eScience ~= Data Science …modulo application area 10/9/2012 Bill Howe, UW 6
  • 7. 10/9/2012 Bill Howe, UW 7
  • 8. UW Data Science Education Efforts Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington 10/9/2012 Bill Howe, UW 8
  • 9. Huge number of relevant courses, new and existing. 10/9/2012 Bill Howe, UW 9
  • 10. “I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.” -- Aaron Kimball, CTO Wibidata 10/9/2012 Bill Howe, UW 10
  • 11. Breadth tools abstractions Hadoop MapReduce PostgreSQL Relational Algebra glm(…) in R Logistic Regression Tableau InfoVis 10/9/2012 Bill Howe, UW 11
  • 12. Depth structures statistics Management Analysis Relational Algebra Linear Algebra Standards ad hoc files 10/9/2012 Bill Howe, UW 12
  • 13. Scale desktop cloud main memory distributed R Hadoop local files S3, Azure Storage 10/9/2012 Bill Howe, UW 13
  • 14. Target hackers analysts Assume Assume little or no proficiency in programming Python, Java, R 10/9/2012 Bill Howe, UW 14
  • 15. Breadth tools abstr. Depth structs stats Scale desk cloud Target hackers analysts 10/9/2012 Bill Howe, UW 15
  • 16. tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 16
  • 17. William W. Cohen Machine tools abstr. tools abstr. Learning structs stats structs stats desk cloud desk cloud hackers analysts hackers analysts 10/9/2012 Bill Howe, UW 17
  • 18. Dan Magda Suciu Balazinska tools abstr. CSE 344 Introduction to Data Management structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 18
  • 19. Jeff Hammerbacher Mike Franklin tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 19
  • 20. Introduction to Data Science Rachel Schutt tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 20
  • 21. tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 21
  • 22. Bill Howe Richard Sharp Roger Barga tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 22
  • 23. Bill Howe tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 23
  • 24. tools abstr. structs stats desk cloud hackers analysts 10/9/2012 Bill Howe, UW 24
  • 25. tools abstr. What goes around comes around • 2004 Dean et al. MapReduce • 2008 Hadoop 0.17 release • 2008 Olston et al. Pig: Relational Algebra on Hadoop • 2008 DryadLINQ: Relational Algebra in a Hadoop-like system • 2009 Thusoo et al. HIVE: SQL on Hadoop • 2009 Hbase: Indexing for Hadoop • 2010 Dietrich et al. Schemas and Indexing for Hadoop • 2012 Transactions in HBase (plus VoltDB, other NewSQL systems) • But also some permanent contributions: – Fault tolerance – Schema-on-Read – User-defined functions that don’t suck 10/9/2012 Bill Howe, UW 25
  • 26. tools abstr. What are the abstractions of data science? “Data Jujitsu” “Data Wrangling” Translation: “We have no idea what “Data Munging” this is all about” 10/9/2012 Bill Howe, UW 26
  • 27. tools abstr. What are the abstractions of data science? matrices and linear algebra? relations and relational algebra? objects and methods? files and scripts? data frames and functions? 10/9/2012 Bill Howe, UW 27
  • 28. structs stats “80% of analytics is sums and averages” -- Aaron Kimball, wibidata God created the integers; all else is the work of man Codd created relations; all else is the work of man 10/9/2012 Bill Howe, UW 28
  • 29. structs stats Three types of tasks: 1) Preparing to run a model “80% of the work” -- Aaron Kimball Gathering, cleaning, integrating, restructuring, transformi ng, loading, filtering, deleting, combining, merging, verifyi ng, extracting, shaping, massaging 2) Running the model 3) Interpreting the results The other 80% of the work 10/9/2012 Bill Howe, UW 29
  • 30. Problem structs stats How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%”
  • 31. structs stats src: Christian Grant, MADSkills 10/9/2012 Bill Howe, UW 31
  • 32. (Sparse) Matrix Multiply in SQL structs stats src: Christian Grant, MADSkills 10/9/2012 Bill Howe, UW 32
  • 33. Aside: Schema-on-Write vs. Schema-on-Read • A schema* is a shared consensus about some universe of discourse • At the frontier of research, this shared consensus does not exist, by definition • Any schema that does emerge will change frequently, by definition • Data found “in the wild” will typically not conform to any schema, by definition • But this doesn’t mean we have to live with ad hoc scripts and files • My answer: Schema-later, “lazy schemification” * ontology/metadata standard/controlled vocabulary/etc. 10/9/2012 Bill Howe, UW 33
  • 34. Data Access Hitting a Wall desk cloud Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • You can GREP 1 MB in a second • You can FTP 1 MB in 1 sec • You can GREP 1 GB in a minute • You can FTP 1 GB / min (~1$) • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • … 2 days and 1K$ • … 3 years and 1M$ • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help [slide src: Jim Gray] 34
  • 35. hackers analysts US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” --Mckinsey Global Institute 10/9/2012 Bill Howe, UW 35
  • 36. hackers analysts Biologists are beginning to write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 We see thousands of WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by THEN w.end_bp - x.start_bp + 1 END AS len_overlap non-programmers FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
  • 37. UW Curricular Activities Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data Database Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course Intro to Data Science 10/9/2012 Bill Howe, UW 37
  • 38. How do you deliver hands-on big data experience to 10k students? Cloud vendors’ free tiers? – 1 micro instance is not “big data” Cloud vendors’ academic discounts? (e.g., Amazon’s education grants) – $100 / head = $1M – invites abuse: free credits with no obligation to complete course Out of pocket? – Don’t want to be the only non-free Coursera course – Unclear that we can require it (perhaps analogous to a textbook?) 10/9/2012 Bill Howe, UW 38
  • 39. 10k Students on 10k GB for $10k • Requirements – Inexpensive: Need a fixed, small budget; O(10k) maximum – Fair: All students need to be able to complete the assignment • Non-solutions – Fixed cluster • Fairness problems; no quality of service guarantees – Autoscaling cluster • No upper bound to cost – Budget cap (via, e.g., Amazon’s IAM) • Fairness problems: Different students consume different levels of resources depending on background, etc. 10/9/2012 Bill Howe, UW 39
  • 40. 10k Students on 10k GB for $10k • Key idea: 10k students all working on the same assignment = lots of redundancy • Students debug locally on scaled down datasets • Then submit 10k jobs • Prune the queue aggressively – Remove duplicates – Detect typical mistakes syntactically; return cached results – Global common subexpression elimination (feasible thanks to abstractions) 10/9/2012 Bill Howe, UW 40
  • 41. 10k Students on 10k GB for $10k • Another approach we are considering: Kaggle- Kaggle-style Prize assignments student date runtime output quality notes votes sarah123 4/23/12 5 min 44 sec result.txt 45% I removed the dirty data for this run 456 jane456 4/22/12 3 min 23 sec result.txt 23% I used gradient descent this time 97 : : • Pay-to-play: Students submit successful jobs to access leaderboards • Cast votes for their preferred solution. • Grade determined by f(votes(your_solution), grade(your_solution), grade(solutions_you_voted_for)) • We run the top-k highest ranked solutions on the full size dataset • Problem: Easy to game the system and just coast 10/9/2012 Bill Howe, UW 41
  • 42. http://escience.washington.edu billhowe@cs.washington.edu Coursera course: https://www.coursera.org/course/datasci Certificate program: http://www.pce.uw.edu/courses/data-science-intro 10/9/2012 Bill Howe, UW 42
  • 43. Science is becoming a database query problem Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquisition supports many hypotheses) – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) – Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites 40TB / 2 nights 1 device ~1TB / day 100s of devices 10/9/2012 Bill Howe, eScience Institute 43
  • 44. [src: Carol Goble] • Power distribution • 80:20 rule Popularity / Sales Head Tail Products / Results First published May 2007, Wired Magazine article 2004
  • 45. A “Needs Hierarchy” of Science Data Management “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 full semantic integration analytics query sharing storage 10/9/2012 Bill Howe, UW 45

Editor's Notes

  1. eSciec
  2. R data frames or matrices?Hadoop or MapReduce?Greenplum or Relational Algebra?
  3. How much statistical sophistcation do you emphasize? Select, project, joinslicing and dicing tableskey value storesvsLinear algebraEMGradient descentStatistical pattern recognitionUnsupervised learning
  4. Using R to analyze survey resultsUsing Hadoop to analyze clickstream
  5. How much programming background do you assume? How much do you teach?
  6. “Data Jujitsu”“Data Wrangling”“Data Munging”
  7. R and files vs. databasesHadoop and friends vs. databasesGod created …. Codd created….
  8. (granted we had a minute for Bill (clearly Bill) to describe this new eScience movement)We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.Essentially, we want to remove the speed-bump of data handling from the scientists.
  9. PortableConciseAutomatically parallelizableComposable with cleaning, filtering, munging, etc.
  10. The Schema-on-read folks are winning this debate: there’s just too much heterogeity
  11. Our collaborators tell us that loading data into memory with R is the major bottleneck.It actually changes the science they can do:I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).
  12. But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.