Bill Howe, PhD
Director of Research,
Scalable Data Analytics
University of Washington
eScience Institute
eScience and Data...
2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are tryin...
Jim Gray
The University of Washington
eScience Institute
• Rationale
– The exponential increase in physical and virtual sensing tec...
7/8/2013 Bill Howe, UW 5
#ofbytes
# of data sources
telescopes
spectra
LSST (~100PB; images, spectra)
PanSTARRS (~40PB; im...
PDB
GenBank
UniProt
Pfam
Spreadsheets, Notebooks
Local, Lost
High throughput experimental methods
Industrial scale
Commons...
Types of Data Stored
5.20%
11.70%
15.20%
20.50%
22.80%
48.00%
61.20%
71.60%
0% 20% 40% 60% 80% 100%
Quantitative/tabular/s...
Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
...
How much data do you work with?
Wright 2013
The Long Tail is worthy of investment
Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science:
How Scientific ...
The Long Tail is worthy of investment
Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science:
How Scientific ...
7/8/2013 Bill Howe, UW 12
“Ensuring a successful future for the biological sciences will require
restraint in the growth o...
7/3/13 13
7/8/2013 Bill Howe, UW 14
100x more impact
How much more productive is a great
programmer than a mediocre programmer?
Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
7/8/2013 Bill Howe, U...
7/8/2013 Bill Howe, UW
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description...
7/8/2013 Bill Howe, UW 18
“The future is already here; it’s just
not very evenly distributed.”
-- William Gibson
Three Avenues of Attack
• Technological
• Educational
• Organizational
7/8/2013 Bill Howe, UW 19
SQLSHARE: QUERY-AS-A-SERVICE
Technology for the Long Tail
Alex Szalay Jim Gray
How can we deliver 1000 little SDSSs?
1) Upload data “as is”
Cloud-hosted; no need to
install or design a database;
no pre-defined schema
2) Analyze data with S...
7/8/2013 23
Browse English descriptions
7/8/2013 24
Save the results, share them with others
Edit a Query
VizDeck
Python
Client
“Flagship”
SQLShare App
(Python) on EC2
SQLShare REST API
Excel
Addin
R
Client
Spreadsheet
Crawler
A...
METADATA
Aside
7/8/2013 Bill Howe, UW 26
What about metadata?
• Claim: Comprehensive metadata standards*
represent a shared consensus about the world
• At the fron...
7/8/2013 Bill Howe, UW 28
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioni...
A “Needs Hierarchy” of Science Data Management
storage
sharing
7/8/2013 Bill Howe, UW 29
query
curation
analytics
“As each...
A “Needs Hierarchy” of Science Data Management
storage
sharing
7/8/2013 Bill Howe, UW 30
curation
query
analytics
“As each...
Views
• A view is just a query with a name
• We can use the view just like a real table
7/8/2013 Bill Howe, UW 31
Why can ...
Scientific data management reduces to sharing views
• Integrate data from multiple sources?
– joins and unions with views
...
Deeply nested hierarchies of views
Provenance
Controlled Sharing
Implicit re-execution
Bring the computation
to the data
• Rich query services, not data
cemetaries
• Avoid “Transloading:” Pointless
data moveme...
USE CASES
7/8/2013 Bill Howe, UW 35
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
SeaFlow
Francois
Ribalet...
SeaFlow
10
0
10
1
10
2
10
3
10
4
100
10
1
10
2
10
3
10
4
ps3.fcs…Focus
D1/FSC
D2/FSC
d1/FSC
d2 / FSC
10
0
10
1
10
2
10
3
1...
SeaFlow Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Script-oriented computing was killing them
• Scripts (typically in R) must be pre-shared with all collaborators
• When the...
Howe, et al., CISE 2012
Workflow Systems
• Scripts (typically in R) must be pre-shared with all
collaborators
• When the data changes, everybody h...
Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methy...
Halperin, Howe, et al. SSDBM 2013
Robin
Kodner
7/8/2013 Bill Howe, UW 44
“I have had two students who are struggling with R come up
and tell me how much mor...
Bill Howe, UW
“An undergraduate student and I are working with gigabytes of tabular data
derived from analysis of protein ...
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_star...
DATA SCIENCE
7/8/2013 Bill Howe, UW 47
Education
7/8/2013 Bill Howe, UW 48
Drew Conway’s Data Science Venn Diagram
7/8/2013 Bill Howe, UW 49
7/8/2013 Bill Howe, UW 50
“I worry that the Data Scientist role is like
the mythical “webmaster” of the 90s:
master of all...
7/8/2013 Bill Howe, UW 51
But what are the abstractions
of data science?
tools abstr.
“Data Jujitsu”
“Data Wrangling”
“Dat...
UW Data Science Education Efforts
7/8/2013 Bill Howe, UW 52
Students Non-Students
CS/Informatics Non-Major
professionals r...
7/8/2013 Bill Howe, UW 53
Bill Howe
“Very very interesting tht there is high correlation
in the way the election results were being
announced and the way the ...
“Darth Grader”
… I was frankly amazed that you were so fast in responding to my
query, in spite of the class being so huge. I’d like to t...
INTELLECTUAL
INFRASTRUCTURE
Institution
7/8/2013 Bill Howe, UW 57
Seek work in “Pasteur’s Quadrant”
Considerations of use
Questforfundamentalunderstanding
Pasteur
Edison
Bohr
7/8/2013 Bill Howe, UW 59
Multiple modes of interaction, multiple time scales
1-2 years and up1-2 weeks and down 1-2 quart...
2018
2008
Some local observations:
• Big data work exposes common ground
• Every job is becoming “data scientist”
• More π...
7/8/2013 Bill Howe, UW 61
On NoSQL
7/8/2013 Bill Howe, UW 62
http://sqlshare.escience.washington.edu
billhowe@cs.washington.edu
http://escience.washington.edu
7/8/2013 Bill Howe, UW 63
WHY SQL?
7/8/2013 Bill Howe, UW 64
5/18/10 Garret Cole, eScience Institute
What’s the point?
• Conventional wisdom says “Science data isn’t relational”
– Thi...
7/8/2013 Bill Howe, UW 66
God made the integers;
all else is the work of man.
(Leopold Kronecker, 19th Century Mathematici...
7/8/2013 Bill Howe, UW 67
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = ...
Data
Curation
Data
Management
Data
Analytics
Cyberinfrastructure
Database &
Systems
Researchers
Stats, ML, and Viz
Library...
SQLShare as a CS Research Platform
• Automatic “Starter” Queries
– (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani B...
Two Failure Modes Serving the Long Tail
over-abstraction
“uber-system”
“neither fish nor fowl”
tries to address so many
re...
A stripped-down version of Jim Gray’s
“20 questions” methodology
Experimental Engagement Algorithm for the Long Tail
1. Ge...
Upcoming SlideShare
Loading in …5
×

eResearch New Zealand Keynote

662 views

Published on

Bill Howe gave a keynote at eResearch New Zealand in early July 2013

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
662
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Who we areEducation: MOOC structure quotesResearch: Long tail: it’s got to get dramatically more accessible Ex: Using tools from 10 years ago Ex: Spreadsheets EaaS Databases: not used, make it a service, used Vis: shareable Shiny R SQLShare Tools VizDeck Myria: Scale way up Aside on datalog / RA Aside on algebraic optimization Goal: An online “home” for every dataset Repositories aren’t enough – data museums. Data needs to be manipulable, not just discoverable. Ex: Odata (but no joins) Ex: social science thing (no query at all) Ex: Fusion Tables (great, but limited. Example) Ex: Data.gov Ex: BigQuery (excellent – may subsume us one day. But limited SQL, and limited support for data ingest. No views.) Workflow systems: too much focus on the verbs rather than the nouns Not as reusable as they seem Make hard things simple and simple things hard. And really hard things are really hard. Databases make simple things simple and hard things hard. and really hard things are really hard. But everything scales. No code. “Complex workflows” are usually 90% simple table operations – “data munging” most of this is expressible in RA, or goes away entirely. 80% of data science is mucking aboutOrganizational innovation Generation 1: Research Scientists, augmented with faculty The researc scientists become 2nd class citizens, with primary allegiance to some home department The faculty become “regular” faculty, with primary allegiance to some home department in both cases, respect is the coin of the realm. “Like regular faculty, but different” == “Not like regular faculty” need new metrics Generation 2: Obervation: We are producing people who are motivated by building things rather than publishing papers Observation: In the US, to a 1st approx., we lose 100% of these people to industry. Observation: Most papers, especially in CS, don’t actually work. Observation: A startup can build a prototype in 2 months for $10k that would take a research project 2 years and $100k Conclusion: We need to stop writing papers, build things that work, and do it fast. Problem: We need a career path for these people. Maybe enlightened institutions will offer faculty positions. But we want to try something different. Incubator:
  • So what is Big Data? You’ll hear about the three V’s: Volume, Variety, Velocity.I like to think about this in terms of # of bytes vs. # of unique sources.
  • D
  • Insidious pressureSample datasets to make sure they fit in memory
  • So the long tail is real, and they are struggling with data, and they have high impactSo I don’t like the term “long tail”They are the “breakaway group”Analogy: Big science is the relative safety of the peloton: riders all draft off of one another, trading the lead, making incremental progress.Breakaway groups have to do a lot more work because they have a smaller team. But this is where you typically find the srongest riders, and you have to breakaway at some point to make the big discoveries.
  • So here’s what scares the hell out of me: These breakaway riders are using shitty bikes, and they can’t afford a crew.The minimum cost of doing data-intensive science is going up, but worse they don’t know it.A brilliant researcher with a brilliant student with a brilliant idea is no longer enough. You need the cyberinfrastructure, and you need the intellectual infrastructure.
  • SDSS offered one solution: put the data into a relational database and serve it on the web.But SDSS invoved ultimately a single source of data – the telescope. It involved a carefully engineered schema, and a significant computing and application infrastructure.So we started asking: How can we support 1000 little "SDSSs” for small- and medium- sized projects?---We started thinking about a new tool. schema designed in part by a turing-award winning computer database expert  We can't afford to build a database + applications from scratch for every project and nobody wants to maintain such a system anyway.  Most importantly, the data comes from all over the place instead of a single source like SDSS --- we can't pretend the data will arrive clean and coherent.
  • So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database.You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries.Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.  ----Key ideas to get data in: a) Use the cloud to avoid having to install and run a databaseb) Give up on the schema -- just throw your data in "as is" and do "lazy integration.”c) Use some magic to automate parsing, integration, recommendations, and more.Key ideas to get data out:a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find.b) Saving and reusing queries is a first class requirement.  Given examples, it's easy to modify it into an "adjacent" query.c) Expose the whole system through a REST API to make it easy to bring new client applications online.
  • List of English descriptions
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that strong semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • Advantage/ inconvenient sheath fluid alignment particles/laser. Sheath fluid replacement. Loading samples to the instrument.Advantage/ inconvenient sheathless
  • I’ll give you a couple of examples of use cases we are seeing.Robin Kodner at Friday Harbor Labs in the San Juan islands works on a program called Sound Experience that gives K12 – undergrads experience sailing, but reuses this ship time to do some real science and collect samples from areas that wouldotherwise be very expensive.
  • eSciec
  • “Data Jujitsu”“Data Wrangling”“Data Munging”
  • 100k students50k active and watching videos regularly15k completing the asignmentsProgramming Assignments, Peer Review
  • https://mail.google.com/mail/u/1/?ui=2&shva=1&zx=dqid4glhiaai#inbox/13f4123a5ef7d103
  • I see UW-IT and eScience “joined at the hip” in providing the intellectual infrastrcuture needed to support science.
  • Finding the balance
  • eResearch New Zealand Keynote

    1. 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute eScience and Data Science at the UW eScience Institute 7/8/2013 Bill Howe, UW 1
    2. 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
    3. 3. Jim Gray
    4. 4. The University of Washington eScience Institute • Rationale – The exponential increase in physical and virtual sensing tech is transitioning all fields of science and engineering from data-poor to data-rich – Techniques and technologies include • Sensors and sensor networks, data management, data mining, machine learning, visualization, cluster/cloud computing • Mission – Help position the University of Washington and partners at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them. • Strategy – Bootstrap a cadre of Research Scientists – Add faculty in key fields – Build out a “consultancy” of students and non-research staff 7/8/2013 Bill Howe, UW 4
    5. 5. 7/8/2013 Bill Howe, UW 5 #ofbytes # of data sources telescopes spectra LSST (~100PB; images, spectra) PanSTARRS (~40PB; images, trajectories) OOI (~50TB/year; sims, RSN) IOOS (~50TB/year; sims, satellite, gliders, AUVs, vessels, more) CMOP (~10TB/year; sims, stations, gliders, AUVs, vessels, more) SDSS (~400TB; images, spectra, catalogs) n-body sims models AUVs stations cruises, CTDs flow cytometry gliders ADCP satellites Astronomy Ocean Sciences 3 V’s of Big Data Volume Variety Velocity
    6. 6. PDB GenBank UniProt Pfam Spreadsheets, Notebooks Local, Lost High throughput experimental methods Industrial scale Commons based production Publicly data sets Cherry picked results Preserved CATH, SCOP (Protein Structure Classification) ChemSpider Long Tail of Research Data [src: Carol Goble]
    7. 7. Types of Data Stored 5.20% 11.70% 15.20% 20.50% 22.80% 48.00% 61.20% 71.60% 0% 20% 40% 60% 80% 100% Quantitative/tabular/statistical Text (literature, transcriptions, field notes) Images (photographs, maps) Video recordings Audio recordings Multimedia digital objects Geo-tagged objects/ spatial data Other Lewis et al 2008 src: Conversations with Research Leaders (2008)
    8. 8. Where do you store your data? src: Conversations with Research Leaders (2008) src: Faculty Technology Survey (2011) 5% 6% 12% 27% 41% 66% 87% 0% 20% 40% 60% 80% 100% Other Department-managed data center External (non-UW) data center Server managed by research group Department-managed server External device (hard drive, thumb drive) My computer Lewis et al 2011
    9. 9. How much data do you work with? Wright 2013
    10. 10. The Long Tail is worthy of investment Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6) log(NSERC grants) ($) log(NSERC + CIHR rants) ($)
    11. 11. The Long Tail is worthy of investment Jean-Michel Fortin, David J. Currie, Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE 8(6) “In sum, greater productivity is not strongly related to greater funding.” “…the best article of one rich researcher received, on average, 14% fewer citations than the best article from any random pair of researchers, each of whom received only half as much funding!” “if maximizing the total impact of the entire pool of grantees is the goal, then the “few big” strategy would be less effective than the “many small” strategy.”
    12. 12. 7/8/2013 Bill Howe, UW 12 “Ensuring a successful future for the biological sciences will require restraint in the growth of large centers and -omics-like projects, so as to provide more financial support for the critical work of innovative small laboratories striving to understand the wonderful complexity of living systems.” Alberts B (2012) The End of “Small Science”? Science 337: 1583 The long tail is worthy of investment
    13. 13. 7/3/13 13
    14. 14. 7/8/2013 Bill Howe, UW 14
    15. 15. 100x more impact How much more productive is a great programmer than a mediocre programmer?
    16. 16. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/8/2013 Bill Howe, UW 16
    17. 17. 7/8/2013 Bill Howe, UW Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 17
    18. 18. 7/8/2013 Bill Howe, UW 18 “The future is already here; it’s just not very evenly distributed.” -- William Gibson
    19. 19. Three Avenues of Attack • Technological • Educational • Organizational 7/8/2013 Bill Howe, UW 19
    20. 20. SQLSHARE: QUERY-AS-A-SERVICE Technology for the Long Tail
    21. 21. Alex Szalay Jim Gray How can we deliver 1000 little SDSSs?
    22. 22. 1) Upload data “as is” Cloud-hosted; no need to install or design a database; no pre-defined schema 2) Analyze data with SQL Right in your browser, writing queries on top of queries on top of queries ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Queries on queries on queries…
    23. 23. 7/8/2013 23 Browse English descriptions
    24. 24. 7/8/2013 24 Save the results, share them with others Edit a Query
    25. 25. VizDeck Python Client “Flagship” SQLShare App (Python) on EC2 SQLShare REST API Excel Addin R Client Spreadsheet Crawler ASP.NET OAuth2 WCF
    26. 26. METADATA Aside 7/8/2013 Bill Howe, UW 26
    27. 27. What about metadata? • Claim: Comprehensive metadata standards* represent a shared consensus about the world • At the frontier of research, this shared consensus does not exist, by definition • Any consensus that does emerge will change frequently, by definition • Data found “in the wild” will typically not conform to any standard, by definition So are we stuffed? * ontology / schema / controlled vocabulary / etc. 7/8/2013 Bill Howe, UW 27
    28. 28. 7/8/2013 Bill Howe, UW 28 “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43 Maslow’s Needs Hierarchy
    29. 29. A “Needs Hierarchy” of Science Data Management storage sharing 7/8/2013 Bill Howe, UW 29 query curation analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
    30. 30. A “Needs Hierarchy” of Science Data Management storage sharing 7/8/2013 Bill Howe, UW 30 curation query analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
    31. 31. Views • A view is just a query with a name • We can use the view just like a real table 7/8/2013 Bill Howe, UW 31 Why can we do this? Because we know that every query returns a relation: We say that the language is “algebraically closed”
    32. 32. Scientific data management reduces to sharing views • Integrate data from multiple sources? – joins and unions with views • Standardize on units, apply naming conventions? – rename columns, apply functions with views • Attach metadata? – add new tables with descriptive names, add new columns with views • Data cleaning, quality control? – hide bad values with views • Maintain provenance? – inspect view dependencies • Propagate updates? – view maintenance • Protect sensitive data? – expose subsets with views (assuming views carry permissions) 7/8/2013 Bill Howe, UW 32
    33. 33. Deeply nested hierarchies of views Provenance Controlled Sharing Implicit re-execution
    34. 34. Bring the computation to the data • Rich query services, not data cemetaries • Avoid “Transloading:” Pointless data movement from one environment to another – Compute  Vis – Server  Client – Cloud  Cloud • “Share the soup” and curate incrementally as a side effect of usage • Pay-as-you-go metadata
    35. 35. USE CASES 7/8/2013 Bill Howe, UW 35
    36. 36. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
    37. 37. SeaFlow 10 0 10 1 10 2 10 3 10 4 100 10 1 10 2 10 3 10 4 ps3.fcs…Focus D1/FSC D2/FSC d1/FSC d2 / FSC 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSC Small Stuff 580-30 IS Ultraplankton Prochlorococcus  Continuous observations of various phytoplankton groups from 1-20 m in size  Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton  Based on ORANGE fluo: Synechococcus, Cryptophytes  Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
    38. 38. SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
    39. 39. Script-oriented computing was killing them • Scripts (typically in R) must be pre-shared with all collaborators • When the data changes, everybody has to re-run all the scripts • When the scripts change, everybody has to re-run all the scripts. • Implicit assumption that all data fits in main memory • Terrible provenance, terrible reproducibility • Pipeline of scripts dependent on intricate file formats and file naming schemes 7/8/2013 Bill Howe, UW 39
    40. 40. Howe, et al., CISE 2012
    41. 41. Workflow Systems • Scripts (typically in R) must be pre-shared with all collaborators • When the data changes, everybody has to re-run all the scripts • When the scripts change, everybody has to re-run all the scripts. • Implicit assumption that data fits in main memory • No provenance • Pipeline of scripts dependent on intricate file formats and/or file naming schemes 7/8/2013 Bill Howe, UW 41
    42. 42. Steven Roberts SQL as a lab notebook: http://bit.ly/16Xj2JP Calculate # methylated CGs Calculate # all CGs Calculate methylation ratio Link methylation with gene description GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Join Reorder columns Count Count JoinJoin Reorder columns Reorder columns Compute Trim Excel Join Join misstep: join w/ wrong fill Calculate # methylated CGs Calculate # all CGs GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Calculate methylation ratio and link with gene description Popular service for Bioinformatics Workflows
    43. 43. Halperin, Howe, et al. SSDBM 2013
    44. 44. Robin Kodner 7/8/2013 Bill Howe, UW 44 “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” Data management and statistics for biologists
    45. 45. Bill Howe, UW “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White Andrew White, UW Chemistry 7/8/2013 45 Decoding nonspecific interactions from nature. A. White, A. Nowinski, W. Huang, A. Keefe, F. Sun, S. Jiang. (2012) Chemical Science. Accepted
    46. 46. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
    47. 47. DATA SCIENCE 7/8/2013 Bill Howe, UW 47 Education
    48. 48. 7/8/2013 Bill Howe, UW 48
    49. 49. Drew Conway’s Data Science Venn Diagram 7/8/2013 Bill Howe, UW 49
    50. 50. 7/8/2013 Bill Howe, UW 50 “I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.” -- Aaron Kimball, CTO Wibidata
    51. 51. 7/8/2013 Bill Howe, UW 51 But what are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about” Claim: Relational Algebra is the universal formalism for data modeling and manipulation, and every data scientist should know it
    52. 52. UW Data Science Education Efforts 7/8/2013 Bill Howe, UW 52 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) Coursera Course: Intro to Data Science Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
    53. 53. 7/8/2013 Bill Howe, UW 53 Bill Howe
    54. 54. “Very very interesting tht there is high correlation in the way the election results were being announced and the way the graph is shaped.” “I was quite amazined that I was able to obtain this analysis with just 2 days of lecturing and practice!” “Inspired by all my new learning, I thought about doing a little sentiment analysis myself for our national elections!”
    55. 55. “Darth Grader”
    56. 56. … I was frankly amazed that you were so fast in responding to my query, in spite of the class being so huge. I’d like to thank you so much for teaching this course. This has been one of the most useful courses I’ve ever taken and you’re an awesome instructor. With such great quality education freely available, I wonder why I joined a Master’s program haha. Thanks again and take care. I’ll try my best to finish another competition. “With such great quality education freely available, I wonder why I joined a Master’s program haha.”
    57. 57. INTELLECTUAL INFRASTRUCTURE Institution 7/8/2013 Bill Howe, UW 57
    58. 58. Seek work in “Pasteur’s Quadrant” Considerations of use Questforfundamentalunderstanding Pasteur Edison Bohr
    59. 59. 7/8/2013 Bill Howe, UW 59 Multiple modes of interaction, multiple time scales 1-2 years and up1-2 weeks and down 1-2 quarters Incubation (projects) Communication (events) Collaboration (partnerships)
    60. 60. 2018 2008 Some local observations: • Big data work exposes common ground • Every job is becoming “data scientist” • More π-shaped people! • Democratization to the long tail is key • Industry and research aren’t too different Incubator • Seed grants to students and postdocs • Rotating staff from science and industry • An evolving portfolio of reusable tools • Produce digital capital and human capital Data Science Incubator 2013
    61. 61. 7/8/2013 Bill Howe, UW 61 On NoSQL
    62. 62. 7/8/2013 Bill Howe, UW 62 http://sqlshare.escience.washington.edu billhowe@cs.washington.edu http://escience.washington.edu
    63. 63. 7/8/2013 Bill Howe, UW 63
    64. 64. WHY SQL? 7/8/2013 Bill Howe, UW 64
    65. 65. 5/18/10 Garret Cole, eScience Institute What’s the point? • Conventional wisdom says “Science data isn’t relational” – This is nonsense • Conventional wisdom says “Scientists won’t write SQL” – This is nonsense • So why aren’t databases being used more often? – They’re a PITA • We implicate difficulty in – installation, configuration – schema design, data loading – performance tuning – app-building (NoGUI?) We ask instead, “What kind of platform can support ad hoc scientific Q&A with SQL?”
    66. 66. 7/8/2013 Bill Howe, UW 66 God made the integers; all else is the work of man. (Leopold Kronecker, 19th Century Mathematician) slide src: Mike Franklin Codd made relations; all else is the work of man. (Raghu Ramakrishnan, DB text book author)
    67. 67. 7/8/2013 Bill Howe, UW 67 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
    68. 68. Data Curation Data Management Data Analytics Cyberinfrastructure Database & Systems Researchers Stats, ML, and Viz Library Science Researchers DataONE Hadoop GraphLab Vertica Greenplum Oracle/MS/IBM Dataverse R/SPSS/MATLAB/Stata Dryad ICPSR Geodata.gov Tableau Weka GenBank Intellectual Infrastructure Spark Pig HIVE Shark Dremel
    69. 69. SQLShare as a CS Research Platform • Automatic “Starter” Queries – (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle) • VizDeck: Automatic Mashups and Visualization – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon) • Info Extraction from Spreadsheets – (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani) • Scalable Analytics-as-a-Service – (Dan Suciu, Magda Balazinska, Bill Howe) • Optimizing Iterative Queries for Machine Learning – (Dan Suciu, Magda Balazinska, Bill Howe) • Case Studies in Metagenomics, Chemistry, more SSDBM 2011 SIGMOD 2011 (demo) SSDBM 2011 CHI 2012 SIGMOD 2012 (demo) 7/8/2013 Bill Howe, UW 69 VLDB 2010 Datalog2.0 2012 CIDR 2013 Data engineering 2012 CiSE 2012
    70. 70. Two Failure Modes Serving the Long Tail over-abstraction “uber-system” “neither fish nor fowl” tries to address so many requirements, it addresses none too reactive, ad hoc, one-off Addresses exactly 1 application no leverage; doesn’t scale
    71. 71. A stripped-down version of Jim Gray’s “20 questions” methodology Experimental Engagement Algorithm for the Long Tail 1. Get the data 2. Load the data “as is” – no schema design 3. Get ~20 questions (in English) 4. Translate the questions into SQL (when possible) 5. Provide these “starter queries” to the researchers Q: Can researchers questions be expressed in SQL? Q: Are a few examples sufficient for novices to self-train with SQL? Q: Can we scale this process up? Q: If so, will the use of SQL reduce their data handling overhead?

    ×