Your SlideShare is downloading. ×
0
Enabling Collaborative Research Data Management with SQLShare               Bill Howe, PhD                                ...
Assessing UW Researchers‘   Data Management Needs1. Conversations with Research Leaders (2008)   • First large-scale asses...
Data Management Needs• Expertise—designing new systems, enhancing  current practices, analysis• Storage—Large amounts of d...
Types of Data Stored                        Quantities/statistics                                     71.60%              ...
Data Storage Location                          My computer                                                 87%External dev...
http://escience.washington.edu
The University of Washington    eScience Institute• Rationale   – The exponential increase in sensors is transitioning all...
eScience Big Data GroupBill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)Staff    –   Seu...
[src: Carol Goble]                     • Power distribution                     • 80:20 rulePopularity / Sales            ...
[src: Carol Goble]Long Tail of Research Data                 High throughput experimental methods                 Industri...
ProblemHow much time do you spend “handlingdata” as opposed to “doing science”?      Mode answer: “90%”10/3/2012          ...
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome###query                      length   COG hit #1   e-value #1      ...
Data management becoming the bottleneck• Data is captured and manipulated in spreadsheets• This approach made sense five y...
Why not build a database?• Assumes an pre-defined schema     – more on this later• Requires specialized technical skills a...
Database-as-a-Service for the 99%Approach: Strip down databases to bare essentials   – Upload -> Query -> Share   – Try to...
Getting Data In• Upload data through a browser to a cloud  database (so no need to manage a local system)• No need to ―des...
Select from a list of English descriptions10/3/2012                                                18
Edit a Query               Save the results, share them with others  10/3/2012                                            ...
http://sqlshare.escience.washington.edu10/3/2012                  Bill Howe, UW              20
http://sqlshare.escience.washington.edu10/3/2012                  Bill Howe, UW              21
http://sqlshare.escience.washington.edu10/3/2012                       Bill Howe, UW         22
http://sqlshare.escience.washington.edu10/3/2012                       Bill Howe, UW         23
http://sqlshare.escience.washington.edu10/3/2012                 Bill Howe, UW               24
Spreadsheet   Excel    Python     R                               Crawler     Addin    Client   Client             ―Flagsh...
Deeply nested hierarchies of views                      Provenance                    Controlled Sharing10/3/2012         ...
Robin                                                   Kodner2010 Pilot- Outreach and Education-based sampling: Schooner ...
NatureMapping Program                                                                      Karen                          ...
Andrew                                                     White, UW                                                     C...
WHY SQL?10/3/2012     Bill Howe, UW   30
A ―Needs Hierarchy‖ of Science Data Management“As each need is satisfied, thenext higher level in the hierarchydominates c...
NoSchema (not NoSQL)• A schema* is a shared consensus about some  universe of discourse• At the frontier of research, this...
A ―Needs Hierarchy‖ of Science Data Management Silos           full semantic                 integration                 a...
What‘s the point?• Conventional wisdom says ―Science data isn‘t relational‖    – This is utter nonsense• Conventional wisd...
Biologists are beginning to write very complexqueries (rather than relying on staff programmers)Example: Computing the ove...
So why SQL?• Covers 80% of what we need    – Ex: Sloan Digital Sky Survey    – Ex: Hybrid Hash Join algorithm published in...
SQLShare as a CS Research Platform                                                                         SSDBM 2011• Aut...
Where we‘re headed:              •   Local or cloud-hosted deployments        done!              •   Multi-institution sha...
http://sqlshare.escience.washington.edu     billhowe@cs.washington.edu       http://escience.washington.edu10/3/2012      ...
10/3/2012   Bill Howe, UW   40
Scientific data management reduces to sharing views• Integrate data from multiple sources?      – joins and unions with vi...
An observation about ―handling data‖• How many plasmids were bombarded in July and  have a rescue and expression?  SELECT ...
An observation about ―handling data‖• Which samples have not been cloned?  SELECT *  FROM plasmiddb  WHERE NOT (ISDATE(clo...
An observation about ―handling data‖• How often does each RNA hit appear inside  the annotated surface group?          SEL...
An observation about ―handling data‖For a given promoter (or protein fusion), how manyexpressing line have been generated ...
Find all TIGRFam ids (proteins) that are missing from at leastone of three samples (relations)              SELECT col0 FR...
On NoSQL10/3/2012      Bill Howe, UW   47
An Observation on NoSQL•   2004 Dean et al. MapReduce•   2008 Hadoop 0.17 release•   2008 Olston et al. Pig: Relational Al...
Digression: Relational Database History    Pre-Relational: if your data changed, your application broke.    Early RDBMS we...
Key Idea: An Algebra of Tables                                                   select                                   ...
Algebraic Optimization     N = ((4*2)+((4*3)+0))/1     Algebraic Laws:     1. (+) identity:      x+0 = x     2. (/) identi...
Architecture                                          Excel     Python     R                                          Addi...
Science is becoming a database query problemOld model: “Query the world” (Data acquisition coupled to a specific hypothesi...
Problem – Research data is captured and manipulated in   spreadsheets – This perhaps made sense five years ago; the data v...
Astronomy                    LSST                                       Oceanography                     PanSTARRS        ...
Parallel Iterative AnalyticsLarge scale                                                                     GridFields―Lon...
Parallel Iterative AnalyticsLargescale                                                                  GridFields―Long ta...
A ―Needs Hierarchy‖ of Science Data Management            analytics            query            full integration          ...
A ―Needs Hierarchy‖ of Science Data Management            analytics            query            full integration          ...
Four Conjectures aboutDeclarative Query for Science • Most science data manipulation tasks can be   expressed in relationa...
5/18/10   Garret Cole, eScience Institute
5/18/10   Garret Cole, eScience Institute
5/18/10   Garret Cole, eScience Institute
metadata    sequence    data           search results
SQL      5/18/10   Garret Cole, eScience Institute
The Long Tail                  LSST                  (~100PB)                       “The future is already here; it’s just...
Experimental Engagement Algorithm for the Long Tail     A stripped-down version of Jim Gray’s        “20 questions” method...
• Which samples have not been cloned?  SELECT *  FROM plasmiddb  WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘) 10/3/2012   ...
Details•   Backend is Microsoft‘s SQL Azure•   About a year old, essentially zero advertising•   50+ active users (UW and ...
View Semantics and Features•      ―Saved query‖ = View with attached metadata•      Unify views and tables as ―datasets‖  ...
Karen                            Dvornich10/3/2012   Bill Howe, UW        71
10/3/2012   Bill Howe, UW   72
Enabling Collaborative Research Data Management with SQLShare
Upcoming SlideShare
Loading in...5
×

Enabling Collaborative Research Data Management with SQLShare

1,108

Published on

Relational databases remain underused in the long tail of science, despite a number of significant
success stories and a natural correspondence between scientific inquiry and ad hoc database query.
Barriers to adoption have been articulated in the past, but spreadsheets and other file-oriented ap-
proaches still dominate. At the University of Washington eScience Institute, we are exploring a new
“delivery vector” for selected database features targeting researchers in the long tail: a web-based
query-as-a-service system called SQLShare that eschews conventional database design, instead empha-
sizing a simple Upload-Query-Share workflow and exposing a direct, full-SQL query interface over
“raw” tabular data. We augment the basic query interface with services for cleaning and integrating
data, recommending and authoring queries, and automatically generating visualizations. We find that
even non-programmers are able to create and share SQL views for a variety of tasks, including quality
control, integration, basic analysis, and access control. Researchers in oceanography, molecular biol-
ogy, and ecology report migrating data to our system from spreadsheets, from conventional databases,
and from ASCII files. In this paper, we will provide some examples of how the platform has enabled sci-
ence in other domains, describe our SQLShare system, and propose some emerging research directions
in this space for the database community.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,108
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • D
  • We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.Essentially, we want to remove the speed-bump of data handling from the scientists.
  • List of English descriptions
  • Data files can be uploaded as is, and we make an attempt to automatically infer the appropriate schema based on a type analysis of the values and a variety of heuristics.
  • goal is to be inclusive rather than exclusive: accept everything, then clean it up with queries and views.so we can tolerate irregular number of columns, mixed-type columns, missing column names.There is a lot more to do here. Kathleen Fisher’s work on LearnPADS provides a great mechanism for automatically inferring the structure of ad hoc data formats – Very promising for us.Jeff Heer’s and Joe Hellerstein’s work on Data Wrangler is relevant, as is Google Refine.
  • Every uploaded table is associated with a trivial view. You can modify this view to do some basic cleanup, and in fact that is typically the first step we are seeing people perform.
  • Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations, We are exploring some of these.
  • I’ll give you a couple of examples of use cases we are seeing.Robin Kodner at Friday Harbor Labs in the San Juan islands works on a program called Sound Experience that gives K12 – undergrads experience sailing, but reuses this ship time to do some real science and collect samples from areas that wouldotherwise be very expensive.
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • So how do we do this? Back to the needs hierarchy.It’s already happening.To a first approximation, I consider storage and sharing solved thanks to cloud computing. There are plenty of issues to resolve, especially in sustainability and uptake, but these systems exist. And they’re getting cheaper.But semantic integration and analytics are still largely application-specific, ad hoc, siloed manner.Query services for science are underexplored, with a few mind-blowingly successful examples such as the Sloan Digital Sky Survey.I think many of you are familiar with SDSS, but briefly: It is not an understatement to say that the liberal application of database technology has transformed the field of Astronomy. Undergrads and grads use SQL alongside IDL and Fortran. 5000 papers were written about the SDSS data, and only about 100 of those were co-authored by the Pis. The rest were written by astronomers exercising the public query interfaces.So the roadmap here is to FIRST expose data through query services, THEN use these services to generalize, scale, and democratize analytics and semantic integration.
  • To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  • To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  • To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  • To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
  • Why has MapReduce been so successful?One reason: it turned “mere mortal” java programmers into distributed systems programmersMapReduce raised the level of abstraction for big dataBut not high enough, evidentlyTwo of the earliest and most successful projects in the Hadoop ecosystem were Pig and HIVE: Declarative query languages on top of HadoopNoSQL is a misnomerNoSchema? NoLoading? NoLicenseFees? NoOverpricedDBA?NoMySQLTaught database people to think about fault tolerance differently and to not ignore dirty data.
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • But the fundamental error made by computer scientists, and it’s probably the fault of the database community, is to assume that semantic integration is a prerequisite for query and analytics.It isn’t. It’s the final goal, not some insignificant preamble to analysis.Domain scientists know this – they take a very pragmatic approach. They write code to do data handling, they write code to do analytics, and they do data integration on the fly in a task-specific way.So one of my goals is to convince you of is that you can decouple declarative query from semantic integration, and doing so gives scientists a very powerful tool.
  • The DNA material in the samples taken from the water are then sequenced in a machine to produce millions of short strings
  • These DNA reads can then be cross-referenced in public databases to determine what organisms were present in the water, and what genes were being expressed
  • Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files.This process is repeated many times, leading to 100s of spreadsheetsAt this point, the actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc. It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”!We’ve heard that 90% of their work is manipulating the data before they can actually answer a question!
  • Transcript of "Enabling Collaborative Research Data Management with SQLShare"

    1. 1. Enabling Collaborative Research Data Management with SQLShare Bill Howe, PhD Tom Lewis Director of Director, Academic & Research, Scalable Data Collaborative Applications Analytics University of Washington University of Washington Information Technology eScience Institute10/3/2012 Bill Howe, UW 1
    2. 2. Assessing UW Researchers‘ Data Management Needs1. Conversations with Research Leaders (2008) • First large-scale assessment of researchers‘ needs • 124 Interviews with top researchers2. Faculty Technology Survey (2011) • Use of teaching and research technologies • Paired with student and TA surveys • Reached all disciplines, levels of research • 689 instructors responded
    3. 3. Data Management Needs• Expertise—designing new systems, enhancing current practices, analysis• Storage—Large amounts of data for current projects and data archives• Backup—inconsistent systems among researchers, some unreliable practices• Security—secure access needed for inter- institutional partners
    4. 4. Types of Data Stored Quantities/statistics 71.60% 61.20%Text (literature, transcriptions, field notes) 48.00% Images (photographs, maps) 22.80% Video recordings 20.50% Audio recordings 15.20% Multimedia digital objects 11.70% Geo-tagged objects/ spatial data 5.20% 0% 20% 40% 60% 80% 100%
    5. 5. Data Storage Location My computer 87%External device (hard drive, thumb drive) 66% 41% Department-managed server 27% Server managed by research team 12% External (Non-UW) data center 6% Department-managed data center 6% Other 5% Data managed by research team 5% 0% 20% 40% 60% 80% 100%
    6. 6. http://escience.washington.edu
    7. 7. The University of Washington eScience Institute• Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – Techniques and technologies include • Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing – If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive• Mission – Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them• Strategy – Bootstrap a cadre of Research Scientists – Add faculty in key fields – Build out a ―consultancy‖ of students and non-research staff 10/3/2012 Bill Howe, eScience Institute 8
    8. 8. eScience Big Data GroupBill Howe, Phd (databases, cloud, data-intensive scalable computing, visualization)Staff – Seung-Hee Bae, Phd (postdoc, scalable machine learning algorithms) – Dan Halperin, Phd (postdoc; scalable systems) – Sagar Chitnis, Research Engineer (Azure, databases, web services) – (alumna) Marianne Shaw, Phd (hadoop, semantic graph databases) – (alumna) Alicia Key, Research Engineer (visualization, web applications)Students – Scott Moe (2nd yr Phd, Applied Math) – Daniel Perry (2nd yr Phd, HCDE)Partners – CSE DB Faculty: Magda Balazinska, Dan Suciu – CSE students: Paris Koutris, Prasang Upadhyaya, – UW-IT (web applications, QA/support) – Cecilia Aragon, Phd, Associate Professor, HCDE (visualization, scientific applications) 10/3/2012 Bill Howe, UW 9
    9. 9. [src: Carol Goble] • Power distribution • 80:20 rulePopularity / Sales Head Tail Products / Results First published May 2007, Wired Magazine article 2004
    10. 10. [src: Carol Goble]Long Tail of Research Data High throughput experimental methods Industrial scale Commons based production Publicly data sets Cherry picked results PreservedGenBankPDB UniProt ChemSpider Pfam CATH, SCOP (Protein Structure Classification) Spreadsheets, Notebooks Local, Lost
    11. 11. ProblemHow much time do you spend “handlingdata” as opposed to “doing science”? Mode answer: “90%”10/3/2012 Bill Howe, UW 12
    12. 12. ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1chr_4[480001-580000].287 4500chr_4[560001-660000].1 3556chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C Simple Examplechr_9[320001-420000].548chr_27[320001-404298].20 2833 3991 COG5406 COG4547 2.00E-04 5.00E-05 38 18 43.9 46.2 1001 620 Nucleosome binding factor SPN, Cobalamin biosynthesis protein Cchr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fachr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fachr_24[160001-260000].65 3542chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fachr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrchr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and pchr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and pchr_11[1-100000].70 2886chr_11[80001-180000].100 1523 COGAnnotation_coastal_sample.txt id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 10/3/2012 Bill Howe, UW 13
    13. 13. Data management becoming the bottleneck• Data is captured and manipulated in spreadsheets• This approach made sense five years ago; the data volumes were manageable• But now: – 50k rows in each of 50-100 spreadsheets, per scientist – Significantly more sharing required: ―mega-collabs‖ – Continuously producing new results – The management overhead is beginning to dominate – Difficult to share data publicly (emailing files is a mess) – Lots of work to get an answers to any new question 10/3/2012 14
    14. 14. Why not build a database?• Assumes an pre-defined schema – more on this later• Requires specialized technical skills and a huge amount of up-front effort• Not a perfect fit – it‘s hard to design a ―permanent‖ database for a fast-moving research target• Researchers have little interest in operating and maintaining a data system – they just want to organize, manipulate, and share data10/3/2012 15
    15. 15. Database-as-a-Service for the 99%Approach: Strip down databases to bare essentials – Upload -> Query -> Share – Try to eliminate installation, configuration, schema design, data loading, tuning, easy, thanks to the cloud app-building harder 10/3/2012 Bill Howe, UW 16
    16. 16. Getting Data In• Upload data through a browser to a cloud database (so no need to manage a local system)• No need to ―design a database‖ before you use your data – just upload and get to work• Write SQL queries, with some automated help and some guided help from experts• Build on your own results – Output of one query can be the input to another – Encourages sharing and reuse – Avoids everyone on the team running the same data processing steps over and over – Provides provenance – ―how did you get this result?‖ 10/3/2012 Bill Howe, UW 17
    17. 17. Select from a list of English descriptions10/3/2012 18
    18. 18. Edit a Query Save the results, share them with others 10/3/2012 19
    19. 19. http://sqlshare.escience.washington.edu10/3/2012 Bill Howe, UW 20
    20. 20. http://sqlshare.escience.washington.edu10/3/2012 Bill Howe, UW 21
    21. 21. http://sqlshare.escience.washington.edu10/3/2012 Bill Howe, UW 22
    22. 22. http://sqlshare.escience.washington.edu10/3/2012 Bill Howe, UW 23
    23. 23. http://sqlshare.escience.washington.edu10/3/2012 Bill Howe, UW 24
    24. 24. Spreadsheet Excel Python R Crawler Addin Client Client ―Flagship‖VizDeck SQLShare App ASP.NET (Python) on EC2 OAuth2 SQLShare REST API WCF
    25. 25. Deeply nested hierarchies of views Provenance Controlled Sharing10/3/2012 Bill Howe, UW 26
    26. 26. Robin Kodner2010 Pilot- Outreach and Education-based sampling: Schooner Adventuress 9 13 11 14 12 15 16 4 510 8 7 10/3/2012 Bill Howe, UW 6 27
    27. 27. NatureMapping Program Karen Dvornich Wildlife Observations (1902- ) Data collection and submission options: 1. Download/upload spreadsheet 2. Online data entry 3. NatureTracker on handheld/GPS 4. Android ODK (Open Data Kit) Water Quality Monitoring Sites (2003 - ) 10/3/2012 Bill Howe, UW 28
    28. 28. Andrew White, UW Chemistry“An undergraduate student and I are working with gigabytes of tabulardata derived from analysis of protein surfaces.Previously, we were using huge directory trees and plain text files.Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” -- Andrew D White 10/3/2012 Bill Howe, UW 29
    29. 29. WHY SQL?10/3/2012 Bill Howe, UW 30
    30. 30. A ―Needs Hierarchy‖ of Science Data Management“As each need is satisfied, thenext higher level in the hierarchydominates conscious functioning.” -- Maslow 43 full semantic integration analytics query sharing storage10/3/2012 Bill Howe, UW 31
    31. 31. NoSchema (not NoSQL)• A schema* is a shared consensus about some universe of discourse• At the frontier of research, this shared consensus does not exist, by definition• Any schema that does emerge will change frequently, by definition• Data found ―in the wild‖ will typically not conform to any schema, by definition• But this doesn‘t mean we have to punt on databases and go back to ad hoc scripts and files * ontology/metadata standard/controlled vocabulary/etc.10/3/2012 Bill Howe, UW 32
    32. 32. A ―Needs Hierarchy‖ of Science Data Management Silos full semantic integration analytics query sharing Cloud storage10/3/2012 Bill Howe, UW 33
    33. 33. What‘s the point?• Conventional wisdom says ―Science data isn‘t relational‖ – This is utter nonsense• Conventional wisdom says ―Scientists won‘t write SQL‖ – This is utter nonsense• So why aren‘t databases being used more often? – They‘re a PITA• We implicate difficulty in – installation, configuration – schema design, data loading – performance tuning – app-building (NoGUI?)We ask instead, “What kind of platform cansupport ad hoc scientific Q&A with SQL?” 5/18/10 Garret Cole, eScience Institute
    34. 34. Biologists are beginning to write very complexqueries (rather than relying on staff programmers)Example: Computing the overlaps of two sets of blast results SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 We see thousands of WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) queries written by THEN w.end_bp - x.start_bp + 1 END AS len_overlap non-programmers FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC
    35. 35. So why SQL?• Covers 80% of what we need – Ex: Sloan Digital Sky Survey – Ex: Hybrid Hash Join algorithm published in BMC bioinformatics• Empower a new class of data-savvy scientist who isn‘t forced to be trained as an IT professional• Algebraic optimization – Ask me about this if you‘re interested• Views: Logical and physical data Independence – Reason about the problem independently of the data representation – No re-execution of ―workflows‖ – No file format incompatibilities – No version mismatches – Data and code tightly coupled and (logically) centralize 10/3/2012 Bill Howe, UW 36
    36. 36. SQLShare as a CS Research Platform SSDBM 2011• Automatic ―Starter‖ Queries – (Bill Howe, Garret Cole, Nodira Khoussainova, Leilani Battle) SIGMOD 2011 (demo)• VizDeck: Automatic Mashups and Visualization SSDBM 2011 – (Bill Howe, Alicia Key, Daniel Perry, Cecilia Aragon) CHI 2012• Info Extraction from Spreadsheets SIGMOD 2012 (demo) – (Mike Cafarella, Dave Maier, Bill Howe, Sagar Chitnis, Abdu Alwani)• Scalable Analytics-as-a-Service – (Dan Suciu, Magda Balazinska, Bill Howe)• Optimizing Iterative Queries for Machine Learning VLDB 2010 – (Dan Suciu, Magda Balazinska, Bill Howe) Datalog2.0 2012 CIDR 2013 10/3/2012 Bill Howe, UW 37
    37. 37. Where we‘re headed: • Local or cloud-hosted deployments done! • Multi-institution sharing • Global users and permissions • Distributed data and distributed query We are looking for partners!10/3/2012 Bill Howe, UW 38
    38. 38. http://sqlshare.escience.washington.edu billhowe@cs.washington.edu http://escience.washington.edu10/3/2012 Bill Howe, UW 39
    39. 39. 10/3/2012 Bill Howe, UW 40
    40. 40. Scientific data management reduces to sharing views• Integrate data from multiple sources? – joins and unions with views• Standardize on units, apply naming conventions? – rename columns, apply functions with views• Attach metadata? – add new tables with descriptive names, add new columns with views• Data cleaning, quality control? – hide bad values with views• Maintain provenance? – inspect view dependencies• Propagate updates? – view maintenance• Protect sensitive data? – expose subsets with views (assuming views carry permissions)10/3/2012 Bill Howe, UW 41
    41. 41. An observation about ―handling data‖• How many plasmids were bombarded in July and have a rescue and expression? SELECT count(*) FROM [bombardment_log] WHERE bomb_date BETWEEN ‘7/1/2010 AND ‘7/31/2010 AND rescue clone IS NOT NULL AND [expression?] = yes 5/18/10 Garret Cole, eScience Institute
    42. 42. An observation about ―handling data‖• Which samples have not been cloned? SELECT * FROM plasmiddb WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘) 5/18/10 Garret Cole, eScience Institute
    43. 43. An observation about ―handling data‖• How often does each RNA hit appear inside the annotated surface group? SELECT hit, COUNT(*) as cnt FROM tigrfamannotation_surface GROUP BY hit ORDER BY cnt DESC5/18/10 Garret Cole, eScience Institute
    44. 44. An observation about ―handling data‖For a given promoter (or protein fusion), how manyexpressing line have been generated (they would allhave different strain designations) SELECT strain, count(distinct line) FROM glycerol_stocks GROUP BY strain 5/18/10 Garret Cole, eScience Institute
    45. 45. Find all TIGRFam ids (proteins) that are missing from at leastone of three samples (relations) SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] UNION SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] EXCEPT SELECT col0 FROM [refseq_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [est_hma_fasta_TGIRfam_refs] INTERSECT SELECT col0 FROM [combo_hma_fasta_TGIRfam_refs] 10/3/2012 Bill Howe, UW 46
    46. 46. On NoSQL10/3/2012 Bill Howe, UW 47
    47. 47. An Observation on NoSQL• 2004 Dean et al. MapReduce• 2008 Hadoop 0.17 release• 2008 Olston et al. Pig: Relational Algebra on Hadoop• 2008 DryadLINQ: Relational Algebra in a Hadoop-like system• 2009 Thusoo et al. HIVE: SQL on HadoopNoSQL is a misnomer – NoMySQL? – NoSchema? – NoLoading? – NoLicenseFees!10/3/2012 Bill Howe, UW 48
    48. 48. Digression: Relational Database History Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.“Activities of users at terminals and most application programs shouldremain unaffected when the internal representation of data is changed andeven when some aspects of the external representation are changed.” -- Codd 1979Key Ideas: Programs that manipulate tabular data exhibit an algebraicstructure allowing reasoning and manipulation independently of physicaldata representation
    49. 49. Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product10/3/2012 Bill Howe, UW 50
    50. 50. Algebraic Optimization N = ((4*2)+((4*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*4 two operations instead of five, no division operatorSame idea works with very large tables / graphs, but the payoff is much higher 51
    51. 51. Architecture Excel Python R Addin Client ClientSpreadsheet ―Flagship‖ VizDeck Ingest SQLShare App (Python) on EC2 ASP.NET SQLShare REST API SQLShare REST API Windows Azure IIS SQL Azure SQL Server
    52. 52. Science is becoming a database query problemOld model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquisition supports many hypotheses) – Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) – Biology: lab automation, high-throughput sequencing, – Oceanography: high-resolution models, cheap sensors, satellites40TB / 2 nights1 device ~1TB / day 100s of devices 10/3/2012 Bill Howe, eScience Institute 53
    53. 53. Problem – Research data is captured and manipulated in spreadsheets – This perhaps made sense five years ago; the data volumes were manageable – But now: 50k rows, 100s of files, ―mega-collabs‖Why not put everything into a database? – A huge amount of up-front effort – Hard to design for a moving target – Running a database system is huge drainApproach: SQLShare – Upload data through your browser: no setup, no install – Login to browse science questions in English – Click a question, see the SQL to answer it question – Edit the SQL to answer an "adjacent" question, even if you wouldn‘t know how to write it from scratch https://sqlshare.escience.washington.edu/ 10/3/2012 54
    54. 54. Astronomy LSST Oceanography PanSTARRS SRA MG-RAST OOI GenBank# of bytes PDB SDSS Biology IOOS UniProt Pfam NDBC LANL Galaxy Pathway HIV Commons BioMart GEO SQLShare # of sources 10/3/2012 Bill Howe, UW 55
    55. 55. Parallel Iterative AnalyticsLarge scale GridFields―Long tail‖ Scale, training SQLShare: Query-as-a-Service VizDeckcitizen science;public outreach astrophysics life sciences earth sciences commercial
    56. 56. Parallel Iterative AnalyticsLargescale GridFields―Long tail‖ Scale, training SQLShare: Query-as-a-Service VizDeckcitizen science;public outreach astrophysics life sciences earth sciences commercial
    57. 57. A ―Needs Hierarchy‖ of Science Data Management analytics query full integration sharing storage10/3/2012 Bill Howe, UW 58
    58. 58. A ―Needs Hierarchy‖ of Science Data Management analytics query full integration sharing storage10/3/2012 Bill Howe, UW 59
    59. 59. Four Conjectures aboutDeclarative Query for Science • Most science data manipulation tasks can be expressed in relational algebra • Most science analytics task can be expressed in relational algebra + recursion Hellerstein 09, Re 12 • These expressions can be efficiently and scalably executed in the cloud • Researchers are willing and able to program using relational algebra languages c.f. SDSS10/3/2012 Bill Howe, UW 60
    60. 60. 5/18/10 Garret Cole, eScience Institute
    61. 61. 5/18/10 Garret Cole, eScience Institute
    62. 62. 5/18/10 Garret Cole, eScience Institute
    63. 63. metadata sequence data search results
    64. 64. SQL 5/18/10 Garret Cole, eScience Institute
    65. 65. The Long Tail LSST (~100PB) “The future is already here; it’s just CERN not very evenly distributed.” (~15PB/year) -- William Gibsondata volume PanSTARRS (~40PB) SDSS (~100TB) Ocean CARMEN Modelers Seis- Microbiologists <Spreadsheet (~50TB) mologists users> rank 10/3/2012 Bill Howe, eScience Institute 66
    66. 66. Experimental Engagement Algorithm for the Long Tail A stripped-down version of Jim Gray’s “20 questions” methodology1. Get the data2. Load the data ―as is‖ – no schema design3. Get ~20 questions (in English)4. Translate the questions into SQL (when possible)5. Provide these ―starter queries‖ to the researchers Q: Can researchers questions be expressed in SQL? Q: Are a few examples sufficient for novices to self-train with SQL? Q: Can we scale this process up? Q: If so, will the use of SQL reduce their data handling overhead?
    67. 67. • Which samples have not been cloned? SELECT * FROM plasmiddb WHERE NOT (ISDATE(cloned) OR cloned = ‗yes‘) 10/3/2012 Bill Howe, UW 68
    68. 68. Details• Backend is Microsoft‘s SQL Azure• About a year old, essentially zero advertising• 50+ active users (UW and external)• 2500+ tables/views (~20% are public• largest table: 1.1M rows, smallest table: 1 row• transitioning to support by UW-IT10/3/2012 Bill Howe, UW 69
    69. 69. View Semantics and Features• ―Saved query‖ = View with attached metadata• Unify views and tables as ―datasets‖ – table = “select * from [raw_table]”• Replacement semantics for name conflicts – old versions materialized and archived• Materialize downstream views – when dependencies deleted – when dependencies become incompatible• Permissions – public, private, ACLs (may work on groups in the future)• Sharing, social querying, CQMS* – search, recent queries, friends’ queries, favorites, ratings – facilitate sharing and recommendations of not just whole queries, but common predicates, join patterns, etc. 10/3/2012 Bill Howe, UW * [Khoussainova, CIDR 2009] 70
    70. 70. Karen Dvornich10/3/2012 Bill Howe, UW 71
    71. 71. 10/3/2012 Bill Howe, UW 72
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×