Research Dataspaces: Pay-as-you-go Integration and Analysis

1,176 views
1,093 views

Published on

Invited talk to Center for HIV/AIDS Vaccine Immunology (CHAVI) all-hands meeting at Duke University, May

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,176
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • My name is Bill Howe and I’m from the University of Washington eScience Institute
  • Drowning in data; starving for information
    We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
    “Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.”
    “tens of terabytes of data per day” -- genome center at Washignton University
    Increase data collection exponentially with flowcam
  • Steps 1,2,3,4 are expensive
    At the frontier of research, 1,2,3 are by definition elusive. By definition, the domain is not fully understood. By definition, researchers do not universally agree on the interpretation of data; there is no universal domain model.
    How do you build a domain model, build consensus? You perform experiments, analyze the data, publish the results -- the overall goal of the scientific enterprise is to achieve shared knowledge. So it is a mistake to presuppose the existence of an ontology as a way to faciliatate data analysis.
    We perform data analysis in order to establish an ontology; we can’t make establishing an ontology a prerequisite for data analysis.
    shared knowledge is the end result of research, not a precondition for it.
    So in practice, you find individual researchers, individual analysts managing their own data ad hoc -- mostly in text files. and mostly with SAS, perl, python, MATLAB, maybe Excel if they do not have aprogramming background.
    So what results is this ecosystem of “desultory” data -- varying levels of metadata, varying levels of quality, varying levels of accessibility.
    Now, it is a phenomenally good idea to strive for machine readable representations of knowledge. When these exist, there are a variety of ways to exploit them to improv interoperability, build applications, facilitate communication. But at the frontier of research, they simply can’t exist until we have a chance to analyze all the data.
    So this talk is about technology that can tolerate the heterogeneity and ambiguity of research data.
    Aside: Formally, an interpretation of a model is a mapping from the elements in the model to the elements in the real world. A sound interpretation of a logical model is one where all true statements in the artificial model are true when mapped into the real world.
    A global ontology or othr form of comprehensive schema presumes some measure of consensus (which may take the form of mappings between different sub-ontologies or sub schemas -- yet these are still a form of consensus.)
    Even if you can successfully capture global agreement into a shared schema, it’s only a snapshot -- its half-life will be very short.
    Even if your schema is sound and complete, and even if you can keep up with changing requirements,
  • My claim is that the value of a repository is quadratic in the number of datasets it holds. The reason is that every pairwise comparison of two datasets potentially provides new insight.
    Some things to observe: Many system only provide basic retrieval -- unary operation -- an so they scale linearly. (Though of course the user can download two datasets and compare them locally, so the B coefficient is non-zero.
    I think of C as the benefit of having a shared domain model to focus the conversation, resolve ambiguities, and encourage converging mental models among users, especially new users (students). So this is important, but it does not get any benefit from having 1, 10, 1000, or a 1M datasets. It’s value is constant.
    So my recommended strategy is to make sure B is non-zero, and crank up D as high as possible.
    Now, having a rich, thorough domain model can help facilitate more operations -- maybe a new genome browser might be easier to build if we have a well-tuned schema to work from. However, it’s demonstrably NOT the case that a genome browser *could not* be built without such a schema. Indeed, a huge number of simple desktop genome browsers exist that do not have any shared semantics.
    The disadvantage to having a rich, thorough domain model is that it restricts D -- it limits the amount of data you can put into the system. Data with missing, incomplete, or ambiguous metadata cannot be ingested. So you’ve increased C (and possibly B and U) at the expense of D -- this is not a good idea.
    So we need systems that are inclusive, that emphasize breadth over depth (at least initially), emphasize coverage.
  • Many failed systems in and out of science attribute their failure to inflexibility -- too hard to get data into the system due to over-engineered metadata standards.
    The value of a data repository is is quadratic in the number of datasets it holds
    Vr = QD^2 + RD + C
    D=number of datasets
    Q=analysis capability
    R=simple accessibility
    C=communicability <-- intrinsic value of domain knowledge, metadata standard
    So a rich and thorough domain model increases C, but can decrease D -- it’s too difficult to put data into the correct format, with all the required metadata, so the system is underused.
    FGDC
  • Data in avariety of formats ensconced in autonomous systems with different capabilities and different schemas
    How do you get started in this environment? What’s the first thing you do?
  • Data source do not share a schema, and may not exhibit a schema at all. Data is allowed to exist in its native form behind its native interfaces.
    These data sources are also autonomous, so you’re not necessarily allowed to take all the flat files and repalce them with XML.
    You have to pay for all this freedom and flexibility somewhere, and here’s where you do it: With lots of global properties, you can define sophisticated services that exploit them: structured query, and strong integrity guarantees
  • Put another way, databases are inherently exclusive, helping you reject data that does not conform to your schema or satisfy your integrity constraints.
    The dataspace support platform is inclusive -- everybody is welcome
  • The dataspace provides a hierarchy of services to accommodate varying degrees of data “maturity”
  • Add a screenshot of Google
    There is no global schema for the Internet
    Search is approximate, “best effort”…and highly effective
  • <number>
    What do you need to do forecast the physical state of the ocean? You’re going to be solving a set of partial differential equations, so you need forcings at the boundaries of the domain -- river discharge, tides, and atmospheric condtions, bathymetry. Every day, you can download results of atmospheric forecasts, compute tidal forcings, and estimate river discharge, as well as some observational data to compare with your simulation and see how well you’re doing.
    these data go into files and relational databases. When the forecast is ready to run, these inputs are staged out to compute cluster, along with the FORTRAN code that will solve the equations and some post processing routines. The forecast executes, incrementally generating data files, visualizations, log files, and status information.
    This information is pushed back to the storage servers and the visualizations are served over the web.
    This process generates lots of intermediate data of a variety of types -- We want to provide browse and query services over these data without disturbing the operational system and without a lengthy design phase -- we want results by 5:00pm
    Some data loaded into a relational database
    Others left as files (no need for ad hoc query; one-time use; large size)
    SELFE Eulerian Lagrangian Semi-implicit Finite element model. Solves 3D Navier Stokes equations. Produces 6 variables * 700MB /day.
    Hindcast runs: compare code versions, compare inputs, long term behavior, what if analysis (river dredging), Tsunami model assumptions
    Data products:
    Animations, maps, timeseries plots station extractions, model-data comparisons
  • Slide from Robin Kodner. Key idea: This protocol is far more precise than BLAST for sequence searching, but it generates a lot of heterogeneous intermediate results that must be analyzed -- this step had completely roadblocked the research. With SQLShare, you can just throw all the data up to the cloud and start asking questions right away -- collaboratively. None of the overhead associated with an RDBMS.
    == Name ==
    SQLShare: Cloud-based Collaborative Query
    == PI ==
    Ginger Armbrust
    == Science Perspective ==
    From Robin:
    “The SQLShare database is allowing me to do basic sorting and clustering of my data that took me a week to do in excel, now in a matter of seconds. It is also making it possible to correlate the analyzed data that results from different kinda of anaylsis from different analysis pathways, which maximizes the use of the data. Further, it allows for finding correlations between different projects and the corresponding environmental metadata that would be impossible without the database. Without the database, I'd only be able to utilize a fraction of my data, and find only a fraction of the interesting nuggets that we are looking for. I conceived of the database to help me with metagenomic data but its so useful, we are now using it to do comparative genomics and evolutionary studies.”
    == Computational View ==
    Goal: Tolerate the “spreadsheet tsunami”
    Each user has O(100) files with O(100k) rows each
    heterogeneous, changing schemas
    Observation: Databases underused in science
    Hypothesis: Scientists dislike RDBMS, not SQL
    Installation, configuration, schema design, physical tuning
    Approach: Just put it in the cloud and query it
    ignore DB design; do auto-scaling and auto-tuning
    System-enabled sharing of data and queries
    Only makes sense for science!
    Ex: Two labs both buy an AB Solid sequencer, so both may use the same queries to process the output
    == Resources ==
    Azure for the application, SQL Azure for the system data, EC2 for the user data (to avoid 10GB limit on SQL Azure)
    == Comparison ==
    Quotes: “I can do science again” “That took me a week to do with spreadsheets!” “I spend 90% of my time manipulating data in spreadsheets.” “My research was stuck on data analysis before SQLShare”
  • Environmental samples are sequenced.
    Sequence fragments are looked up in public databases, and passed through phylogenetic analysis to place them at the appropriate location in the tree.
    Each step generates a bunch of “residual” data, usually in the form of spreadsheets or text files.
    This process is repeated many times, leading to 100s of “desultory” spreadsheets
    The actual science questions are answered using these spreadsheets by computing “manual joins”, creating plots, searching and filtering, copying and pasting, etc.
    It’s a mess -- when asked how much time is spent “handling data” as opposed to “doing science”, one postdoc said a staggering 90%!
  • Here are two datasets: Sequence annotations for the Phaeo-dactylum organism and sequence annotations from an environmental sample.
    The task is to compare these sets of annotations to determine what role Phaeo is serving in the metagenomic population, if present.
    Previously, researchers had to manually cross-reference data between spreadsheets.
    But the join between these datasets is trivially expressed in SQL
    Now, that was just the first step -- counting subsets, finding intersections, finding “top K” matches, etc. must also be performed manullay, but are also easily expressd in SQL.
  • -- No schema design: Just upload everything “as is” and start querying. No one wants to create one, and the schema’s going to change anyway.
    -- We find that ALL of the scientists’ English queries are expressible in SQL. Some can be complex, however.
    -- Challenge: SQL is hard
    -- Solution: Let scientists train themselves. Give them examples to modify instead of a “blinking cursor.” More generally: Facilitate collaborative query authoring, sharing, and reuse. Support collaboration between the “carpet lab” and the “tile lab” (Computer geeks work in carpeted offices, bio geeks work in the wet lab.)
    -- How?
    1) Use the cloud to logically and physically co-locate all data across all labs -- no more islands
    2) Let queries be saved and shared
    3) Log everything and do machine learning on the log to perform “Query autocomplete” (Nodira and Magda’s work)
    4) Automatically adapt queries for use on ‘similar’ datasets (change table names, etc.)
    many more ideas….
  • Items to point out:
    -- IMPORTANT: These are not trivial queries! But with some help, scientists can write them. We give them an example, and they modify the example and save it for reuse. Queries can be optionally shared across users.
  • <number>
    Currently we have insular data sources
    Pay as you go
    Smoothing the ROI curve!
  • Science slide for Ginger
  • In the previous case, the same source of database identifiers were used; when they differ the process can be more complicated.
    Here we have two datasets: Phaeo gene annotations again, and set of sample annotations with references to the TIGRFam database.
    The workflow here might look like:
    Find an annotation of interest in Phaeo dataset
    Look up COG Id to get Protein Name
    Search for Protein Name in various online databases (here we use SwissProt) to collect additional information
    Browse to cross-reference information to find TIGRFam Id,
    Find Gene Ontology synonym of the TIGRFam Id to collect additional metadata (other metadata not shown -- another step)
    Finally, match TIGRFam Ids back to original sample.
    By putting all of this data into a database, you can write these expressions as joins. More importantly, you can go beyond “lookup” tasks and express the actual science questions directly:
    What percentage of Phaeo genes are present in this sample? What metabolic processes are those genes involved in?
    Note that we do NOT want to attempt to create “YAUDB” (yet another universal database). these data are uploaded and manipulated in an exploratory, task-specific manner. We aim to provide SQL over YOUR data, not a universal reference database from scratch.
    (That being said, our research involves learning a universal database schema -- incrementally and organically -- based on the upoloaded data, the executed queries, and any available user input.
    Bill
  • It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  • So what’s wrong?
    Applications write queries, not users
    Schema design, tuning, “protectionist” attitudes
  • It turns out that you can express a wide variety of computations using only a handful of operators.
  • Data Management != Storage Management
    Storage Management is
    SATA/SCSI/Fiber
    Backup policies and procedures
    redundancy decisions (RAID 0, 1+0, 0+1, 5
    Access methods
    Query languages
    Data Mining, Analysis, Visualization
    Data Integration
  • Research Dataspaces: Pay-as-you-go Integration and Analysis

    1. 1. Research Dataspaces: Pay-as-you-go Integration and Analysis Bill Howe, Phd University of Washington QuickTime™ and a decompressor are needed to see this picture.
    2. 2. 3/12/09 Bill Howe, eScience Institute2 Data acquisition is no longer the bottleneck to scientific discovery Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing,
    3. 3. 3/12/09 Bill Howe, eScience Institute3 Biology Oceanography Astronomy Two dimensions#ofbytes # of data types LSST SDSS Galaxy BioMart GEO IOOS OOI LANL HIVPathway Commons PanSTARRS
    4. 4. 3/12/09 Bill Howe, eScience Institute4 Building a Research Data Management System: Status Quo 1. Establish (scientific) consensus 2. Derive and encode a domain model (schema) 3. Retrofit new domain model to existing data 4. Build applications 5. Analyze data Encode shared knowledge in a machine readable manner a. Relational schema, ontology, metadata standards, conventions, controlled vocabularies, object model, API b. Mappings between existing models Scope, vision, requirements, terminology Populate the schema, attach semantics, clean data Use domain model to inform design Do science
    5. 5. 3/12/09 Bill Howe, eScience Institute5 The Value of a Data Repository VR = BD2 + UD + C D = # of datasets in the repository B = # of binary operations facilitated U = # of unary operations facilitated C = intrinsic value of the schema (for communication, etc.)
    6. 6. 3/12/09 Bill Howe, eScience Institute6 Quote A typical biological data management system involves accessing or gathering data from multiple sources, followed by data correlation, classification, review, and curation using domain specific tools (e.g., functional clusters, ontologies) and expertise. In practice, biological data management is less daunting when it is considered in the context of an iterative strategy based on gradual data integration while accumulating domain specific knowledge throughout the integration process. Victor Markowitz, LBNL
    7. 7. 3/12/09 Bill Howe, eScience Institute7 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
    8. 8. 3/12/09 Bill Howe, eScience Institute8 QuickTime™ and a decompressor are needed to see this picture. slide source: Alon HalevyFranklin, Halevy, Maier 2005 Dataspaces
    9. 9. 3/12/09 Bill Howe, eScience Institute9 Data Management Solutions
    10. 10. 3/12/09 Bill Howe, eScience Institute10 Databases vs. Dataspaces Single Schema Data “Coexistence” Centralized Administration Autonomous Sources Structured Query Search, Browse, Approximate Answers Strict Integrity Constraints Patterns and trends; few global properties
    11. 11. 3/12/09 Bill Howe, eScience Institute11 Dataspaces vs. Databases (2)  Databases are Exclusive  Reject data that violates types, schema, integrity constraints, rules + triggers  In return:  structured query, logical and physical data independence, transactions  …over the clean subset of your data  Dataspaces are Inclusive  Few restrictions; all data is welcome  In return, best effort services at first:  Cataloging, keywords, attribute-value  …over (almost) everything
    12. 12. 3/12/09 Bill Howe, eScience Institute12 Dataspace Services Catalog Keyword search Structured Query Anakysis and Vis Task-specific Tools Time Over time, a dataset becomes accessible by additional services
    13. 13. 3/12/09 Bill Howe, eScience Institute13 Dataspace Services Keyword Search Structured Query Analysis and Visualization Task-specific Applications Cataloguing
    14. 14. 3/12/09 Bill Howe, eScience Institute14 Dataspace Services Cataloguing Keyword Search Structured Query Analysis and Vis Task-specific Tools
    15. 15. 3/12/09 Bill Howe, eScience Institute15 Example: The Internet
    16. 16. 3/12/09 Bill Howe, eScience Institute16 Example: Ocean Circulation Forecasting System Atmospheric models Tides River discharge filesystem salinity isolines station extractions model-data comparisons products via the web forcings (i.e., inputs) Simulation results Config and log files Intermediate files Annotations Data Products Relations perl and cron cluster perl and cron … FORTRAN RDBMS
    17. 17. 3/12/09 Bill Howe, eScience Institute17 Example: Environmental Metagenomics ANNOTATION TABLES Pfams TIGRfams COGs FIGfams SAMPLING metagenome 4 metagenome 3 metagenome 2 metagenome 1 CAMERA annotation PPLACER of Pfams, TIGRfams, COGs, FIGfams STATs taxonomic info seed alignmentHMMer search of meta*ome reference treealigned meta*ome fragments precomputed precomputed sequencing raw data environment metadata raw data analyzed data SQLShare analyzed data correlate diversity w/environment correlate diversity and nutrients find new genes find new taxa and their distributions compare meta*omes src: Robin Kodner
    18. 18. 3/12/09 Bill Howe, eScience Institute18 Example: CHAVI Relational Dataspace Interface and Analysis B Cell Control T Cell Control NK Cell Control Genetics Databases NHP Database Virus Seq. Data src: Bart Haynes
    19. 19. 3/12/09 Bill Howe, eScience Institute19 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
    20. 20. 3/12/09 Bill Howe, eScience Institute20 Example Systems cast as DSSPs  Atlas (LabKey)  data model: tables and files  Mark Igra will present  “Data Warehouse” prototype (SCHARP)  data model: relations  SQLShare (UW eScience)  data model: relations  Quarry [Howe, et al. 2006]  data model: triples  iTrails [Salles et al. 2007]  data model: triples  Google Fusion Tables [Halevy 2010]  data model: relations
    21. 21. 3/12/09 Bill Howe, eScience Institute21 Environmental Sampling Public annotation DBs Sequencing metadata search hits taxonomic info correlate diversity w/environment? correlate diversity w/nutrients? find new genes? find new taxa and their distributions? compare meta*omes? Pfams, TIGRfams, COGs, FIGfams Phylogeny “90% of my time spent manipulating data rather than doing science'”
    22. 22. 3/12/09 Bill Howe, eScience Institute22 Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fa chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fa chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fa chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt select * from annotationsummary_combinedorfannotation16_phaeo_genome, COGAnnotation_surface where phaeo_gene = surf_hit
    23. 23. 3/12/09 Bill Howe, eScience Institute23 Environmental Sampling Public annotation DBs Sequencing metadata search hits taxonomic info correlate diversity w/environment? correlate diversity w/nutrients? find new genes?find new taxa and their distributions? compare meta*omes? Pfams, TIGRfams, COGs, FIGfams SQL “That took me a week with Excel” “I can do science again” SQLShare Phylogeny
    24. 24. 3/12/09 Bill Howe, eScience Institute24
    25. 25. 3/12/09 Bill Howe, eScience Institute25
    26. 26. 3/12/09 Bill Howe, eScience Institute26 SQLShare Motivation  Conventional wisdom says “Scientists won’t write SQL”  We don’t believe it  Instead, we implicate difficulty in  installation  configuration  schema design  performance tuning  data ingest  over-reliance on GUIs We ask “What kind of technology would make SQL a natural fit for hypothesis testing?”
    27. 27. 3/12/09 Bill Howe, eScience Institute27 SQLShare Features  Collaborative SQL authoring and sharing  Views for incremental abstraction and integration  Semi-automatic integration  Identify “natural” unions and joins  SQL Autocomplete  User starts typing, system uses query logs to make suggestions [Khoussainova 10]  English Query  Bootstrap a SQL query from an English questions  Simple Visualization  via Integration with Google Fusion Tables
    28. 28. 3/12/09 Bill Howe, eScience Institute28 Outline  Challenges  Dataspaces  Dataspace Support Platforms  Next Steps
    29. 29. 3/12/09 Bill Howe, eScience Institute29 Next Steps  Define scope  Define HIV Dataspace team  Build a minimal technical team  “Data Wrangler”  “Application Wrangler”  Identify and catalog dataspace “participants” (i.e., sources)  Review data access rights and security requirements  Gather “spanning basis” of questions to answer  Jim Gray’s “20 questions” methodology  Gather “spanning basis” of existing data  use exemplars if necessary  load data “as is” into a database
    30. 30. 3/12/09 Bill Howe, eScience Institute30 Next Steps (2)  Answer initial questions (Data wrangler)  RDBMS example: create views  Visualize initial answers (Application wrangler)  Demonstrate early progress  Check breadth (what’s missing?)  Check depth (Did “hard” questions get answered?)
    31. 31. 3/12/09 Bill Howe, eScience Institute31 Summary  Conventional “schema-first” approaches break down in research contexts  The dataspace abstraction and DSSPs offer a way forward  Systems and best practices are emerging in the literature and from production deployments
    32. 32. 3/12/09 Bill Howe, eScience Institute32
    33. 33. 3/12/09 Bill Howe, eScience Institute33 BACKUP SLIDES
    34. 34. 3/12/09 Bill Howe, eScience Institute34 Feature: Sharing SQL
    35. 35. 3/12/09 Bill Howe, eScience Institute35 Feature: SQL Autocomplete  User requests suggestions on-the-fly as he/she types query  Recommends snippets:  predicates in the WHERE clause  tables in the FROM clause  attributes in the SELECT clause  Recommendations are context-aware  Leverages past queries by user and collaborators Src: Nodira Khoussainova
    36. 36. 3/12/09 Bill Howe, eScience Institute36 Feature: English Query  Lots of research on Natural Language Interfaces to Databases  c.f. [Etzioni 2008, Zettermeyer 2009]  Very hard problem, in general  Significant simplification: user can inspect and “fix” the generated SQL prior to execution
    37. 37. 3/12/09 Bill Howe, eScience Institute37 Feature: Simple Visualization For each phaeo gene, count the number of matches in the COGAnnotation_surface dataset, joining on COG id. Return the top 10 most commonly found genes. Implementation: Export to Google Fusion Tables
    38. 38. 3/12/09 Bill Howe, eScience Institute38 Dataspaces: Summary A “Dataspace Support Platform” should  use a “lowest common denominator” data model  not rely crucially on upfront global consensus  not rely crucially on “perfect” metadata  embrace exceptions, but exploit patterns  support task-specific, “top down” integration  ….but seek and exploit cross-cutting patterns where possible  deliver incremental return for incremental investment  …in data quality enhancement  …in metadata normalization  …in usage standardization  …in application “convergence”
    39. 39. 3/12/09 Bill Howe, eScience Institute39 Timeline time, scope, effort valueforusers Semantic Web RDF/OWL Ontologies Insular Data Sources Data Integration Tools Federated Databases Dataspace support platforms Dataspaces
    40. 40. 3/12/09 Bill Howe, eScience Institute40 Example: Metagenomics 1. Who is there? Which organisms make up the population? 2. What are they doing? Which metabolic pathways are present and active? (and who is doing what?) 3. Compare datasets - across a transect (nearshore vs. deep ocean) - before/after some event (e.g., Spring freshet) - across salinity/temperature gradients - diurnal cycles (day/night) metagenomics metatranscriptomics metaproteomics Study microbial populations sampled from the environment instead of individual organisms Source: Robin Kodner, Armbrust Lab
    41. 41. id query hit e_value query_start query_end hit_start hit_end hit_length 6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134 6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134 6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134 6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134 6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606 6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606 6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606 6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153 6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153 6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153 coastal sample Complex Example … [H] COG4547 Cobalamin biosynthesis protein CobT (nicotinate-mononucleotide:5, 6-dimethylbenzimidazole phosphoribosyltransferase) Ype: YPMT1.87 Atu: AGl2410 Sme: SMc00701 Bme: BMEI0050 Mlo: mll3561 Ccr: CC0672 … [J] COG5099 RNA-binding protein of the Puf family, translational repressor Sce: YGL014w YGL178w YJR091c YLL013c YPR042c Spo: SPAC1687.22c SPAC4G8.03c SPAC4G9.05 SPAC6G9.14 SPBC56F2.08c SPBP35G2.14 SPCC1682.08c Ecu: ECU11g1730 … COG database ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein Cob chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, SP chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein Cob chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf fam chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf fam chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf fam chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydrola chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and pro chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and pro chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome SwissProt web service Browser Cross-Reference TIGR01650 GO:0051116 contributes_to TIGR01651 GO:0009236 NULL TIGR01651 GO:0051116 NULL TIGR01660 GO:0008940 NULL TIGR01660 GO:0009061 NULL TIGR01660 GO:0009325 NULL TIGR01663 GO:0000012 NULL TIGR01663 GO:0046403 NULL TIGRFAM to GO Mapping id query hit e_value query_start query_end hit_start hit_end hit_length 6409 FHJ7DRN01BYA61.1 TIGR00149 2.20E-21 1 84 43 125 134 6410 FHJ7DRN01BDTEA.1 TIGR00149 3.40E-09 3 42 30 69 134 6411 FHJ7DRN02HEUGQ.1 TIGR00149 1.70E-05 4 46 1 46 134 6412 FHJ7DRN01CA4BO.1 TIGR00149 5.30E-05 4 45 1 45 134 6413 FHJ7DRN01DM2FK.3 TIGR01651 5.70E-64 1 76 511 586 606 6414 FHJ7DRN01B8BPS.1 TIGR01651 1.20E-36 1 52 500 551 606 6415 FHJ7DRN02JM54P.1 TIGR01651 2.20E-24 15 80 301 366 606 6416 FHJ7DRN02FK6C5.2 TIGR00039 2.70E-16 1 45 37 85 153 6417 FHJ7DRN01D019A.1 TIGR00039 8.90E-12 5 65 48 118 153 6418 FHJ7DRN02FYAFO.1 TIGR00039 1.60E-11 1 76 67 153 153 coastal sample
    42. 42. 3/12/09 Bill Howe, eScience Institute42 Pre-relational brittleness: if your data changed, your application often broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979 Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Background: Relational Databases
    43. 43. 3/12/09 Bill Howe, eScience Institute43 Relational Databases: Summary  A General Data Model: “just tables”  Logical and Physical Data Independence  Declarative Query Language  via the “Relational Algebra”  Good Scalability  “SQL is the most successful parallel language in the world”  Results  $15B industry  Nearly every (non-search engine) website you visit is backed by a RDBMS  One of the all-time best examples of CS research impact
    44. 44. 3/12/09 Bill Howe, eScience Institute44 So what went wrong?  DBAs!  “Schema design” became paramount  “Applications write queries, not users”  Applications became tightly coupled to schema  Ad hoc queries, ad hoc views, ad hoc data confounded predictable performance, centralized management, and strong global guarantees  Result: Other tools enlisted to fil the gap  Java, etc.; XML, RDF, etc.; Web Services
    45. 45. 3/12/09 Bill Howe, eScience Institute45 Key Idea: Data Independence physical data independence logical data independence files and pointers relations view s SELECT seq FROM all_sequences WHERE seq = ‘GATTACGATATTA’; SELECT dna FROM ncbi_sequences WHERE dna = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
    46. 46. 3/12/09 Bill Howe, eScience Institute46 Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
    47. 47. 3/12/09 Bill Howe, eScience Institute47 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
    48. 48. 3/12/09 Bill Howe, eScience Institute48 My Interests Computer Science Scientific Data Management Databases Data-Intensive Scalable Computing Research Data Integration Cloud Computing Visual Data Analytics
    49. 49. 3/12/09 Bill Howe, eScience Institute49 Research Cycle Observe Experiment Analyze Publish/Shar e Synthesis
    50. 50. 3/12/09 Bill Howe, eScience Institute50 Web Services Data Management Query Languages Storage Cloud Computing Visualization; Workflow Information Integration Information Extraction, Access Methods Data Mining, Distributed Programming Models, complexity-hiding interfaces

    ×