Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/2...
Outline <ul><li>Needs, challenges, today’s solutions  and emerging trends </li></ul><ul><li>SciDB design and planned featu...
Size Challenge <ul><li>Data set sizes grow dramatically </li></ul><ul><li>Growth rate increases </li></ul><ul><li>Implicat...
Analytics Complexity <ul><li>More data varieties = more ways to analyze it </li></ul><ul><li>Rapid growth of complexity of...
Today’s Technologies <ul><li>Existing databases </li></ul><ul><ul><li>Most too monolithic </li></ul></ul><ul><ul><li>Expen...
Today’s Solutions <ul><li>Metadata in lightweight database plus bulk data in files </li></ul><ul><ul><li>BaBar, LHC, LCLS ...
Future <ul><li>Emerging trends </li></ul><ul><ul><li>Shared nothing parallel database </li></ul></ul><ul><ul><li>Lightweig...
XLDB Activities <ul><li>2007 </li></ul><ul><li>Indentify trends, roadblocks </li></ul><ul><li>Bridge the gaps </li></ul><u...
Outline <ul><li>Needs, challenges, today’s solutions  and emerging trends </li></ul><ul><li>SciDB design and planned featu...
<ul><li>Philosophy </li></ul><ul><ul><li>address common scientific needs </li></ul></ul><ul><ul><li>geared for analytics, ...
Data Model - Types <ul><li>Scalars </li></ul><ul><ul><li>standard base types (int, float, string, date, …) </li></ul></ul>...
Data Model - Operators <ul><li>Native (built-in) </li></ul><ul><ul><li>array-sql (filter, project, group_by, aggregation, ...
Data Model – Match To Science Needs <ul><li>astronomy </li></ul><ul><li>earth and environmental sciences, including oceano...
Query Language <ul><li>“Parse-tree” representation of operations </li></ul><ul><li>“Bindings” to C++, Python, IDL, ... (TB...
Storage Model <ul><li>Granularity </li></ul><ul><ul><li>“ Chunked” arrays </li></ul></ul><ul><ul><ul><li>Chunk = unit of s...
Versioning <ul><li>No overwrite storage </li></ul><ul><li>Named versions </li></ul><ul><li>Delta compression </li></ul>
Provenance <ul><li>Need: </li></ul><ul><ul><li>what operations led to creation of given element </li></ul></ul><ul><ul><li...
Uncertainty <ul><li>Error bars carried along in the computation </li></ul><ul><li>Initial version </li></ul><ul><ul><li>in...
Resource Management <ul><li>Query scheduling </li></ul><ul><ul><li>including shared scans (train scheduling) </li></ul></u...
Other Features <ul><li>High availability / automatic fail over </li></ul><ul><li>Auto config and auto self-healing </li></ul>
Green Computing <ul><li>Aggressive compression    less disk </li></ul><ul><li>Approximate results    stop computing earl...
Science / Industry Needs <ul><li>Scale </li></ul><ul><li>Complex analytics </li></ul><ul><ul><li>time series, needle in ha...
Outline <ul><li>Needs, challenges, today’s solutions  and emerging trends </li></ul><ul><li>SciDB design and planned featu...
Partnership - Roles <ul><li>Science and high-end commercial </li></ul><ul><ul><li>provide input, including usecases </li><...
Partnership – Current Players <ul><li>Science and high-end commercial </li></ul><ul><ul><li>LSST, PNNL, UCSB, LLNL </li></...
Have Usecases From  <ul><li>astronomy (LSST) </li></ul><ul><li>industry (eBay) </li></ul><ul><li>genomics (LLNL) </li></ul...
Timeline <ul><li>Mid June ‘09 </li></ul><ul><ul><li>professional-looking scidb.org </li></ul></ul><ul><ul><li>start buildi...
Manpower <ul><li>All work so far in-kind </li></ul><ul><li>4.5 FTEs working on demo </li></ul><ul><ul><li>SLAC, MIT, UW, R...
Summary <ul><li>Many commonalities within science and  between science and industry </li></ul><ul><li>Existing off-the-she...
Related Links <ul><li>http://scidb.org </li></ul><ul><li>http://www-conf.slac.stanford.edu/xldb07 </li></ul><ul><li>http:/...
Upcoming SlideShare
Loading in …5
×

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

3,680 views

Published on

SciDB: Open Source Data Management System for Data-Intensive Scientific Analytics

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,680
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
89
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

  1. 1. Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009
  2. 2. Outline <ul><li>Needs, challenges, today’s solutions and emerging trends </li></ul><ul><li>SciDB design and planned features </li></ul><ul><li>SciDB structure and timeline </li></ul>
  3. 3. Size Challenge <ul><li>Data set sizes grow dramatically </li></ul><ul><li>Growth rate increases </li></ul><ul><li>Implications </li></ul><ul><ul><li>Failures are routine </li></ul></ul><ul><ul><li>Provenance tracking is a must </li></ul></ul><ul><ul><li>Massive parallelization is a must </li></ul></ul><ul><ul><li>Full automation, self-adjustment is a must </li></ul></ul>
  4. 4. Analytics Complexity <ul><li>More data varieties = more ways to analyze it </li></ul><ul><li>Rapid growth of complexity of analytics </li></ul><ul><ul><li>Time series comparisons </li></ul></ul><ul><ul><li>N 2 and N 3 correlations </li></ul></ul><ul><ul><li>Proximity and grouping-based searches </li></ul></ul><ul><li>Interactive exploration enables most science </li></ul><ul><li>Data uncertainty matters </li></ul><ul><li>Provenance is an integral part of analytics </li></ul><ul><li>User annotations are important </li></ul><ul><li>Ad-hoc integration of derived data with raw data desired </li></ul><ul><li>True for science and industry </li></ul>
  5. 5. Today’s Technologies <ul><li>Existing databases </li></ul><ul><ul><li>Most too monolithic </li></ul></ul><ul><ul><li>Expensive to scale </li></ul></ul><ul><ul><li>Expensive to provide high availability </li></ul></ul><ul><ul><li>Built for perfect schemas and clean data </li></ul></ul><ul><ul><li>Relational data model far from ideal for most projects </li></ul></ul><ul><ul><li>APIs far from ideal </li></ul></ul><ul><ul><ul><li>Desired intuitive interfaces </li></ul></ul></ul><ul><li>Most very large systems shy away from databases </li></ul>
  6. 6. Today’s Solutions <ul><li>Metadata in lightweight database plus bulk data in files </li></ul><ul><ul><li>BaBar, LHC, LCLS </li></ul></ul><ul><li>Bulk data stored as unstructured data in database </li></ul><ul><ul><li>NIF </li></ul></ul><ul><li>Raw data in files, derived data in database </li></ul><ul><ul><li>PanSTARRS, LSST (future projects) </li></ul></ul><ul><li>Complete (or mostly) home-grown systems </li></ul><ul><ul><li>ATT, Google, Yahoo, Amazon, Facebook </li></ul></ul><ul><ul><li>Most common solution </li></ul></ul><ul><li>All in database </li></ul><ul><ul><li>WalMart (very expensive) </li></ul></ul><ul><ul><li>eBay (very expensive, testing new home grown solution) </li></ul></ul><ul><ul><li>SDSS, bio, genomics (small-ish, single-server databases) </li></ul></ul><ul><li>Little reusing, roll-your-own mentality </li></ul>
  7. 7. Future <ul><li>Emerging trends </li></ul><ul><ul><li>Shared nothing parallel database </li></ul></ul><ul><ul><li>Lightweight, specialized components </li></ul></ul><ul><ul><li>On low-cost commodity hardware </li></ul></ul><ul><ul><li>Aggressive compression </li></ul></ul><ul><li>Several attempts to push state-of-the-art forward </li></ul><ul><ul><li>Aster Data, Vertica, ParAccel, EnterpriseDB, Greenplum, Netezza </li></ul></ul><ul><li>Some issues not addressed by anyone </li></ul><ul><ul><li>Arrays, provenance, uncertainty, partial results, intuitive interfaces </li></ul></ul>
  8. 8. XLDB Activities <ul><li>2007 </li></ul><ul><li>Indentify trends, roadblocks </li></ul><ul><li>Bridge the gaps </li></ul><ul><li>2008 </li></ul><ul><li>Complex analytics </li></ul><ul><li>Bridge the gaps </li></ul><ul><li>2009 </li></ul><ul><li>Reach out to non-US communities </li></ul><ul><li>Connect with remaining sciences </li></ul>
  9. 9. Outline <ul><li>Needs, challenges, today’s solutions and emerging trends </li></ul><ul><li>SciDB design and planned features </li></ul><ul><li>SciDB structure and timeline </li></ul>
  10. 10. <ul><li>Philosophy </li></ul><ul><ul><li>address common scientific needs </li></ul></ul><ul><ul><li>geared for analytics, not OLTP </li></ul></ul><ul><li>Key requirements </li></ul><ul><ul><li>open source </li></ul></ul><ul><ul><li>commercial quality </li></ul></ul><ul><ul><li>peta-scale </li></ul></ul>New Open Source Science Database System
  11. 11. Data Model - Types <ul><li>Scalars </li></ul><ul><ul><li>standard base types (int, float, string, date, …) </li></ul></ul><ul><ul><li>geospatial (3-D points, lines, polygons, boxes) </li></ul></ul><ul><li>Multi-d arrays </li></ul><ul><ul><li>regular or irregular </li></ul></ul><ul><ul><li>any number of dimensions </li></ul></ul><ul><ul><li>nesting allowed </li></ul></ul><ul><ul><li>dense or sparse </li></ul></ul>
  12. 12. Data Model - Operators <ul><li>Native (built-in) </li></ul><ul><ul><li>array-sql (filter, project, group_by, aggregation, …) </li></ul></ul><ul><ul><li>array (pivot, regridding, reshaping, transformations, nest, flatten, …) </li></ul></ul><ul><li>User-defined functions </li></ul><ul><ul><li>Postgres-style </li></ul></ul><ul><ul><li>coded in C++ </li></ul></ul><ul><li>Native operators coded as UDFs </li></ul><ul><li>All UDFs treated equally </li></ul><ul><ul><li>optimizer might do more with built-in UDFs </li></ul></ul><ul><li>Two kinds: per cell, per array </li></ul><ul><li>All UDFs executable in parallel </li></ul><ul><li>Paradigm: primitives for data-heavy compute </li></ul>
  13. 13. Data Model – Match To Science Needs <ul><li>astronomy </li></ul><ul><li>earth and environmental sciences, including oceanography, remote sensing, seismology </li></ul><ul><li>bio-medical imaging </li></ul><ul><li>fusion </li></ul><ul><li>bio (need sequences) </li></ul><ul><li>chemistry (need network structures) </li></ul>
  14. 14. Query Language <ul><li>“Parse-tree” representation of operations </li></ul><ul><li>“Bindings” to C++, Python, IDL, ... (TBD) </li></ul><ul><li>Tight integration with popular statistical tools like R or MATLAB </li></ul>
  15. 15. Storage Model <ul><li>Granularity </li></ul><ul><ul><li>“ Chunked” arrays </li></ul></ul><ul><ul><ul><li>Chunk = unit of storage, buffering and compression </li></ul></ul></ul><ul><ul><li>Chunks partitioned across nodes </li></ul></ul><ul><li>Parallel model </li></ul><ul><ul><li>Shared-nothing parallel DBMS </li></ul></ul><ul><ul><ul><li>runs on a grid of computers, uniformity not required </li></ul></ul></ul><ul><ul><li>Data exchanged between nodes as needed </li></ul></ul><ul><li>Format </li></ul><ul><ul><li>Loaded or in-situ modes </li></ul></ul><ul><ul><ul><li>in-situ: limited capabilities </li></ul></ul></ul><ul><ul><li>Adaptors to translate external popular formats (like HDF5) on the fly </li></ul></ul>
  16. 16. Versioning <ul><li>No overwrite storage </li></ul><ul><li>Named versions </li></ul><ul><li>Delta compression </li></ul>
  17. 17. Provenance <ul><li>Need: </li></ul><ul><ul><li>what operations led to creation of given element </li></ul></ul><ul><ul><li>what operations used this element </li></ul></ul><ul><ul><li>what data elements were used as input to this operation </li></ul></ul><ul><ul><li>what data elements were created as output from this operation </li></ul></ul><ul><li>Natively supported </li></ul><ul><ul><li>easy if workflow in SciDB </li></ul></ul><ul><li>Loading external provenance </li></ul><ul><li>Efficient querying </li></ul><ul><li>No-overwrite + delta compression helps </li></ul>
  18. 18. Uncertainty <ul><li>Error bars carried along in the computation </li></ul><ul><li>Initial version </li></ul><ul><ul><li>interval arithmetic </li></ul></ul><ul><ul><li>uniform error distribution </li></ul></ul><ul><li>More complex models usually science-specific </li></ul><ul><ul><li>might consider implementing some in the future if enough commonalities </li></ul></ul><ul><li>Approximate results </li></ul>
  19. 19. Resource Management <ul><li>Query scheduling </li></ul><ul><ul><li>including shared scans (train scheduling) </li></ul></ul><ul><li>Query progress </li></ul><ul><li>Support for long-running queries (cancel/stop/restart) </li></ul><ul><li>Pre-execution query cost estimates </li></ul><ul><li>Per user/query limits </li></ul>
  20. 20. Other Features <ul><li>High availability / automatic fail over </li></ul><ul><li>Auto config and auto self-healing </li></ul>
  21. 21. Green Computing <ul><li>Aggressive compression  less disk </li></ul><ul><li>Approximate results  stop computing early </li></ul><ul><li>Shared scans  share I/O </li></ul><ul><li>Scale out as you go  incremental provisioning </li></ul>
  22. 22. Science / Industry Needs <ul><li>Scale </li></ul><ul><li>Complex analytics </li></ul><ul><ul><li>time series, needle in haystack, group based </li></ul></ul><ul><li>Summary statistics @petascale </li></ul><ul><li>Arrays </li></ul><ul><li>Provenance </li></ul><ul><li>Uncertainty </li></ul><ul><li>Integration with statistical tools </li></ul>… all needed by industry
  23. 23. Outline <ul><li>Needs, challenges, today’s solutions and emerging trends </li></ul><ul><li>SciDB design and planned features </li></ul><ul><li>SciDB structure and timeline </li></ul>
  24. 24. Partnership - Roles <ul><li>Science and high-end commercial </li></ul><ul><ul><li>provide input, including usecases </li></ul></ul><ul><ul><li>provide some resources </li></ul></ul><ul><ul><li>review design, test the product </li></ul></ul><ul><li>DBMS brain trust </li></ul><ul><ul><li>design, oversee construction, perform research </li></ul></ul><ul><li>Non profit company </li></ul><ul><ul><li>manage the project </li></ul></ul><ul><ul><li>support resulting system </li></ul></ul>
  25. 25. Partnership – Current Players <ul><li>Science and high-end commercial </li></ul><ul><ul><li>LSST, PNNL, UCSB, LLNL </li></ul></ul><ul><ul><li>eBay, Vertica, Microsoft </li></ul></ul><ul><ul><li>lighthouse customers: LSST and eBay </li></ul></ul><ul><li>DBMS brain trust </li></ul><ul><ul><li>Michael Stonebraker, David DeWitt, Dave Maier, Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel </li></ul></ul><ul><li>Non profit company </li></ul><ul><ul><li>SciDB, Inc. - 501c(3) foundation </li></ul></ul><ul><li>Plus… 5 developers working on 1 st prototype </li></ul>
  26. 26. Have Usecases From <ul><li>astronomy (LSST) </li></ul><ul><li>industry (eBay) </li></ul><ul><li>genomics (LLNL) </li></ul><ul><li>climate (PNNL/ARM) </li></ul><ul><li>seismic (Emory Univ) </li></ul><ul><li>environmental observation & modeling (Oregon Univ) </li></ul><ul><li>earth remote sensing (UCSB) </li></ul><ul><li>fusion (LLNL/NIF) </li></ul><ul><li>WE NEED YOUR USECASES </li></ul>
  27. 27. Timeline <ul><li>Mid June ‘09 </li></ul><ul><ul><li>professional-looking scidb.org </li></ul></ul><ul><ul><li>start building user community </li></ul></ul><ul><li>Late August ’09 </li></ul><ul><ul><li>planned demo at VLDB </li></ul></ul><ul><ul><li>reach out to non-US communities through XLDB3 </li></ul></ul><ul><li>End of Q1’10 – alpha </li></ul><ul><li>End of Q4’10 – beta </li></ul>
  28. 28. Manpower <ul><li>All work so far in-kind </li></ul><ul><li>4.5 FTEs working on demo </li></ul><ul><ul><li>SLAC, MIT, UW, RAS </li></ul></ul><ul><li>Good chances to have funds available this FY to hire ~5 full time developers </li></ul><ul><li>Actively looking for more partners </li></ul><ul><ul><li>GET INVOLVED </li></ul></ul>
  29. 29. Summary <ul><li>Many commonalities within science and between science and industry </li></ul><ul><li>Existing off-the-shelf technologies inefficient for very large scale analytics </li></ul><ul><li>SciDB – new open source science DBMS </li></ul><ul><ul><li>Community realizes shared software infrastructure is good </li></ul></ul><ul><ul><li>Big lighthouse customers </li></ul></ul><ul><ul><li>Strong team </li></ul></ul><ul><ul><li>If successful, will enable unprecedented analyses at extreme scale </li></ul></ul>
  30. 30. Related Links <ul><li>http://scidb.org </li></ul><ul><li>http://www-conf.slac.stanford.edu/xldb07 </li></ul><ul><li>http://www-conf.slac.stanford.edu/xldb08 </li></ul><ul><li>http://www-conf.slac.stanford.edu/xldb09 </li></ul>

×