Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009
Outline Needs, challenges, today’s solutions  and emerging trends SciDB design and planned features SciDB structure and timeline
Size Challenge Data set sizes grow dramatically Growth rate increases Implications Failures are routine Provenance tracking is a must Massive parallelization is a must Full automation, self-adjustment is a must
Analytics Complexity More data varieties = more ways to analyze it Rapid growth of complexity of analytics Time series comparisons N 2  and N 3  correlations Proximity and grouping-based searches Interactive exploration enables most science Data uncertainty matters Provenance is an integral part of analytics User annotations are important Ad-hoc  integration of derived data  with raw data desired True for science and industry
Today’s Technologies Existing databases Most too monolithic Expensive to scale Expensive to provide high availability Built for perfect schemas and clean data Relational data model far from ideal for most projects APIs far from ideal Desired intuitive interfaces Most very large systems shy away from databases
Today’s Solutions Metadata in lightweight database plus bulk data in files BaBar, LHC, LCLS Bulk data stored as unstructured data in database NIF Raw data in files, derived data in database PanSTARRS, LSST (future projects) Complete (or mostly) home-grown systems ATT, Google, Yahoo, Amazon, Facebook Most common solution All in database WalMart (very expensive) eBay (very expensive, testing new home grown solution) SDSS, bio, genomics (small-ish, single-server databases)  Little reusing, roll-your-own mentality
Future Emerging trends Shared nothing parallel database Lightweight, specialized components On low-cost commodity hardware Aggressive compression Several attempts to push state-of-the-art forward Aster Data, Vertica, ParAccel,  EnterpriseDB, Greenplum, Netezza Some issues not addressed by anyone Arrays, provenance, uncertainty,  partial results, intuitive interfaces
XLDB Activities 2007 Indentify trends, roadblocks Bridge the gaps 2008 Complex analytics Bridge the gaps 2009 Reach out to non-US communities Connect with remaining sciences
Outline Needs, challenges, today’s solutions  and emerging trends SciDB design and planned features SciDB structure and timeline
Philosophy address common scientific needs geared for analytics, not OLTP Key requirements open source commercial quality peta-scale New Open Source Science Database System
Data Model - Types Scalars standard base types (int, float, string, date, …) geospatial (3-D points, lines, polygons, boxes) Multi-d arrays regular or irregular any number of dimensions nesting allowed dense or sparse
Data Model - Operators Native (built-in) array-sql (filter, project, group_by, aggregation, …) array (pivot, regridding, reshaping, transformations, nest, flatten, …) User-defined functions Postgres-style coded in C++ Native operators coded as UDFs All UDFs treated equally optimizer might do more with built-in UDFs Two kinds: per cell, per array All UDFs executable in parallel Paradigm: primitives for data-heavy compute
Data Model – Match To Science Needs astronomy earth and environmental sciences, including oceanography, remote sensing, seismology bio-medical imaging  fusion bio (need sequences) chemistry (need network structures)
Query Language “Parse-tree” representation of operations “Bindings” to C++, Python, IDL, ... (TBD) Tight integration with popular statistical  tools like R or MATLAB
Storage Model Granularity “ Chunked” arrays Chunk = unit of storage, buffering and compression Chunks partitioned across nodes Parallel model Shared-nothing parallel DBMS runs on a grid of computers, uniformity not required Data exchanged between nodes as needed Format Loaded or in-situ modes in-situ: limited capabilities Adaptors to translate external popular formats  (like HDF5) on the fly
Versioning No overwrite storage Named versions Delta compression
Provenance Need: what operations led to creation of given element what operations used this element what data elements were used as input to this operation what data elements were created as output from this operation Natively supported easy if workflow in SciDB Loading external provenance Efficient querying No-overwrite + delta compression helps
Uncertainty Error bars carried along in the computation Initial version interval arithmetic uniform error distribution More complex models usually  science-specific might consider implementing some  in the future if enough commonalities Approximate results
Resource Management Query scheduling including shared scans (train scheduling) Query progress Support for long-running queries (cancel/stop/restart) Pre-execution query cost estimates Per user/query limits
Other Features High availability / automatic fail over Auto config and auto self-healing
Green Computing Aggressive compression    less disk Approximate results    stop computing early Shared scans    share I/O Scale out as you go    incremental provisioning
Science / Industry Needs Scale Complex analytics time series, needle in haystack, group based Summary statistics @petascale Arrays Provenance Uncertainty Integration with statistical tools …  all needed by industry
Outline Needs, challenges, today’s solutions  and emerging trends SciDB design and planned features SciDB structure and timeline
Partnership - Roles Science and high-end commercial provide input, including usecases provide some resources review design, test the product DBMS brain trust design, oversee construction, perform research Non profit company manage the project support resulting system
Partnership – Current Players Science and high-end commercial LSST, PNNL, UCSB, LLNL eBay, Vertica, Microsoft lighthouse customers: LSST and eBay DBMS brain trust Michael Stonebraker, David DeWitt, Dave Maier, Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel Non profit company SciDB, Inc. - 501c(3) foundation Plus… 5 developers working on 1 st  prototype
Have Usecases From  astronomy (LSST) industry (eBay) genomics (LLNL) climate (PNNL/ARM) seismic (Emory Univ) environmental observation & modeling  (Oregon Univ) earth remote sensing (UCSB) fusion (LLNL/NIF) WE NEED YOUR USECASES
Timeline Mid June ‘09 professional-looking scidb.org start building user community Late August ’09 planned demo at VLDB reach out to non-US communities  through XLDB3 End of Q1’10 – alpha End of Q4’10 – beta
Manpower All work so far in-kind 4.5 FTEs working on demo SLAC, MIT, UW, RAS Good chances to have funds available  this FY to hire ~5 full time developers Actively looking for more partners GET INVOLVED
Summary Many commonalities within science and  between science and industry Existing off-the-shelf technologies  inefficient for very large scale analytics SciDB – new open source science DBMS Community realizes shared software  infrastructure is good Big lighthouse customers Strong team If successful, will enable unprecedented  analyses at extreme scale
Related Links http://scidb.org http://www-conf.slac.stanford.edu/xldb07 http://www-conf.slac.stanford.edu/xldb08 http://www-conf.slac.stanford.edu/xldb09

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

  • 1.
    Open Source DataManagement System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009
  • 2.
    Outline Needs, challenges,today’s solutions and emerging trends SciDB design and planned features SciDB structure and timeline
  • 3.
    Size Challenge Dataset sizes grow dramatically Growth rate increases Implications Failures are routine Provenance tracking is a must Massive parallelization is a must Full automation, self-adjustment is a must
  • 4.
    Analytics Complexity Moredata varieties = more ways to analyze it Rapid growth of complexity of analytics Time series comparisons N 2 and N 3 correlations Proximity and grouping-based searches Interactive exploration enables most science Data uncertainty matters Provenance is an integral part of analytics User annotations are important Ad-hoc integration of derived data with raw data desired True for science and industry
  • 5.
    Today’s Technologies Existingdatabases Most too monolithic Expensive to scale Expensive to provide high availability Built for perfect schemas and clean data Relational data model far from ideal for most projects APIs far from ideal Desired intuitive interfaces Most very large systems shy away from databases
  • 6.
    Today’s Solutions Metadatain lightweight database plus bulk data in files BaBar, LHC, LCLS Bulk data stored as unstructured data in database NIF Raw data in files, derived data in database PanSTARRS, LSST (future projects) Complete (or mostly) home-grown systems ATT, Google, Yahoo, Amazon, Facebook Most common solution All in database WalMart (very expensive) eBay (very expensive, testing new home grown solution) SDSS, bio, genomics (small-ish, single-server databases) Little reusing, roll-your-own mentality
  • 7.
    Future Emerging trendsShared nothing parallel database Lightweight, specialized components On low-cost commodity hardware Aggressive compression Several attempts to push state-of-the-art forward Aster Data, Vertica, ParAccel, EnterpriseDB, Greenplum, Netezza Some issues not addressed by anyone Arrays, provenance, uncertainty, partial results, intuitive interfaces
  • 8.
    XLDB Activities 2007Indentify trends, roadblocks Bridge the gaps 2008 Complex analytics Bridge the gaps 2009 Reach out to non-US communities Connect with remaining sciences
  • 9.
    Outline Needs, challenges,today’s solutions and emerging trends SciDB design and planned features SciDB structure and timeline
  • 10.
    Philosophy address commonscientific needs geared for analytics, not OLTP Key requirements open source commercial quality peta-scale New Open Source Science Database System
  • 11.
    Data Model -Types Scalars standard base types (int, float, string, date, …) geospatial (3-D points, lines, polygons, boxes) Multi-d arrays regular or irregular any number of dimensions nesting allowed dense or sparse
  • 12.
    Data Model -Operators Native (built-in) array-sql (filter, project, group_by, aggregation, …) array (pivot, regridding, reshaping, transformations, nest, flatten, …) User-defined functions Postgres-style coded in C++ Native operators coded as UDFs All UDFs treated equally optimizer might do more with built-in UDFs Two kinds: per cell, per array All UDFs executable in parallel Paradigm: primitives for data-heavy compute
  • 13.
    Data Model –Match To Science Needs astronomy earth and environmental sciences, including oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)
  • 14.
    Query Language “Parse-tree”representation of operations “Bindings” to C++, Python, IDL, ... (TBD) Tight integration with popular statistical tools like R or MATLAB
  • 15.
    Storage Model Granularity“ Chunked” arrays Chunk = unit of storage, buffering and compression Chunks partitioned across nodes Parallel model Shared-nothing parallel DBMS runs on a grid of computers, uniformity not required Data exchanged between nodes as needed Format Loaded or in-situ modes in-situ: limited capabilities Adaptors to translate external popular formats (like HDF5) on the fly
  • 16.
    Versioning No overwritestorage Named versions Delta compression
  • 17.
    Provenance Need: whatoperations led to creation of given element what operations used this element what data elements were used as input to this operation what data elements were created as output from this operation Natively supported easy if workflow in SciDB Loading external provenance Efficient querying No-overwrite + delta compression helps
  • 18.
    Uncertainty Error barscarried along in the computation Initial version interval arithmetic uniform error distribution More complex models usually science-specific might consider implementing some in the future if enough commonalities Approximate results
  • 19.
    Resource Management Queryscheduling including shared scans (train scheduling) Query progress Support for long-running queries (cancel/stop/restart) Pre-execution query cost estimates Per user/query limits
  • 20.
    Other Features Highavailability / automatic fail over Auto config and auto self-healing
  • 21.
    Green Computing Aggressivecompression  less disk Approximate results  stop computing early Shared scans  share I/O Scale out as you go  incremental provisioning
  • 22.
    Science / IndustryNeeds Scale Complex analytics time series, needle in haystack, group based Summary statistics @petascale Arrays Provenance Uncertainty Integration with statistical tools … all needed by industry
  • 23.
    Outline Needs, challenges,today’s solutions and emerging trends SciDB design and planned features SciDB structure and timeline
  • 24.
    Partnership - RolesScience and high-end commercial provide input, including usecases provide some resources review design, test the product DBMS brain trust design, oversee construction, perform research Non profit company manage the project support resulting system
  • 25.
    Partnership – CurrentPlayers Science and high-end commercial LSST, PNNL, UCSB, LLNL eBay, Vertica, Microsoft lighthouse customers: LSST and eBay DBMS brain trust Michael Stonebraker, David DeWitt, Dave Maier, Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel Non profit company SciDB, Inc. - 501c(3) foundation Plus… 5 developers working on 1 st prototype
  • 26.
    Have Usecases From astronomy (LSST) industry (eBay) genomics (LLNL) climate (PNNL/ARM) seismic (Emory Univ) environmental observation & modeling (Oregon Univ) earth remote sensing (UCSB) fusion (LLNL/NIF) WE NEED YOUR USECASES
  • 27.
    Timeline Mid June‘09 professional-looking scidb.org start building user community Late August ’09 planned demo at VLDB reach out to non-US communities through XLDB3 End of Q1’10 – alpha End of Q4’10 – beta
  • 28.
    Manpower All workso far in-kind 4.5 FTEs working on demo SLAC, MIT, UW, RAS Good chances to have funds available this FY to hire ~5 full time developers Actively looking for more partners GET INVOLVED
  • 29.
    Summary Many commonalitieswithin science and between science and industry Existing off-the-shelf technologies inefficient for very large scale analytics SciDB – new open source science DBMS Community realizes shared software infrastructure is good Big lighthouse customers Strong team If successful, will enable unprecedented analyses at extreme scale
  • 30.
    Related Links http://scidb.orghttp://www-conf.slac.stanford.edu/xldb07 http://www-conf.slac.stanford.edu/xldb08 http://www-conf.slac.stanford.edu/xldb09