SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009

Outline Needs, challenges, today’s solutions and emerging trends SciDB design and planned features SciDB structure and timeline

Size Challenge Data set sizes grow dramatically Growth rate increases Implications Failures are routine Provenance tracking is a must Massive parallelization is a must Full automation, self-adjustment is a must

Analytics Complexity More data varieties = more ways to analyze it Rapid growth of complexity of analytics Time series comparisons N 2 and N 3 correlations Proximity and grouping-based searches Interactive exploration enables most science Data uncertainty matters Provenance is an integral part of analytics User annotations are important Ad-hoc integration of derived data with raw data desired True for science and industry

Today’s Technologies Existing databases Most too monolithic Expensive to scale Expensive to provide high availability Built for perfect schemas and clean data Relational data model far from ideal for most projects APIs far from ideal Desired intuitive interfaces Most very large systems shy away from databases

Today’s Solutions Metadata in lightweight database plus bulk data in files BaBar, LHC, LCLS Bulk data stored as unstructured data in database NIF Raw data in files, derived data in database PanSTARRS, LSST (future projects) Complete (or mostly) home-grown systems ATT, Google, Yahoo, Amazon, Facebook Most common solution All in database WalMart (very expensive) eBay (very expensive, testing new home grown solution) SDSS, bio, genomics (small-ish, single-server databases) Little reusing, roll-your-own mentality

Future Emerging trends Shared nothing parallel database Lightweight, specialized components On low-cost commodity hardware Aggressive compression Several attempts to push state-of-the-art forward Aster Data, Vertica, ParAccel, EnterpriseDB, Greenplum, Netezza Some issues not addressed by anyone Arrays, provenance, uncertainty, partial results, intuitive interfaces

XLDB Activities 2007 Indentify trends, roadblocks Bridge the gaps 2008 Complex analytics Bridge the gaps 2009 Reach out to non-US communities Connect with remaining sciences

Philosophy address common scientific needs geared for analytics, not OLTP Key requirements open source commercial quality peta-scale New Open Source Science Database System

Data Model - Types Scalars standard base types (int, float, string, date, …) geospatial (3-D points, lines, polygons, boxes) Multi-d arrays regular or irregular any number of dimensions nesting allowed dense or sparse

Data Model - Operators Native (built-in) array-sql (filter, project, group_by, aggregation, …) array (pivot, regridding, reshaping, transformations, nest, flatten, …) User-defined functions Postgres-style coded in C++ Native operators coded as UDFs All UDFs treated equally optimizer might do more with built-in UDFs Two kinds: per cell, per array All UDFs executable in parallel Paradigm: primitives for data-heavy compute

Data Model – Match To Science Needs astronomy earth and environmental sciences, including oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)

Query Language “Parse-tree” representation of operations “Bindings” to C++, Python, IDL, ... (TBD) Tight integration with popular statistical tools like R or MATLAB

Storage Model Granularity “ Chunked” arrays Chunk = unit of storage, buffering and compression Chunks partitioned across nodes Parallel model Shared-nothing parallel DBMS runs on a grid of computers, uniformity not required Data exchanged between nodes as needed Format Loaded or in-situ modes in-situ: limited capabilities Adaptors to translate external popular formats (like HDF5) on the fly

Versioning No overwrite storage Named versions Delta compression

Provenance Need: what operations led to creation of given element what operations used this element what data elements were used as input to this operation what data elements were created as output from this operation Natively supported easy if workflow in SciDB Loading external provenance Efficient querying No-overwrite + delta compression helps

Uncertainty Error bars carried along in the computation Initial version interval arithmetic uniform error distribution More complex models usually science-specific might consider implementing some in the future if enough commonalities Approximate results

Resource Management Query scheduling including shared scans (train scheduling) Query progress Support for long-running queries (cancel/stop/restart) Pre-execution query cost estimates Per user/query limits

Other Features High availability / automatic fail over Auto config and auto self-healing

Green Computing Aggressive compression  less disk Approximate results  stop computing early Shared scans  share I/O Scale out as you go  incremental provisioning

Science / Industry Needs Scale Complex analytics time series, needle in haystack, group based Summary statistics @petascale Arrays Provenance Uncertainty Integration with statistical tools … all needed by industry

Partnership - Roles Science and high-end commercial provide input, including usecases provide some resources review design, test the product DBMS brain trust design, oversee construction, perform research Non profit company manage the project support resulting system

Partnership – Current Players Science and high-end commercial LSST, PNNL, UCSB, LLNL eBay, Vertica, Microsoft lighthouse customers: LSST and eBay DBMS brain trust Michael Stonebraker, David DeWitt, Dave Maier, Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel Non profit company SciDB, Inc. - 501c(3) foundation Plus… 5 developers working on 1 st prototype

Have Usecases From astronomy (LSST) industry (eBay) genomics (LLNL) climate (PNNL/ARM) seismic (Emory Univ) environmental observation & modeling (Oregon Univ) earth remote sensing (UCSB) fusion (LLNL/NIF) WE NEED YOUR USECASES

Timeline Mid June ‘09 professional-looking scidb.org start building user community Late August ’09 planned demo at VLDB reach out to non-US communities through XLDB3 End of Q1’10 – alpha End of Q4’10 – beta

Manpower All work so far in-kind 4.5 FTEs working on demo SLAC, MIT, UW, RAS Good chances to have funds available this FY to hire ~5 full time developers Actively looking for more partners GET INVOLVED

Summary Many commonalities within science and between science and industry Existing off-the-shelf technologies inefficient for very large scale analytics SciDB – new open source science DBMS Community realizes shared software infrastructure is good Big lighthouse customers Strong team If successful, will enable unprecedented analyses at extreme scale

Related Links http://scidb.org http://www-conf.slac.stanford.edu/xldb07 http://www-conf.slac.stanford.edu/xldb08 http://www-conf.slac.stanford.edu/xldb09

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

More Related Content

What's hot

Similar to SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

Recently uploaded

SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics