SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics - Presentation Transcript

    1. Open Source Data Management System for Data-Intensive Scientific Analytics Jacek Becla San Diego Supercomputer Center 05/29/2009
    2. Outline
      • Needs, challenges, today’s solutions and emerging trends
      • SciDB design and planned features
      • SciDB structure and timeline
    3. Size Challenge
      • Data set sizes grow dramatically
      • Growth rate increases
      • Implications
        • Failures are routine
        • Provenance tracking is a must
        • Massive parallelization is a must
        • Full automation, self-adjustment is a must
    4. Analytics Complexity
      • More data varieties = more ways to analyze it
      • Rapid growth of complexity of analytics
        • Time series comparisons
        • N 2 and N 3 correlations
        • Proximity and grouping-based searches
      • Interactive exploration enables most science
      • Data uncertainty matters
      • Provenance is an integral part of analytics
      • User annotations are important
      • Ad-hoc integration of derived data with raw data desired
      • True for science and industry
    5. Today’s Technologies
      • Existing databases
        • Most too monolithic
        • Expensive to scale
        • Expensive to provide high availability
        • Built for perfect schemas and clean data
        • Relational data model far from ideal for most projects
        • APIs far from ideal
          • Desired intuitive interfaces
      • Most very large systems shy away from databases
    6. Today’s Solutions
      • Metadata in lightweight database plus bulk data in files
        • BaBar, LHC, LCLS
      • Bulk data stored as unstructured data in database
        • NIF
      • Raw data in files, derived data in database
        • PanSTARRS, LSST (future projects)
      • Complete (or mostly) home-grown systems
        • ATT, Google, Yahoo, Amazon, Facebook
        • Most common solution
      • All in database
        • WalMart (very expensive)
        • eBay (very expensive, testing new home grown solution)
        • SDSS, bio, genomics (small-ish, single-server databases)
      • Little reusing, roll-your-own mentality
    7. Future
      • Emerging trends
        • Shared nothing parallel database
        • Lightweight, specialized components
        • On low-cost commodity hardware
        • Aggressive compression
      • Several attempts to push state-of-the-art forward
        • Aster Data, Vertica, ParAccel, EnterpriseDB, Greenplum, Netezza
      • Some issues not addressed by anyone
        • Arrays, provenance, uncertainty, partial results, intuitive interfaces
    8. XLDB Activities
      • 2007
      • Indentify trends, roadblocks
      • Bridge the gaps
      • 2008
      • Complex analytics
      • Bridge the gaps
      • 2009
      • Reach out to non-US communities
      • Connect with remaining sciences
    9. Outline
      • Needs, challenges, today’s solutions and emerging trends
      • SciDB design and planned features
      • SciDB structure and timeline
      • Philosophy
        • address common scientific needs
        • geared for analytics, not OLTP
      • Key requirements
        • open source
        • commercial quality
        • peta-scale
      New Open Source Science Database System
    10. Data Model - Types
      • Scalars
        • standard base types (int, float, string, date, …)
        • geospatial (3-D points, lines, polygons, boxes)
      • Multi-d arrays
        • regular or irregular
        • any number of dimensions
        • nesting allowed
        • dense or sparse
    11. Data Model - Operators
      • Native (built-in)
        • array-sql (filter, project, group_by, aggregation, …)
        • array (pivot, regridding, reshaping, transformations, nest, flatten, …)
      • User-defined functions
        • Postgres-style
        • coded in C++
      • Native operators coded as UDFs
      • All UDFs treated equally
        • optimizer might do more with built-in UDFs
      • Two kinds: per cell, per array
      • All UDFs executable in parallel
      • Paradigm: primitives for data-heavy compute
    12. Data Model – Match To Science Needs
      • astronomy
      • earth and environmental sciences, including oceanography, remote sensing, seismology
      • bio-medical imaging
      • fusion
      • bio (need sequences)
      • chemistry (need network structures)
    13. Query Language
      • “Parse-tree” representation of operations
      • “Bindings” to C++, Python, IDL, ... (TBD)
      • Tight integration with popular statistical tools like R or MATLAB
    14. Storage Model
      • Granularity
        • “ Chunked” arrays
          • Chunk = unit of storage, buffering and compression
        • Chunks partitioned across nodes
      • Parallel model
        • Shared-nothing parallel DBMS
          • runs on a grid of computers, uniformity not required
        • Data exchanged between nodes as needed
      • Format
        • Loaded or in-situ modes
          • in-situ: limited capabilities
        • Adaptors to translate external popular formats (like HDF5) on the fly
    15. Versioning
      • No overwrite storage
      • Named versions
      • Delta compression
    16. Provenance
      • Need:
        • what operations led to creation of given element
        • what operations used this element
        • what data elements were used as input to this operation
        • what data elements were created as output from this operation
      • Natively supported
        • easy if workflow in SciDB
      • Loading external provenance
      • Efficient querying
      • No-overwrite + delta compression helps
    17. Uncertainty
      • Error bars carried along in the computation
      • Initial version
        • interval arithmetic
        • uniform error distribution
      • More complex models usually science-specific
        • might consider implementing some in the future if enough commonalities
      • Approximate results
    18. Resource Management
      • Query scheduling
        • including shared scans (train scheduling)
      • Query progress
      • Support for long-running queries (cancel/stop/restart)
      • Pre-execution query cost estimates
      • Per user/query limits
    19. Other Features
      • High availability / automatic fail over
      • Auto config and auto self-healing
    20. Green Computing
      • Aggressive compression  less disk
      • Approximate results  stop computing early
      • Shared scans  share I/O
      • Scale out as you go  incremental provisioning
    21. Science / Industry Needs
      • Scale
      • Complex analytics
        • time series, needle in haystack, group based
      • Summary statistics @petascale
      • Arrays
      • Provenance
      • Uncertainty
      • Integration with statistical tools
      … all needed by industry
    22. Outline
      • Needs, challenges, today’s solutions and emerging trends
      • SciDB design and planned features
      • SciDB structure and timeline
    23. Partnership - Roles
      • Science and high-end commercial
        • provide input, including usecases
        • provide some resources
        • review design, test the product
      • DBMS brain trust
        • design, oversee construction, perform research
      • Non profit company
        • manage the project
        • support resulting system
    24. Partnership – Current Players
      • Science and high-end commercial
        • LSST, PNNL, UCSB, LLNL
        • eBay, Vertica, Microsoft
        • lighthouse customers: LSST and eBay
      • DBMS brain trust
        • Michael Stonebraker, David DeWitt, Dave Maier, Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel
      • Non profit company
        • SciDB, Inc. - 501c(3) foundation
      • Plus… 5 developers working on 1 st prototype
    25. Have Usecases From
      • astronomy (LSST)
      • industry (eBay)
      • genomics (LLNL)
      • climate (PNNL/ARM)
      • seismic (Emory Univ)
      • environmental observation & modeling (Oregon Univ)
      • earth remote sensing (UCSB)
      • fusion (LLNL/NIF)
      • WE NEED YOUR USECASES
    26. Timeline
      • Mid June ‘09
        • professional-looking scidb.org
        • start building user community
      • Late August ’09
        • planned demo at VLDB
        • reach out to non-US communities through XLDB3
      • End of Q1’10 – alpha
      • End of Q4’10 – beta
    27. Manpower
      • All work so far in-kind
      • 4.5 FTEs working on demo
        • SLAC, MIT, UW, RAS
      • Good chances to have funds available this FY to hire ~5 full time developers
      • Actively looking for more partners
        • GET INVOLVED
    28. Summary
      • Many commonalities within science and between science and industry
      • Existing off-the-shelf technologies inefficient for very large scale analytics
      • SciDB – new open source science DBMS
        • Community realizes shared software infrastructure is good
        • Big lighthouse customers
        • Strong team
        • If successful, will enable unprecedented analyses at extreme scale
    29. Related Links
      • http://scidb.org
      • http://www-conf.slac.stanford.edu/xldb07
      • http://www-conf.slac.stanford.edu/xldb08
      • http://www-conf.slac.stanford.edu/xldb09

    + San Diego Supercomputer CenterSan Diego Supercomputer Center, 5 months ago

    custom

    858 views, 0 favs, 0 embeds more stats

    SciDB: Open Source Data Management System for Data- more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 858
      • 858 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 8
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories