Rise of the scientific database


Published on

Slides from the talk, "Rise of the Scientific Database" at Strata 2012 (Santa Clara).

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Rise of the scientific database

  1. 1. Rise of the Scientific Database John A. De Goes, @jdegoes
  2. 2. Agenda• Scientific Computing & Databases• Blessing / Curse of the RDBMS• Power of the Array• Scientific Databases• Hadoop• Summary & Conclusions
  3. 3. What is Scientific Computing?"Scientific computing is concerned withconstructing mathematical models andquantitative analysis techniques and usingcomputers to analyze and solve scientificproblems." —Wikipedia
  4. 4. J LAPACK Mathematica Julia Fortran LINPACK SciLab Spark Modern numerical linear algebra MATLAB SciPy MLBase Gradient methods Conjugate gradient PDL SciDB Finitedifferences Finite difference for PDEs Poisson solvers Rasdaman MonetDB / SciQL 1940s 1960s 1980s 2000s The Future 1950s 1970s 1990s 2010s Finite element methods Stable SVD algorithms Large-scale eigenvalue NumPy ??? solvers Numeric linear algebra Iterative methods Hadoop GNU Octave Linear programming Stable pseudoinverses Mahout Python Monte carlo FFT HPCC SPSS APL invented CUDA SAS released OpenCL BrookGPU
  5. 5. What is a Database?"A technology that combines the ability tostore data with a high-level, high-performance means of storing, retrieving,and manipulating that data without havingto write code or have knowledge of themechanisms of implementation."
  6. 6. Relational Model Ingres (QUEL) System R (SEQUEL) Julia SQL/DBS Spark DBS2 ODBMS MLBase Oracle MySQL SciDB "RDBMS" PostgreSQL MonetDB / SciQL1960s 1980s 2000s The Future 1970s 1990s 2010sCODASYL SQL wins MongoDB ??? IMS DB2 CouchDB SABRE DBase Riak SQL Server Neo4j Other solutions
  7. 7. The Relationship between Scientific Computing & Databases Scientific Scientific Data Computing Databases Analysis
  8. 8. The Database Landscape Unstructured 2000 ? ?Semi-structured 2005 2000 ? Structured 1970 1980 ? Operational Analytical Scientific gets & puts sums & counts data analysis
  9. 9. Relational AlgebraProjection Selection Rename Natural Join R S Semijoin Antijoin Division Theta Join R S R S R ÷ SLeft outer join Right outer join Full outer join Aggregation R ⟕ S R ⟖ S R⟗ S G1, G2, ..., Gm g f1(A1), f2(A2), ..., fk(Ak) (r)
  10. 10. The Curse of RDBMSSets Tuples ??? rows columns
  11. 11. The Curse of RDBMSSets Tuples Arrays rows columns
  12. 12. The Power of the Array• Linear Algebra• Transforms (Fourier, wavelet, etc.)• Spatial Analysis• Temporal Analysis• Etc.
  13. 13. Poor Man’s ArraysSELECT X.row AS row, Y.col AS col, SUM(X.value * Y.value) AS value, FROM X, Y where X.col = X.row GROUP BY X.row, Y.col
  14. 14. Poor Man’s ArraysSELECT A.name, A.sales, SUM(B.sales) AS running_total FROM Sales AS A, Sales AS B WHERE A.sales < B.sales or (A.sales = B.sales and A.name = B.name) GROUP BY A.name, A.sales
  15. 15. Poor Man’s Arrays
  16. 16. What is a Scientific Database?• First-class support for multidimensional arrays • Creation • Manipulation • Composition• Capable of expressing whole analyses, not just snippets• Tremendous benefits across multiple dimensions • Scalability & Performance • Expressiveness & Usability • Robustness & Accuracy
  17. 17. Array Algebra• Many different approaches (NRCA, SciQL, AFL, ODMG, etc.)• Possible to define as extensions to relational core (but not necessary)• Most approaches share common core • Array deconstruction • Array construction • Array reduction
  18. 18. Scientific DatabasesRasdaman SciDB MonetDB (+SciQL)
  19. 19. What About Hadoop?• Commonly used in scientific computing• No scientific database technology • But many useful programming libraries • Hama • Mahout • Cascading• Hadoop doesn’t make it easy • YARN should help (Tez?) • Balancing needs help• Not the only game in town anymore (BDAS, MPI-2, HPCC, etc.)
  20. 20. Conclusions• Scientific computing can benefit from a scientific database• Success of RDBMS was also a curse• NoSQL, big data, catalysts for disruption• Still early for scientific databases• Hadoop loves/hates science
  21. 21. Resources SciDB / Array Functional Language http://bit.ly/VdXJkA Rasdaman / rasql http://en.wikipedia.org/wiki/Rasdaman MonetDB / SciQL http://monetdb.org Precog / Quirrel http://precog.comQuery Language for Multidimensional Arrays: Design, Implementation, & Optimization Techniques John A. De Goes, @jdegoes