How to do Complex Analytics Michael Stonebraker
Big Volume - Little Analytics <ul><li>SQL aggregates, group_by </li></ul><ul><li>Find me the average closing price of MSFT...
Big Data - Big Analytics <ul><li>Complex math operations (machine learning, clustering, trend detection, ….) </li></ul><ul...
Big Data - Big Analytics An Example <ul><li>Consider closing price on all trading days for the last 5 years for two stocks...
Now Make It Interesting … <ul><li>Do this for all pairs of 4000 stocks </li></ul><ul><ul><li>The data is the following 400...
Solution <ul><li>Except for the constant and subtracting off the means: </li></ul><ul><ul><li>Stock  *  Stock T </li></ul>...
Big Data - Big Analytics Requirements <ul><li>SQL-style data management </li></ul><ul><ul><li>Filters, joins, …. </li></ul...
Big Data - Big Analytics Solution Options <ul><li>Math package </li></ul><ul><li>RDBMS </li></ul><ul><li>RDBMS + math pack...
Solution Options R, SAS, Matlab, et al <ul><li>Weak or non-existent data management </li></ul><ul><ul><li>Do the correlati...
Solution Options  RDBMS alone <ul><li>SQL simulator (MadLib) is slooooow </li></ul><ul><ul><li>And only does some of the r...
Solution Options R + RDBMS <ul><li>Have to extract and transform the data from RDBMS table to math package data format (e....
Array DBMS (e.g.  Paradigm4/SciDB) <ul><li>Array SQL data management </li></ul><ul><li>With massively scalable array analy...
Array Versus Relational Tables <ul><li>Math functions run directly on native storage format  </li></ul><ul><li>Dramatic st...
Hadoop <ul><li>Awful performance on data management </li></ul><ul><ul><li>No indexes, no statistics, … </li></ul></ul><ul>...
Hadoop <ul><li>No Math </li></ul><ul><ul><li>Roll your own or </li></ul></ul><ul><ul><li>Use Mahout (yet another moving pa...
Summary <ul><li>RDBMS good on data management, bad on math </li></ul><ul><li>Math products don’ t scale and have no data m...
Upcoming SlideShare
Loading in...5
×

Michael Stonebraker How to do Complex Analytics

1,077

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,077
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Michael Stonebraker How to do Complex Analytics

  1. 1. How to do Complex Analytics Michael Stonebraker
  2. 2. Big Volume - Little Analytics <ul><li>SQL aggregates, group_by </li></ul><ul><li>Find me the average closing price of MSFT on all trading days within the last 3 years </li></ul><ul><li>Find me the average closing price of each stock in the DJIA on trading days in the last 5 years </li></ul><ul><li>High performance on SQL analytics available from the data warehouse crowd </li></ul>
  3. 3. Big Data - Big Analytics <ul><li>Complex math operations (machine learning, clustering, trend detection, ….) </li></ul><ul><ul><li>The world of the “quants” </li></ul></ul><ul><ul><li>Mostly specified as linear algebra on array data </li></ul></ul><ul><li>A dozen or so common ‘inner loops’ </li></ul><ul><ul><li>Matrix multiply </li></ul></ul><ul><ul><li>QR decomposition </li></ul></ul><ul><ul><li>SVD decomposition </li></ul></ul><ul><ul><li>Linear regression </li></ul></ul>
  4. 4. Big Data - Big Analytics An Example <ul><li>Consider closing price on all trading days for the last 5 years for two stocks A and B </li></ul><ul><li>What is the covariance between the two time-series? </li></ul><ul><ul><ul><li>(1/N) * sum (A i - mean(A)) * (B i - mean (B)) </li></ul></ul></ul>
  5. 5. Now Make It Interesting … <ul><li>Do this for all pairs of 4000 stocks </li></ul><ul><ul><li>The data is the following 4000 x 1000 matrix </li></ul></ul>Hourly data? All securities? Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7 … . t 1000 S 1 S 2 … S 4000
  6. 6. Solution <ul><li>Except for the constant and subtracting off the means: </li></ul><ul><ul><li>Stock * Stock T </li></ul></ul>
  7. 7. Big Data - Big Analytics Requirements <ul><li>SQL-style data management </li></ul><ul><ul><li>Filters, joins, …. </li></ul></ul><ul><li>Complex array manipulation </li></ul>
  8. 8. Big Data - Big Analytics Solution Options <ul><li>Math package </li></ul><ul><li>RDBMS </li></ul><ul><li>RDBMS + math package </li></ul><ul><li>Array data base </li></ul><ul><li>Hadoop </li></ul>
  9. 9. Solution Options R, SAS, Matlab, et al <ul><li>Weak or non-existent data management </li></ul><ul><ul><li>Do the correlation only for companies with revenue > $1B ? </li></ul></ul><ul><li>File system storage </li></ul><ul><li>R doesn ’t scale and is not a parallel system </li></ul><ul><ul><li>Revolution does a bit better </li></ul></ul>
  10. 10. Solution Options RDBMS alone <ul><li>SQL simulator (MadLib) is slooooow </li></ul><ul><ul><li>And only does some of the required operations </li></ul></ul><ul><li>Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow </li></ul><ul><ul><li>And current UDF model not powerful enough to support iteration </li></ul></ul>
  11. 11. Solution Options R + RDBMS <ul><li>Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames) </li></ul><ul><li>‘ move the world’ nightmare </li></ul><ul><li>Need to learn 2 systems </li></ul><ul><li>And R still doesn’ t scale and is not a parallel system </li></ul><ul><li>Some RDBMS vendors are working on these issues </li></ul>
  12. 12. Array DBMS (e.g. Paradigm4/SciDB) <ul><li>Array SQL data management </li></ul><ul><li>With massively scalable array analytics </li></ul><ul><li>In a single system! </li></ul><ul><li>Open source </li></ul><ul><li>Runs in the cloud or private grid of commodity HW </li></ul>
  13. 13. Array Versus Relational Tables <ul><li>Math functions run directly on native storage format </li></ul><ul><li>Dramatic storage efficiencies as # of dimensions & attributes grows </li></ul><ul><li>High performance on both sparse and dense data </li></ul>48 cells 16 cells
  14. 14. Hadoop <ul><li>Awful performance on data management </li></ul><ul><ul><li>No indexes, no statistics, … </li></ul></ul><ul><li>Low level interface </li></ul><ul><ul><li>40 years of DBMS research points to high level interfaces </li></ul></ul><ul><li>At the very least move to Pig, Hive, … </li></ul><ul><ul><li>Another moving part to integrate </li></ul></ul>
  15. 15. Hadoop <ul><li>No Math </li></ul><ul><ul><li>Roll your own or </li></ul></ul><ul><ul><li>Use Mahout (yet another moving part to integrate) </li></ul></ul><ul><li>And Hadoop is very inefficient on math that is not “embarassingly parallel” </li></ul>
  16. 16. Summary <ul><li>RDBMS good on data management, bad on math </li></ul><ul><li>Math products don’ t scale and have no data management </li></ul><ul><li>Hadoop is slow and has too many moving parts that are not well integrated </li></ul><ul><ul><li>Not good at either task! </li></ul></ul><ul><li>Opportunity for a new DBMS? </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×