Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Michael Stonebraker How to do Complex Analytics


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Michael Stonebraker How to do Complex Analytics

  1. 1. How to do Complex Analytics Michael Stonebraker
  2. 2. Big Volume - Little Analytics <ul><li>SQL aggregates, group_by </li></ul><ul><li>Find me the average closing price of MSFT on all trading days within the last 3 years </li></ul><ul><li>Find me the average closing price of each stock in the DJIA on trading days in the last 5 years </li></ul><ul><li>High performance on SQL analytics available from the data warehouse crowd </li></ul>
  3. 3. Big Data - Big Analytics <ul><li>Complex math operations (machine learning, clustering, trend detection, ….) </li></ul><ul><ul><li>The world of the “quants” </li></ul></ul><ul><ul><li>Mostly specified as linear algebra on array data </li></ul></ul><ul><li>A dozen or so common ‘inner loops’ </li></ul><ul><ul><li>Matrix multiply </li></ul></ul><ul><ul><li>QR decomposition </li></ul></ul><ul><ul><li>SVD decomposition </li></ul></ul><ul><ul><li>Linear regression </li></ul></ul>
  4. 4. Big Data - Big Analytics An Example <ul><li>Consider closing price on all trading days for the last 5 years for two stocks A and B </li></ul><ul><li>What is the covariance between the two time-series? </li></ul><ul><ul><ul><li>(1/N) * sum (A i - mean(A)) * (B i - mean (B)) </li></ul></ul></ul>
  5. 5. Now Make It Interesting … <ul><li>Do this for all pairs of 4000 stocks </li></ul><ul><ul><li>The data is the following 4000 x 1000 matrix </li></ul></ul>Hourly data? All securities? Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7 … . t 1000 S 1 S 2 … S 4000
  6. 6. Solution <ul><li>Except for the constant and subtracting off the means: </li></ul><ul><ul><li>Stock * Stock T </li></ul></ul>
  7. 7. Big Data - Big Analytics Requirements <ul><li>SQL-style data management </li></ul><ul><ul><li>Filters, joins, …. </li></ul></ul><ul><li>Complex array manipulation </li></ul>
  8. 8. Big Data - Big Analytics Solution Options <ul><li>Math package </li></ul><ul><li>RDBMS </li></ul><ul><li>RDBMS + math package </li></ul><ul><li>Array data base </li></ul><ul><li>Hadoop </li></ul>
  9. 9. Solution Options R, SAS, Matlab, et al <ul><li>Weak or non-existent data management </li></ul><ul><ul><li>Do the correlation only for companies with revenue > $1B ? </li></ul></ul><ul><li>File system storage </li></ul><ul><li>R doesn ’t scale and is not a parallel system </li></ul><ul><ul><li>Revolution does a bit better </li></ul></ul>
  10. 10. Solution Options RDBMS alone <ul><li>SQL simulator (MadLib) is slooooow </li></ul><ul><ul><li>And only does some of the required operations </li></ul></ul><ul><li>Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow </li></ul><ul><ul><li>And current UDF model not powerful enough to support iteration </li></ul></ul>
  11. 11. Solution Options R + RDBMS <ul><li>Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames) </li></ul><ul><li>‘ move the world’ nightmare </li></ul><ul><li>Need to learn 2 systems </li></ul><ul><li>And R still doesn’ t scale and is not a parallel system </li></ul><ul><li>Some RDBMS vendors are working on these issues </li></ul>
  12. 12. Array DBMS (e.g. Paradigm4/SciDB) <ul><li>Array SQL data management </li></ul><ul><li>With massively scalable array analytics </li></ul><ul><li>In a single system! </li></ul><ul><li>Open source </li></ul><ul><li>Runs in the cloud or private grid of commodity HW </li></ul>
  13. 13. Array Versus Relational Tables <ul><li>Math functions run directly on native storage format </li></ul><ul><li>Dramatic storage efficiencies as # of dimensions & attributes grows </li></ul><ul><li>High performance on both sparse and dense data </li></ul>48 cells 16 cells
  14. 14. Hadoop <ul><li>Awful performance on data management </li></ul><ul><ul><li>No indexes, no statistics, … </li></ul></ul><ul><li>Low level interface </li></ul><ul><ul><li>40 years of DBMS research points to high level interfaces </li></ul></ul><ul><li>At the very least move to Pig, Hive, … </li></ul><ul><ul><li>Another moving part to integrate </li></ul></ul>
  15. 15. Hadoop <ul><li>No Math </li></ul><ul><ul><li>Roll your own or </li></ul></ul><ul><ul><li>Use Mahout (yet another moving part to integrate) </li></ul></ul><ul><li>And Hadoop is very inefficient on math that is not “embarassingly parallel” </li></ul>
  16. 16. Summary <ul><li>RDBMS good on data management, bad on math </li></ul><ul><li>Math products don’ t scale and have no data management </li></ul><ul><li>Hadoop is slow and has too many moving parts that are not well integrated </li></ul><ul><ul><li>Not good at either task! </li></ul></ul><ul><li>Opportunity for a new DBMS? </li></ul>