Your SlideShare is downloading. ×
Michael Stonebraker How to do Complex Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Michael Stonebraker How to do Complex Analytics

1,040
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,040
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. How to do Complex Analytics Michael Stonebraker
  • 2. Big Volume - Little Analytics
    • SQL aggregates, group_by
    • Find me the average closing price of MSFT on all trading days within the last 3 years
    • Find me the average closing price of each stock in the DJIA on trading days in the last 5 years
    • High performance on SQL analytics available from the data warehouse crowd
  • 3. Big Data - Big Analytics
    • Complex math operations (machine learning, clustering, trend detection, ….)
      • The world of the “quants”
      • Mostly specified as linear algebra on array data
    • A dozen or so common ‘inner loops’
      • Matrix multiply
      • QR decomposition
      • SVD decomposition
      • Linear regression
  • 4. Big Data - Big Analytics An Example
    • Consider closing price on all trading days for the last 5 years for two stocks A and B
    • What is the covariance between the two time-series?
        • (1/N) * sum (A i - mean(A)) * (B i - mean (B))
  • 5. Now Make It Interesting …
    • Do this for all pairs of 4000 stocks
      • The data is the following 4000 x 1000 matrix
    Hourly data? All securities? Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7 … . t 1000 S 1 S 2 … S 4000
  • 6. Solution
    • Except for the constant and subtracting off the means:
      • Stock * Stock T
  • 7. Big Data - Big Analytics Requirements
    • SQL-style data management
      • Filters, joins, ….
    • Complex array manipulation
  • 8. Big Data - Big Analytics Solution Options
    • Math package
    • RDBMS
    • RDBMS + math package
    • Array data base
    • Hadoop
  • 9. Solution Options R, SAS, Matlab, et al
    • Weak or non-existent data management
      • Do the correlation only for companies with revenue > $1B ?
    • File system storage
    • R doesn ’t scale and is not a parallel system
      • Revolution does a bit better
  • 10. Solution Options RDBMS alone
    • SQL simulator (MadLib) is slooooow
      • And only does some of the required operations
    • Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow
      • And current UDF model not powerful enough to support iteration
  • 11. Solution Options R + RDBMS
    • Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames)
    • ‘ move the world’ nightmare
    • Need to learn 2 systems
    • And R still doesn’ t scale and is not a parallel system
    • Some RDBMS vendors are working on these issues
  • 12. Array DBMS (e.g. Paradigm4/SciDB)
    • Array SQL data management
    • With massively scalable array analytics
    • In a single system!
    • Open source
    • Runs in the cloud or private grid of commodity HW
  • 13. Array Versus Relational Tables
    • Math functions run directly on native storage format
    • Dramatic storage efficiencies as # of dimensions & attributes grows
    • High performance on both sparse and dense data
    48 cells 16 cells
  • 14. Hadoop
    • Awful performance on data management
      • No indexes, no statistics, …
    • Low level interface
      • 40 years of DBMS research points to high level interfaces
    • At the very least move to Pig, Hive, …
      • Another moving part to integrate
  • 15. Hadoop
    • No Math
      • Roll your own or
      • Use Mahout (yet another moving part to integrate)
    • And Hadoop is very inefficient on math that is not “embarassingly parallel”
  • 16. Summary
    • RDBMS good on data management, bad on math
    • Math products don’ t scale and have no data management
    • Hadoop is slow and has too many moving parts that are not well integrated
      • Not good at either task!
    • Opportunity for a new DBMS?