Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

# Saving this for later?

### Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Standard text messaging rates apply

# Michael Stonebraker How to do Complex Analytics

1,040
views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
1,040
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
26
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. How to do Complex Analytics Michael Stonebraker
• 2. Big Volume - Little Analytics
• SQL aggregates, group_by
• Find me the average closing price of MSFT on all trading days within the last 3 years
• Find me the average closing price of each stock in the DJIA on trading days in the last 5 years
• High performance on SQL analytics available from the data warehouse crowd
• 3. Big Data - Big Analytics
• Complex math operations (machine learning, clustering, trend detection, ….)
• The world of the “quants”
• Mostly specified as linear algebra on array data
• A dozen or so common ‘inner loops’
• Matrix multiply
• QR decomposition
• SVD decomposition
• Linear regression
• 4. Big Data - Big Analytics An Example
• Consider closing price on all trading days for the last 5 years for two stocks A and B
• What is the covariance between the two time-series?
• (1/N) * sum (A i - mean(A)) * (B i - mean (B))
• 5. Now Make It Interesting …
• Do this for all pairs of 4000 stocks
• The data is the following 4000 x 1000 matrix
Hourly data? All securities? Stock t 1 t 2 t 3 t 4 t 5 t 6 t 7 … . t 1000 S 1 S 2 … S 4000
• 6. Solution
• Except for the constant and subtracting off the means:
• Stock * Stock T
• 7. Big Data - Big Analytics Requirements
• SQL-style data management
• Filters, joins, ….
• Complex array manipulation
• 8. Big Data - Big Analytics Solution Options
• Math package
• RDBMS
• RDBMS + math package
• Array data base
• 9. Solution Options R, SAS, Matlab, et al
• Weak or non-existent data management
• Do the correlation only for companies with revenue > \$1B ?
• File system storage
• R doesn ’t scale and is not a parallel system
• Revolution does a bit better
• 10. Solution Options RDBMS alone
• SQL simulator (MadLib) is slooooow
• And only does some of the required operations
• Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow
• And current UDF model not powerful enough to support iteration
• 11. Solution Options R + RDBMS
• Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames)
• ‘ move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’ t scale and is not a parallel system
• Some RDBMS vendors are working on these issues
• 12. Array DBMS (e.g. Paradigm4/SciDB)
• Array SQL data management
• With massively scalable array analytics
• In a single system!
• Open source
• Runs in the cloud or private grid of commodity HW
• 13. Array Versus Relational Tables
• Math functions run directly on native storage format
• Dramatic storage efficiencies as # of dimensions & attributes grows
• High performance on both sparse and dense data
48 cells 16 cells
• Awful performance on data management
• No indexes, no statistics, …
• Low level interface
• 40 years of DBMS research points to high level interfaces
• At the very least move to Pig, Hive, …
• Another moving part to integrate