Linear regression on 1 terabytes of data? Some crazy observations and actions

2,579
-1

Published on

Joint Statistical Meeting 2013 Topic Contributed Session Presentation. Big Data Exploration with Amazon.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,579
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Linear regression on 1 terabytes of data? Some crazy observations and actions

  1. 1. Linear Regression on 1 Terabytes of Data? Some Crazy Observations and Actions Hesen Peng Amazon.com Big Data Exploration with Amazon
  2. 2. Model building procedure for a major internet company Planning and Idea Generation Data collection Model building and offline evaluation Implementation for application online Performance evaluation in real world Experiment Design, Clinical Trial Major Machine Learning/Stat research Interesting weekend project Unsupervised Machine Learning, Survival analysis Power Point
  3. 3. Linear regression with 1TB of data
  4. 4. Wanna try it out? • Use Amazon Web Service! (with free tire) – http://aws.amazon.com/education/ • Write simple distributed algorithm: – Python: MRJob (https://github.com/Yelp/mrjob) – R: RHadoop (https://github.com/RevolutionAnalytics/RHadoop) – Launch your own Sun/Oracle Grid Engine environment for parallel computing (http://star.mit.edu/cluster/)
  5. 5. New Challenges • Association beyond linear – Make better use of data: (most) factors are statistically significant in linear models with 1 TB of data – (Better?) Prediction • Everything goes to real time – Build/ update model, analytics, data storage in real time – Faster response to new happenings – Save engineering overhead
  6. 6. Real time big data analytics work flow Real time data input (training + testing data) Real time analytics front end Dashboarding/ monitoring Model building / update Prediction server Outlier detection and pre-processing Huge Statistical ChallengeTree design rather than ring design, enabling parallel construction and update
  7. 7. Where are we? Offline model building and scheduled updating Linear regression / GLM using Mahout etc Random Forest, SVM, Hashing, and beyond Mutual information, Brownian Covariate, Mira score, and density estimation! Batch processing and near real time updating Batch update to the linear model Batch update of random forest, adaptively throw away trees ? Real time data processing / cleaning and model building Linear model built and consumed in real time ? Real time universal association discovery ! Timeliness of model build Complexityof association
  8. 8. Universal association discovery • Discovere associations between to random vectors • Regardless of dimension and association form (linear / nonlinear/ higher order interaction). • E.g. Mutual information, Brownian Distance Covariate, Mira score (1NN edge sum)
  9. 9. Intuition Hesen Peng, Tianwe Yu. SeMira: Universal Association Discovery and Variable Selection among Continuous Variables using Functions on the Observation Graph
  10. 10. Mira score: another function on the distance graph • Where d(i) is the distance between observation i and its nearest neighbore. • O(N2P) • How to adapt to real time analytics? – Segment data for batch processing – Keep partial data in memory and change the calculation function
  11. 11. From O(N2P) to O(NP) A whole distance matrix between observations Only keep the most up-to- date few in memory and calculate NN distance btw observations kept in memory Yes, loss of power; assuming association is independent of sequence of observation
  12. 12. We are still at Day 1 • Mira score: only capable of detecting association between continuous variables – SeMira: variable selection – No prediction yet • Functions on the distance graph is a gold mine. • Real time analytics = $$$ – Fraud detection – Clustering – Recommendation systems
  13. 13. Join Us! • Ask Hesen for referral: hesepeng@amazon.com • http://www.amazon.com/gp/jobs • Jobs of all levels: – Research Scientist – Business Intelligence Engineer – Software Development Engineers – Machine Learning scientist – Manager in Machine Learning
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×