• Like
Meetup8 29 2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Meetup8 29 2013

  • 177 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
177
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 4/23/13 Hack Data With Math Meetup 8/29/2013
  • 2. Introduction What H2 O is: ● Machine learning platform ● Distributed ● In-memory ● Open Source What H2 O can do: ● Scales your analysis: ● Handles large datasets: Billions of rows, 100s of GBs ● Performs computations at very quickly (near Fortran speeds)
  • 3. Agenda Use H2O with these data: ● MNIST Data ● Kaggle Allstate Data
  • 4. MNIST DATA: Recognizing Handwritten Digits
  • 5. MNIST Data ➢ Each observation has ~800 features, one feature for each pixel in the image ➢ Each feature observation ranges from 0 to 255 where 0 is blank and 255 is totally black ➢ There are ~60K observations total 0 1 ……... 783 784 5 0 ……... 231 255 . . . . . . . . ……... ……... ……... ……... . . . . . . . . 1 120 ……... 4 0 Class Labels Pixel Values
  • 6. Random Forest Random Forest For Classification ➢ Build a committee of decorrelated decision trees, call it a forest ➢ Give data to the committee for prediction ➢ Majority vote on a row of data to classify
  • 7. Random Forest Pros ➢ Decision trees model complex interactions ➢ Committee of trees reduces classification error ➢ Easy to train and tune on small data
  • 8. MNIST Data Class Label Counts Inspect data by piping together command line tools into lengthy statements... Or use H2O! Demo time! Direct your browser to 192.168.1.161:xxxxx xxxxx is your provided port number
  • 9. Bodily Injury Claims: Allstate Data
  • 10. Generalized Linear Modeling (GLM) Supervised Learning For Prediction: >Train on data with known labels >Validate on out of sample data with known labels >Test on new data ● Enough training data for the model to adequately capture complex interactions between variables ● Use shrinkage methods to improve predictive power ● A model can be mostly judged by its “ability to predict” ○ Model is no good when predictive power falls below some threshold Examples: ○ Regression: Linear, Logistic, Poisson, Tweedie
  • 11. Allstate Kaggle Data Demo 2
  • 12. Thanks!