4/23/13
Hack Data With
Math
Meetup 8/29/2013
Introduction
What H2
O is:
● Machine learning platform
● Distributed
● In-memory
● Open Source
What H2
O can do:
● Scales ...
Agenda
Use H2O with these data:
● MNIST Data
● Kaggle Allstate Data
MNIST DATA: Recognizing Handwritten Digits
MNIST Data
➢ Each observation has ~800 features, one
feature for each pixel in the image
➢ Each feature observation ranges...
Random Forest
Random Forest For Classification
➢ Build a committee of decorrelated decision
trees, call it a forest
➢ Give...
Random Forest
Pros
➢ Decision trees model complex
interactions
➢ Committee of trees reduces
classification error
➢ Easy to...
MNIST Data Class Label Counts
Inspect data by piping together command line tools into lengthy
statements...
Or use H2O! De...
Bodily Injury Claims: Allstate Data
Generalized Linear Modeling (GLM)
Supervised Learning For Prediction:
>Train on data with known labels
>Validate on out of...
Allstate Kaggle Data
Demo 2
Thanks!
Upcoming SlideShare
Loading in …5
×

Meetup8 29 2013

402 views
314 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
402
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Meetup8 29 2013

  1. 1. 4/23/13 Hack Data With Math Meetup 8/29/2013
  2. 2. Introduction What H2 O is: ● Machine learning platform ● Distributed ● In-memory ● Open Source What H2 O can do: ● Scales your analysis: ● Handles large datasets: Billions of rows, 100s of GBs ● Performs computations at very quickly (near Fortran speeds)
  3. 3. Agenda Use H2O with these data: ● MNIST Data ● Kaggle Allstate Data
  4. 4. MNIST DATA: Recognizing Handwritten Digits
  5. 5. MNIST Data ➢ Each observation has ~800 features, one feature for each pixel in the image ➢ Each feature observation ranges from 0 to 255 where 0 is blank and 255 is totally black ➢ There are ~60K observations total 0 1 ……... 783 784 5 0 ……... 231 255 . . . . . . . . ……... ……... ……... ……... . . . . . . . . 1 120 ……... 4 0 Class Labels Pixel Values
  6. 6. Random Forest Random Forest For Classification ➢ Build a committee of decorrelated decision trees, call it a forest ➢ Give data to the committee for prediction ➢ Majority vote on a row of data to classify
  7. 7. Random Forest Pros ➢ Decision trees model complex interactions ➢ Committee of trees reduces classification error ➢ Easy to train and tune on small data
  8. 8. MNIST Data Class Label Counts Inspect data by piping together command line tools into lengthy statements... Or use H2O! Demo time! Direct your browser to 192.168.1.161:xxxxx xxxxx is your provided port number
  9. 9. Bodily Injury Claims: Allstate Data
  10. 10. Generalized Linear Modeling (GLM) Supervised Learning For Prediction: >Train on data with known labels >Validate on out of sample data with known labels >Test on new data ● Enough training data for the model to adequately capture complex interactions between variables ● Use shrinkage methods to improve predictive power ● A model can be mostly judged by its “ability to predict” ○ Model is no good when predictive power falls below some threshold Examples: ○ Regression: Linear, Logistic, Poisson, Tweedie
  11. 11. Allstate Kaggle Data Demo 2
  12. 12. Thanks!

×