Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Looking into the Future
Using Google’s Prediction API
Justin Grammens
Recursive Awesome & IoT Weekly
What is Prediction?
• Defined by Wikipedia as: “A statement about an
uncertain event.”
• Continues on to read… “It is often...
Statistical Inference
• Statistical inference is the process of deducing
properties of an underlying distribution by analy...
Frequentist Inference
• Data is repeatable random sample with a specific
probability
• Parameters and probabilities remain ...
Bayesian Inference
• Take into account prior results and subjective
beliefs
• Update probabilities of occurrence based on ...
Predictions by Machines
• Could therefore define
prediction as an “informed
guess or opinion.”
• Software systems have to
b...
What is Prediction API?
• Announced at Google I/O in 2011
• Provides pattern-matching and machine learning
capabilities.
•...
What Are Some Usages?
What Do You Need?
• Google Account
• Google Platform Console project
• Google Predication API Activated
• Google Cloud Sto...
Steps Involved
• Define what you are trying to accomplish
• Find the training data and format to support your goal
(hardest...
Hosted Model
• The Prediction API hosts a gallery of user-submitted
models
• Owners can charge for the use of the model
• ...
How To Train
• 3 ways to create and train the correct type of model
• CSV File - Lives on Google Cloud Storage
• Training ...
CSV File Rules
• Maximum file size 2.5 GB
• No header row. Yes, to the system it’s irrelevant
• One example per line
• The ...
CSV File Rules
• Text Strings
• Double quotes around all text strings
• Text matching is case-sensitive
• Numeric Values
•...
Structuring Data
• Example Value
• “The Answer”
• Features
• No limit on number of
feature
• More features & examples
the ...
What’s The Answer?
Regression Model
Example Data
• Define your data to support numbers and strings
• Query of “Seattle, 288, sunny”, might get...
Classification Model
Example Data
• Query of “Lose weight now!” you would get
result of “spam”
• Returns the category from ...
Authorization
• You must use OAuth 2.0 to authorize requests
• Can share your model with others
• View: User can call Anal...
Tips & Tricks
• The more examples & features the better results
• However - Adding more features doesn’t always give bette...
Tips & Tricks
• Need to add a numeric aspect to the genre?
• Add additional genre columns and weight it based
on count
gen...
Tips & Tricks
• Always put something into each feature
• Include all the features that you know about
• For Regression:
• ...
Tips & Tricks
• Can only compare against known relationships
• Can’t feed an untrained title and user to get rating
• Solu...
Let’s Talk Data!
• Nice Ride
• Based on the starting station, predict the ending station
• New York Cab Rides
• Given a st...
Based on the starting
station, can we predict
the ending station?
Nice Ride Location Rides
• https://
www.niceridemn.org/
data/
• Offers a live XML
stream to update
along the way
Nice Ride Location Rides
Started
with this:
Next: Ended
with
this:
Nice Ride Insert Data
ID
&
Location
Nice Ride Running
Prediction
Status
Lessons Learned
• I forgot to put the
values in quotes.
Treated it as
numerical
regression.
• Verify how it’s
interpreting...
Nice Ride Location Rides
Show Scripts, API & Results
Can we predict the
movement of NYC cabs?
NYC Cab Ride Data
Data DictionaryData Website
Sample Data
Contains pickup & drop off latitude and longitude
There’s A Problem
• Asking for 2 inputs and 2 outputs!
• Not possible with Prediction API as it only supports
one dependen...
Let’s predict the cost of
a NYC cab ride instead!
Prediction Demo
• Features are
distances (B)
• Examples are prices
(A)
• Is this accurate?
• Different fares
based on area...
Ok, not really… Let's
use location based
data instead
Prediction Demo
• Latitude /
Longitude are the
features (B, C, D, E
• Price Is The
Example (A)
• Examples
NYC Cab Ride Location
Show Scripts, API & Results
Sentiment Analysis of
a Speech
Speech Sentiment
• Always Check Your Data!
• Website incorrectly
claimed positive(4),
negative(0) and
neutral(2) sentiment...
Speech Sentiment
FeatureExample Value
Training
Examples
Sentiment Training
Sentiment Example
Show Scripts, API & Results
Obama State of the Union Speech - 1/16
Donald Trump Speech Des Moines, IA - ...
Smart Spreadsheets
Install Smart Autofill Add-on
Smart Spreadsheets
Prediction API used to fill in missing values
Smart Spreadsheets
Select columns to use for data training
Smart Spreadsheets
“Example Values” are populated
Final Thoughts - Overfitting
• Overfitting the model generally takes the form of
making an overly complex model to explain
i...
Thank You
Justin Grammens
justin@recursiveawesome.com
http://recursiveawesome.com
Checkout my IoT Weekly Newsletter
http:/...
Upcoming SlideShare
Loading in …5
×

Looking into the Future: Using Google's Prediction API

565 views

Published on

We all would like to predict the future at some point in our lives. Well thanks to Google we can now be one step closer! This talk will give an overview of what the Google Prediction API is, how you can use it to analyze data sets, it's strengths and weaknesses and run open data sets through the system covering both regression and categorization models.

Published in: Data & Analytics
  • Be the first to comment

Looking into the Future: Using Google's Prediction API

  1. 1. Looking into the Future Using Google’s Prediction API Justin Grammens Recursive Awesome & IoT Weekly
  2. 2. What is Prediction? • Defined by Wikipedia as: “A statement about an uncertain event.” • Continues on to read… “It is often, but not always, based upon experience or knowledge.” • In statistics, prediction is a part of Statistical Inference.
  3. 3. Statistical Inference • Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. • Two major paradigms used for statistical inference • Frequentist Inference • Bayesian Inference
  4. 4. Frequentist Inference • Data is repeatable random sample with a specific probability • Parameters and probabilities remain constant during the test • Results are independent results from prior tests • Q: Will the sun rise tomorrow? What’s the probability of a sun dying based on all the suns in the universe
  5. 5. Bayesian Inference • Take into account prior results and subjective beliefs • Update probabilities of occurrence based on new data • Tests are NOT run in isolation and affect one another • Q: Will the sun rise tomorrow? Depends on how many times we have seen it rise in the past
  6. 6. Predictions by Machines • Could therefore define prediction as an “informed guess or opinion.” • Software systems have to be trained before they can be effective. source: reading.pppst.com
  7. 7. What is Prediction API? • Announced at Google I/O in 2011 • Provides pattern-matching and machine learning capabilities. • Handles both numeric or text input • Handles both classification or regression output • Access from App Engine, client libs and command line • Able to retrain the model on the fly - Bayesian?
  8. 8. What Are Some Usages?
  9. 9. What Do You Need? • Google Account • Google Platform Console project • Google Predication API Activated • Google Cloud Storage API Activated
  10. 10. Steps Involved • Define what you are trying to accomplish • Find the training data and format to support your goal (hardest part) • Upload training data to Google Cloud Storage • Train the system against the data you provide • Send queries to your model • Upload additional data with new information gained.
  11. 11. Hosted Model • The Prediction API hosts a gallery of user-submitted models • Owners can charge for the use of the model • Hosted models are versioned so they an be updated easily • Models are submitted in PMML format • XML-based language to define statistical & data models • Appears to currently be a waitlist
  12. 12. How To Train • 3 ways to create and train the correct type of model • CSV File - Lives on Google Cloud Storage • Training data embedded in request • Limited to the size of an HTTP Request < 2MB • Empty model created and trained with update calls
  13. 13. CSV File Rules • Maximum file size 2.5 GB • No header row. Yes, to the system it’s irrelevant • One example per line • The first column indicates to the system the type of model. • Ideally remove punctuation (other then apostrophes) from your data.
  14. 14. CSV File Rules • Text Strings • Double quotes around all text strings • Text matching is case-sensitive • Numeric Values • Integer and decimals are supported • Numbers: "1", "23", “999" • Strings: "6 12", “colt 45"
  15. 15. Structuring Data • Example Value • “The Answer” • Features • No limit on number of feature • More features & examples the better • To train 16MB ~ 1 hour
  16. 16. What’s The Answer?
  17. 17. Regression Model Example Data • Define your data to support numbers and strings • Query of “Seattle, 288, sunny”, might get back value of 62 • Don’t need to match any values in the dataset • Fill model with all columns then query with first column missing
  18. 18. Classification Model Example Data • Query of “Lose weight now!” you would get result of “spam” • Returns the category from the dataset
  19. 19. Authorization • You must use OAuth 2.0 to authorize requests • Can share your model with others • View: User can call Analyze, Get, List and Predict on the project and/or any model owned by the project. • Edit: User has all the permissions of Can view, but can also Delete, Insert, and Update any models owned by the project. • Is Owner: User has all the permissions of Can edit, but can also grant permissions to other users to access the project.
  20. 20. Tips & Tricks • The more examples & features the better results • However - Adding more features doesn’t always give better predictions is_comedy is_drama is_action is_horror Y N N N VS genre Comedy
  21. 21. Tips & Tricks • Need to add a numeric aspect to the genre? • Add additional genre columns and weight it based on count genre genre genre genre genre Drama Drama Drama Comedy Comedy
  22. 22. Tips & Tricks • Always put something into each feature • Include all the features that you know about • For Regression: • Make sure will have the time to ensure the values are correct • Conversely, if you have exact numbers use them • Try to have at least a few hundred examples for each category
  23. 23. Tips & Tricks • Can only compare against known relationships • Can’t feed an untrained title and user to get rating • Solution is to break the title into genre, director, actors Rating user_name movie_title 9.5 Justin Star Wars 2.2 Justin Disaster Movie 5.0 Justin Billy Madison
  24. 24. Let’s Talk Data! • Nice Ride • Based on the starting station, predict the ending station • New York Cab Rides • Given a starting GPS coordinate, predict where the cab ride will end • Sentiment Analysis • Based on the state of the union speech define the sentiment
  25. 25. Based on the starting station, can we predict the ending station?
  26. 26. Nice Ride Location Rides • https:// www.niceridemn.org/ data/ • Offers a live XML stream to update along the way
  27. 27. Nice Ride Location Rides Started with this: Next: Ended with this:
  28. 28. Nice Ride Insert Data ID & Location
  29. 29. Nice Ride Running Prediction Status
  30. 30. Lessons Learned • I forgot to put the values in quotes. Treated it as numerical regression. • Verify how it’s interpreting your data with “get” call. Type
  31. 31. Nice Ride Location Rides Show Scripts, API & Results
  32. 32. Can we predict the movement of NYC cabs?
  33. 33. NYC Cab Ride Data Data DictionaryData Website
  34. 34. Sample Data Contains pickup & drop off latitude and longitude
  35. 35. There’s A Problem • Asking for 2 inputs and 2 outputs! • Not possible with Prediction API as it only supports one dependent variable. :( • Change of plan…
  36. 36. Let’s predict the cost of a NYC cab ride instead!
  37. 37. Prediction Demo • Features are distances (B) • Examples are prices (A) • Is this accurate? • Different fares based on areas of the city
  38. 38. Ok, not really… Let's use location based data instead
  39. 39. Prediction Demo • Latitude / Longitude are the features (B, C, D, E • Price Is The Example (A) • Examples
  40. 40. NYC Cab Ride Location Show Scripts, API & Results
  41. 41. Sentiment Analysis of a Speech
  42. 42. Speech Sentiment • Always Check Your Data! • Website incorrectly claimed positive(4), negative(0) and neutral(2) sentiment. • Data had groups of sentiment values. • Source
  43. 43. Speech Sentiment FeatureExample Value Training Examples
  44. 44. Sentiment Training
  45. 45. Sentiment Example Show Scripts, API & Results Obama State of the Union Speech - 1/16 Donald Trump Speech Des Moines, IA - 1/24
  46. 46. Smart Spreadsheets Install Smart Autofill Add-on
  47. 47. Smart Spreadsheets Prediction API used to fill in missing values
  48. 48. Smart Spreadsheets Select columns to use for data training
  49. 49. Smart Spreadsheets “Example Values” are populated
  50. 50. Final Thoughts - Overfitting • Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. • Therefore, a model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. • Exact query should not return EXACT examples
  51. 51. Thank You Justin Grammens justin@recursiveawesome.com http://recursiveawesome.com Checkout my IoT Weekly Newsletter http://iotweeklynews.com

×