Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

1,975 views

Published on

H2O World 2015 - Nachum Shacham of PayPal

Published in: Software

H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham

  1. 1. © 2015 PayPal Inc. All rights reserved.. Data Science with Big Data in a Corporate Environment Tasks, Challenges, and Tradeoffs Nachum Shacham Principal Data Scientist PayPal 11/10/2015
  2. 2. © 2015 PayPal Inc. All rights reserved. Data Science From 10K ft 2 ML Algorithm Design Optimization Scores Performance Charts Training set Test set(s) Model Fitting Validate, calculate KPIs
  3. 3. © 2015 PayPal Inc. All rights reserved. Predictive Modeling Meets Big Data:Bridging the Gap 3 Data Warehouse Model(s) selection and tuning Data Munging Deploy Feature Engineering Observational big-data Data Exploration Validation Data Sourcing Deep, wide, distributed ,sparse, partiallyunknown, redundant,unreliable • Unified  format  (“data  frame”) • Well  understood • Statistically  known  and  “well  behaved” Data Science with Big Data: Quantitative change à Process adjustment
  4. 4. © 2015 PayPal Inc. All rights reserved. 4 Predictive Models for Serving Customers Rewards Tech Support Customer’s future needs Unblanced Precision/recall latency Personalization Model Organic External Cohort Info gathering vs UX Features DATA Metrics Deplyment
  5. 5. © 2015 PayPal Inc. All rights reserved. Data Exploration: Knowing What’s in the Data and Fitting the Pieces Together 5 Understaning Data Contents & meaning Big data visualization Units Language Join keys Null values: codes, quantities Correlations Distributions and outliers
  6. 6. © 2015 PayPal Inc. All rights reserved. 6 Data Munging: Reshaping the Data à Decisions, Decisions Unify values Money, time Dummy variables Sub groups categorical variable Missing values Join keys do not match Unify null symbols Impute / discard Reformat keys Ignore rows DiscardTransform Nullify Irrelevant/ redundant values Transform Remove outliers Standardize Skewed distributions
  7. 7. © 2015 PayPal Inc. All rights reserved. •R •data.table •Dplyr •Lubridate •complete •{t, l, s}apply •{g}sub •grep{l} •ggplot2 7 Data Munging Packages & Functions •Python •pandas •re •Numpy •Matplotlib •csv •datetime
  8. 8. © 2015 PayPal Inc. All rights reserved. 8 Feature Engineering Transform raw data to better represent the problem to the model Better features à Simpler models, better results Manual Feature Manipulation Feature learning algorithms Integrated in model fitting • Dummy  variables • Text  integration • Sub/super  sampling   (unbalanced  sets)
  9. 9. © 2015 PayPal Inc. All rights reserved. 9 Model Tuning Learning Rate Parameter Grid Ensemble: degree, type Interpretability vs Accuacy
  10. 10. © 2015 PayPal Inc. All rights reserved. Model Building Using H2O 10 Rich library of ML algorithms Auto detect feature format Interoperate with ordinary R: split work for complete munging Runs through parameter grid Big Server
  11. 11. © 2015 PayPal Inc. All rights reserved. • Who should judge the value of the results? •Evaluation criteria • What are the cost/value of FP, TP, FN, TP? •FP: PR cost •FN: Financialloss •TP: Intended gain • ContinuousModel evaluation in production 11 Model Evaluation
  12. 12. © 2015 PayPal Inc. All rights reserved. • Model results’ deployment destinations •Humans (decision makers, analysts) •Computers (recommendations) •Databases (scores, leads) • Latencies: hours to milliseconds •From events to model •From model output to action 12 Model Deployment
  13. 13. © 2015 PayPal Inc. All rights reserved. • Multiple aligned plots • Large number of data points on frame • Statisticalplots • Network plots • Model fitting on plot data 13 Big Data Visualization
  14. 14. © 2015 PayPal Inc. All rights reserved. • Detection of concept drift post deployment •Changein feature set distribution •Changein conditionalprobability P(y | X) • Auto correction for concept drift •Repeat training on time window of recent data •Time window: fixed vs. variable size • Changedetection •CUSUM •StatisticalProcess Control 14 Model Drift
  15. 15. © 2015 PayPal Inc. All rights reserved. Transaction & Behavioral: Site Speed impact 15 NOT  ALL  SPEED  IMPROVEMENTS   ARE  CREATED  EQUAL load  time Bounce  rate
  16. 16. © 2015 PayPal Inc. All rights reserved. System Log Data:Query Cost Comparison 16 TCO CONSUMPTION EXEC  COST
  17. 17. © 2015 PayPal Inc. All rights reserved. Summary: The “Extra” Steps Make The Difference 17 Details: the Critical Factor Platform à Performance Decisions: understand the impact Delayed Incomplete Irrelevant
  18. 18. © 2015 PayPal Inc. All rights reserved. THANK YOU 18

×