Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ML Basics

51 views

Published on

An introductory course on building ML applications with primary focus on supervised learning. Covers the typical ML application cycle - Problem formulation, data definitions, offline modeling, platform design. Also, includes key tenets for building applications.
Note: This is an old slide deck. The content on building internal ML platforms is a bit outdated and slides on the model choices do not include deep learning models.

Published in: Technology
  • Eat THIS "prickly flower" to crush food cravings. Ugly plant kills sugar & carb cravings instantly ➽➽ https://url.cn/5yLnA6L
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

ML Basics

  1. 1. Building ML Applications – A practitioner’s perspective 3/28/18 1 Srujana Merugu
  2. 2. Outline • Overview of ML – What is ML ? Why ML ? – Predictive modeling recap – ML application lifecycle • Problem formulation & data definition • Offline modeling • Building an internal ML platform • Key tenets for real-world ML applications 2
  3. 3. What is ML? 3/28/18 3 A"top"bank"uses"a"ML"model"for" underwriting"decisions"built"for"a" foreign"customer"segment"on"Indian" customers"with"no"changes." HYPE ! FEAR ! MISUSE!
  4. 4. What is Machine Learning? 3/28/18 4 “Field of study that gives computers the ability to learn from data without being explicitly programmed”. - Arthur Samuel (1959)
  5. 5. What is Machine Learning? 3/28/18 5 “Field of study that gives computers the ability to learn without being explicitly programmed”. - Arthur Samuel (1959) Main elements of ML • Representing relevant information as concrete variables • Collecting empirical observations on variables • Algorithms to infer associations between variables
  6. 6. 3/28/18 6 Abstract..Problem.Areas.e.g.,. time.series..modeling,.active. learning • variable.dependency. structure.(e.g.,. sequences,.trees) • nature.of.observations. (e.g.,..noisy,.adversarial) • observation.process. (e.g.,active,incremental ) Model.Classes,.e.g.,..linear. models,.CNNs,.CRFs,.SVMs • Quantification.of..variable. dependencies.&. assumptions. • Specifying.an.exact. optimization.problem • Theoretical.results. Learning.Algorithms,.e.g.,. SGD,.EM,.Distributed.LDA. • Mechanisms.to.solve.the. optimization.problem • Theoretical.results • Practical.enhancements:. scalable,.distributed,. incremental.versions. ML#Research#
  7. 7. 3/28/18 7 Implementation1of1 Algorithms/Models,1e.g.,1sklearn word2vec,1Inception • Software1encoding1of1 algorithms1or1models1in1 specific1programming1 languages ML1Platform1Utilities,1e.g.,1 AzureML,H20,scikitIlearn,Kera, • Software1for1efficient1 management1of1ML1 workflows Abstract11Problem1Areas1e.g.,1 time1series11modeling,1active1 learning • variable1dependency1 structure1(e.g.,1 sequences,1trees) • nature1of1observations1 (e.g.,11noisy,1adversarial) • observation1process1 (e.g.,active,incremental ) Model1Classes,1e.g.,11linear1 models,1CNNs,1CRFs,1SVMs • Quantification1of11variable1 dependencies1&1 assumptions1 • Specifying1an1exact1 optimization1problem • Theoretical1results1 Learning1Algorithms,1e.g.,1 SGD,1EM,1Distributed1LDA1 • Mechanisms1to1solve1the1 optimization1problem • Theoretical1results • Practical1enhancements:1 scalable,1distributed,1 incremental1versions1 ML#Research# ML#Software#Development#
  8. 8. • Translating*application*problems*to*ML*problems • Applying*ML*methodology,*making*right*modeling* choices** • Using*existing*software*tools*effectively* Application* Problems e.g.*Seller* Fraud* Detection ML*Literature ML*Practice Algo &*Model Implementations ML*Platform Utilities Robust Production* Systems More*art*than*science*!**
  9. 9. Why Machine Learning? • Learn it when you can’t code it – e.g. product similarity (fuzzy relationships & trade-offs) • Learn it when you need to contextualize – e.g., personalized product recommendations (fine-grained context) • Learn it when you can’t track it over time – e.g., seller fraud detection (input-output mapping changes dynamically) • Learn it when you can’t scale it – e.g., customer service ( complex task, large scale input, low latency) • Learn it when you don’t understand it – e.g., review aspect-sentiment mining (hidden structure to be detected) 3/28/18 9
  10. 10. Why not Machine Learning ? • Low problem complexity – e.g., classes being well separated in chosen representation • Less interpretability – true for complex models relative to simple rules – when it drives critical decisions (e.g., health care) • Lag time in case data has to be collected or available data is biased – when there exists more holistic domain knowledge • Expensive modeling effort 3/28/18 10
  11. 11. Supervised*Learning Predict(new data based( on((observed(data Unsupervised*Learning Detect(latent(structure( in the(data Reinforcement*Learning* Adapt behavior to( optimize(long(term(goals( using(observed(rewards Broad Areas of Machine Learning 3/28/18 11
  12. 12. Given: An input object with some features X (covariates/independent variables) Goal: Predict a new target/label attributeY (response/dependent variable) Note: – Input & output attributes (X,Y) can be simple (e.g., numeric or categorical values/vector) or have a complex structure (e.g., time-series, text sequences) – Classification (Y is categorical), Regression (Y is numeric/ordinal) Typical Predictive Modeling Problem 12
  13. 13. Shipping Logistics 3/28/18 13 Given: a customer order and seller details Predict: expected shipping time
  14. 14. Product Catalog Management 3/28/18 14 Given: a new product Predict: product category it should be placed in
  15. 15. Product Recommendations 3/28/18 15 Given: a user, current context & a candidate product Predict: preference of user for the product
  16. 16. Advertising 3/28/18 16 Given: a user, search query and a candidate product ad Predict: expected click through rate of user for the product ad
  17. 17. Many more applications ! • Advertising • Product search and browse experience • Forecasting product demand and supply • Product pricing/competitor monitoring • Understanding customer profiles and lifetime value • Detecting seller and customer fraud • Enriching product catalog & review content • … 3/28/18 17
  18. 18. Given: An input object with some features X (covariates/independent variables) Goal: Predict a new target/label attributeY (response/dependent variable) Note: – Input & output attributes (X,Y) can be simple (e.g., numeric or categorical values/vector) or have a complex structure (e.g., time-series, text sequences) Typical Predictive Modeling Problem 18
  19. 19. • Training data with correct input-output pairs (X,Y) • Data samples in both train/unseen data are generated in same way (i.i.d) Supervised Learning: Key Assumptions Fraud NOT)Fraud Fraud 19
  20. 20. Supervised Learning Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i)
  21. 21. Supervised Learning Training: Given training examples {(Xi,Yi)} where Xi is input features and Yi the target variable, learn a model F to best fit the training data (i.e., Yi ≈ F(Xi) for all i) Prediction: Given a new sample X with unknown Y, predict it using F(X)
  22. 22. Training: Find a “good” model f from the training data ! Supervised LearningKey Elements of a Supervised Learning Algorithm
  23. 23. Training: Find a “good” model f from the training data ! • What is an allowed “model” ? – Member of a model class H, e.g., linear models • What is “good” ? – Accurate predictions on training data in terms of a loss function L, e.g., squared error (Y –F(X))2 • How do you “find” it ? – Optimization algorithm A, e.g., gradient descent Supervised LearningKey Elements of a Supervised Learning Algorithm
  24. 24. Training: Apply algorithm A to find model from the class H that optimizes a loss function L on the training data D • H: model class, L: loss function,A: optimization algorithm • Different choices lead to different models on same data D Supervised LearningKey Elements of a Supervised Learning Algorithm
  25. 25. ML Application Development Life Cycle 3/28/18 25
  26. 26. 3/28/18 26
  27. 27. 3/28/18 27
  28. 28. 3/28/18 28
  29. 29. 3/28/18 29
  30. 30. 3/28/18 30
  31. 31. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  32. 32. 3/28/18 32
  33. 33. 3/28/18 33
  34. 34. 3/28/18 34
  35. 35. 3/28/18 35
  36. 36. Problem Formulation 3/28/18 36
  37. 37. 3/28/18 37
  38. 38. Machine Learning Problem Definition Business Problem: Optimize a decision process to improve business metrics • Sub-optimal decisions due to missing information • Solution strategy: predict missing information from available data using ML
  39. 39. Machine Learning Problem Definition Business Problem: Optimize a decision process to improve business metrics • Sub-optimal decisions due to missing information • Solution strategy: predict missing information from available data using ML Example: Reseller fraud • Business objective: Limit fraud orders to increase #users served and reduce return shipping expenses. • Decision process: Add friction to orders via disabling cash on delivery (COD) • Missing information relevant to the decision: – Likelihood of the buyer reselling the products – Likely return shipping costs – Unserved demand for the product
  40. 40. Key elements of a ML Prediction Problem • Instance definition • Target variable to be predicted • Input features • Sources of training data • Modeling metrics (Online/Offline, Primary/Secondary) • Deployment constraints
  41. 41. Instance Definition • Is it the right granularity from the business perspective? • Is it feasible from the data collection perspective ?
  42. 42. Instance Definition Multiple options • a customer • a purchase order spanning multiple products • a single product order (quantity can be >1"though) 3/28/18 42
  43. 43. Instance Definition Multiple options • a customer • a purchase order spanning multiple products • a single product order (quantity can be >1 though) [preferred choice] Why? • Reselling behavior is also at a single product level • COD presented per product not per entire purchase • Blocking a customer on all his orders can even hurt 3/28/18 43
  44. 44. TargetVariable to be Predicted • Can we express the business metrics (approximately) in terms of the prediction quality of the target? • Will accurate predictions of the target improve the business metrics substantially?
  45. 45. Potential Business Impact • Will accurate predictions of the target improve the business metrics substantially? • Compute business metrics for each case – Ideal scenario ( perfect predictions on target ) – Dumb baseline ( default predictions e.g., majority class) – Simple predictor with rules/domain knowledge – Existing solution (if one exists) – Likely scenario with a reasonable effort ML model ( • Assess effort vs. benefits 3/28/18 45
  46. 46. TargetVariable to be Predicted • Can we express the business metrics (approximately) in terms of the prediction quality of the target? • Will accurate predictions of the target improve the business metrics substantially? • What is the data collection effort ? – manual labeling costs • Is it possible to get high quality observations? – uncertainty in the definition, noise or bias in labeling process
  47. 47. TargetVariable Multiple options • Likelihood of buyer reselling the current order • Number of unserved users because of the current order • Expected return shipping expenses for the current order 3/28/18 47
  48. 48. TargetVariable Multiple options. • Likelihood of buyer reselling the current order [compromise choice] • Number of unserved users because of the current order • Expected return shipping expenses for the current order Why? • Last two choices better in terms of business metrics, but data collection is difficult • First choice makes data collection easy (esp. as a binary label) and addresses business metrics in a reasonable, but slightly suboptimal way 3/28/18 48
  49. 49. Input features • Is the feature predictive of the target ? • Are the features going to be available in production setting ? – Need to define exact time windows for features based on aggregates – Watch out for time lags in data availability – Be wary of target leakages (esp. conditional expectations of target ) • How costly is to compute/acquire the feature ? – Might be different in training/prediction settings
  50. 50. Input Features Reselling vs. Non-reselling indicators • High product discount • High order quantity relative to other orders of same product – Normalize by median/mean to get relative values • More for some products/verticals – Product/vertical id can be used • Buyer being a business store in product category – Buyer’s category purchase count – Buyer being a business store 3/28/18 50
  51. 51. Input Features Reselling vs. Non-reselling indicators • High product discount [feasible] • High order quantity relative to other orders of same product – Normalize by median/mean to get relative values [with lag] • More for some products/verticals – Product/vertical id can be used [feasible] • Buyer being a business store in product category – Buyer’s category purchase count [with lag] – Buyer being a business store [expensive join with external info] 3/28/18 51
  52. 52. Sources of Training Data • Is the distribution of training data similar to production data? – e.g., if production data evolves over time, can the “training data” be adjusted accordingly ? • Are there systemic biases (e.g., data filters) in training data? – Adjust the scope of prediction process so that it matches with the training data setting
  53. 53. Sources of Training Data Historical order data – input features are available, but target is missing Target observations – Manual labeling on a random subset after focused investigations on the address and the customer purchase history. – Improve labeling efficiency by filtering by order quantity and apply same filtering in production 3/28/18 53
  54. 54. Modeling Metrics • Online metrics are measured on a live system – Can be defined directly in terms of the key business metrics – typically measured via A/B tests and these metrics are potentially influenced by a number of factors (e.g., net revenue) • Offline metrics are meant to be computed on retrospective “labeled” data – typically measured during offline experimentation and more closely tied to prediction quality (e.g., area under ROC curve) • Primary metrics are ones that we are actively trying to optimize – e.g., losses due to fraud • Secondary metrics are ones that can serve as guardrails – e.g., customer base size 3/28/18 54
  55. 55. Offline Modeling Metrics • Does improvement in offline modeling metrics result in gains in online business metrics ? Model quality: – A) Maximize coverage of fraud orders at certain level of certainty (>90%) – B) Binary target: Four decision possibilities • Maximize average payoff in terms of expected return costs given the different possibilities 3/28/18 55 Return'Pay+offs Predicted'Fraud Predicted'Not'Fraud Actual-Fraud 0 2 avg.-return-costs Actual-Not-Fraud 2avg.-lost order-costs 0
  56. 56. Deployment Constraints • What are the application latency & hardware constraints ? Computational constraints: – Orders per sec, allowed latency for COD disable action – Available processing power, memory
  57. 57. Problem Formulation Template 3/28/18 57 • Template(s) with questions on all the key elements – Listing of possible choices – Reason for preferred choice • Populated for each project by product manager + ML expert
  58. 58. Exercise: ML Problem Definition Good choices for target variable, features & other elements ? – Predicting shipping time for an order – Forecasting the demand for different products – Determining the nature of a customer complaint – Predicting customer preference for a product 3/28/18 58
  59. 59. Data Definition 3/28/18 59
  60. 60. 3/28/18 60
  61. 61. Motivation 3/28/18 61 • Early detection & prevention of common data related errors • Reproducibility • Auditability • Robustness to failure in data fetch pipelines
  62. 62. Data Elements of Interest 3/28/18 62 • Instance identifiers • Target variables • Input features • Other factors useful for evaluating online/offline metrics Fields to specify for each variable of interest • ID, Name, Version • Modeling role, Owner, Description, Tags
  63. 63. Definitions 3/28/18 63 Three possible copies for same variable based on the stages • Offline training, Offline evaluation, Deployment Fields to specify for each variable of interest for each stage • Precise definition (query + source for raw ones or formula for derived ones) • Data type, value check conditions • Units/Level sets • Is Aggregate? , Exact aggregation set or time window • Missing or invalid value indicators, reasons, mitigations (e.g., div by 0 for ratios) • First creation date • Known quality issues
  64. 64. Review Criteria 3/28/18 64 • Unambiguous definitions to allow for ready implementation • Parity across different stages (training/evaluation/deployment) – Definitions – Data type, value checks, units, level sets – Aggregation windows – Missing/invalid value handling of derived variables
  65. 65. Review Criteria 3/28/18 65 • Is the input X to targetY map invariant across stages? – Do definitions drift with time ?(Use averages not sums in general ) • e.g., customer spend to date in books ! order fraud status – Do we have the correct feature snapshot of X forY ? • e.g. , customer loyalty category (from when?)
  66. 66. Review Criteria 3/28/18 66 • Common data leakages – Unintentional peeking into future, target, or any kind of unobserved variables – Ambiguously specified aggregates, e.g., customer revenue till the “most recent” order ; interpretation can be different in training data and deployment settings because of delays in data logging – Time-varying features for which only certain (or recent) snapshots are available, e.g., marital status of the customer
  67. 67. Review Criteria 3/28/18 67 • Handling of invalid/missing values of raw variables – Join errors in preprocessing – Service failures in deployment • Handling of known data quality issues – Corruption of data for certain segments/time periods
  68. 68. Data Definition Template 3/28/18 68 • Template(s) with details of all the data elements and review questions • Populated for each project by all the relevant stake holders
  69. 69. Outline • Overview of ML ecosystem • Problem formulation & data definition • Offline modeling • Building an internal ML platform • Key tenets for real-world ML applications 69
  70. 70. Offline Modeling 3/28/18 70
  71. 71. 3/28/18 71
  72. 72. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  73. 73. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  74. 74. Data Collection & Integration Abstract process: (specifics depend on data management infrastructure) • Find where the data resides – API, database/table names, external web sources • Identify mappings between schemas of different sources • Obtain the instance identifiers • Perform a bunch of queries (joins/aggregations) for obtaining the features/target Data access/integration tools: • SQL • Hive, Pig, SparkSQL (for large joins) • Scrapy, Nutch (web-crawling) 3/28/18 74
  75. 75. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  76. 76. Data Exploration: Why? • Data quality is really critical – “Garbage in garbage out” – Need to verify if data confirms to expectations (domain knowledge) • ML modeling requires making a lot of choices ! • Better understanding of data – Early detection and fixing of data quality issues – Good preprocessing, feature engineering & modeling choices 3/28/18 76
  77. 77. Data Exploration: What to look for ? • Size and schema – #instances & #features, – feature names & data types (numeric, ordinal, categorical, text) • Univariate feature and target summaries – prevalence of missing, outlier, or junk values – distributional properties of features, – skew in target distribution • Bivariate target-feature dependencies – distributional properties of features conditioned on the target – feature-target correlation measures • Temporal variations of features & targets 3/28/18 77
  78. 78. Example: Data Schema 3/28/18 78
  79. 79. Example: Dataset Summary 3/28/18 79 Missing- values Class- Imbalance Skewed Distribution
  80. 80. Univariate (Feature or Target) Histograms 3/28/18 80 Y(axis:.log.Scale..
  81. 81. Feature-Target Dependencies • Class histograms conditioned on feature value 3/28/18 81 For small values, fraud fraction is almost 7-10 times less, For large values it is comparable or more
  82. 82. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  83. 83. Data Sampling & Splitting • Generalization to “unseen” data forms the core of ML • Split “labeled” data into train and test datasets – Train split is used to learn models – Test split is proxy for the “unseen” data in deployment setting is used to evaluate model performance Note: In the test split, target is known unlike data in deployment setting 3/28/18 83
  84. 84. Creating Train & Test Splits • Random disjoint splits – Randomly shuffle & the split into train and test sets (e.g., 80% train & 20% test) • K-fold cross validation – Partition into K subsets: Use K-1 to train & one for test – Rotate among the different sets to create K different train-test splits – More reliable avg. performance estimate, variance measures, statistical tests – Leave-one out: Extreme case of K-fold (each fold is single instance) • Out-of-time splits (data has a temporal dependence) – Train & test splits obtained via time-based cutoff (from production constraints) • Special case: Imbalanced classes (for certain algorithms, e.g., decision trees) – Balance train split alone by down (or up) sampling majority (minority) class
  85. 85. K-fold CrossValidation 3/28/18 85
  86. 86. Complex pipelines: Additional Splits • In a simple scenario, the target labels are used only by the “learning algorithm” – Train and test splits suffice for this case • Complex pipelines might have multiple elements that need a peek at the target – e.g., Feature selection, Meta learning algorithms, Output calibration etc. – Separate data splits for each elements leads to better generalization – Need to consider size of available labeled data as well 3/28/18 86
  87. 87. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  88. 88. Data Preprocessing • Special handling of text valued features – Necessary to preserve the relevant “information” – Appropriate handling of special characters, punctuation, spaces & markup • Feature/row scaling (for numeric attributes) – Necessary to avoid numerical computation issues, speedup convergence – Columns: • z-scoring: subtract mean, divide by std-deviation! mean=0, variance=1 • fixed range: subtract min, divide by range ! 0 to 1"range – Rows: L1 norm, L2 norm • Imputing missing/outlier values – Necessary to avoid incorrect estimation of model parameters – Handling strategies depend on the semantics of “missing” 3/28/18 88
  89. 89. Handling Outliers & MissingValues • Indication of suspect instance: discard the record • Informative w. r. t. target: introduce a new indicator variable • Missing at random – Numeric: replace with mean/median or conditional mean (regression) – Categorical: replace with mode or likely value conditioned on the rest
  90. 90. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  91. 91. Feature Engineering • Case 1: Raw features are not highly predictive of the target, esp. in case of simple model classes (e.g., linear models ) • Solution: Feature extraction, i.e., construct new more predictive features from raw ones to boost model performance • Case 2: Too many features with few training instances ! “memorizing” or “overfitting” situation leading to poor generalization. • Solution: Feature selection, i.e., drop non-informative features to improve generalization. 3/28/18 91
  92. 92. Feature Extraction • Basic conversions for linear models – e.g., 1-Hot encoding, Sparse encoding of text • Non-linear feature transformations for linear models – Linear models are scalable, but not expressive ! Need non-linear features – e.g. Binning, quadratic interactions • Domain-specific transformations – Well studied for their effectiveness on special data types such as text, images – e.g.,TF-IDF transformation, SIFT features • Dimensionality reduction – High dimensional features (e.g., text) can lead to “overfitting”, but retaining only some dimensions may be sub-optimal – Informative low dimensional approximation of the raw feature – e.g., PCA, clustering3/28/18 92
  93. 93. Basic Conversions: Categorical Features • One-Hot Encoding – Converts a categorical feature with K values into a binary vector of size K−1 – Just a representation to enable use in linear models 3/28/18 93 Product_vertical Handset Book Mobile isBook isMobile 0 0 1 0 0 1
  94. 94. Basic Conversions: High Dimensional Text-like Features Sparse Matrix Encoding Text features: • each feature value snippet is split into tokens(dimensions) • Bag of tokens ! a sparse vector of "counts" over token vocabulary • Single text feature ! sparse matrix with #columns = vocabulary size Other high dimensional features: • similar process via map from raw features to a bag of dimensions 3/28/18 94
  95. 95. Non-linearity: Numeric Features • Non-linear functions of features or target (in regression) – Log transformation, polynomial powers, Box-cox transforms – Useful given additional knowledge on the feature-target relationship • Binning – Numeric feature ! categorical one with #values = #bins – Results in more weights (K-1 for K bins) in linear models instead of just one weight for the raw numeric feature – Useful when the feature-target relation is non-linear or non-monotonic – Bins: equal ranges, equal #examples, maximize bin purity (e.g. entropy) 3/28/18 95
  96. 96. Numeric Feature Binning 3/28/18 96 [3]
  97. 97. Interaction Features • Required when features influence on target is not purely additive – linear combinations of features won’t work Example: Order with 50% discount on mobiles is much more likely to indicate fraud than a simple combination of 50% discount or mobile order. Common Interaction Features: • Non-linear functions of two or more numeric features, e.g., products & ratios • Cross-products of two or more categorical features • Aggregates of numerical features corresponding to categorical feature values • Tree-paths: use leaves from decision trees trained on a smaller sample 3/28/18 97
  98. 98. Categorical-Categorical Interaction Features 3/28/18 98
  99. 99. Numerical-Categorical Interaction Features • Compute aggregates of a numeric feature corresponding to each value of a categorical feature • New interaction feature ! numeric one obtained by replacing the categorical feature value with the corresponding numeric aggregate, • e.g., brand_id ! brand_avg_rating, brand_avg_return_cnt • Especially useful for categorical features with high cardinality (>50) 3/28/18 99
  100. 100. Tree Path Features • Learn a decision tree on a small data sample with raw features • Paths to the leaves are conjunctions constructed from conditions on multiple raw features • Highly informative with respect to the target. 3/28/18 100
  101. 101. Domain-Specific Transformations Text Analytics and Natural Language Processing • Stop-words removal/Stemming: Helps focus on semantics • Removing high/low percentiles: Reduces features w/o loss in predictive power • TF-IDF normalization: Corpus wide normalization of word frequency • Frequent N-grams: Capture multi-word concepts • Parts of speech/Ontology tagging: Focus on words with specific roles Web Information Extraction • Hyperlinks, Separating multiple fields of text (URL, in/out anchor text, title, body) • Structural cues: XPaths/CSS; Visual cues: relative sizes/positions of elements • Text style (italics/bold, font-size, colors) Image Processing • SIFT features, Edge extractors, Patch extractors 3/28/18 101
  102. 102. Dimensionality Reduction • Clustering along feature values – K-means variants (along feature values) • Low rank matrix factorization – Principal Component Analysis (PCA) – Non-negative Matrix Factorization (NNMF) • Topic models – Latent Dirichlet Allocation (LDA) – Probabilistic Latent Semantic Analysis (PLSA) • Neural embeddings – Word2Vec(Skip-gram), Para2Vec 3/28/18 102
  103. 103. Feature Selection Key Idea: Sometimes “Less (features) is more (predictive power)” Motivating reasons: • To improve generalization • To meet prediction latency or model storage constraints(for some applications) Broadly, three classes of methods: • Filter or Univariate methods, e.g., information-gain filtering • Wrapper methods, e.g., forward search • Embedded methods, e.g., regularization 3/28/18 103
  104. 104. Feature Selection: Filter or Univariate Methods • Goal: Find the “top” individually predictive features – “predictive”: specified correlation metric • Ranking by a univariate score – Score features via an empirical statistical measure that corresponds to predictive power w.r.t. target – Those above a cut-off (count, percentile, score threshold) are retained Note: • Fast, but highly sub-optimal since features evaluated in isolation • Independent of the learning algorithm. • e.g., Chi-squared test, information gain, correlation coefficient 3/28/18 104
  105. 105. Feature-Target Correlation • Mutual information: Captures correlation between categorical feature (X) and class label (Y) • p(x,y): Fraction of examples with X=x and Y=y • p(x), p(y): Fraction of examples with X=x,Y=y 3/28/18 105 I(X,Y) = p(x, y)log p(x, y) p(x)p(y)y∈sup(Y ) ∑ x∈sup(X) ∑
  106. 106. Feature-Target Correlation • Pearson’s correlation coefficient: Captures linear relationship between numeric feature (X) and target value (Y) • Xi,Yi: Value of X,Y in ith instance • : Mean of X,Y • Covariance matrix: Captures correlations between every pair of features 3/28/18 106 ρ(X,Y) = cov(X,Y) σXσY = (Xi − X) i ∑ (Yi −Y ) (Xi − X)2 i ∑ # $ % & ' ( 1/2 (Yi −Y )2 i ∑ # $ % & ' ( 1/2 X,Y
  107. 107. Feature Selection: Wrapper Methods • Goal: Find the “best” subset from all possible subsets of input features – “best” : specified performance metric & specified learning algo • Iterative search – Start with an initial choice (e.g., entire set, random subset) – Each stage: find a better choice from a pool of candidate subsets. Note: • Computationally very expensive • e.g., Backward search/Recursive feature elimination, Forward search 3/28/18 107
  108. 108. Feature Selection: Embedded Methods • Identify predictive features while the model is being created itself • Penalty methods: learning objective has an additional penalty term that pushes the learning algorithm to prefer simpler models • Good trade-off in terms of the optimality & computational costs • e.g., Regularization methods (LASSO, Elastic Net, Ridge Regression) 3/28/18 108
  109. 109. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  110. 110. Training: Find a “good” model f from the training data ! Supervised LearningKey Elements of a Supervised Learning Algorithm
  111. 111. Training: Find a “good” model f from the training data ! • What is an allowed “model” ? – Member of a model class H, e.g., linear models • What is “good” ? – Accurate predictions on training data in terms of a loss function L, e.g., squared error (Y –F(X))2 • How do you “find” it ? – Optimization algorithm A, e.g., gradient descent Supervised LearningKey Elements of a Supervised Learning Algorithm
  112. 112. Training: Apply algorithm A to find model from the class H that optimizes a loss function L on the training data D • H: model class, L: loss function,A: optimization algorithm • Different choices lead to different models on same data D Supervised LearningKey Elements of a Supervised Learning Algorithm
  113. 113. Model Training (Recap) Key elements in learning algorithms: • Model class, e.g., linear models, decision trees, neural networks • Loss function, e.g., logistic loss, hinge loss • Optimization algorithm, e.g., gradient descent, & assoc. params Lot of algorithms & hyper-parameters to choose from ! 3/28/18 113
  114. 114. Scikit-learn Guide 3/28/18 114
  115. 115. Model Choice: Classification 3/28/18 115 Primary factors: High #data instances ( > 10(MM ) • Linear models – online learning SGD High #features/examples ratio (>1) • Linear Models:Aggressive (L1) regularization • Linear Models: Dimensionality reduction • Naïve Bayes (homogeneous independent features) Need non-linear interactions • Kernel methods (e.g., Gaussian SVM) • Tree Ensembles (e.g., GBDT, RF) • Deep learning methods (e.g., CNNs, RNNs) •
  116. 116. Model Choice: Regression 3/28/18 116 Primary factors: High #data instances ( > 10(MM ) • Linear models – online learning SGD High #features/examples ratio (>1) • Linear models:Aggressive (L1) regularization • Linear models: Dimensionality reduction Need non-linear interactions • Kernel methods (e.g., Gaussian SVR) • Tree Ensembles (e.g., GBDT, RF) • Deep learning methods (e.g., CNNs, RNNs) •
  117. 117. Model Evaluation & Diagnostics Model Evaluation: • Train error: Estimate of the expressive power of the model/algorithm relative to training data • Test error: A more reliable estimate of likely performance on “unseen” data Post evaluation: What is the right strategy to get a better model ? • 1) Get more training data instances • 2a) Get more features or construct more complex features • 2b) Explore more complex models/algorithms • 3a) Drop some features • 3b) Explore simpler models/algorithm
  118. 118. Overfitting • Overfitting: Model fits training data well (low training error) but does not generalize well to unseen data (poor test error) • Complex models with large #parameters capture not only good patterns (that generalize), but also noisy ones 3/28/18 118 Y' X' High'prediction'error Actual Predicted Model
  119. 119. Underfitting • Underfitting: Model lacks the expressive power to capture target distribution (poor training and test error) • Simple linear model cannot capture target distribution 3/28/18 119 Y( X(
  120. 120. Bias &Variance • Bias of algo: Difference between the actual target and the avg. estimated target where averaging is done over models trained on different data samples • Variance of algo:Variation in predictions of models trained on diff. data samples 3/28/18 120
  121. 121. Model Complexity: Bias &Variance • Simple learning algos with small #params. ! high bias & low variance – e.g., Linear models with few features – Reduce bias by increasing model complexity (adding more features) • Complex learning algos with large #params ! low bias & high variance – e.g. Linear models with sparse high dimensional features, decision trees – Reduce variance by increasing training data & decreasing model complexity (feature selection) 3/28/18 121
  122. 122. Validation Curve 3/28/18 122 Overfitting Region Optimal- choice Ideal choice: Match of complexity between learning algorithm and the training data. Prediction performance vs. Model complexity parameter
  123. 123. Learning Curve 3/28/18 123 Prediction performance vs. Num. of training examples Ideal choice: Early portion of the flat region.
  124. 124. Common Evaluation Metrics Standard evaluation metrics exist for each class of predictive learning scenarios – Binary Classification – Multi-class & Multi-label Classification – Regression – Ranking • Loss function used in training objective is just one choice of evaluation metric – Usually picked because the learning algorithm is readily available – Might be a good, but not necessarily ideal choice from business perspective • Critical to work backwards from business metrics to create more meaningful metrics 3/28/18 124
  125. 125. 125 Customer)orders)– Blues)are)not)fraudulent)(P),)Reds)are)fraudulent)(N) Operational+Decision+Point:)Thresholding)on)the)score)(User)has)to)choose!)) Score)using)customer)order)features)to)create)a)rank)order)from)low)to)high)certainty Classification – Making Predictions 3/28/18
  126. 126. Classification – Operational Point Evaluation Metrics • For each threshold, Confusion matrix for binary classification of P vs. N • Precision = TP/(TP+FP): How correct are we on the ones we predicted P? • Recall = TP/(TP+FN):What fraction of actual P’s did we predict correctly? • True Positive Rate (TPR) = Recall • False Positive Rate (FPR) = FP/(FP+TN):What fraction of N’s did we predict wrongly? Actual'P Actual N Predicted(P TP FP Predicted(N FN TN 3/28/18 126
  127. 127. 127 Tradeoff(Curve 0% 20% 40% 60% 80% 100% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% %(Cum(Non0Frauds %(Cum(Frauds 90% True%Positive%Rate% False%Positive%Rate ROC curve: Plot of TPR vs. FPR AUC: Area under ROC curve • Perfect classifier: AUC =1 • Random classifier: AUC =0.5 • Odds of scoring P > N • Effective for comparing learning algorithms across operational pts Operational point: • Maximize (TPR – FPR), F1-measure • Other business driven choices 3/28/185 Receiver Operating Characteristic (ROC) Curve
  128. 128. 12 Precision) Recall 0.750.25 0.5 0 1 1 High)Precision High)Recall 0.25 0.5 0.75 3/28/18 Precision- Recall Curve Noisy curve in this region for small datasets
  129. 129. • Binary Classification: Score threshold corresponds to operational point • Application-specific bounds on Precision or Recall – Maximize recall (or precision) with a lower bound on precision (or recall) • Application-specific misclassification cost matrix – Optimize overall misclassification cost (TP*CTP +FP*CFP + TN*CTN + FN*CFN ) – Reduces'to'standard'misclassification'error'when'CTP'=CTN=0'and'CFP'=CFN =1 Classification: Picking an Operational Point 3/28/18 129 Actual'P Actual N Predicted'P CTP CFP Predicted'N CFN CTN
  130. 130. Regression – Evaluation Metrics • Metrics when regression is used for predicting target values – Root Mean Square Error(RMSE): – R2 : How much better is the model compared to just picking the best constant? R2 =1- (Model Mean Squared Error /Variance) – MAPE (Mean Absolute Percent Error): 3/28/18 130 1 N (Yi − F(Xi ) i ∑ )2 # $ % & ' ( 1/2 1 N Yi − F(Xi ) Yii ∑ # $ %% & ' ((
  131. 131. Model Fine-tuning • Lot of algorithms and hyper-parameters (e.g., learning rate) to choose from – Infeasible to explore all choices • Practical solution approach – Narrow down a few suitable algorithms from meta data (size/attribute types) – For each chosen algorithm, systematically explore hyper-parameter choices • Alternate optimization • Exhaustive grid search • Bayesian optimization (e.g., Spearmint, MOE) • Each exploration: learning a model on a train split & evaluating on a test split. • Preferred split mechanism: k-fold cross-validation • Best hyperparameter choices based on the test (cross-validation) error 3/28/18 131
  132. 132. Multiple stages of optimization • Objective: Find f(.) to optimize some cost L(Yunseen, f(Xunseen)) • ML Methodology: – (Step 1) Model Learning: Determine good choices of f(.) that optimize LA(Ytrain, Xtrain) for different choices of hyperparameters and algorithms – (Step II) Hyperparameter Fine-tuning : Among choices in Step I, pick the one that optimizes LB(Yeval-split, Xeval-split) – (Step III) Operational choices (e.g., score thresholding, output calibration): For the choice in Step II, determine the operational choices so as to optimize LC(Yop-split, Xop-split) • Note: – Ideal choice is to have L=LA = LB = LC and the data splits to be i.i.d., but not always possible, • e.g., need L =max. recall for precision >90%, but LA = logistic loss & LB = Area under ROC – Preferable to choose intermediate metrics that are “close” to the desired business metric and robust off-the-shelf implementations 3/28/18 132
  133. 133. Data Collection &,Integration Data,Exploration, Feature,Engineering Meet Business, Goals? Model,Training,,Evaluation,,,,,,,,,,,,, &,,Fine>tuning Data,Preprocessing Data,Sampling/Splitting, Offline Modeling Process
  134. 134. Building an Internal ML Platform 3/28/18 134
  135. 135. 3/28/18 135
  136. 136. 136 Typical(ML(Production( System Raw data Data*fetch*+* aggregation Prod.* re7training Prod. models Configs Offline modeling Models, Configs, Reports Prod. scoring A/B$ bucketing Business*logic optimization Data* monitoring*& A/B*tests outcome environment action (stimulus,(action,(outcome) stimulus Dashboards,* alerts,*A/B*results Data* collection*& attribution
  137. 137. 137 Offline'Modeling Config Raw' data sources Data'fetch'+' aggregation Model Evaluation Reports Interactive'data analysis Model' Learning Models Configs Configs
  138. 138. Existing ML Platform Utilities Open%source%packages 3/28/18 138 Google%Cloud%ML • Not)cost)effective)for)large)companies) • Need)to)move)data)to)external)clouds Managed%services • Free)and)flexible)) • Gaps)in)functionality Large%companies%need%an%internal%ML%platform%to%make%up%for%the%gaps%!
  139. 139. Primary Challenges • Fast error-proof productionization • Scalability vs. flexibility trade-off • Reusability & extensibility of modeling effort • Management of offline modeling experiments • Interactive monitoring of modeling experiments 3/28/18 139
  140. 140. Challenge: Road to Productionization • Long slow road to delivery for each new application, Very little reuse across applications • POC code → Production code translation is highly error prone • Rigorous evaluation & debugging of actual production systems is unlikely since these tasks are owned by dev ops folks and data scientists don’t understand production code 3/28/18 140 Product(( Manager Data( Scientist Software( Engineers Dev(Ops App Requirements(&( Metrics PoC(Modeling [R,Python] Production( code(
  141. 141. Solution: Self-contained “Models” Data scientists • build application-specific configurations for data collection & modeling • ship self-contained production “models” (i.e.,“model”, configurations, library dependencies) say via Docker (not POC code !) Software engineers • build application-agnostic* production code & systems for automation of data collection, model scoring, re-training, evaluation, etc. 141 Data$Scientist Software$Engineers Dev$Ops Self5contained$Models Application$agnostic Production$code$ *$Need$to$consider$data$scale,$latency$for$scoring$&$retraining$which$have$some$dependency$on$the$application
  142. 142. Solution: Self-contained “Models” • ML packages such as scikit-learn, spark-mllib, Keras allow for an easy serialization of the entire processing pipeline (i.e., preprocessing, feature- engineering, scoring) along with the fitted parameters as a single “model” that can be exported to be used for scoring. 3/28/18 142
  143. 143. Challenge: Scalability vs. Flexibility Trade-off • Scalability requirements vary across applications • Factors to consider – Size of training data – Frequency of retraining – Rate of arrival of prediction instances and latency bounds (in case of online predictions) – Size of batch and frequency of scoring (in case of batch predictions) • Data scientists prefer to train models on single machines where possible 3/28/18 143
  144. 144. Solution: Support Multiple Choices • Moderate scale for training & prediction – train models on a single machine (in Python/R); – export model as is to multiple machines with the same image and predict in parallel • Moderate scale for training, but large prediction scale – train models on a single machine (e.g., spark-Mllib in Python/R); – export model to a different environment (e.g., Scala/Java ) that allows more efficient parallelization. • Large scale for both training & prediction – train models and predict on a cluster(e.g., via sparkit-learn, PySpark or Scala, ) 3/28/18 144
  145. 145. Challenge: Reusability & Extensibility of Modeling Effort • ML workflows are more than just the “models pipeline” – e.g., data fetch/aggregation from multiple sources, evaluation across multiple models, exploratory data analysis • Offline modeling code (notebooks) tends to get dirty fast – Mix of interactive analysis (specific to application) and processing of data • Common approach to reuse – limited use of libraries + cut & paste code 3/28/18 145
  146. 146. Example Workflow: Data Fetch + Aggregation 3/28/18 146 Libraries( Data$ Source$1 Data$ Source$2 Data$ Source$3 Data Reader Consolidated Data$File$(s) Data$$$$$ Aggregation Data$$$$$ Writer Read Utilities Aggregation Utilities Write Utilities Data(Aggregation(Config( Read Config Aggregation Config Write Config Workflow
  147. 147. Example Workflow: Model Learning 3/28/18 147 Workflow Libraries. Consolidated+Data+ Data Splitter Model++++ & Report Target+Constructor Feature Pipeline+setup Filters/ Splitters/ Samplers Transformers Learning Algos Learning.Config. Data+Split/ Sampling+Config++ Target Config Model Config Feature Config HP+search+config Model Set+up HP Search Predict&+Eval Eval Config Param search Eval Metrics
  148. 148. Example Workflow: Model Evaluation 3/28/18 148 Libraries( Workflow Labeled'Data' Eval' reports Feature Pipeline'setup Transformers Learning Algos Evaluation(Config( Pre:trained,'feature'models'config Model Set'up Predict Eval metrics Eval Eval Config
  149. 149. Example Workflow: Model Scoring 3/28/18 149 Libraries( Workflow Unlabeled(Data( PredictionsFeature Pipeline(setup Transformers Learning Algos Prediction(Config( Pre:trained,(feature(model(config Model Set(up Predict
  150. 150. Solution: Workflow Abstractions • Each workflow is represented as a DAG over nodes – DAGs can be encoded asYAML or JSON files • Each node is a computational unit with the following elements – name – environment of execution(e.g., python/scala) – actual function to be executed (via a link to existing module, class, method) – inputs (with default choices) and outputs – tags to aid discovery 3/28/18 150
  151. 151. Solution: Workflow Abstractions • Wrapper libraries allow hooks to existing ML packages (sklearn, keras, etc) via nodes • Properly indexed repositories of workflow DAGs, nodes and node-configurations allow discovery and reuse • Editing tools for composing DAGs enable extensibility 3/28/18 151
  152. 152. ML#Workflow Library Orchestrator Deployment# Engine 152 Physical#Computing#Resources Contribution Tools Edit Tools Discover Tools Orchestrator#assembles#the#composition#and#manages#the# deployment#with#the#help#of#deployment#manager Physical#computing#resources#provide#the#execution#environment# $ Discover#and#Deploy:#Searches#the#library#of#workflows# meeting#certain#criteria,#and#deploys#them.# $ Edit#&#Experiment:#Take#an#existing#ML#workflow,#creates#a#new# one##by##making#some#edits#(mostly#data,#config parameters),# experiments#with#it#and#publishes#it $ Create#&#Contribute:##Entirely#create#a#new#library#functions,# nodes#and#possibly#workflow#DAGs##and#adds#to#repository ML Workflow-centric Architecture
  153. 153. Challenge: Management of Experiments • Manual tracking of experimental results requires considerable effort and is error-prone • Low reproducibility and auditability of offline modeling experiments 3/28/18 153
  154. 154. Solution: Automated Repositories of ML Entities • Run: execution of a workflow – consumes datasets and configurations as inputs and generates models, reports and new datasets as outputs – organizes all the inputs/outputs and intermediate results in an appropriate directory structure • Automatically updated versioned repositories – workflow DAGs, nodes, configs – runs, datasets, models, reports • Post each run, the repositories are automatically updated with the appropriate linkages between the different entities 3/28/18 154
  155. 155. Challenge: Interactive Monitoring of Experiments • Interactive execution of experiments ! messy code 3/28/18 155
  156. 156. Solution: Read-only monitoring • Additional layer that allows workflow DAGs to be executed one step at a time and outputs to be examined from an interactive tool (e.g., Jupyter notebooks) – run_node(), load_input(), load_output() • Cloning of intermediate inputs & outputs on demand so that these can be analyzed without affecting the original run – Changes to the actual run have to be explicitly made via workflow DAGs, configs 3/28/18 156
  157. 157. Key Tenets for Real-world ML Applications 3/28/18 157
  158. 158. Key Tenets for Real-world ML applications Design phase: • Work backwards from the application use case – ML problem formulation & evaluation metrics aligned with business goals – Software stack/ML libraries based on scalability/latency/retraining needs • Keep the ML problem formulation simple (but ensure validity) – Understand assumptions/limitations of ML methods & apply them with care – Should enable ease of development, testing, and maintenance
  159. 159. Key Tenets for Real-world ML applications Modeling phase: • Ensure data is of high quality – Fix missing values, outliers, target leakages • Narrow down modeling options based on data characteristics – Learn about the relative effectiveness of various preprocessing, feature engineering, and learning algorithms for different types of data. • Be smart on the trade-off between feature engg. effort & model complexity – Sweet spot depends on the problem complexity, available domain knowledge, and computational requirements; • Ensure offline evaluation is a good “proxy” for the “real unseen” data evaluation – Generate train/test splits similar to how it would be during deployment
  160. 160. Key Tenets for Real-world ML applications Deployment phase: • Establish train vs. production parity – Checks on every possible component that could change • Establish improvement in business metrics before scaling up – A/B testing over random buckets of instances • Trust the models, but always audit – Insert safe-guards (automated monitoring) and manual audits • View model building as a continuous process not a one-time effort – Retrain periodically to handle data drifts & design for this need Don’t adopt Machine Learning because of the hype !
  161. 161. Thank You ! Happy Modeling ! Contact: srujana@gmail.com
  162. 162. Useful References 3/29/18 162 • Google&AI&Course • https://ai.google/education/#?modal_active=none • Rules&of&Machine&Learning:&Best&Practices&for&ML&Engineering& • http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf • What’s&your&ML&Test&Score?&A&rubric&for&ML&production&systems& • https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf • Practical&advice&for&analysis&of&large,&complex&data&sets • http://www.unofficialgoogledatascience.com/2016/10/practicalAadviceAforAanalysisA ofAlarge.html

×