Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BIG2016- Lessons Learned from building real-life user-focused Big Data systems

2,739 views

Published on

Invited talk at the BIG2016 conference - colocated with WWW 2016 in Montreal, Canada

Published in: Technology
  • Préstamo de dinero rápido para unas buenas vacaciones . Hola/Buenas Tardes . Usted está atrapado, prohibida banco y usted no tiene el favor de bancos o mejor tiene un proyecto y necesidad de financiación, un mal expediente de crédito o necesidad dinero para pagar las cuentas,fondos para invertir en negocios. Así que si usted necesita préstamo de dinero no dude en ponerse en contacto conmigo por mi correo electrónico : fernandezlucas.sebastian@gmail.com fernandezlucas.sebastian@gmail.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Oferta de préstamo serio entre las personas honestas. Para todos sus problemas financieros, por favor póngase en contacto conmigo porque tengo un capital importante y me comprometo de préstamo tiene todas las personas honestas. Usted está en busca de listo para reactivar sus actividades ya sea para la realización del proyecto , ya sea para comprar un apartamento, pero por desgracia el banco plantea que usted tiene condiciones que usted es incapbles. No más preocupaciones que soy un individuo I otorga préstamos que van desde 5000 a 1000 000 € tiene todas las personas capaz de mantener su programa tiene una tasa de interés del 3% por año. Por favor, no dude en ponerse en contacto conmigo si usted está en necesidad. PD: Aquí está mi correo electrónico : fernandezlucas.sebastian@gmail.com fernandezlucas.sebastian@gmail.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

BIG2016- Lessons Learned from building real-life user-focused Big Data systems

  1. 1. LessonsLearned from building real-life user-focused Big Data Systems Xavier Amatriain (@xamat) www.quora.com/profile/Xavier-Amatriain 04/12/16
  2. 2. A bit about
  3. 3. Our Mission “To share and grow the world’s knowledge” • Millions of questions & answers • Millions of users • Thousands of topics • ...
  4. 4. Demand What we care about Quality Relevance
  5. 5. LessonsLearned
  6. 6. MoreDatavs.BetterModels
  7. 7. More data or better models? Really? Anand Rajaraman: VC, Founder, Stanford Professor
  8. 8. More data or better models? Sometimes, it’s not about more data
  9. 9. More data or better models? Norvig: “Google does not have better Algorithms only more Data” Many features/ low-bias models
  10. 10. More data or better models? Sometimes, it’s not about more data
  11. 11. How useful is Big Data? ● “Everybody” has Big Data ○ Does everyone need it? ○ E.g. How many users do you need to compute a MF of 100 factors? ● Smart (e.g. stratified) sampling can produce as good (or better) results!
  12. 12. Sometimesyoudoneed A(more)ComplexModel
  13. 13. Better models and features that “don’t work” ● E.g. You have a linear model and have been selecting and optimizing features for that model ■ More complex model with the same features -> improvement not likely ■ More expressive features with the same model -> improvement not likely ● More complex features may require a more complex model ● A more complex model may not show improvements with a feature set that is too simple
  14. 14. Modelselectionisalsoabout Hyperparameteroptimization
  15. 15. Hyperparameter optimization ● Automate hyperparameter optimization by choosing the right metric. ○ But, is it as simple as choosing the max? ● Bayesian Optimization (Gaussian Processes) better than grid search ○ See spearmint, hyperopt, AutoML, MOE...
  16. 16. Supervisedvs.plus UnsupervisedLearning
  17. 17. Supervised/Unsupervised Learning ● Unsupervised learning as dimensionality reduction ● Unsupervised learning as feature engineering ● The “magic” behind combining unsupervised/supervised learning ○ E.g.1 clustering + knn ○ E.g.2 Matrix Factorization ■ MF can be interpreted as ● Unsupervised: ○ Dimensionality Reduction a la PCA ○ Clustering (e.g. NMF) ● Supervised ○ Labeled targets ~ regression
  18. 18. Supervised/Unsupervised Learning ● One of the “tricks” in Deep Learning is how it combines unsupervised/supervised learning ○ E.g. Stacked Autoencoders ○ E.g. training of convolutional nets
  19. 19. Everythingisanensemble
  20. 20. Ensembles ● Netflix Prize was won by an ensemble ○ Initially Bellkor was using GDBTs ○ BigChaos introduced ANN-based ensemble ● Most practical applications of ML run an ensemble ○ Why wouldn’t you? ○ At least as good as the best of your methods ○ Can add completely different approaches (e. g. CF and content-based) ○ You can use many different models at the ensemble layer: LR, GDBTs, RFs, ANNs...
  21. 21. Ensembles & Feature Engineering ● Ensembles are the way to turn any model into a feature! ● E.g. Don’t know if the way to go is to use Factorization Machines, Tensor Factorization, or RNNs? ○ Treat each model as a “feature” ○ Feed them into an ensemble
  22. 22. The Master Algorithm? It definitely is the ensemble!
  23. 23. TheLostArt ofFeatureEngineering
  24. 24. Feature Engineering ● Main properties of a well-behaved ML feature ○ Reusable ○ Transformable ○ Interpretable ○ Reliable ● Reusability: You should be able to reuse features in different models, applications, and teams ● Transformability: Besides directly reusing a feature, it should be easy to use a transformation of it (e.g. log(f), max(f), ∑ft over a time window…)
  25. 25. Feature Engineering ● Main properties of a well-behaved ML feature ○ Reusable ○ Transformable ○ Interpretable ○ Reliable ● Interpretability: In order to do any of the previous, you need to be able to understand the meaning of features and interpret their values. ● Reliability: It should be easy to monitor and detect bugs/issues in features
  26. 26. Feature Engineering Example - Quora Answer Ranking What is a good Quora answer? • truthful • reusable • provides explanation • well formatted • ...
  27. 27. Feature Engineering Example - Quora Answer Ranking How are those dimensions translated into features? • Features that relate to the answer quality itself • Interaction features (upvotes/downvotes, clicks, comments…) • User features (e.g. expertise in topic)
  28. 28. Implicitsignalsbeat explicitones (almostalways)
  29. 29. Implicit vs. Explicit ● Many have acknowledged that implicit feedback is more useful ● Is implicit feedback really always more useful? ● If so, why?
  30. 30. ● Implicit data is (usually): ○ More dense, and available for all users ○ Better representative of user behavior vs. user reflection ○ More related to final objective function ○ Better correlated with AB test results ● E.g. Rating vs watching Implicit vs. Explicit
  31. 31. ● However ○ It is not always the case that direct implicit feedback correlates well with long-term retention ○ E.g. clickbait ● Solution: ○ Combine different forms of implicit + explicit to better represent long-term goal Implicit vs. Explicit
  32. 32. bethoughtfulaboutyour TrainingData
  33. 33. Defining training/testing data ● Training a simple binary classifier for good/bad answer ○ Defining positive and negative labels -> Non-trivial task ○ Is this a positive or a negative? ● funny uninformative answer with many upvotes ● short uninformative answer by a well-known expert in the field ● very long informative answer that nobody reads/upvotes ● informative answer with grammar/spelling mistakes ● ...
  34. 34. Other training data issues: Time traveling ● Time traveling: usage of features that originated after the event you are trying to predict ○ E.g. Your upvoting an answer is a pretty good prediction of you reading that answer, especially because most upvotes happen AFTER you read the answer ○ Tricky when you have many related features ○ Whenever I see an offline experiment with huge wins, I ask: “Is there time traveling?”
  35. 35. YourModelwilllearn whatyouteachittolearn
  36. 36. Training a model ● Model will learn according to: ○ Training data (e.g. implicit and explicit) ○ Target function (e.g. probability of user reading an answer) ○ Metric (e.g. precision vs. recall) ● Example 1 (made up): ○ Optimize probability of a user going to the cinema to watch a movie and rate it “highly” by using purchase history and previous ratings. Use NDCG of the ranking as final metric using only movies rated 4 or higher as positives.
  37. 37. Example 2 - Quora’s feed ● Training data = implicit + explicit ● Target function: Value of showing a story to a user ~ weighted sum of actions: v = ∑a va 1{ya = 1} ○ predict probabilities for each action, then compute expected value: v_pred = E[ V | x ] = ∑a va p(a | x) ● Metric: any ranking metric
  38. 38. Offline testing ● Measure model performance, using (IR) metrics ● Offline performance = indication to make decisions on follow-up A/B tests ● A critical (and mostly unsolved) issue is how offline metrics correlate with A/B test results.
  39. 39. Learntodealwith PresentationBias
  40. 40. 2D Navigational modeling More likely to see Less likely
  41. 41. The curse of presentation bias ● User can only click on what you decide to show ● But, what you decide to show is the result of what your model predicted is good ● Simply treating things you show as negatives is not likely to work ● Better options ● Correcting for the probability a user will click on a position -> Attention models ● Explore/exploit approaches such as MAB
  42. 42. Youdon’tneedtodistribute yourMLalgorithm
  43. 43. Distributing ML ● Most of what people do in practice can fit into a multi- core machine ○ Smart data sampling ○ Offline schemes ○ Efficient parallel code ● Dangers of “easy” distributed approaches such as Hadoop/Spark ● Do you care about costs? How about latencies?
  44. 44. Distributing ML ● Example of optimizing computations to fit them into one machine ○ Spark implementation: 6 hours, 15 machines ○ Developer time: 4 days ○ C++ implementation: 10 minutes, 1 machine ● Most practical applications of Big Data can fit into a (multicore) implementation
  45. 45. Conclusions
  46. 46. ● In data, size is not all that matters ● Understand dependencies between data, models & systems ● Choose the right metric & optimize what matters ● Be thoughtful about ○ Your ML infrastructure/tools ○ Interaction between data and UX
  47. 47. Questions?

×