Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keira Zhou - Batch and Streaming Processing in the World of Data Engineering and Data Science

171 views

Published on

PyData Seattle 2017

Published in: Technology
  • Be the first to comment

Keira Zhou - Batch and Streaming Processing in the World of Data Engineering and Data Science

  1. 1. Keira Zhou Senior Data Engineer @ Capital One
  2. 2. § Data Engineer @ Capital One § Previous: § Fellow @ Insight Data Engineering § BS + MS in Systems Engineering @UVA 1
  3. 3. 2
  4. 4. § Task: Predict Phishing website in near real time 3
  5. 5. § Public Dataset: Phishing Websites Dataset § Some examples § Using URL Shortening Services “TinyURL”: § bit.ly/19DXSk4 § Age of Domain: § minimum age of the legitimate domain is 6 months § Adding Prefix or Suffix Separated by (-) to the Domain: § http://www.confirme-paypal.com/ 4
  6. 6. § Classification problem § Logistic Regression with Stochastic Gradient Descent 5
  7. 7. 6
  8. 8. § Offline § Retrain the model with historical data + new data § Model fits the global distribution of the data better § Can be unpractical for large data sets § Online § Use new observation to further train your model § Model is more influenced by the recent data § Adapt to new trend faster § Batch / Mini-batch § Wait for a batch of observation to further train your model 7
  9. 9. 8 § Near live..
  10. 10. 9
  11. 11. 10 § Spark 2.0 feature: Save model to file • Load model file
  12. 12. 11 § Online training
  13. 13. § Online algorithms Pros: § Computationally much faster § Useful when dataset is too big § Adapt to new trend faster § Online algorithms Cons: § Majority of the algorithms only work in batch § Some feature extractions are slow § IP Geo lookup § Hard to always get it right in automatic way 12
  14. 14. § Building and Maintaining a streaming pipeline can be challenging § But why?? 13
  15. 15. 14 • Message buffer • Keeps logs of messages
  16. 16. 15 Unordered Duplicates Unbounded
  17. 17. 16 Processing Time Event Time
  18. 18. 17 Time-agonistic Event time matters • Second Look: generous tips • Top merchants in the past hour • Second Look: double swipe • Deduplication
  19. 19. § At-most-once: Potential data loss § e.g. Video data sent via UDP § At-least-once: Potential duplication § e.g. Email alerts § Exactly-once: Ideal world § Requires better configuration of your infrastructure § Consistency in light of machine failures 18
  20. 20. § Challenges from data stream itself § How to handle time depend on use case § Event time: reflects real life but harder to implement § Challenges from your infrastructure § Exactly once delivery is critical for accurate streaming analytical results § You probably would want that for your online model § Streaming gives you more timely results § Not everything needs to be real-time 19
  21. 21. BATCH OR STREAMING? § Model Updating § Batch: improve accuracy everytime you retrain the model § Online: adapt to new data points as they comes in § Latency & Correctness § Batch: high latency but more control of the data § Streaming: low latency but less control § Monitoring § Maintainability § It is easier to maintain one pipeline rather than two § Lambda vs. Kappa 20
  22. 22. 21 Data Science Data Engineering • Find the right features • Get labeled data • Manual labeling • Develop a model • Guarantee “exact once” for the streaming pipeline • Tuning Spark • # of executors • Memory • Event & Processing Time • Spark Programming • Spark connector or Python connector • Use MLlib
  23. 23. 22 § Github Repo for the example § https://github.com/keiraqz/StreamingLogisticRegression § Phishing Websites Data Set § https://archive.ics.uci.edu/ml/datasets/Phishing+Websites# § Spark Streaming ML Algorithm § https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression § The World Beyond Batch: Streaming 101 § https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 § Lambda Architecture § http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html § Kappa Architecture § https://www.oreilly.com/ideas/questioning-the-lambda-architectur
  24. 24. 23

×