Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personalization

7,248 views

Published on

Flink Forward 2015

Published in: Technology
  • Be the first to comment

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personalization

  1. 1. Using Flink with MongoDB to enhance relevancy in personalization “How to use Flink with MongoDB?” Marc Schwering Sr. Solution Architect – EMEA marc@mongodb.com @m4rcsch
  2. 2. 2 Agenda For This Session •  Personalization Process Review •  The Life of an Application •  Separation of Concerns / Real World Architecture •  Apache Spark and Flink Data Processing Projects •  Clustering with Apache Flink •  Next Steps
  3. 3. 3 High Level Personalization Process 1.  Profile  created   2.  Enrich  with  public  data   3.  Capture  ac9vity   4.  Clustering  analysis     5.  Define  Personas   6.  Tag  with   personas   7.  Personalize  interac9ons   Batch analytics Public data Common technologies •  R •  Hadoop •  Spark •  Python •  Java •  Many other options Personas changed much less often than tagging
  4. 4. 4 Evolution of a Profile (1) { "_id" : ObjectId("553ea57b588ac9ef066428e1"), "ipAddress" : "216.58.219.238", "referrer" : ”kay.com", "firstName" : "John", "lastName" : "Doe", "email" : "johndoe@gmail.com" }
  5. 5. 5 Evolution of a Profile (n+1) { "_id" : ObjectId("553e7dca588ac9ef066428e0"), "firstName" : "John", "lastName" : "Doe", "address" : "229 W. 43rd St.", "city" : "New York", "state" : "NY", "zipCode" : "10036", "age" : 30, "email" : "john.doe@mongodb.com", "twitterHandle" : "johndoe", "gender" : "male", "interests" : [ "electronics", "basketball", "weightlifting", "ultimate frisbee", "traveling", "technology" ], "visitedCounts" : { "watches" : 3, "shirts" : 1, "sunglasses" : 1, "bags" : 2 }, "purchases" : [ { "id" : 1, "desc" : "Power Oxford Dress Shoe", "category" : "Mens shoes" }, { "id" : 2, "desc" : "Striped Sportshirt", "category" : "Mens shirts" } ], "persona" : "shoe-fanatic” }
  6. 6. 6 One size/document fits all? •  Profile Data –  Preferences –  Personal information •  Contact information •  DOB, gender, ZIP... •  Customer Data –  Purchase History –  Marketing History •  „Session Data“ –  View History –  Shopping Cart Data –  Information Broker Data •  Personalisation Data –  Persona Vectors –  Product and Category recommendations Application Batch analytics
  7. 7. 7 Separation of Concerns •  Profile Data –  Preferences –  Personal information •  Contact information •  DOB, gender, ZIP... •  Customer Data –  Purchase History –  Marketing History •  „Session Data“ –  View History –  Shopping Cart Data –  Information Broker Data •  Personalisation Data –  Persona Vectors –  Product and Category recommendations Batch analytics Layer Frontend - System Profile Service Customer Service Session Service Persona Service
  8. 8. 8 Benefits •  Code does less, Document and Code stays focused •  Split ability – Different Teams – New Languages – Defined Dependencies
  9. 9. 9 Advice for Developers (1) •  Code does less, Document and Code stays focused •  Split ability – Different Teams – New Languages – Defined Dependencies KISS => Keep it simple and save! => Clean Code <= •  Robert C. Marten: https://cleancoders.com/ •  M. Fowler / B. Meyer. et. al.: Command Query Separation
  10. 10. Analytics and Personalization From Query to Clustering
  11. 11. 11 Separation of Concerns •  Profile Data –  Preferences –  Personal information •  Contact information •  DOB, gender, ZIP... •  Customer Data –  Purchase History –  Marketing History •  „Session Data“ –  View History –  Shopping Cart Data –  Information Broker Data •  Personalisation Data –  Persona Vectors –  Product and Category recommendations Batch analytics Layer Frontend – System Profile Service Customer Service Session Service Persona Service
  12. 12. 12 Separation of Concerns •  Profile Data –  Preferences –  Personal information •  Contact information •  DOB, gender, ZIP... •  Customer Data –  Purchase History –  Marketing History •  „Session Data“ –  View History –  Shopping Cart Data –  Information Broker Data •  Personalisation Data –  Persona Vectors –  Product and Category recommendations Batch analytics Layer Frontend – System Profile Service Customer Service Session Service Persona Service
  13. 13. 13 Architecture revised Profile Service Customer Service Session Service Persona Service Frontend – System Backend– Systems Data Processing
  14. 14. 14 Advice for Developers (2) •  OWN YOUR DATA! (but only relevant Data) •  Say no! (to direct Data ie. DB Access)
  15. 15. Data Processing
  16. 16. 16 Hadoop in a Nutshell •  An open source distributed storage and distributed batch oriented processing framework •  Hadoop Distributed File System (HDFS) to store data on commodity hardware •  Yarn as resource management platform •  MapReduce as programming model working on top of HDFS
  17. 17. 17 Spark in a Nutshell •  Spark is a top-level Apache project •  Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB •  Fast and general engine for large-scale data processing and analytics •  Advanced DAG execution engine with support for data locality and in-memory computing
  18. 18. 18 Flink in a Nutshell •  Flink is a top-level Apache project •  Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB •  A distributed streaming dataflow engine •  Streaming and batch •  Iterative in memory execution and handling •  Cost based optimizer
  19. 19. 19 Latency of query operations Query Aggregation MapReduce Cluster Algorithms time MongoDB Hadoop Spark/Flink
  20. 20. Iterative Algorithms / Clustering
  21. 21. 21 K-Means in Pictures •  Source: Wikipedia K-Means
  22. 22. 22 K-Means as a Process
  23. 23. 23 Iterations in Hadoop and Spark
  24. 24. 24 Iterations in Flink •  Dedicated iteration operators •  Tasks keep running for the iterations, not redeployed for each step •  Caching and optimizations done automatically
  25. 25. Example
  26. 26. 26 Result
  27. 27. 27 More…?
  28. 28. 28 Takeaways •  Evolution is amazing and exiting! –  Be ready to learn new things, ask questions across Silos! •  Stay focused => Start and stay small –  Evaluate with BigDocuments but do a PoC focussed on the topic •  Extending functionality could be challenging –  Evolution is outpacing help channels –  A lot of options (Spark, Flink, Storm, Hadoop….) –  More than just a binary •  Extending functionality is easy –  Aggregation, MapReduce –  Connectors opening a new variety of Use Cases
  29. 29. 29 Next Steps •  Try out Flink –  http://flink.apache.org/ –  https://github.com/mongodb/mongo-hadoop –  https://github.com/m4rcsch/flink-mongodb-example •  Participate and ask Questions! –  @m4rcsch –  marc@mongodb.com •  We are hiring!! J
  30. 30. Thank you! Marc Schwering Sr. Solutions Architect – EMEA marc@mongodb.com @m4rcsch

×