Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka at trivago

492 views

Published on

short presentation at the Kafka meetup in Munich
https://www.meetup.com/Apache-Kafka-Germany-Munich/events/236402498/

Published in: Data & Analytics

Kafka at trivago

  1. 1. Apache Kafka at trivago 2017-01-25, Munich, Germany Clemens Valiente
  2. 2. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente
  3. 3. 3 As a hotel price comparison engine, our most valuable information are hotel prices. They are not only shown to our visitors to support their hotel booking decision, but also stored and later analyzed by Business Intelligence. With over one million hotels and all major booking websites connected to our system, we have one of the most complete sources of information on hotel price development and trends Collecting price information for BI
  4. 4. 4 The past: Data pipeline 2010 – 2015
  5. 5. 5 The past: Data pipeline 2010 – 2015 Java Software Engineering
  6. 6. 6 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  7. 7. 7 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  8. 8. 8 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years
  9. 9. 9 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins
  10. 10. 10 The past: Data pipeline 2010 – 2015 Facts & Figures Price dimensions - Around one million hotels - 250 booking websites - Travellers search for up to 180 days in advance - Data collected over five years Restrictions - Only single night stays - Only prices from European visitors - Prices cached up to 30 minutes - One price per hotel, website and arrival date per day - “Insert ignore”: The first price per key wins Size of data - We collected a total of 56 billion prices in those five years - Towards the end of this pipeline in early 2015 on average around 100 million prices per day were written to BI
  11. 11. 11 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  12. 12. 12 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  13. 13. 13 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  14. 14. 14 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  15. 15. 15 The past: Data pipeline 2010 – 2015 Java Software Engineering BI Warehouse
  16. 16. 16 Refactoring the pipeline: Requirements • Scales with an arbitrary amount of data (future proof) • reliable and resilient • low performance impact on Java backend • long term storage of raw input data • fast processing of filtered and aggregated data • Open source • we want to log everything: • more prices • Length of stay, room type, breakfast info, room category, domain • with more information • Net & gross price, city tax, resort fee, affiliate fee, VAT
  17. 17. 17 Present data pipeline 2017 – ingestion Düsseldorf
  18. 18. 18 Present data pipeline 2017 – ingestion Düsseldorf
  19. 19. 19 Present data pipeline 2017 – ingestion San Francisco Düsseldorf Hongkong
  20. 20. 20 Present data pipeline 2017 – processing Camus
  21. 21. 21 Present data pipeline 2017 – results after two years in production • Very reliable, barely any downtime or service interruptions of the system • Java team is very happy – less load on their system • BI team is very happy – more data, more resources to process it • stakeholders very happy • Faster results • Better quality of results due to more data • More detailed results • => Shorter research phase, more and better stories • => Less requests & workload for BI
  22. 22. 22 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing
  23. 23. 23 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics
  24. 24. 24 Present data pipeline 2017 – facts & figures Kafka Cluster specifications - Cluster of 5 machines in each data centre for logs - An additional cluster of two machines in Düsseldorf for aggregation/stream processing Data Size (price log) - Over 4 trillion messages collected so far - 10 billion messages/day - Over a hundred topics Camus - Mapreduce application that writes prices to hdfs - 15 Mappers running in parallel - Pretty much continuously in 10 minute intervals - To be replaced by Gobblin/Kafka Connect
  25. 25. 25 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors
  26. 26. 26 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka
  27. 27. 27 Present data pipeline 2017 – use cases & status quo Uses for price information - Monitoring price parity in hotel market - Anomaly and fraud detection - Price feed for online marketing - Display of price development and delivering price alerts to website visitors Other data sources and usage - Clicklog information from our website and mobile app - Used for marketing performance analysis, product tests, invoice generation etc - Every Euro of revenue at some point was a message in Kafka Status quo - Our entire BI business logic runs on and through the kafka – hadoop pipeline - Almost all departments rely on data, insights and metrics delivered by hadoop - Most of the company could not do their job without hadoop data
  28. 28. 28 Düsseldorf Leipzig Palma Ongoing Projects: Breaking up the Monolith
  29. 29. 29 Düsseldorf PalmaLeipzig
  30. 30. 30 Key challenges and learnings ● Settle on a common message format (Avro/Protobuf, not csv or json) ● A common message envelope is helpful (e.g. header with timestamp and sender) ● For stream processing repeat your key in your message value ● Monitor your consumer offsets with an audit log, especially across data centres ● Turn off auto creation of topics, but have a process in place for topic creation
  31. 31. Email: clemens.valiente@trivago.com de.linkedin.com/in/clemensvaliente Senior Data Engineer trivago Düsseldorf Originally a mathematician Studied at Uni Erlangen At trivago for 5 years Clemens Valiente Thank you! Questions and comments?
  32. 32. ● Thanks to Jan Filipiak for his brainpower behind most projects ● Additional resources: ● https://github.com/trivago/gollum A n:m message multiplexer written in Go ● https://github.com/trivago/triava TriavaCache, JSR107 compliant cache

×