Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Polylog: A Log-Based Architecture for Distributed Systems

685 views

Published on

The talk focuses on a log-based architecture ("The Polylog") we've developed to handle data change capture in order to easily build new services and databases based on other service's full datasets. Some of the tools we'll cover include Debezium for database change capture, Kafka for storing the logs, and the Denormalizer, which is an in-house tool we built to do left joins on streams.

Published in: Internet
  • Be the first to comment

Polylog: A Log-Based Architecture for Distributed Systems

  1. 1. The Polylog A Log-Based Architecture for Distributed Systems
  2. 2. JW Player Motivation Inspiration Implementation Use cases 1 2 3 4 5 Agenda
  3. 3. JW Player
  4. 4. JW Player 1. Established - 2008 2. Headquarters - NYC (2 Park Ave) 3. Employees - 200+ 4. Business Model - SaaS 5. JW Player Footprint: 5%+ of all video on the web
  5. 5. Data @ JW Player 1. 1Bn video hours consumed per month 2. 1Bn unique viewers per month 3. 5MM analytics events per minute 4. 3TB of logs per day Pipelines Ingestion, pipelines, Infrastructure Discovery Recs & Search in Production Insights Customer dashboards Media Intelligence Media metadata extraction Data Science R&D, instrumentation, predictive modeling
  6. 6. Recommendations and Search
  7. 7. Motivation
  8. 8. JW Player is breaking up its monolith 1. JW Player is moving to a Service Oriented Architecture (SOA) 2. SOA promotes loose coupling between services 3. Part of the roadmap is to break up our monolithic database into separate datastores for faster iteration
  9. 9. Some services don’t work under SOA 1. Our data services depend on syncing Elasticsearch with numerous tables from the monolith 2. Traditional API-style architecture doesn’t work for indexing data across many sources and data change monitoring: a. Hard to know when, how and what changed b. Hard to maintain consistency c. Hard to scan the entire dataset
  10. 10. We need the ability to perform both iterative updates and full rebuilds of recommendations simply and efficiently Our Mission
  11. 11. Inspiration
  12. 12. The Monolog 1. New York Times solved this problem with log-based architecture 2. CMSs write to Kafka first, from which other services read and build 3. “Mono” because everything written to single Kafka topic and partition https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
  13. 13. The simplicity of logs Simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time.
  14. 14. 1. Distributed and fault tolerant 2. Stores full history 3. Can replay from beginning 4. Supports log compaction 5. Clients in many languages: JVM, Python, Go Apache Kafka: distributed logs # hello world in Kafka import confluent_kafka consumer = confluent_kafka.Consumer({ "bootstrap.servers": "my-kafka:9092", "group.id": "my_consumer", }) consumer.subscribe(["my_topic"]) while True: message = consumer.poll() process_message(message)
  15. 15. Implementation
  16. 16. The Polylog 1. Fewer assumptions than Monolog 2. Can be multiple topics, partitions or clusters 3. Easier to scale 4. Ability to create consistent view of denormalized data
  17. 17. Polylog components 1. Producers - populating The Polylog a. Debezium b. Custom 2. Storage - Kafka 3. Intermediate processors a. Denormalizer b. Custom 4. Consumers - consuming off of The Polylog
  18. 18. The Polylog
  19. 19. Debezium: read logs from the database 1. Reads op logs from various databases (MySQL, Postgres, Mongo, etc.) and writes to Kafka 2. Minimal setup 3. Every table is a topic 4. Handles schema changes 5. Configuration options (e.g. table whitelist, column blacklist)
  20. 20. 1. Debezium is not appropriate for all use cases 2. We have custom producers writing to Polylog a. Derived data (E.g. algorithm results) b. Producers requiring business logic c. Kafka as source of truth Custom Producers
  21. 21. Denormalizer: left joins on streams 1. Join records across multiple topics 2. Create full denormalized records (e.g. media with tags) 3. Generic schema 4. RocksDB with AWS S3 backup 5. Looking to open source
  22. 22. Denormalizer: what does the data look like? { "id": 123, "title": "My title", "duration":600 } { "PrimaryKey": "0360", "Record": { "id": 123, "title": "My title", "duration": 600 }, "Children": { "table2": [{ "PrimaryKey": "0203", "Record": { "id": 234, "table1_id": 123, "val": "hello world!" }, "Children": {...} } } mysql.mydb.table1 my_denormalizer_topic { "id": 234, "table1_id": 123, "val": "hello world" } mysql.mydb.table2
  23. 23. Consumers: stream to other datastores 1. Read denormalized records 2. Transform into expected format 3. Write transformed records into another datastore (e.g. Elasticsearch)
  24. 24. Use cases
  25. 25. 1. Build data models from disparate data sources
  26. 26. 2. Kafka as primary source of truth a. Write to Kafka first b. Can have multiple consumers c. At least once guarantee d. Guarantee consistency - Avoid dual write issue
  27. 27. 3. Database migrations a. Avoid dual write issues! b. Stand up new service while old service still active c. Seamless switch - no hard cutover
  28. 28. 4. Data change monitoring
  29. 29. 5. Disaster recovery and fault tolerance a. Kafka retention means we have an audit trail b. Examples: ➢ Accidentally overwriting data in upstream database ➢ Debugging how data changed over time
  30. 30. a. “Don’t be a salmon!” - don’t talk directly to upstream services b. Polylog is a single data source that multiple consumers can work off of c. When you need a service that can’t do basic API calls 6. New services based on other service's datasets
  31. 31. Conclusion
  32. 32. Use log-based architectures! 1. Build data models from disparate data sources 2. Kafka as primary source of truth 3. Database migrations for SOA 4. Data change monitoring 5. Disaster recovery and fault tolerance 6. Building new services based on other service's full datasets
  33. 33. Thank you... and we’re hiring! Questions?

×