Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taras Kloba "ETL - вже не актуальна: тривалі живі потоки фз системою Apache Kafka"

8 views

Published on

BigData & Data Engineering

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Taras Kloba "ETL - вже не актуальна: тривалі живі потоки фз системою Apache Kafka"

  1. 1. ETL is Dead. Long Live Streams with Apache Kafka. Taras Kloba, BI Team Lead/Data Architect, Intellias
  2. 2. Agenda • About me; • One problem in data transferring; • Ways to solve this problem; • About Apache Kafka; • Demo of reliable data sending; • Questions?
  3. 3. Taras Kloba • 7 years of experience with databases; • Certified Data Engineer on Google Cloud; • Certified Expert Microsoft SQL Server; • Co-organizer “SQL Saturday” in Lviv and Krakow; • Trainer, speaker, consultant; • Owner “SQL” trademark in Ukraine . SQL.ua, CEO/Founder Intellias, BI Team Lead/Data Architect Quick facts (Q62JCJRJGY77)(9DG5NZ4EVP7A) (M2HE6LPRJ6MV)
  4. 4. My current project: One of the biggest B2B software solution for the iGaming industry in the World. +300 GB new data every day
  5. 5. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:04:00’ AND ’2018-11-03 00:08:00’
  6. 6. ?
  7. 7. Previous legacy system 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07 00:08 00:09 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:00:00’ AND ’2018-11-03 00:04:00’ SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 00:02:00’ AND ’2018-11-03 00:06:00’
  8. 8. ?
  9. 9. Phantom reads (classical definition)
  10. 10. Phantom reads (in our cases) Tnx: 1 2018-11-03 12:00:00 Tnx: 2 2018-11-03 12:01:00 Tnx: 2 commit 12:03:00 Tnx: 1 commit 12:05:00 SELECT * FROM fact_transactions WHERE upd BETWEEN ’2018-11-03 11:58:00’ AND ’2018-11-03 12:04:00’ Trans_id Upd 2 2018-11-03 12:01:00
  11. 11. ?
  12. 12. #1. Isolation levels - Serializable With a lock-based concurrency control DBMS implementation, serializability requires read and write locks (acquired on selected data) to be released at the end of the transaction. Also range-locks must be acquired when a SELECT query uses a ranged WHERE clause, especially to avoid the phantom reads phenomenon. A not best solution for high load solutions.
  13. 13. #2. Triggers Traditionally, the most common technique used for capturing events was to use database or application-level triggers. The reason why this technique is still very widespread is due to its simplicity and familiarness. A not best solution for high load solutions.
  14. 14. #3. Change Data Capture is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. (Wikipedia) CDC solutions occur most often in data- warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse.
  15. 15. Apache Kafka Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
  16. 16. Typical data flow in companies
  17. 17. Streaming platform to coordinate all data flows.
  18. 18. Kafka Connect API (E and L in Streaming ETL) • Scalability: Leverages Kafka for scalability • Fault tolerance: Builds on Kafka’s fault tolerance model • Management and monitoring: One way of monitoring all connectors • Schemas: Offers an option for preserving schemas from source to sink
  19. 19. Kafka Connect. Create new connector.
  20. 20. Kafka’s streams API (The T in ETL) • Easiest way to do stream processing using Kafka; • True event-at-a-time stream “ processing; no microbatching; • Dataflow-style windowing based on “ event-time; handles late-arriving data
  21. 21. Kafka Stream API. Create new processor
  22. 22. Demo
  23. 23. Conclusion • Apache Kafka is robust • Triggers will keep your data in sync but can have significant performance overhead • Utilizing a logical replication slot can eliminate trigger overhead and transfer the computation load elsewhere • Not a panacea: still need to use good architectural patterns
  24. 24. Questions?
  25. 25. Thank you! Taras Klioba +38 093 74 876 15 taras@klioba.com

×