Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming ETL - from RDBMS to Dashboard with KSQL

638 views

Published on

Apache Kafka is a massively scalable message queue that is being used at more and more places connecting more and more data sources. This presentation will introduce Kafka from the perspective of a mere mortal DBA and share the experience of (and challenges with) getting events from the database to Kafka using Kafka connect including poor-man’s CDC using flashback query and traditional logical replication tools. To demonstrate how and why this is a good idea, we will build an end-to-end data processing pipeline. We will discuss how to turn changes in database state into events and stream them into Apache Kafka. We will explore the basic concepts of streaming transformations using windows and KSQL before ingesting the transformed stream in a dashboard application.

Published in: Data & Analytics
  • Be the first to comment

Streaming ETL - from RDBMS to Dashboard with KSQL

  1. 1. Streaming ETL From rdbms to dashboard with Kafka and KSQL Björn Rost
  2. 2. Things I am good at •Oracle (and relational) databases •Performance •High-Availability •PL/SQL and ETL •Replication •Exadata •Automation/DevOps •Linux and Solaris •VMs and solaris containers © 2016 Pythian 11
  3. 3. Things I am getting good at •Kafka and streaming •Cloud and cloud native data processing •Dataflow, bigquery •Machine learning •docker © 2016 Pythian 12
  4. 4. Things I am not good at And have limited interest in •“real” programming •Especially java •GUIs •Coming up with meaningful demos © 2016 Pythian 13
  5. 5. ABOUT PYTHIAN Pythian’s 400+ IT professionals help companies adopt and manage disruptive technologies to better compete © 2016 Pythian 14
  6. 6. TECHNICAL EXPERTISE © 2016 Pythian. Confidential 15 Infrastructure: Transforming and managing the IT infrastructure that supports the business DevOps: Providing critical velocity in software deployment by adopting DevOps practices Cloud: Using the disruptive nature of cloud for accelerated, cost-effective growth Databases: Ensuring databases are reliable, secure, available and continuously optimized Big Data: Harnessing the transformative power of data on a massive scale Advanced Analytics: Mining data for insights & business transformation using data science
  7. 7. © 2016 Pythian 16 assumptions •You know more about kafka than me •Today you do not want to hear much about how great Oracle is
  8. 8. AGENDA • Motivation / what are we going to build here? • Getting rdbms data into kafka • streaming ETL and KSQL • Feeding kafka into grafana • Demo time! (or Q&A) © 2016 Pythian 17
  9. 9. motivation (noun) /məʊtɪˈveɪʃ(ə)n/ AKA: how to tease you enough to pay attention through the next 42 mintues
  10. 10. overview © 2016 Pythian 19
  11. 11. The full(er) picture © 2016 Pythian 20 mysql maxwell kafka ksql elastic clickstream
  12. 12. The 3 Vs of Big Data © 2016 Pythian 21 Volume VarietyVelocity
  13. 13. StreamingRDBMS the “king of state” © 2016 Pythian 22 • Takes transactions and stores consistent state • Tell you what *is* or *was* • One central ”system of record” • Sucks for large volumes of logs • Great at updates, deletes and rollbacks • Every DB speaks SQL • Stores and distributes events • Tell you what *happened* • Has a concept of order • Connects many different systems • Sucks at accounting and inventories • Append-only • Processing = programming*
  14. 14. • I have $42 in my bank account • The address of user xx is yyy • Inventory • Invoice and order data • Spatial objects (maps) • A transferred $42 to B • Address change • Add or remove an item • Clickstreams and logs • IoT messages • Location movements (GPS) • Gaming actions Event examplesState examples © 2016 Pythian 23
  15. 15. Demo setup in mysql © 2016 Pythian 24 mysql>select * from orders order by id desc limit 5; +-------+---------+-------+---------+ | id | product | price | user_id | +-------+---------+-------+---------+ | 10337 | wine | 10 | 3 | | 10336 | olives | 1 | 14 | | 10335 | olives | 3 | 7 | | 10334 | olives | 8 | 32 | | 10333 | salt | 3 | 27 | +-------+---------+-------+---------+ 5 rows in set (0.00 sec)
  16. 16. rdbms -> kafka © 2016 Pythian 25
  17. 17. Kafka-connect-jdbc • open source connector • runs a query every n seconds •Remembers offset •Really only captures inserts •Broken Data type mapping (oracle) •Issues withTimezones (oracle) © 2016 Pythian 26
  18. 18. lumpy - @lumpyACED © 2016 Pythian 27
  19. 19. Simple diary example © 2016 Pythian 28 mysql>describe diary; +-------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+-------------+------+-----+---------+----------------+ | id | smallint(6) | NO | PRI | NULL | auto_increment | | event | varchar(42) | YES | | NULL | | +-------+-------------+------+-----+---------+----------------+ 2 rows in set (0.00 sec) mysql>select * from diary order by id desc limit 5; +----+---------------------------------------------+ | id | event | +----+---------------------------------------------+ | 18 | i hate the snow | | 17 | still jealous i did not get to go to israel | | 16 | i am jealous i did not get to go to Israel | | 15 | i am jealous i did not get to go to india | | 13 | i am very cold and alone | +----+-----------------------------------_---------+ 5 rows in set (0.00 sec)
  20. 20. Diary example © 2016 Pythian 29 mysql>insert into diary (event) values ('I would love to meet the meetup guys'); Query OK, 1 row affected (0.00 sec) mysql>select * from diary order by id desc limit 2; +----+--------------------------------------+ | id | event | +----+--------------------------------------+ | 19 | I would love to meet the meetup guys | | 18 | i hate the snow | +----+--------------------------------------+ 2 rows in set (0.00 sec)
  21. 21. Connect-jdbc-diary.properties © 2016 Pythian 30 name=mysql-diary-source connector.class=io.confluent.connect.jdbc.JdbcSourceConnector tasks.max=1 connection.url=jdbc:mysql://localhost:3306/code_demo?user=lumpy&password=lumpy table.whitelist=diary mode=incrementing incrementing.column.name=id topic.prefix=mysql-
  22. 22. Still simple but not as easy: inventory © 2016 Pythian 31 SQL>describe inventory; Name Null? Type ----------------------------------------- -------- ------------------------ ID NOT NULL NUMBER(8) NAME VARCHAR2(42) COUNT NUMBER(8) SQL>select * from inventory; ID NAME COUNT ---------- ------------ ---------- 1 nametag 1 4 friends 294 5 selfies 1005
  23. 23. Still simple but not as easy: inventory © 2016 Pythian 32 SQL>update inventory set count=count+2 where name='friends'; 1 row updated. SQL>select * from inventory; ID NAME COUNT ---------- ------------ ---------- 1 nametag 1 4 friends 296 5 selfies 1005
  24. 24. How about one extra column to catch updates? © 2016 Pythian 33 alter table inventory add (last_modified timestamp);
  25. 25. How about two extra columns to catch deletes? © 2016 Pythian 34 alter table inventory add (valid_from timestamp, valid_to timestamp);
  26. 26. © 2016 Pythian 35 Poor man’s CDC • SELECT … VERSIONS BETWEEN … • this adds pseudocolumns • version_starttime in TS format • version_operation • the data is gathered from UNDO by default • > 11.2.0.4 allow basic flashback data archives without extra licenses • specify retention period for as long as you want Oracle flashback query
  27. 27. flashback query output © 2016 Pythian 36 ID NAME COUNT O VERSIONS_STARTTIME ---- ------------ ------- - -------------------------------- 4 friends 42 I 27-JUN-17 05.10.17.000000000 AM 3 shrimp 1 I 27-JUN-17 05.10.17.000000000 AM 6 mouse ears 2 D 27-JUN-17 03.51.50.000000000 PM 4 friends 42 U 27-JUN-17 05.10.41.000000000 AM 6 mouse ears 2 I 27-JUN-17 05.23.11.000000000 AM 5 selfies 1001 U 27-JUN-17 03.56.12.000000000 PM 5 selfies 1000 U 27-JUN-17 05.10.41.000000000 AM 4 friends 42 U 27-JUN-17 03.51.22.000000000 PM 4 friends 92 U 27-JUN-17 10.14.14.000000000 PM 4 friends 117 U 27-JUN-17 10.23.17.000000000 PM 4 friends 142 U 27-JUN-17 10.28.21.000000000 PM 5 selfies 1002 U 27-JUN-17 03.56.22.000000000 PM select id, name, count, versions_operation, versions_starttime from inventory versions between scn minvalue and maxvalue order by versions_starttime;
  28. 28. © 2016 Pythian 37 •aka total recall •background job mines UNDO •saves data to special tables •create flashback archive per table •define retention •extends flashback query flashback data archives
  29. 29. flashback query config for connect-jdbc © 2016 Pythian 38 connection.url=jdbc:oracle:thin:lumpy/lumpy@//localhost:1521/BRORCL query=select id, name, count, versions_operation, versions_starttime from inventory versions between scn minvalue and maxvalue mode=timestamp+incrementing timestamp.column.name=VERSIONS_STARTTIME incrementing.column.name=ID topic.prefix=connect-inventory
  30. 30. Using DB tx logs © 2016 Pythian 39
  31. 31. •DBs typically separate data (random and async) from logs (sync and sequential) •This increases performance and recoverability •Bonus: log of all changes •Different names, same concept •Oracle: redo and archivelogs •Mysql: binlogs •Postgres: Write-Ahead-Logs (WAL) •SQL Server: transaction logs Databases already have ”event” logs © 2016 Pythian 40
  32. 32. dbvisit replicate © 2016 Pythian 41
  33. 33. RDBMS CDC tools © 2016 Pythian 42
  34. 34. Maxwell for mysql •Reads binlogs directly •Has it’s own json format (read: not kafka-connect) •Open, easy, awesome © 2016 Pythian 43
  35. 35. Maxwell setup © 2016 Pythian 44 maxwell --user='maxwell' --password='maxwell’ --host='127.0.0.1' --producer=kafka --kafka.bootstrap.servers=localhost:9092 --kafka_topic=maxwell_%{database}_%{table}
  36. 36. Maxwell output © 2016 Pythian 45 {"database":"code", "table":"orders", "type":"insert", "ts":1516802610, "xid":42025, "commit":true, "data":{"id":12734, "product":"salt", "price":7, "user_id":24 } }
  37. 37. Data processing © 2016 Pythian 46
  38. 38. © 2016 Pythian 47 •Transform raw data from transactional systems •Store it again optimized for analytics and reports •Star-schema •Aggregates and roll-ups •Runs in batches, typically nighlty ETL for traditional analytics
  39. 39. •In-memory •Column stores •Report in real-time •Decision-support •Machine learning and AI •New data sources •Clickstream •IoT •Big Data © 2016 Pythian 48 Hot topics in analytics
  40. 40. KSQL and Event Stream Processing •Kafka already has kafka streams for processing •But you need to actually write code ▪Same problem with Apache Spark and Dataflow (Apache Beam) etc etc •KSQL allows stream processing with the language you probably already know •Currently in ”developer-preview” © 2016 Pythian 49
  41. 41. What’s the deal with streaming data processing? © 2016 Pythian 50 bounded unbounded Finite, complete, consistent Infinite, uncomplete, different inconsistent sources
  42. 42. Easy: single element transforms •Connect SMT •KSQL •Kafka Streams © 2016 Pythian 51
  43. 43. Creating stream from topic and transforms © 2016 Pythian 52 create stream orders_raw (data map(varchar, varchar)) with (kafka_topic = 'maxwell_code_orders', value_format = 'JSON’); ksql>describe orders_raw; Field | Type ------------------------------------------------ ROWTIME | BIGINT (system) ROWKEY | VARCHAR(STRING) (system) DATA | MAP[VARCHAR(STRING),VARCHAR(STRING)] ------------------------------------------------
  44. 44. Creating stream from topic and transforms © 2016 Pythian 53 ksql>select * from orders_raw limit 5; 1516805044165 | {"database":"code","table":"orders","pk.id":546} | {product=wine, user_id=31, price=1, id=546} 1516805044304 | {"database":"code","table":"orders","pk.id":547} | {product=salt, user_id=17, price=2, id=547} 1516805044423 | {"database":"code","table":"orders","pk.id":548} | {product=salt, user_id=16, price=6, id=548} 1516805044550 | {"database":"code","table":"orders","pk.id":549} | {product=olives, user_id=11, price=8, id=549} 1516805044683 | {"database":"code","table":"orders","pk.id":550} | {product=salt, user_id=36, price=3, id=550} LIMIT reached for the partition. Query terminated
  45. 45. Creating stream from topic and transforms © 2016 Pythian 54 create stream orders_flat as select data['id'] as id, data['product'] as product, data['price'] as price, data['user_id'] as user_id from orders_raw; ksql>describe orders_flat; Field | Type ------------------------------------- ROWTIME | BIGINT (system) ROWKEY | VARCHAR(STRING) (system) ID | VARCHAR(STRING) PRODUCT | VARCHAR(STRING) PRICE | VARCHAR(STRING) USER_ID | VARCHAR(STRING) -------------------------------------
  46. 46. Creating stream from topic and transforms © 2016 Pythian 55 create stream orders as select cast(id as integer) as id, product, cast(price as bigint) as price, cast(user_id as integer) as user_id from orders_flat; ksql>describe orders; Field | Type ------------------------------------- ROWTIME | BIGINT (system) ROWKEY | VARCHAR(STRING) (system) ID | INTEGER PRODUCT | VARCHAR(STRING) PRICE | BIGINT USER_ID | INTEGER -------------------------------------
  47. 47. Creating stream from topic and transforms © 2016 Pythian 56 ksql>select * from orders limit 5; 1516805228829 | {"database":"code","table":"orders","pk.id":2031} | 2031 | olives | 1 | 21 1516805228964 | {"database":"code","table":"orders","pk.id":2032} | 2032 | salt | 2 | 28 1516805229114 | {"database":"code","table":"orders","pk.id":2033} | 2033 | wine | 1 | 26 1516805229254 | {"database":"code","table":"orders","pk.id":2034} | 2034 | wine | 5 | 2 1516805229377 | {"database":"code","table":"orders","pk.id":2035} | 2035 | salt | 5 | 1 LIMIT reached for the partition. Query terminated
  48. 48. Aggregates are a lot harder © 2016 Pythian 57 4 3 2
  49. 49. And then there are joins © 2016 Pythian 58 4 3 2
  50. 50. © 2016 Pythian 59
  51. 51. Slicing a stream into windows © 2016 Pythian 60 08:00 08:05 08:10 08:15 08:20
  52. 52. Late arrivals make this more complicated… © 2016 Pythian 61 08:00 08:05 08:10 08:15 08:20 event_ts=8:02
  53. 53. Tumbling windows: fixed-size, gap-less © 2016 Pythian 62 08:00 08:05 08:10 08:15 08:20
  54. 54. Hopping windows: fixed-size, overlapping © 2016 Pythian 63 08:00 08:05 08:10 08:15 08:20
  55. 55. session windows: variable-size, timeout per key © 2016 Pythian 64 08:00 08:05 08:10 08:15 08:20
  56. 56. Create a windowed aggregate in ksql © 2016 Pythian 65 create table orders_per_min as select product, sum(price) amount from orders window hopping (size 60 seconds, advance by 15 seconds) group by product; CREATE TABLE orders_per_min_ts as select rowTime as event_ts, * from orders_per_min;
  57. 57. Create a windowed aggregate in ksql © 2016 Pythian 66 ksql>select event_ts, product, amount from orders_per_min_ts limit 20; 1516805280000 | olives | 444 1516805295000 | olives | 436 1516805310000 | olives | 307 1516805325000 | olives | 125 1516805280000 | salt | 921 1516805295000 | salt | 906 1516805310000 | salt | 528 1516805325000 | salt | 229 1516805280000 | wine | 470 1516805295000 | wine | 470 1516805310000 | wine | 305 1516805325000 | wine | 103
  58. 58. Aggregate functions © 2016 Pythian 67 Function Example Description COUNT COUNT(col1) Count the number of rows MAX MAX(col1) Return the maximum value for a given column and window MIN MIN(col1) Return the minimum value for a given column and window SUM SUM(col1) Sums the column values TOPK TOPK(col1, k) Return the TopK values for the given column and window TOPKDISTINCT TOPKDISTINCT(col1, k) Return the distinct TopK values for the given column and window
  59. 59. Demo time! © 2016 Pythian 68 mysql maxwell kafka ksql elastic clickstream Huge credit to github clickstream demo
  60. 60. •https://www.confluent.io/blog/turnin g-the-database-inside-out-with- apache-samza/ •https://www.confluent.io/blog/ksql- open-source-streaming-sql-for-apache- kafka/ •https://www.rittmanmead.com/blog/ 2017/10/ksql-streaming-sql-for- apache-kafka/ © 2016 Pythian 70 More resources
  61. 61. •RDBMS also want to speak “stream” •Stream processing is coming fast and is here to stay •KSQL is something to be excited about © 2016 Pythian 71 Summary https://github.com/bjoernrost/mysql-ksql-etl-demo
  62. 62. THANK YOU Paragraph 72© 2016 Pythian

×