Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Norikra: Stream Processing with SQL

5,253 views

Published on

HadoopCon 2014 Taiwan Tech Talk
* Stream processing overview
* Using SQL as DSL for stream processing
* Details of Norikra
* Norikra queries
* Use cases

Published in: Technology
  • Be the first to comment

Norikra: Stream Processing with SQL

  1. 1. Norikra: Stream Processing With SQL 2014/09/13 HadoopCon 2014 Taiwan Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi Tagomori (@tagomoris) LINE Corporation Analytics Platform Team
  3. 3. THE ONE THING WHAT YOU MUST LEAN TODAY IS
  4. 4. Norikra
  5. 5. Norikra IS NOT Norika
  6. 6. Topics Basics of stream processing Stream processing with SQL Norikra overview Norikra queries Use cases in production
  7. 7. Stream Processing Less latency Less computing power No query schedule management
  8. 8. Data Flow And Latency data window query execution Batch Stream incremental query execution
  9. 9. Query For Stored Data table v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 At first, all data MUST be stored.
  10. 10. Query For Stored Data v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table WHERE v3=’x’ GROUP BY v1,v2 table
  11. 11. Query For Stored Data v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table WHERE v3=’x’ GROUP BY v1,v2 table SELECT v4,COUNT(*) FROM table WHERE v1 AND v2 GROUP BY v4
  12. 12. Query For Stored Data v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table WHERE v3=’x’ GROUP BY v1,v2 table SELECT v4,COUNT(*) FROM table WHERE v1 AND v2 GROUP BY v4 “All data” means “data that will not be used”.
  13. 13. Query For Stream Data v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream SELECT v4,COUNT(*) FROM table.win:xxx WHERE v1 AND v2 GROUP BY v4 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6
  14. 14. Query For Stream Data v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream SELECT v4,COUNT(*) FROM table.win:xxx WHERE v1 AND v2 GROUP BY v4 v1,v2,v3 v1,v2,v4 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6
  15. 15. Query For Stream Data v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream SELECT v4,COUNT(*) FROM table.win:xxx WHERE v1 AND v2 GROUP BY v4 v1,v2,v3 v1,v2,v3,v4,v5,v6 v1,v2,v4 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6
  16. 16. Query For Stream Data v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream SELECT v4,COUNT(*) FROM table.win:xxx WHERE v1 AND v2 GROUP BY v4 v1,v2,v3 v1,v2,v4 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 All data will be discarded right after insertion. (Bye-bye storage system maintenance!)
  17. 17. Incremental Calculation v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 internal data (memory) v1 v2 COUNT TRUE TRUE 0 TRUE FALSE 1 FALSE TRUE 33 FALSE FALSE 2
  18. 18. Incremental Calculation v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 internal data (memory) v1 v2 COUNT TRUE TRUE 1 TRUE FALSE 1 FALSE TRUE 33 FALSE FALSE 2
  19. 19. Incremental Calculation v1,v2,v3,v4,v5,v6 SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 internal data (memory) v1 v2 COUNT TRUE TRUE 1 TRUE FALSE 1 FALSE TRUE 34 FALSE FALSE 2
  20. 20. Incremental Calculation SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2 stream v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 v1,v2,v3,v4,v5,v6 internal data (memory) v1 v2 COUNT TRUE TRUE 1 TRUE FALSE 2 FALSE TRUE 37 FALSE FALSE 3 memory can store internal data
  21. 21. Data Window Target time (or size) range of queries Batch FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘ AND dt < ‘2014-09-13 14:20:00’ Stream “Calculate this query every 50 minutes” Extended SQL required SELECT v1,v2,COUNT(*) FROM table.win:xxx WHERE v3=’x’ GROUP BY v1,v2
  22. 22. Stream Processing With SQL Esper: Java library to process stream needs to be implemented in Java daemon code With schema for data/query OSS under GPLv2 http://esper.codehaus.org/
  23. 23. Esper EPL Select values of height and weight for all events with age larger than 30 SELECT height, weight FROM tbl WHERE age > 30
  24. 24. Esper EPL Count records group by height value for events with age larger than 30 SELECT height, COUNT(*) AS c FROM tbl WHERE age > 30 GROUP BY height This query doesn’t ever produce results
  25. 25. Esper EPL Count records group by height value for events with age larger than 30 per every 1 hour SELECT height, COUNT(*) AS c FROM tbl.win:time_batch(1 hour) WHERE age > 30 GROUP BY height
  26. 26. With/without Schema Schema-full data: strict schema: predefined fields w/ types (or reject) schema on read: try to read known fields (or ignore) Schema-less data: Any field (or ignore), any type (implicit/explicit conversion) fit for services under development: All internet services including us!
  27. 27. Stream Processing & Schema Queries first, data second for all stream processing Queries automatically know what fields to query schema-less (mixed) data stream fields subset for query A fields subset for query B query A query B events from API endpoint events from billing service events of service X TO BE
  28. 28. break.
  29. 29. Norikra: Schema-less Stream Processing with SQL Server software, runs on JVM Open source software (GPLv2) http://norikra.github.io/ https://github.com/norikra/norikra
  30. 30. Norikra: Schema-less event stream: Add/Remove data fields whenever you want SQL: No more restarts to add/remove queries w/ JOINs, w/ SubQueries w/ UDF (in Java/Ruby from rubygem) Truly Complex events: Nested Hash/Array, accessible directly from SQL HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)
  31. 31. How To Setup Norikra: Install JRuby download jruby.tar.gz, extract it and export $PATH use rbenv rbenv install jruby-1.7.xx rbenv shell jruby-.. Install Norikra gem install norikra Execute Norikra server norikra start
  32. 32. Norikra Interface: Command line: norikra-client norikra-client target open ... norikra-client query add ... tail -f ... | norikra-client event send ... WebUI show status show/add/remove queries HTTP API JSON, MessagePack
  33. 33. Norikra Queries: (1) SELECT name, age FROM events target
  34. 34. Norikra Queries: (1) {“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”} SELECT name, age FROM events {“name”:”tagomoris”,”age”:34}
  35. 35. Norikra Queries: (1) {“name”:”tagomoris”, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”} without “age” SELECT name, age FROM events nothing
  36. 36. Norikra Queries: (2) {“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”} SELECT name, age FROM events WHERE current=”Taipei” {“name”:”tagomoris”,”age”:34}
  37. 37. Norikra Queries: (2) {“name”:”hadoop”, “age”:99, “address”:”Somewhere”, “corp”:”ASF”, “current”:”Elsewhere”} SELECT name, age FROM events WHERE current=”Taipei” nothing
  38. 38. Norikra Queries: (3) SELECT age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) GROUP BY age
  39. 39. Norikra Queries: (3) {“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”} SELECT age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) GROUP BY age every 5 mins {”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
  40. 40. Norikra Queries: (4) {“name”:”tagomoris”, “age”:34, “address”:”Tokyo”, “corp”:”LINE”, “current”:”Taipei”} SELECT age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) GROUP BY age SELECT max(age) as max FROM events.win:time_batch(5 mins) {”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ... {“max”:51} every 5 mins
  41. 41. Norikra Queries: (5) {“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...] } SELECT age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) GROUP BY age
  42. 42. Norikra Queries: (5) {“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...] } SELECT user.age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) GROUP BY user.age
  43. 43. Norikra Queries: (5) {“name”:”tagomoris”, “user:{“age”:34, “corp”:”LINE”, “address”:”Tokyo”}, “current”:”Taipei”, “speaker”:true, “attend”:[true,true,false, ...] } SELECT user.age, COUNT(*) as cnt FROM events.win:time_batch(5 mins) WHERE current=”Taipei” AND attend.$0 AND attend.$1 GROUP BY user.age
  44. 44. break. next: use cases
  45. 45. Use case 1: External API call reports for partners (LINE) External API call for LINE Business Connect LINE backend sends requests to partner’s API endpoint using users’ messages http://developers.linecorp.com/blog/?p=3386
  46. 46. Use case 1: External API call reports for partners (LINE) API error response summaries http://developers.linecorp.com/blog/?p=3386
  47. 47. Use case 1: External API call reports for partners (LINE) channel gateway partner’s server logs query results MySQL Mail SELECT channelId AS channel_id, reason, detail, count(*) AS error_count, min(timestamp) AS first_timestamp, max(timestamp) AS last_timestamp FROM api_error_log.win:time_batch(60 sec) GROUP BY channelId,reason,detail HAVING count(*) > 0 http://developers.linecorp.com/blog/?p=3386
  48. 48. Use case 2: Prompt reports for Ad service console Prompt reports with Norikra + Fixed reports with Hive app serverapp serverapp server app serverapp serverapp server Fluentd HDFS console service execute hive query (daily) fetch query results (frequently) impression logs
  49. 49. Use case 2: Prompt reports for Ad service console Hive query for fixed reports SELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uu FROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140913' AND get_json_object(log, '$.type')='click' ) x GROUP BY yyyymmdd, hh, campaign_id, region, lang
  50. 50. Use case 2: Prompt reports for Ad service console Norikra query for prompt reports SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uu FROM myservice.win:time_batch(1 hours) WHERE type="click" GROUP BY campaign.id, member.region, member.lang
  51. 51. Use case 3: Realtime access dashboard on Google Platform Access log visualization Count using Norikra (2-step), Store on Google BigQuery Dashboard on Google Spreadsheet + Apps Script http://qiita.com/kazunori279/items/6329df57635799405547 https://www.youtube.com/watch?v=EZkw5TDcCGw
  52. 52. Use case 3: Realtime access dashboard on Google Platform Server Fluentd http://qiita.com/kazunori279/items/6329df57635799405547 https://www.youtube.com/watch?v=EZkw5TDcCGw ngnix access log access logs to BigQuery norikra query results norikra query to aggregate node to aggregate locally
  53. 53. Use case 3: Realtime access dashboard on Google Platform Fluentd logs to store http://qiita.com/kazunori279/items/6329df57635799405547 https://www.youtube.com/watch?v=EZkw5TDcCGw ngnix 70 servers, 120,000 requests/sec (or more!) ngngninxix ngngninxix ngngninxix ngngninxix ngngninxix ngngninxix ngngninxix ngngninxix ngnix Google BigQuery Google Spreadsheet + Apps script ... counts per host total count
  54. 54. More queries, more simplicity and less latency. Thanks! photo: by my co-workers
  55. 55. See also: http://norikra.github.io/ “Stream processing and Norikra” http://www.slideshare.net/tagomoris/stream-processing-and-norikra “Batch processing and Stream processing by SQL” http://www.slideshare.net/tagomoris/hcj2014-sql “Log analysis systems and its designs in LINE Corp 2014 Early” http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line- corp-2014-early “Norikra in Action” http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring
  56. 56. HA? Distributed? NO! I have some idea, but I have no time to implement it There are no needs for HA/Distributed processing
  57. 57. Data flow & API? Use Fluentd!
  58. 58. Scalability? 10,000 - 100,000 events/sec on 2CPU 8Core server
  59. 59. Storm or Norikra? Simple and fixed workload for huge traffic Use Storm! Complex and fragile workload for non-huge traffic Use Norikra!

×