Batch processing and Stream processing by SQL

8,269 views
7,939 views

Published on

Published in: Technology
1 Comment
28 Likes
Statistics
Notes
  • http://www.dbmanagement.info/Tutorials/SQL.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
8,269
On SlideShare
0
From Embeds
0
Number of Embeds
3,219
Actions
Shares
0
Downloads
112
Comments
1
Likes
28
Embeds 0
No embeds

No notes for slide

Batch processing and Stream processing by SQL

  1. 1. Batch processing and Stream processing by SQL @tagomoris (TAGOMORI Satoshi) 2014/07/08 Hadoop Conference Japan 2014 #hcj2014 14年7月8日火曜日
  2. 2. TAGOMORI Satoshi (@tagomoris) LINE Corporation Analytics Platform Team 14年7月8日火曜日
  3. 3. 14年7月8日火曜日
  4. 4. 14年7月8日火曜日
  5. 5. 14年7月8日火曜日
  6. 6. SQL 14年7月8日火曜日
  7. 7. BATCH and/or STREAM 14年7月8日火曜日
  8. 8. Analytics data flow overview servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics “Log analysis systems and its designs in LINE corp. 2014 early” http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early 14年7月8日火曜日
  9. 9. servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics STREAM BATCH 14年7月8日火曜日
  10. 10. servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics STREAM BATCH SQL 14年7月8日火曜日
  11. 11. SQL is NOT the best. But, SQL is better than NONE. 14年7月8日火曜日
  12. 12. What supports SQL: RDBMS Apache Hive (on MR/Spark/Tez) Facebook Presto, Cloudera Impala, Apache Drill Google BigQuery, ... ... 14年7月8日火曜日
  13. 13. 14年7月8日火曜日
  14. 14. SQL SQLSQL SQL (2/6)SQL SQL SQL SQL 14年7月8日火曜日
  15. 15. DB Batch Short Batch non-SQL NoSQL HadoopMR Pig ---- SQL RDBMS Hive Presto Impala Drill 14年7月8日火曜日
  16. 16. Batch processing. OR Stream processing? 14年7月8日火曜日
  17. 17. Batch processing Hadoop/Hive Target window: hours - weeks (or more) Total throuput: HIGHEST Query Latency: LARGEST (20sec - mins - hours) 14年7月8日火曜日
  18. 18. Short Batch processing Presto, Impala, Drill Target window: seconds - hours (- days) Total throughput: Normal Query latency: Small (seconds - mins) 14年7月8日火曜日
  19. 19. Stream processing Storm, Kafka, Esper, Norikra, Fluentd, .... Spark streaming(?) Target window: seconds - hours Total throughput: Normal Query latency: SMALLEST (milliseconds) Queries must be written BEFORE DATA Once registered, runs forever 14年7月8日火曜日
  20. 20. Data flow and latency data window query execution Batch Short Batch Stream incremental query exection 14年7月8日火曜日
  21. 21. Data window Target time (or size) range of queries Batch (or short-batch) FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘ AND dt <= ‘2014-07-08 23:59:59’ Stream “Calculate this query for every 3 minutes” Extended SQL required 14年7月8日火曜日
  22. 22. Stream processing with SQL Esper: Java library to process Stream With schema 14年7月8日火曜日
  23. 23. Stream processing with SQL Esper: Java library to process Stream Esper EPL SELECT param1, param2 FROM tbl WHERE age > 30 14年7月8日火曜日
  24. 24. Stream processing with SQL SELECT param, COUNT(*) AS c FROM tbl WHERE age > 30 GROUP BY param Esper: Java library to process Stream Esper EPL 14年7月8日火曜日
  25. 25. Stream processing with SQL SELECT param, COUNT(*) AS c FROM tbl.win:time_batch(1 hour) WHERE age > 30 GROUP BY param Esper: Java library to process Stream Esper EPL 14年7月8日火曜日
  26. 26. 14年7月8日火曜日
  27. 27. Norikra: Schema-less Stream Processing with SQL OSS, based on Esper EPL, GPLv2 Without pre-defined schema Complex event processing (w/ nested hash/array) w/ SQL HTTP RPC w/ JSON or MessagePack (fluentd plugin available!) Dynamic query registration/removing Ultra fast bootstrap (in 3 minutes!) UDF plugins by Java/Ruby http://norikra.github.io/ 14年7月8日火曜日
  28. 28. Distributed processing OR NOT? Norikra is NOT a distributed processing platform. Of course, SCALE OUT IS FANTASTIC. Is non-distributed software useless? MySQL MySQL Cluster Norikra can handle 10k events/sec on 2CPU (8core) server 14年7月8日火曜日
  29. 29. DB Batch Short Batch Stream non-SQL NoSQL HadoopMR Pig ---- Storm Kafka Dataflow(G) SQL RDBMS Hive Presto Impala Drill Norikra 14年7月8日火曜日
  30. 30. Lambda architecture Just same 2 process on: Stream processing Batch processing http://lambda-architecture.net/ 14年7月8日火曜日
  31. 31. Replayable processing Stream processing MUST NOT be replayable Queries on stream processing SHOULD be replayable 14年7月8日火曜日
  32. 32. Hybrid processing: for fault-torelance Stream processing: executes queries in normal Batch processing: executes recovery queries 14年7月8日火曜日
  33. 33. Hybrid processing: for latency-reduction + accuracy Stream processing: for prompt reports (速報値) Batch processing: for fixed reports (確定値) 14年7月8日火曜日
  34. 34. Hybrid stream processing: against complexity Non-SQL stream processing: for simple, fixed, high-traffic events SQL stream processing: for complex, fragile events 14年7月8日火曜日
  35. 35. Case study in LINE Prompt-report & fixed-report Norikra + Hive Hybrid Error detection from application and access logs Norikra + Fluentd Hybrid Realtime aggregation for complex and simple(fixed) objects Norikra + Fluentd Hybrid 14年7月8日火曜日
  36. 36. Case study in LINE Prompt-report & fixed-report Norikra + Hive Hybrid Error detection from application and access logs Norikra + Fluentd Hybrid Realtime aggregation for complex and simple(fixed) objects Norikra + Fluentd Hybrid 14年7月8日火曜日
  37. 37. Hive: fixed-reports SELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uu FROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140708' AND hh='00' AND get_json_object(log, '$.type')='click' ) x GROUP BY yyyymmdd, hh, campaign_id, region, lang 14年7月8日火曜日
  38. 38. Norikra: prompt-reports SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uu FROM myservice.win:time_batch(1 hours) WHERE type="click" GROUP BY campaign.id, member.region, member.lang 14年7月8日火曜日
  39. 39. More queries, more simplicity and less latency. Thanks! 14年7月8日火曜日

×