Your SlideShare is downloading. ×
0
Batch processing and
Stream processing by
SQL
@tagomoris (TAGOMORI Satoshi)
2014/07/08
Hadoop Conference Japan 2014 #hcj20...
TAGOMORI Satoshi (@tagomoris)
LINE Corporation
Analytics Platform Team
14年7月8日火曜日
14年7月8日火曜日
14年7月8日火曜日
14年7月8日火曜日
SQL
14年7月8日火曜日
BATCH
and/or
STREAM
14年7月8日火曜日
Analytics data flow overview
servers Fluentd
Cluster
archive
visualization
notifications
Hadoop / Hive
Presto
Fluentd
Norikr...
servers Fluentd
Cluster
archive
visualization
notifications
Hadoop / Hive
Presto
Fluentd
Norikra
application
metrics
STREAM...
servers Fluentd
Cluster
archive
visualization
notifications
Hadoop / Hive
Presto
Fluentd
Norikra
application
metrics
STREAM...
SQL is NOT the best.
But,
SQL is better than NONE.
14年7月8日火曜日
What supports SQL:
RDBMS
Apache Hive (on MR/Spark/Tez)
Facebook Presto, Cloudera Impala,
Apache Drill
Google BigQuery, ......
14年7月8日火曜日
SQL
SQLSQL
SQL (2/6)SQL
SQL
SQL SQL
14年7月8日火曜日
DB Batch
Short
Batch
non-SQL NoSQL
HadoopMR
Pig
----
SQL RDBMS Hive
Presto
Impala
Drill
14年7月8日火曜日
Batch processing.
OR
Stream processing?
14年7月8日火曜日
Batch processing
Hadoop/Hive
Target window: hours - weeks (or more)
Total throuput: HIGHEST
Query Latency: LARGEST (20sec ...
Short Batch processing
Presto, Impala, Drill
Target window: seconds - hours (- days)
Total throughput: Normal
Query latenc...
Stream processing
Storm, Kafka, Esper, Norikra, Fluentd, ....
Spark streaming(?)
Target window: seconds - hours
Total thro...
Data flow and latency
data window
query execution
Batch
Short
Batch Stream
incremental
query exection
14年7月8日火曜日
Data window
Target time (or size) range of queries
Batch (or short-batch)
FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘
AND d...
Stream processing with SQL
Esper: Java library to process Stream
With schema
14年7月8日火曜日
Stream processing with SQL
Esper: Java library to process Stream
Esper EPL
SELECT param1, param2
FROM tbl
WHERE age > 30
1...
Stream processing with SQL
SELECT param, COUNT(*) AS c
FROM tbl
WHERE age > 30
GROUP BY param
Esper: Java library to proce...
Stream processing with SQL
SELECT param, COUNT(*) AS c
FROM tbl.win:time_batch(1 hour)
WHERE age > 30
GROUP BY param
Esper...
14年7月8日火曜日
Norikra:
Schema-less Stream Processing with SQL
OSS, based on Esper EPL, GPLv2
Without pre-defined schema
Complex event pro...
Distributed processing OR NOT?
Norikra is NOT a distributed processing platform.
Of course, SCALE OUT IS FANTASTIC.
Is non...
DB Batch
Short
Batch
Stream
non-SQL NoSQL
HadoopMR
Pig
----
Storm
Kafka
Dataflow(G)
SQL RDBMS Hive
Presto
Impala
Drill
Nori...
Lambda architecture
Just same 2 process on:
Stream processing
Batch processing
http://lambda-architecture.net/
14年7月8日火曜日
Replayable processing
Stream processing
MUST NOT be replayable
Queries on stream processing
SHOULD be replayable
14年7月8日火曜日
Hybrid processing:
for fault-torelance
Stream processing:
executes queries in normal
Batch processing:
executes recovery q...
Hybrid processing:
for latency-reduction + accuracy
Stream processing:
for prompt reports (速報値)
Batch processing:
for fixed...
Hybrid stream processing:
against complexity
Non-SQL stream processing:
for simple, fixed, high-traffic events
SQL stream pr...
Case study in LINE
Prompt-report & fixed-report
Norikra + Hive Hybrid
Error detection from application and access logs
Nori...
Case study in LINE
Prompt-report & fixed-report
Norikra + Hive Hybrid
Error detection from application and access logs
Nori...
Hive: fixed-reports
SELECT
yyyymmdd, hh, campaign_id, region, lang,
COUNT(*) AS click,
COUNT(DISTINCT member_id) AS uu
FROM...
Norikra: prompt-reports
SELECT
campaign.id AS campaign_id,
member.region AS region,
member.lang AS lang,
COUNT(*) AS click...
More queries, more simplicity
and less latency.
Thanks!
14年7月8日火曜日
Upcoming SlideShare
Loading in...5
×

Batch processing and Stream processing by SQL

6,242

Published on

Published in: Technology
1 Comment
26 Likes
Statistics
Notes
  • http://www.dbmanagement.info/Tutorials/SQL.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
6,242
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
107
Comments
1
Likes
26
Embeds 0
No embeds

No notes for slide

Transcript of "Batch processing and Stream processing by SQL"

  1. 1. Batch processing and Stream processing by SQL @tagomoris (TAGOMORI Satoshi) 2014/07/08 Hadoop Conference Japan 2014 #hcj2014 14年7月8日火曜日
  2. 2. TAGOMORI Satoshi (@tagomoris) LINE Corporation Analytics Platform Team 14年7月8日火曜日
  3. 3. 14年7月8日火曜日
  4. 4. 14年7月8日火曜日
  5. 5. 14年7月8日火曜日
  6. 6. SQL 14年7月8日火曜日
  7. 7. BATCH and/or STREAM 14年7月8日火曜日
  8. 8. Analytics data flow overview servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics “Log analysis systems and its designs in LINE corp. 2014 early” http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early 14年7月8日火曜日
  9. 9. servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics STREAM BATCH 14年7月8日火曜日
  10. 10. servers Fluentd Cluster archive visualization notifications Hadoop / Hive Presto Fluentd Norikra application metrics STREAM BATCH SQL 14年7月8日火曜日
  11. 11. SQL is NOT the best. But, SQL is better than NONE. 14年7月8日火曜日
  12. 12. What supports SQL: RDBMS Apache Hive (on MR/Spark/Tez) Facebook Presto, Cloudera Impala, Apache Drill Google BigQuery, ... ... 14年7月8日火曜日
  13. 13. 14年7月8日火曜日
  14. 14. SQL SQLSQL SQL (2/6)SQL SQL SQL SQL 14年7月8日火曜日
  15. 15. DB Batch Short Batch non-SQL NoSQL HadoopMR Pig ---- SQL RDBMS Hive Presto Impala Drill 14年7月8日火曜日
  16. 16. Batch processing. OR Stream processing? 14年7月8日火曜日
  17. 17. Batch processing Hadoop/Hive Target window: hours - weeks (or more) Total throuput: HIGHEST Query Latency: LARGEST (20sec - mins - hours) 14年7月8日火曜日
  18. 18. Short Batch processing Presto, Impala, Drill Target window: seconds - hours (- days) Total throughput: Normal Query latency: Small (seconds - mins) 14年7月8日火曜日
  19. 19. Stream processing Storm, Kafka, Esper, Norikra, Fluentd, .... Spark streaming(?) Target window: seconds - hours Total throughput: Normal Query latency: SMALLEST (milliseconds) Queries must be written BEFORE DATA Once registered, runs forever 14年7月8日火曜日
  20. 20. Data flow and latency data window query execution Batch Short Batch Stream incremental query exection 14年7月8日火曜日
  21. 21. Data window Target time (or size) range of queries Batch (or short-batch) FROM-TO: WHERE dt >= ‘2014-07-07 00:00:00‘ AND dt <= ‘2014-07-08 23:59:59’ Stream “Calculate this query for every 3 minutes” Extended SQL required 14年7月8日火曜日
  22. 22. Stream processing with SQL Esper: Java library to process Stream With schema 14年7月8日火曜日
  23. 23. Stream processing with SQL Esper: Java library to process Stream Esper EPL SELECT param1, param2 FROM tbl WHERE age > 30 14年7月8日火曜日
  24. 24. Stream processing with SQL SELECT param, COUNT(*) AS c FROM tbl WHERE age > 30 GROUP BY param Esper: Java library to process Stream Esper EPL 14年7月8日火曜日
  25. 25. Stream processing with SQL SELECT param, COUNT(*) AS c FROM tbl.win:time_batch(1 hour) WHERE age > 30 GROUP BY param Esper: Java library to process Stream Esper EPL 14年7月8日火曜日
  26. 26. 14年7月8日火曜日
  27. 27. Norikra: Schema-less Stream Processing with SQL OSS, based on Esper EPL, GPLv2 Without pre-defined schema Complex event processing (w/ nested hash/array) w/ SQL HTTP RPC w/ JSON or MessagePack (fluentd plugin available!) Dynamic query registration/removing Ultra fast bootstrap (in 3 minutes!) UDF plugins by Java/Ruby http://norikra.github.io/ 14年7月8日火曜日
  28. 28. Distributed processing OR NOT? Norikra is NOT a distributed processing platform. Of course, SCALE OUT IS FANTASTIC. Is non-distributed software useless? MySQL MySQL Cluster Norikra can handle 10k events/sec on 2CPU (8core) server 14年7月8日火曜日
  29. 29. DB Batch Short Batch Stream non-SQL NoSQL HadoopMR Pig ---- Storm Kafka Dataflow(G) SQL RDBMS Hive Presto Impala Drill Norikra 14年7月8日火曜日
  30. 30. Lambda architecture Just same 2 process on: Stream processing Batch processing http://lambda-architecture.net/ 14年7月8日火曜日
  31. 31. Replayable processing Stream processing MUST NOT be replayable Queries on stream processing SHOULD be replayable 14年7月8日火曜日
  32. 32. Hybrid processing: for fault-torelance Stream processing: executes queries in normal Batch processing: executes recovery queries 14年7月8日火曜日
  33. 33. Hybrid processing: for latency-reduction + accuracy Stream processing: for prompt reports (速報値) Batch processing: for fixed reports (確定値) 14年7月8日火曜日
  34. 34. Hybrid stream processing: against complexity Non-SQL stream processing: for simple, fixed, high-traffic events SQL stream processing: for complex, fragile events 14年7月8日火曜日
  35. 35. Case study in LINE Prompt-report & fixed-report Norikra + Hive Hybrid Error detection from application and access logs Norikra + Fluentd Hybrid Realtime aggregation for complex and simple(fixed) objects Norikra + Fluentd Hybrid 14年7月8日火曜日
  36. 36. Case study in LINE Prompt-report & fixed-report Norikra + Hive Hybrid Error detection from application and access logs Norikra + Fluentd Hybrid Realtime aggregation for complex and simple(fixed) objects Norikra + Fluentd Hybrid 14年7月8日火曜日
  37. 37. Hive: fixed-reports SELECT yyyymmdd, hh, campaign_id, region, lang, COUNT(*) AS click, COUNT(DISTINCT member_id) AS uu FROM ( SELECT yyyymmdd, hh, get_json_object(log, '$.campaign.id') AS campaign_id, get_json_object(log, '$.member.region') AS region, get_json_object(log, '$.member.lang') AS lang, get_json_object(log, '$.member.id') AS member_id FROM applog WHERE service='myservice' AND yyyymmdd='20140708' AND hh='00' AND get_json_object(log, '$.type')='click' ) x GROUP BY yyyymmdd, hh, campaign_id, region, lang 14年7月8日火曜日
  38. 38. Norikra: prompt-reports SELECT campaign.id AS campaign_id, member.region AS region, member.lang AS lang, COUNT(*) AS click, COUNT(DISTINCT member.id) AS uu FROM myservice.win:time_batch(1 hours) WHERE type="click" GROUP BY campaign.id, member.region, member.lang 14年7月8日火曜日
  39. 39. More queries, more simplicity and less latency. Thanks! 14年7月8日火曜日
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×