Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lambda Architecture Using SQL


Published on

Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL

Published in: Technology
    Are you sure you want to  Yes  No
    Your message goes here

Lambda Architecture Using SQL

  1. 1. Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)
  2. 2. Taipei
  3. 3. Topics About Me & LINE Data analytics workloads Batch processing Stream processing Lambda architecture Lambda architecture using SQL Norikra: Stream processing with SQL 13:30-14:20 4F
  4. 4. @tagomoris Satoshi Tagomori (田籠 聡) LINE Corporation Analytics Platform Team
  5. 5. Tokyo
  6. 6. LINE Offices Tokyo HQ Spain Thailand Taipei USA Korea
  7. 7. LINE is born! JUNE 23, 2011
  8. 8. Data Analytics Workload Part 01
  9. 9. Various Data Analytics Workload Reports Monthly/Daily reports Hourly (or shorter) news Real-time metrics Automatically updated reports/graphs Alerts for abuse of services, overload, ...
  10. 10. Batch Processing Hadoop MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) For reports MPP Engines Cloudera Impala, Apache Drill, Facebook Presto, ... For interactive analysis For reports of shorter window
  11. 11. Stream Processing Apache Storm Incubator project “Distributed and fault-tolerant realtime computation” Norikra by tagomoris Non-distributed “Stream processing with SQL”
  12. 12. Why Stream Processing? Less latency Realtime metrics Short-term prompt reports Less computing power 10Mbps for batch processing: 100GB/day 10Mbps for stream processing: 1 Server No query schedule management Once query registered, it runs forever
  13. 13. Disadvantage of Stream Processing Queries must be written before data There should be another way to query past data Queries cannot be run twice All results will be lost when any error occurs All data have gone when bugs found Disorders of events break results Recorded time based queries? Or arrival time based queries?
  14. 14. Part 02 Lambda Architecture
  15. 15. Lambda Architecture “The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.”
  16. 16. Lambda Architecture: Overview new data batch layer master dataset serving layer view speed layer real-time view query
  17. 17. Twitter Summingbird Lambda architecture library Batch mode: Scalding on Hadoop MapReduce Realtime mode: Storm Word counting by Summingbird (scala): def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store)
  18. 18. What Lambda Architecture Provides Replayable queries Redo queries anytime if results of speed layer are broken Accurate results on demand Prompt reports in speed layer with arrival time Fixed reports in batch layer with recorded time ... And many more benefits of stream processing
  19. 19. Why All of Us Don’t Use It? Storm doesn’t fit well with many uses Storm requires computer resources too big to deploy Summingbird requires many steps to deploy Many directors/analysts don’t write Scala/Java Summingbird DSL is not enough easy for non-professional people
  20. 20. Lambda Architecture Using SQL Part 03
  21. 21. Existing Hadoop Platform new data HDFS hive query Fluentd presto query
  22. 22. Norikra Schema-less stream processing with SQL “Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.” SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path
  23. 23. Added-on Lambda Architecture Platform new data presto query HDFS hive query norikra query
  24. 24. “Pseudo Lambda” Architecture Using SQL Lambda architecture platform with almost same queries SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS count FROM AccessLog WHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’ GROUP BY path SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path
  25. 25. “Pseudo Lambda” Architecture Using SQL SQL dialects are easy to learn! Standard SQL, Hive, Presto, Impala, Drill, ... + Norikra For non-professional people too! SQL queries are very easy to write twice!
  26. 26. Use Cases in LINE Prompt reports for Ads service Short-term prompt reports by Norikra Daily fixed reports by Hive Summary of application server error log Aggregate error log for alerting by Norikra Check details with Hive, Presto (or grep!) See you later for details!
  27. 27. TMTOWTDI “There’s more than one way to do it.” - Perl programming language
  28. 28. SHARE What I want & What I’m doing! - tagomoris
  29. 29. Q & A