Handling not so big data

6,758 views

Published on

Talk at YAPC::Asia Tokyo 2014

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,758
On SlideShare
0
From Embeds
0
Number of Embeds
2,665
Actions
Shares
0
Downloads
23
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Handling not so big data

  1. 1. Handling not so big data. YAPC::Asia 2014 Day 2 2014/08/30 @tagomoris
  2. 2. TAGOMORI Satoshi (@tagomoris) LINE Corporation Analytics Platform Team
  3. 3. Data Analytics overview collect parse clean up process visualize store process
  4. 4. Data Analytics overview collect parse clean up process visualize store process
  5. 5. Consider data size Stored size? Total? Per day? Throughput? Daily average? Peak time? Structured? Compressed?
  6. 6. DO NOT consider exact data size. It will increase/decrease dramatically!
  7. 7. Consider rough data size Data size per query Sub GigaBytes From GigaBytes to TeraBytes PetaBytes or More
  8. 8. Sub GB Use RDBMS!
  9. 9. PB or More Use Hadoooooooooop! and Storm!
  10. 10. From GB to TB: How large? Main target for many service providers Too large For “a” disk, for “a” memory space Appropriate for many disks Not so large for many memory spaces
  11. 11. From GB to TB: For what? Not so small: We cannot do everything Data analytics methods search, aggregate, recommend, anomaly detect Consider “what you want to do” at first It fixes what you should consider about
  12. 12. Types of data processing Data size and I/O throughput intensive: search, aggregation CPU power and memory size intensive: machine learning, graph processing Select appropriate processing framework/middleware On memory only? With Spilling?
  13. 13. Architecture of distributed processing systems job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  14. 14. Architecture of distributed processing systems job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  15. 15. Short break: Languages and DSLs Java, Scala, ... framework query engine middleware SQL: Hive, Impala, Drill, Presto, ... job management domain specific language Others: Pig, Cascading, ... resource management query processing subsystem monolithic distributed file system
  16. 16. Architecture of distributed processing systems framework job management domain specific language resource management query processing subsystem monolithic distributed file system query engine wwhheerree tthhiiss ttaallkk iiss aabboouutt middleware
  17. 17. A tour of distributed processing frameworks and query engines
  18. 18. MapReduce Hadoop MapReduce: Map + Combine + Shuffle + Reduce Intermediate output is written on disk Shuffle always requires sync Map Map Map Map Combine Combine Combine Combine job management Reduce Reduce Reduce shuffle resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  19. 19. MRv1 vs MRv2 on Hadoop MRv1: Resource Management Job Management Framework MRv2: Job Management Framework domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  20. 20. MRv1 vs MRv2 on Hadoop MRv1: Resource Management Job Management Framework MRv2: Job Management Framework ffoorrggeett tthhiiss domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  21. 21. Apache Spark “Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.” Directed Acyclic Graph (DAG) Resilient Distributed Datasets (RDDs) Pros: Batch, Machine learning, Graph domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  22. 22. Apache Tez “The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.” Directed Acyclic Graph (DAG) Pros: Big MR, Multiple Aggregations domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  23. 23. MR, Spark, Tez job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  24. 24. DAG (directed acyclic graph: 非循環有向グラフ) from: http://tez.apache.org/
  25. 25. Variations of engines Make jobs faster than MapReduce Especially for memory-intensive, complex jobs Hive can replace backend from MR to Spark/Tez MR’s stability is VERY important Alternatives are under development
  26. 26. MPP Engines Apache Drill, Cloudera Impala, Facebook Presto Massively Parallel Processing: MPP domain specific language DSL(SQL) + job management framework job management Data source is external datastores resource management Very low latency: using many threads distributed file system Low availability and less tolerance for memory requirements query processing subsystem monolithic query engine middleware
  27. 27. Stream processing Without any storages Process data for specified windows every X events, per Y minutes, for unique values, .... on memory processing Ultira low latency There are too many things to be considered about...
  28. 28. Stream processing: more Twitter Storm Distributed stream processing platform Processing w/ Java or JVM languages For super high throughput data (not for minimal data) Norikra Non-distributed stream processing platform Processing w/ SQL, but not distributed... For low-middle-high throughput data
  29. 29. what i don’t mention about today...
  30. 30. Hadoopとはなにか Original Hadoop HDFS MapReduce v1 Hadoop v2 HDFS ResourceManager + MapReduce v2 job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  31. 31. Hadoopとはなにか Hadoop v2 HDFS ResourceManager + MapReduce v2 Spark job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  32. 32. Hadoopとはなにか Hadoop v2 HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  33. 33. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) job management domain specific language resource management distributed file system framework monolithic query engine middleware query processing subsystem
  34. 34. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) Apache Drill job management domain specific language resource management distributed file system framework query processing subsystem monolithic query engine middleware
  35. 35. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) framework Apache Drill job management Hive, Pig, ... distributed file system domain specific language query processing subsystem monolithic resource management query engine middleware
  36. 36. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) Apache Drill What Hadoop is ....
  37. 37. A L L Y O U R E N G I N E S A R E B E L O N G T O U S .
  38. 38. What Hadoop is? BigData platform is called as “Hadoop” like “Linux”, not only kernel, but also distribution CORE: distributed file systems data flow
  39. 39. “BigData as a Service” by @naoya_ito AWS EMR/RedShift, Google BigQuery, Treasure Data, ... They have their own architecture and their storages and data flow Data flow is always most important
  40. 40. Perl ? BigData world is dominated by JVM many contributors from many companies We should not make distributed processing software Stand on shoulders on giants! Connect perl world with JVM systems by CPAN modules
  41. 41. 1. What do you want? 2. How large your data is? 3. Choose architecture!
  42. 42. SHARE software, know-how & concerns! Thank you!

×