12. From GB to TB: How large?
Main target for many service providers
Too large
For “a” disk, for “a” memory space
Appropriate for many disks
Not so large for many memory spaces
13. From GB to TB: For what?
Not so small: We cannot do everything
Data analytics methods
search, aggregate, recommend, anomaly detect
Consider “what you want to do” at first
It fixes what you should consider about
14. Types of data processing
Data size and I/O throughput intensive:
search, aggregation
CPU power and memory size intensive:
machine learning, graph processing
Select appropriate processing framework/middleware
On memory only? With Spilling?
15. Architecture of
distributed processing systems
job management
resource management
distributed file system
framework
domain specific language
query
processing
subsystem monolithic
query engine
middleware
16. Architecture of
distributed processing systems
job management
resource management
distributed file system
framework
domain specific language
query processing
subsystem monolithic
query engine
middleware
17. Short break: Languages and DSLs
Java, Scala, ...
framework
query engine
middleware SQL: Hive, Impala, Drill, Presto, ...
job management
domain specific language
Others: Pig, Cascading, ...
resource management
query
processing
subsystem monolithic
distributed file system
18. Architecture of
distributed processing systems
framework
job management
domain specific language
resource management
query
processing
subsystem monolithic
distributed file system
query engine
wwhheerree tthhiiss ttaallkk iiss aabboouutt middleware
19. A tour of
distributed processing frameworks
and query engines
20. MapReduce
Hadoop MapReduce:
Map + Combine + Shuffle + Reduce
Intermediate output is written on disk
Shuffle always requires sync
Map
Map
Map
Map
Combine
Combine
Combine
Combine
job management
Reduce
Reduce
Reduce
shuffle
resource management
distributed file system
framework
domain specific language
query processing
subsystem monolithic
query engine
middleware
21. MRv1 vs MRv2 on Hadoop
MRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
domain specific language
query processing
subsystem monolithic
framework
resource management
distributed file system
query engine
middleware
job management
22. MRv1 vs MRv2 on Hadoop
MRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
ffoorrggeett tthhiiss
domain specific language
query processing
subsystem monolithic
framework
resource management
distributed file system
query engine
middleware
job management
23. Apache Spark
“Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.”
Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDDs)
Pros:
Batch, Machine learning, Graph
domain specific language
query processing
subsystem monolithic
framework
resource management
distributed file system
query engine
middleware
job management
24. Apache Tez
“The Apache Tez project is aimed at building an
application framework which allows for a complex
directed-acyclic-graph of tasks for processing data.”
Directed Acyclic Graph (DAG)
Pros:
Big MR, Multiple Aggregations
domain specific language
query processing
subsystem monolithic
framework
resource management
distributed file system
query engine
middleware
job management
25. MR, Spark, Tez
job management
resource management
distributed file system
framework
domain specific language
query
processing
subsystem monolithic
query engine
middleware
27. Variations of engines
Make jobs faster than MapReduce
Especially for memory-intensive, complex jobs
Hive can replace backend from MR to Spark/Tez
MR’s stability is VERY important
Alternatives are under development
28. MPP Engines
Apache Drill, Cloudera Impala, Facebook Presto
Massively Parallel Processing: MPP
domain specific language
DSL(SQL) + job management
framework
job management
Data source is external datastores
resource management
Very low latency: using many threads
distributed file system
Low availability and less tolerance for memory requirements
query processing
subsystem monolithic
query engine
middleware
29. Stream processing
Without any storages
Process data for specified windows
every X events, per Y minutes, for unique values, ....
on memory processing
Ultira low latency
There are too many things to be considered about...
30. Stream processing: more
Twitter Storm
Distributed stream processing platform
Processing w/ Java or JVM languages
For super high throughput data (not for minimal data)
Norikra
Non-distributed stream processing platform
Processing w/ SQL, but not distributed...
For low-middle-high throughput data
33. Hadoopとはなにか
Original Hadoop
HDFS
MapReduce v1
Hadoop v2
HDFS
ResourceManager + MapReduce v2
job management
resource management
distributed file system
framework
domain specific language
query processing
subsystem monolithic
query engine
middleware
34. Hadoopとはなにか
Hadoop v2
HDFS
ResourceManager + MapReduce v2
Spark
job management
resource management
distributed file system
framework
domain specific language
query processing
subsystem monolithic
query engine
middleware
35. Hadoopとはなにか
Hadoop v2
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
job management
resource management
distributed file system
framework
domain specific language
query processing
subsystem monolithic
query engine
middleware
36. Hadoopとはなにか
v2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
job management
domain specific language
resource management
distributed file system
framework
monolithic
query engine
middleware
query processing
subsystem
37. Hadoopとはなにか
v2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
Apache Drill
job management
domain specific language
resource management
distributed file system
framework
query processing
subsystem monolithic
query engine
middleware
38. Hadoopとはなにか
v2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
framework
Apache Drill
job management
Hive, Pig, ... distributed file system
domain specific language
query processing
subsystem monolithic
resource management
query engine
middleware
39. Hadoopとはなにか
v2: Hadoop
HDFS
ResourceManager + MapReduce v2
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
Apache Drill
What Hadoop is ....
40. A L L Y O U R E N G I N E S A R E
B E L O N G T O U S .
41. What Hadoop is?
BigData platform is called as “Hadoop”
like “Linux”, not only kernel, but also distribution
CORE:
distributed file systems
data flow
42. “BigData as a Service”
by @naoya_ito
AWS EMR/RedShift, Google BigQuery, Treasure Data, ...
They have their own architecture
and their storages and data flow
Data flow is always most important
43. Perl ?
BigData world is dominated by JVM
many contributors from many companies
We should not make distributed processing software
Stand on shoulders on giants!
Connect perl world with JVM systems
by CPAN modules
44. 1. What do you want?
2. How large your data is?
3. Choose architecture!