Handling not so big data

Handling not so big data.
YAPC::Asia 2014 Day 2
2014/08/30
@tagomoris

TAGOMORI Satoshi (@tagomoris)
LINE Corporation
Analytics Platform Team

Data Analytics overview
collect parse
clean up
process
visualize
store process

Consider data size
Stored size?
Total?
Per day?
Throughput?
Daily average?
Peak time?
Structured?
Compressed?

DO NOT consider exact data size.
It will increase/decrease dramatically!

Consider rough data size
Data size per query
Sub GigaBytes
From GigaBytes to TeraBytes
PetaBytes or More

PB or More
Use Hadoooooooooop!
and Storm!

From GB to TB: How large?
Main target for many service providers
Too large
For “a” disk, for “a” memory space
Appropriate for many disks
Not so large for many memory spaces

From GB to TB: For what?
Not so small: We cannot do everything
Data analytics methods
search, aggregate, recommend, anomaly detect
Consider “what you want to do” at first
It fixes what you should consider about

Types of data processing
Data size and I/O throughput intensive:
search, aggregation
CPU power and memory size intensive:
machine learning, graph processing
Select appropriate processing framework/middleware
On memory only? With Spilling?

Architecture of
distributed processing systems
job management
resource management
distributed file system
framework
domain specific language
query
processing
subsystem monolithic
query engine
middleware

Architecture of
job management
resource management
framework
query processing
query engine
middleware

Short break: Languages and DSLs
Java, Scala, ...
framework
query engine
middleware SQL: Hive, Impala, Drill, Presto, ...
job management
Others: Pig, Cascading, ...
resource management
query
processing

Architecture of
framework
job management
resource management
query
processing
query engine
wwhheerree tthhiiss ttaallkk iiss aabboouutt middleware

A tour of
distributed processing frameworks
and query engines

MapReduce
Hadoop MapReduce:
Map + Combine + Shuffle + Reduce
Intermediate output is written on disk
Shuffle always requires sync
Map
Map
Map
Map
Combine
Combine
Combine
Combine
job management
Reduce
Reduce
Reduce
shuffle
resource management
framework
query processing
query engine
middleware

MRv1 vs MRv2 on Hadoop
MRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
query processing
framework
resource management
query engine
middleware
job management

MRv1 vs MRv2 on Hadoop
MRv1:
Resource Management
Job Management
Framework
MRv2:
Job Management
Framework
ffoorrggeett tthhiiss
query processing
framework
resource management
query engine
middleware
job management

Apache Spark
“Spark has an advanced DAG execution engine that
supports cyclic data flow and in-memory computing.”
Directed Acyclic Graph (DAG)
Resilient Distributed Datasets (RDDs)
Pros:
Batch, Machine learning, Graph
query processing
framework
resource management
query engine
middleware
job management

Apache Tez
“The Apache Tez project is aimed at building an
application framework which allows for a complex
directed-acyclic-graph of tasks for processing data.”
Directed Acyclic Graph (DAG)
Pros:
Big MR, Multiple Aggregations
query processing
framework
resource management
query engine
middleware
job management

MR, Spark, Tez
job management
resource management
framework
query
processing
query engine
middleware

DAG (directed acyclic graph: 非循環有向グラフ)
from: http://tez.apache.org/

Variations of engines
Make jobs faster than MapReduce
Especially for memory-intensive, complex jobs
Hive can replace backend from MR to Spark/Tez
MR’s stability is VERY important
Alternatives are under development

MPP Engines
Apache Drill, Cloudera Impala, Facebook Presto
Massively Parallel Processing: MPP
DSL(SQL) + job management
framework
job management
Data source is external datastores
resource management
Very low latency: using many threads
Low availability and less tolerance for memory requirements
query processing
query engine
middleware

Stream processing
Without any storages
Process data for specified windows
every X events, per Y minutes, for unique values, ....
on memory processing
Ultira low latency
There are too many things to be considered about...

Stream processing: more
Twitter Storm
Distributed stream processing platform
Processing w/ Java or JVM languages
For super high throughput data (not for minimal data)
Norikra
Non-distributed stream processing platform
Processing w/ SQL, but not distributed...
For low-middle-high throughput data

what i don’t mention
about today...

Hadoopとはなにか
Original Hadoop
HDFS
MapReduce v1
Hadoop v2
HDFS
ResourceManager + MapReduce v2
job management
resource management
framework
query processing
query engine
middleware

Hadoop v2
HDFS
Spark
job management
resource management
framework
query processing
query engine
middleware

Hadoop v2
HDFS
Apache Spark
Apache Tez
job management
resource management
framework
query processing
query engine
middleware

v2: Hadoop
HDFS
Apache Spark
Apache Tez
Twitter Storm (Apache incubator)
job management
resource management
framework
monolithic
query engine
middleware
query processing
subsystem

v2: Hadoop
HDFS
Apache Spark
Apache Tez
Apache Drill
job management
resource management
framework
query processing
query engine
middleware

v2: Hadoop
HDFS
Apache Spark
Apache Tez
framework
Apache Drill
job management
Hive, Pig, ... distributed file system
query processing
resource management
query engine
middleware

v2: Hadoop
HDFS
Apache Spark
Apache Tez
Apache Drill
What Hadoop is ....

A L L Y O U R E N G I N E S A R E
B E L O N G T O U S .

What Hadoop is?
BigData platform is called as “Hadoop”
like “Linux”, not only kernel, but also distribution
CORE:
distributed file systems
data flow

“BigData as a Service”
by @naoya_ito
AWS EMR/RedShift, Google BigQuery, Treasure Data, ...
They have their own architecture
and their storages and data flow
Data flow is always most important

Perl ?
BigData world is dominated by JVM
many contributors from many companies
We should not make distributed processing software
Stand on shoulders on giants!
Connect perl world with JVM systems
by CPAN modules

1. What do you want?
2. How large your data is?
3. Choose architecture!

SHARE
software, know-how & concerns!
Thank you!

Handling not so big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Handling not so big data

Similar to Handling not so big data (20)

More from SATOSHI TAGOMORI

More from SATOSHI TAGOMORI (20)

Recently uploaded

Recently uploaded (20)

Handling not so big data