SlideShare a Scribd company logo
1 of 45
Download to read offline
Handling not so big data. 
YAPC::Asia 2014 Day 2 
2014/08/30 
@tagomoris
TAGOMORI Satoshi (@tagomoris) 
LINE Corporation 
Analytics Platform Team
Data Analytics overview 
collect parse 
clean up 
process 
visualize 
store process
Data Analytics overview 
collect parse 
clean up 
process 
visualize 
store process
Consider data size 
Stored size? 
Total? 
Per day? 
Throughput? 
Daily average? 
Peak time? 
Structured? 
Compressed?
DO NOT consider exact data size. 
It will increase/decrease dramatically!
Consider rough data size 
Data size per query 
Sub GigaBytes 
From GigaBytes to TeraBytes 
PetaBytes or More
Sub GB 
Use RDBMS!
PB or More 
Use Hadoooooooooop! 
and Storm!
From GB to TB: How large? 
Main target for many service providers 
Too large 
For “a” disk, for “a” memory space 
Appropriate for many disks 
Not so large for many memory spaces
From GB to TB: For what? 
Not so small: We cannot do everything 
Data analytics methods 
search, aggregate, recommend, anomaly detect 
Consider “what you want to do” at first 
It fixes what you should consider about
Types of data processing 
Data size and I/O throughput intensive: 
search, aggregation 
CPU power and memory size intensive: 
machine learning, graph processing 
Select appropriate processing framework/middleware 
On memory only? With Spilling?
Architecture of 
distributed processing systems 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query 
processing 
subsystem monolithic 
query engine 
middleware
Architecture of 
distributed processing systems 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query processing 
subsystem monolithic 
query engine 
middleware
Short break: Languages and DSLs 
Java, Scala, ... 
framework 
query engine 
middleware SQL: Hive, Impala, Drill, Presto, ... 
job management 
domain specific language 
Others: Pig, Cascading, ... 
resource management 
query 
processing 
subsystem monolithic 
distributed file system
Architecture of 
distributed processing systems 
framework 
job management 
domain specific language 
resource management 
query 
processing 
subsystem monolithic 
distributed file system 
query engine 
wwhheerree tthhiiss ttaallkk iiss aabboouutt middleware
A tour of 
distributed processing frameworks 
and query engines
MapReduce 
Hadoop MapReduce: 
Map + Combine + Shuffle + Reduce 
Intermediate output is written on disk 
Shuffle always requires sync 
Map 
Map 
Map 
Map 
Combine 
Combine 
Combine 
Combine 
job management 
Reduce 
Reduce 
Reduce 
shuffle 
resource management 
distributed file system 
framework 
domain specific language 
query processing 
subsystem monolithic 
query engine 
middleware
MRv1 vs MRv2 on Hadoop 
MRv1: 
Resource Management 
Job Management 
Framework 
MRv2: 
Job Management 
Framework 
domain specific language 
query processing 
subsystem monolithic 
framework 
resource management 
distributed file system 
query engine 
middleware 
job management
MRv1 vs MRv2 on Hadoop 
MRv1: 
Resource Management 
Job Management 
Framework 
MRv2: 
Job Management 
Framework 
ffoorrggeett tthhiiss 
domain specific language 
query processing 
subsystem monolithic 
framework 
resource management 
distributed file system 
query engine 
middleware 
job management
Apache Spark 
“Spark has an advanced DAG execution engine that 
supports cyclic data flow and in-memory computing.” 
Directed Acyclic Graph (DAG) 
Resilient Distributed Datasets (RDDs) 
Pros: 
Batch, Machine learning, Graph 
domain specific language 
query processing 
subsystem monolithic 
framework 
resource management 
distributed file system 
query engine 
middleware 
job management
Apache Tez 
“The Apache Tez project is aimed at building an 
application framework which allows for a complex 
directed-acyclic-graph of tasks for processing data.” 
Directed Acyclic Graph (DAG) 
Pros: 
Big MR, Multiple Aggregations 
domain specific language 
query processing 
subsystem monolithic 
framework 
resource management 
distributed file system 
query engine 
middleware 
job management
MR, Spark, Tez 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query 
processing 
subsystem monolithic 
query engine 
middleware
DAG (directed acyclic graph: 非循環有向グラフ) 
from: http://tez.apache.org/
Variations of engines 
Make jobs faster than MapReduce 
Especially for memory-intensive, complex jobs 
Hive can replace backend from MR to Spark/Tez 
MR’s stability is VERY important 
Alternatives are under development
MPP Engines 
Apache Drill, Cloudera Impala, Facebook Presto 
Massively Parallel Processing: MPP 
domain specific language 
DSL(SQL) + job management 
framework 
job management 
Data source is external datastores 
resource management 
Very low latency: using many threads 
distributed file system 
Low availability and less tolerance for memory requirements 
query processing 
subsystem monolithic 
query engine 
middleware
Stream processing 
Without any storages 
Process data for specified windows 
every X events, per Y minutes, for unique values, .... 
on memory processing 
Ultira low latency 
There are too many things to be considered about...
Stream processing: more 
Twitter Storm 
Distributed stream processing platform 
Processing w/ Java or JVM languages 
For super high throughput data (not for minimal data) 
Norikra 
Non-distributed stream processing platform 
Processing w/ SQL, but not distributed... 
For low-middle-high throughput data
what i don’t mention 
about today...
Hadoopとはなにか 
Original Hadoop 
HDFS 
MapReduce v1 
Hadoop v2 
HDFS 
ResourceManager + MapReduce v2 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query processing 
subsystem monolithic 
query engine 
middleware
Hadoopとはなにか 
Hadoop v2 
HDFS 
ResourceManager + MapReduce v2 
Spark 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query processing 
subsystem monolithic 
query engine 
middleware
Hadoopとはなにか 
Hadoop v2 
HDFS 
ResourceManager + MapReduce v2 
Apache Spark 
Apache Tez 
job management 
resource management 
distributed file system 
framework 
domain specific language 
query processing 
subsystem monolithic 
query engine 
middleware
Hadoopとはなにか 
v2: Hadoop 
HDFS 
ResourceManager + MapReduce v2 
Apache Spark 
Apache Tez 
Twitter Storm (Apache incubator) 
job management 
domain specific language 
resource management 
distributed file system 
framework 
monolithic 
query engine 
middleware 
query processing 
subsystem
Hadoopとはなにか 
v2: Hadoop 
HDFS 
ResourceManager + MapReduce v2 
Apache Spark 
Apache Tez 
Twitter Storm (Apache incubator) 
Apache Drill 
job management 
domain specific language 
resource management 
distributed file system 
framework 
query processing 
subsystem monolithic 
query engine 
middleware
Hadoopとはなにか 
v2: Hadoop 
HDFS 
ResourceManager + MapReduce v2 
Apache Spark 
Apache Tez 
Twitter Storm (Apache incubator) 
framework 
Apache Drill 
job management 
Hive, Pig, ... distributed file system 
domain specific language 
query processing 
subsystem monolithic 
resource management 
query engine 
middleware
Hadoopとはなにか 
v2: Hadoop 
HDFS 
ResourceManager + MapReduce v2 
Apache Spark 
Apache Tez 
Twitter Storm (Apache incubator) 
Apache Drill 
What Hadoop is ....
A L L Y O U R E N G I N E S A R E 
B E L O N G T O U S .
What Hadoop is? 
BigData platform is called as “Hadoop” 
like “Linux”, not only kernel, but also distribution 
CORE: 
distributed file systems 
data flow
“BigData as a Service” 
by @naoya_ito 
AWS EMR/RedShift, Google BigQuery, Treasure Data, ... 
They have their own architecture 
and their storages and data flow 
Data flow is always most important
Perl ? 
BigData world is dominated by JVM 
many contributors from many companies 
We should not make distributed processing software 
Stand on shoulders on giants! 
Connect perl world with JVM systems 
by CPAN modules
1. What do you want? 
2. How large your data is? 
3. Choose architecture!
SHARE 
software, know-how & concerns! 
Thank you!

More Related Content

What's hot

Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinDataWorks Summit
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015N Masahiro
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 

What's hot (20)

Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Expand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with ZeppelinExpand data analysis tool at scale with Zeppelin
Expand data analysis tool at scale with Zeppelin
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngine
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 

Viewers also liked

Introducing Swift - and the Sunset of Our Culture?
Introducing Swift - and the Sunset of Our Culture?Introducing Swift - and the Sunset of Our Culture?
Introducing Swift - and the Sunset of Our Culture?dankogai
 
Norikra: Stream Processing with SQL
Norikra: Stream Processing with SQLNorikra: Stream Processing with SQL
Norikra: Stream Processing with SQLSATOSHI TAGOMORI
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubySATOSHI TAGOMORI
 
BigQuery, Fluentd and tagomoris #gcpja
BigQuery, Fluentd and tagomoris #gcpjaBigQuery, Fluentd and tagomoris #gcpja
BigQuery, Fluentd and tagomoris #gcpjaSATOSHI TAGOMORI
 
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014Masahiro Nagano
 
運用とデータ分析の遠くて近い関係、ISUCONを添えて
運用とデータ分析の遠くて近い関係、ISUCONを添えて運用とデータ分析の遠くて近い関係、ISUCONを添えて
運用とデータ分析の遠くて近い関係、ISUCONを添えてSATOSHI TAGOMORI
 
JSON SQL Injection and the Lessons Learned
JSON SQL Injection and the Lessons LearnedJSON SQL Injection and the Lessons Learned
JSON SQL Injection and the Lessons LearnedKazuho Oku
 
作られては消えていく泡のように儚いクラスタの運用話
作られては消えていく泡のように儚いクラスタの運用話作られては消えていく泡のように儚いクラスタの運用話
作られては消えていく泡のように儚いクラスタの運用話Tsuyoshi Torii
 
インフラエンジニアは死んだ Yapc -asia 2014
インフラエンジニアは死んだ Yapc -asia 2014 インフラエンジニアは死んだ Yapc -asia 2014
インフラエンジニアは死んだ Yapc -asia 2014 Satoshi Suzuki
 
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側Takeshi HASEGAWA
 
Hadoop and Kerberos
Hadoop and KerberosHadoop and Kerberos
Hadoop and KerberosYuta Imai
 

Viewers also liked (13)

Introducing Swift - and the Sunset of Our Culture?
Introducing Swift - and the Sunset of Our Culture?Introducing Swift - and the Sunset of Our Culture?
Introducing Swift - and the Sunset of Our Culture?
 
Invitation for v1.0.0
Invitation for v1.0.0Invitation for v1.0.0
Invitation for v1.0.0
 
Norikra: Stream Processing with SQL
Norikra: Stream Processing with SQLNorikra: Stream Processing with SQL
Norikra: Stream Processing with SQL
 
Norikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In RubyNorikra: SQL Stream Processing In Ruby
Norikra: SQL Stream Processing In Ruby
 
BigQuery, Fluentd and tagomoris #gcpja
BigQuery, Fluentd and tagomoris #gcpjaBigQuery, Fluentd and tagomoris #gcpja
BigQuery, Fluentd and tagomoris #gcpja
 
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014
Dockerで遊んでみよっかー YAPC::Asia Tokyo 2014
 
運用とデータ分析の遠くて近い関係、ISUCONを添えて
運用とデータ分析の遠くて近い関係、ISUCONを添えて運用とデータ分析の遠くて近い関係、ISUCONを添えて
運用とデータ分析の遠くて近い関係、ISUCONを添えて
 
JSON SQL Injection and the Lessons Learned
JSON SQL Injection and the Lessons LearnedJSON SQL Injection and the Lessons Learned
JSON SQL Injection and the Lessons Learned
 
作られては消えていく泡のように儚いクラスタの運用話
作られては消えていく泡のように儚いクラスタの運用話作られては消えていく泡のように儚いクラスタの運用話
作られては消えていく泡のように儚いクラスタの運用話
 
インフラエンジニアは死んだ Yapc -asia 2014
インフラエンジニアは死んだ Yapc -asia 2014 インフラエンジニアは死んだ Yapc -asia 2014
インフラエンジニアは死んだ Yapc -asia 2014
 
Fluentd and WebHDFS
Fluentd and WebHDFSFluentd and WebHDFS
Fluentd and WebHDFS
 
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
「スプラトゥーン」リアルタイム画像解析ツール 「IkaLog」の裏側
 
Hadoop and Kerberos
Hadoop and KerberosHadoop and Kerberos
Hadoop and Kerberos
 

Similar to Handling not so big data

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 

Similar to Handling not so big data (20)

Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Big data
Big dataBig data
Big data
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Big data with java
Big data with javaBig data with java
Big data with java
 

More from SATOSHI TAGOMORI

Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speedSATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of RubySATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script ConfusingSATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubySATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsSATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the WorldSATOSHI TAGOMORI
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd SeasonSATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToSATOSHI TAGOMORI
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In RubySATOSHI TAGOMORI
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldSATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceSATOSHI TAGOMORI
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 

More from SATOSHI TAGOMORI (20)

Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
 
Maccro Strikes Back
Maccro Strikes BackMaccro Strikes Back
Maccro Strikes Back
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of Ruby
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script Confusing
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
 
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
Fluentd 101
Fluentd 101Fluentd 101
Fluentd 101
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Handling not so big data

  • 1. Handling not so big data. YAPC::Asia 2014 Day 2 2014/08/30 @tagomoris
  • 2. TAGOMORI Satoshi (@tagomoris) LINE Corporation Analytics Platform Team
  • 3.
  • 4.
  • 5. Data Analytics overview collect parse clean up process visualize store process
  • 6. Data Analytics overview collect parse clean up process visualize store process
  • 7. Consider data size Stored size? Total? Per day? Throughput? Daily average? Peak time? Structured? Compressed?
  • 8. DO NOT consider exact data size. It will increase/decrease dramatically!
  • 9. Consider rough data size Data size per query Sub GigaBytes From GigaBytes to TeraBytes PetaBytes or More
  • 10. Sub GB Use RDBMS!
  • 11. PB or More Use Hadoooooooooop! and Storm!
  • 12. From GB to TB: How large? Main target for many service providers Too large For “a” disk, for “a” memory space Appropriate for many disks Not so large for many memory spaces
  • 13. From GB to TB: For what? Not so small: We cannot do everything Data analytics methods search, aggregate, recommend, anomaly detect Consider “what you want to do” at first It fixes what you should consider about
  • 14. Types of data processing Data size and I/O throughput intensive: search, aggregation CPU power and memory size intensive: machine learning, graph processing Select appropriate processing framework/middleware On memory only? With Spilling?
  • 15. Architecture of distributed processing systems job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 16. Architecture of distributed processing systems job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 17. Short break: Languages and DSLs Java, Scala, ... framework query engine middleware SQL: Hive, Impala, Drill, Presto, ... job management domain specific language Others: Pig, Cascading, ... resource management query processing subsystem monolithic distributed file system
  • 18. Architecture of distributed processing systems framework job management domain specific language resource management query processing subsystem monolithic distributed file system query engine wwhheerree tthhiiss ttaallkk iiss aabboouutt middleware
  • 19. A tour of distributed processing frameworks and query engines
  • 20. MapReduce Hadoop MapReduce: Map + Combine + Shuffle + Reduce Intermediate output is written on disk Shuffle always requires sync Map Map Map Map Combine Combine Combine Combine job management Reduce Reduce Reduce shuffle resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 21. MRv1 vs MRv2 on Hadoop MRv1: Resource Management Job Management Framework MRv2: Job Management Framework domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  • 22. MRv1 vs MRv2 on Hadoop MRv1: Resource Management Job Management Framework MRv2: Job Management Framework ffoorrggeett tthhiiss domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  • 23. Apache Spark “Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.” Directed Acyclic Graph (DAG) Resilient Distributed Datasets (RDDs) Pros: Batch, Machine learning, Graph domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  • 24. Apache Tez “The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.” Directed Acyclic Graph (DAG) Pros: Big MR, Multiple Aggregations domain specific language query processing subsystem monolithic framework resource management distributed file system query engine middleware job management
  • 25. MR, Spark, Tez job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 26. DAG (directed acyclic graph: 非循環有向グラフ) from: http://tez.apache.org/
  • 27. Variations of engines Make jobs faster than MapReduce Especially for memory-intensive, complex jobs Hive can replace backend from MR to Spark/Tez MR’s stability is VERY important Alternatives are under development
  • 28. MPP Engines Apache Drill, Cloudera Impala, Facebook Presto Massively Parallel Processing: MPP domain specific language DSL(SQL) + job management framework job management Data source is external datastores resource management Very low latency: using many threads distributed file system Low availability and less tolerance for memory requirements query processing subsystem monolithic query engine middleware
  • 29. Stream processing Without any storages Process data for specified windows every X events, per Y minutes, for unique values, .... on memory processing Ultira low latency There are too many things to be considered about...
  • 30. Stream processing: more Twitter Storm Distributed stream processing platform Processing w/ Java or JVM languages For super high throughput data (not for minimal data) Norikra Non-distributed stream processing platform Processing w/ SQL, but not distributed... For low-middle-high throughput data
  • 31.
  • 32. what i don’t mention about today...
  • 33. Hadoopとはなにか Original Hadoop HDFS MapReduce v1 Hadoop v2 HDFS ResourceManager + MapReduce v2 job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 34. Hadoopとはなにか Hadoop v2 HDFS ResourceManager + MapReduce v2 Spark job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 35. Hadoopとはなにか Hadoop v2 HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez job management resource management distributed file system framework domain specific language query processing subsystem monolithic query engine middleware
  • 36. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) job management domain specific language resource management distributed file system framework monolithic query engine middleware query processing subsystem
  • 37. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) Apache Drill job management domain specific language resource management distributed file system framework query processing subsystem monolithic query engine middleware
  • 38. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) framework Apache Drill job management Hive, Pig, ... distributed file system domain specific language query processing subsystem monolithic resource management query engine middleware
  • 39. Hadoopとはなにか v2: Hadoop HDFS ResourceManager + MapReduce v2 Apache Spark Apache Tez Twitter Storm (Apache incubator) Apache Drill What Hadoop is ....
  • 40. A L L Y O U R E N G I N E S A R E B E L O N G T O U S .
  • 41. What Hadoop is? BigData platform is called as “Hadoop” like “Linux”, not only kernel, but also distribution CORE: distributed file systems data flow
  • 42. “BigData as a Service” by @naoya_ito AWS EMR/RedShift, Google BigQuery, Treasure Data, ... They have their own architecture and their storages and data flow Data flow is always most important
  • 43. Perl ? BigData world is dominated by JVM many contributors from many companies We should not make distributed processing software Stand on shoulders on giants! Connect perl world with JVM systems by CPAN modules
  • 44. 1. What do you want? 2. How large your data is? 3. Choose architecture!
  • 45. SHARE software, know-how & concerns! Thank you!