SlideShare a Scribd company logo
1 of 76
Download to read offline
Technologies for
Data Analytics Platform
YAPC::Asia Tokyo 2015 - Aug 22, 2015
Who are you?
• Masahiro Nakagawa
• github: @repeatedly
• Treasure Data Inc.
• Fluentd / td-agent developer
• https://jobs.lever.co/treasure-data
• I love OSS :)
• D Language, MessagePack, The organizer of several meetups, etc…
Why do we analyze data?
Reporting
Monitoring
Exploratory data analysis
Confirmatory data analysis
etc…
Need data, data, data!
It means we need
data analysis platform
for own requirements
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Let’s launch platform!
• Easy to use and maintain
• Single server
• RDBMS is popular and has huge ecosystem









RDBMS
ETL Query
Extract + Transformation + Load
×
Oops! RDBMS is not good for data
analytics against large data volume.
We need more speed and scalability!
Let’s consider
Parallel RDBMS instead!
Parallel RDBMS
• Optimized for OLAP workload
• Columnar storage, Shared nothing, etc…
• Netezza, Teradata, Vertica, Greenplum, etc…








 Compute
Node
Leader
Node
Compute
Node
Compute
Node
Query
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
• Good data format for analytics workload
• Read only selected columns, efficient compression
• Not good for insert / update











Columnar Storage
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
Row Columnar
Unit
Unit
Okay, query is now
processed normally.

L
C C C
No silver bullet
• Performance depends on data modeling and query
• distkey and sortkey are important
• should reduce data transfer and IO Cost
• query should take advantage of these keys
• There are some problems
• Cluster scaling, metadata management, etc…
Performance is good :)
But we often want to change schema

for new workloads. Now,

hard to maintain schema and its data…
L
C C C
Okay, let’s separate
data sources into multiple
layers for reliable platform
Schema on Write(RDBMS)
• Writing data using schema

for improving query performance
• Pros:
• minimum query overhead
• Cons:
• Need to design schema and workload before
• Data load is expensive operation
Schema on Read(Hadoop)
• Writing data without schema and

map schema at query time
• Pros:
• Robust over schema and workload change
• Data load is cheap operation
• Cons:
• High overhead at query time
Data Lake
• Schema management is hard
• Volume is increasing and format is often changed
• There are lots of log types
• Feasible approach is storing raw data and

converting it before analyze data
• Data Lake is a single storage for any logs
• Note that no clear definition for now
Data Lake Patterns
• Use DFS, e.g. HDFS, for log storage
• ETL or data processing by Hadoop ecosystem
• Can convert logs via ingestion tools before
• Use Data Lake storage and related tools
• These storages support Hadoop ecosystem
Apache Hadoop
• Distributed computing framework
• First implementation based on Google MapReduce











http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
HDFS
http://nosqlessentials.com/
MapReduce
Cool!
Data load becomes robust!
EL
T
Raw data Transformed data
Apache Tez
• Low level framework for YARN Applications
• Hive, Pig, new query engine and more
• Task and DAG based processing flow









ProcessorInput Output
Task DAG
MapReduce vs Tez
MapReduce Tez
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
SELECT g1.x, g2.avg, g2.cnt

FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x
JOIN (a, b)
ORDER BY
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
Superstition
• HDFS and YARN have SPOF
• Recent version doesn’t have SPOF on both
MapReduce 1 and MapReduce 2
• Can’t build from a scratch
• Really? Treasure Data builds Hadoop on CircleCI.

Cloudera, Hortonworks and MapR too.
• They also check its dependent toolchain.
Which Hadoop package

should we use?
• Distribution by Hadoop distributor is better
• CDH by Cloudera
• HDP by Hortonworks
• MapR distribution by MapR
• If you are familiar with Hadoop and its ecosystem,

Apache community edition becomes an option.
• For example, Treasure Data has patches and

they want to use patched version.
Good :)
In addition, we want to
collect data in efficient way!
Ingestion tools
• There are two execution model!
• Bulk load:
• For high-throughput
• Almost tools transfer data in batch and parallel
• Streaming load:
• For low-latency
• Almost tools transfer data in micro-batch
Bulk load tools
• Embulk
• Pluggable bulk data loader for

various inputs and outputs
• Write plugins using Java and JRuby
• Sqoop
• Data transfer between Hadoop and RDBMS
• Included in some distributions
• Or each bulk loader for each data store
Streaming load tools
• Fluentd
• Pluggable and json based streaming collector
• Lots of plugins in rubygems
• Flume
• Mainly for Hadoop ecosystem, HDFS, HBase, …
• Included in some distributions
• Or Logstash, Heka, Splunk and etc…
Data ingestion also

becomes robust and efficient!
Raw data Transformed data
It works! but…

we want to issue ad-hoc query to entire
data.
We can’t wait loading data into database.
You can use MPP query
engine for data stores.
MPP query engine
• It doesn’t have own storage unlike parallel RDBMS
• Follow “Schema on Read” approach
• data distribution depends on backend
• data schema also depends on backend
• Some products are called “SQL on Hadoop”
• Presto, Impala, Apache Drill, etc…
• It has own execution engine, not use MapReduce.
• Distributed Query Engine for interactive queries

against various data sources and large data.
• Pluggable connector for joining multiple backends
• You can join MySQL and HDFS data in one query
• Lots of useful functions for data analytics
• window functions, approximate query,

machine learning, etc…
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ More work to manage

2 platforms
✓ Can’t query against

“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
Client
Coordinator Connector

Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Execution Model
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data

to disk
Wait between

stages
Okay, we have now low latency
and batch combination.
Raw data
Resolved our concern! But…
we also need quick estimation.
Currently, we have several
stream processing softwares.
Let’s try!!
Apache Storm
• Distributed realtime processing framework
• Low latency: tuple at a time
• Trident mode uses micro batch













https://storm.apache.org/
Norikra
• Schema-less CEP engine for stream processing
• Use SQL like Esper EPL
• Not distributed unlike Storm for now













Calculated result
Great! We can get insight by
streaming and batch way :)
One more. We can make data
transfer more reliable for multiple
data streams with distributed queue
• Distributed messaging system
• Producer - Broker - Consumer pattern
• Pull model, replication, etc…









Apache Kafka
App
Push
Pull
Push vs Pull
• Push:
• Easy to transfer data to multiple destinations
• Hard to control stream ratio in multiple streams
• Pull:
• Easy to control stream ratio
• Should manage consumers correctly
This is a modern analytics platform
Seems complex and hard to
maintain?
Let’s use useful services!
Amazon Redshift
• Parallel RDBMS on AWS
• Re-use traditional Parallel RDMBS know-how
• Scale is easier than traditional systems
• With Amazon EMR is popular
1. Store data into S3
2. EMR processes S3 data
3. Load processed data into Redshift
• EMR has Hadoop ecosystem
Using AWS Services
Google BigQuery
• Distributed query engine and scalable storage
• Tree model, Columnar storage, etc…
• Separate storage from workers
• High performance query by Google infrastructure
• Lots of workers
• Storage / IO layer on Colossus
• Can’t manage Parallel RDBMS properties like distkey,

but it works on almost cases.
BigQuery architecture
Using GCP Services
Treasure Data
• Cloud based end-to-end data analytics service
• Hive, Presto, Pig and Hivemall for one big repository
• Lots of ingestion and output way, scheduling, etc…
• No stream processing for now
• Service concept is Data Lake
• JSON based schema-less storage
• Execution model is similar to BigQuery
• Separate storage from workers
• Can’t specify Parallel RDBMS properties
Using Treasure Data Service
Resource Model Trade-off
Pros Cons
Fully Guaranteed
Stable execution
Easy to control resource
Non boost mechanizm
Guaranteed with 

multi-tenanted
Stable execution
Good scalability
less controllable resource
Fully multi-tenanted
Boosted performance
Great scalability
Unstable execution
MS Azure also has useful services:
DataHub, SQL DWH, DataLake,
Stream Analytics, HDInsight…
Use service or build a platform?
• Should consider using service first
• AWS, GCP, MS Azure, Treasure Data, etc…
• Important factor is data analytics, not platform
• Do you have enough resources to maintain it?
• If specific analytics platform is a differentiator,

building a platform is better
• Use state-of-the-art technologies
• Hard to implement on existing platforms
Conclusion
• Many softwares and services for data analytics
• Lots of trade-off, performance, complexity,
connectivity, execution model, etc
• SQL is a primary language on data analytics
• Should focus your goal!
• data analytics platform is your business core?

If not, consider using services first.
Cloud service for
entire data pipeline!
Appendix
Apache Spark
• Another Distributed computing framework
• Mainly for in-memory computing with DAG
• RDD and DataFrame based clean API
• Combination with Hadoop is popular













http://slidedeck.io/jmarin/scala-talk
Apache Flink
• Streaming based execution engine
• Support batch and pipelined processing
• Hadoop and Spark are batch based
• 







https://ci.apache.org/projects/
flink/flink-docs-master/
Batch vs Pipelined
All stages are pipe-lined
✓ No wait time
✓ fault-tolerance with

check pointing
Batch(Staged) Pipelined
task task
task task
task
task
memory-to-memory
data transfer
✓ use disk if needed
task
disk
disk
Wait between

stagestask
task task
task task
task task stage3
stage2
stage1
Visualization
• Tableau
• Popular BI tool in many area
• Awesome GUI, easy to use, lots of charts, etc
• Metric Insights
• Dashboard for many metrics
• Scheduled query, custom handler, etc
• Chartio
• Cloud based BI tool
How to manage job dependency?
We want to issue Job X
after Job A and Job B are finished.
Data pipeline tool
• There are some important features
• Manage job dependency
• Handle job failure and retry
• Easy to define topology
• Separate tasks into sub-tasks
• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1,
etc…
Luigi
• Python module for building job pipeline
• Write python code and run it.
• task is defined as Python class
• Easy to manage by VCS
• Need some extra tools
• scheduled job, job hisotry, etc…
class T1(luigi.task):
def requires(self):
# dependencies
def output(self):
# store result
def run(self):
# task body
Airflow
• Python and DAG based workflow
• Write python code but it is for defining ADAG
• Task is defined by Operator
• There are good features
• Management web UI
• Task information is stored into database
• Celery based distributed execution
dag = DAG('example')
t1 = Operator(..., dag=dag)
t2 = Operator(..., dag=dag)
t2.set_upstream(t1)

More Related Content

What's hot

Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data storesTomas Doran
 
Presto changes
Presto changesPresto changes
Presto changesN Masahiro
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
How to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataHow to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataN Masahiro
 
Api world apache nifi 101
Api world   apache nifi 101Api world   apache nifi 101
Api world apache nifi 101Timothy Spann
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureHiroshi Toyama
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of MillionsErik Onnen
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerStreamNative
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015N Masahiro
 
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Yoshiyasu SAEKI
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Timothy Spann
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300Timothy Spann
 
JRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldJRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldSATOSHI TAGOMORI
 
Puppet at Spotify
Puppet at SpotifyPuppet at Spotify
Puppet at SpotifyPuppet
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello WorldJosh Fischer
 

What's hot (20)

Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
 
Presto changes
Presto changesPresto changes
Presto changes
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
How to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataHow to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdata
 
Norikra Recent Updates
Norikra Recent UpdatesNorikra Recent Updates
Norikra Recent Updates
 
Api world apache nifi 101
Api world   apache nifi 101Api world   apache nifi 101
Api world apache nifi 101
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar Manager
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practices
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
 
JRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldJRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing World
 
Puppet at Spotify
Puppet at SpotifyPuppet at Spotify
Puppet at Spotify
 
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
 

Viewers also liked

DeNAの分析を支える分析基盤
DeNAの分析を支える分析基盤DeNAの分析を支える分析基盤
DeNAの分析を支える分析基盤Kenshin Yamada
 
PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編toru iwashita
 
初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベントDaisuke Kimura
 
Fluentd v0.14 Overview
Fluentd v0.14 OverviewFluentd v0.14 Overview
Fluentd v0.14 OverviewN Masahiro
 
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介Recruit Technologies
 
Logをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みLogをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みKen Morishita
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線Recruit Lifestyle Co., Ltd.
 
リクルートライフスタイルにおけるUX領域の取り組み
リクルートライフスタイルにおけるUX領域の取り組みリクルートライフスタイルにおけるUX領域の取り組み
リクルートライフスタイルにおけるUX領域の取り組みRecruit Lifestyle Co., Ltd.
 
ナレッジを共有する文化をつくるために
ナレッジを共有する文化をつくるためにナレッジを共有する文化をつくるために
ナレッジを共有する文化をつくるためにRecruit Lifestyle Co., Ltd.
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsSATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In RubySATOSHI TAGOMORI
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
代数方程式とガロア理論
代数方程式とガロア理論代数方程式とガロア理論
代数方程式とガロア理論Junpei Tsuji
 
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜 リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜 Yugo Shimizu
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線Recruit Lifestyle Co., Ltd.
 

Viewers also liked (20)

DeNAの分析を支える分析基盤
DeNAの分析を支える分析基盤DeNAの分析を支える分析基盤
DeNAの分析を支える分析基盤
 
PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編
 
初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント
 
20150723AWS startup tech_meetup
20150723AWS startup tech_meetup20150723AWS startup tech_meetup
20150723AWS startup tech_meetup
 
Fluentd v0.14 Overview
Fluentd v0.14 OverviewFluentd v0.14 Overview
Fluentd v0.14 Overview
 
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介
求職サービスの検索ログを用いたクエリのカテゴリ推定とその活用事例の紹介
 
Logをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組みLogをs3とredshiftに格納する仕組み
Logをs3とredshiftに格納する仕組み
 
JIRA meets Tableau & AWS
JIRA meets Tableau & AWSJIRA meets Tableau & AWS
JIRA meets Tableau & AWS
 
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
 
リクルートライフスタイルにおけるUX領域の取り組み
リクルートライフスタイルにおけるUX領域の取り組みリクルートライフスタイルにおけるUX領域の取り組み
リクルートライフスタイルにおけるUX領域の取り組み
 
Developer summit 2015 gcp
Developer summit 2015   gcpDeveloper summit 2015   gcp
Developer summit 2015 gcp
 
ナレッジを共有する文化をつくるために
ナレッジを共有する文化をつくるためにナレッジを共有する文化をつくるために
ナレッジを共有する文化をつくるために
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
 
NIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's DistanceNIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's Distance
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
代数方程式とガロア理論
代数方程式とガロア理論代数方程式とガロア理論
代数方程式とガロア理論
 
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜 リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
 

Similar to Technologies for Data Analytics Platform

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 

Similar to Technologies for Data Analytics Platform (20)

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Apache drill
Apache drillApache drill
Apache drill
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

More from N Masahiro

Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUN Masahiro
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalkFluentd v1 and future at techtalk
Fluentd v1 and future at techtalkN Masahiro
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconN Masahiro
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellN Masahiro
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellN Masahiro
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOSconN Masahiro
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12N Masahiro
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guideN Masahiro
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4N Masahiro
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaFluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaN Masahiro
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65N Masahiro
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -N Masahiro
 
Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014N Masahiro
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014N Masahiro
 
Can you say the same words even in oss
Can you say the same words even in ossCan you say the same words even in oss
Can you say the same words even in ossN Masahiro
 
I am learing the programming
I am learing the programmingI am learing the programming
I am learing the programmingN Masahiro
 
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)N Masahiro
 
D vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaD vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaN Masahiro
 

More from N Masahiro (20)

Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EU
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalkFluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaFluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Can you say the same words even in oss
Can you say the same words even in ossCan you say the same words even in oss
Can you say the same words even in oss
 
I am learing the programming
I am learing the programmingI am learing the programming
I am learing the programming
 
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
 
D vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaD vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoya
 
Goodbye Doost
Goodbye DoostGoodbye Doost
Goodbye Doost
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Technologies for Data Analytics Platform

  • 1. Technologies for Data Analytics Platform YAPC::Asia Tokyo 2015 - Aug 22, 2015
  • 2. Who are you? • Masahiro Nakagawa • github: @repeatedly • Treasure Data Inc. • Fluentd / td-agent developer • https://jobs.lever.co/treasure-data • I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…
  • 3. Why do we analyze data?
  • 6. It means we need data analysis platform for own requirements
  • 7. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  • 9. • Easy to use and maintain • Single server • RDBMS is popular and has huge ecosystem
 
 
 
 
 RDBMS ETL Query Extract + Transformation + Load
  • 10. × Oops! RDBMS is not good for data analytics against large data volume. We need more speed and scalability!
  • 12. Parallel RDBMS • Optimized for OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…
 
 
 
 
 Compute Node Leader Node Compute Node Compute Node Query
  • 13. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … • Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update
 
 
 
 
 
 Columnar Storage time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … Row Columnar Unit Unit
  • 14. Okay, query is now processed normally.
 L C C C
  • 15. No silver bullet • Performance depends on data modeling and query • distkey and sortkey are important • should reduce data transfer and IO Cost • query should take advantage of these keys • There are some problems • Cluster scaling, metadata management, etc…
  • 16. Performance is good :) But we often want to change schema
 for new workloads. Now,
 hard to maintain schema and its data… L C C C
  • 17. Okay, let’s separate data sources into multiple layers for reliable platform
  • 18. Schema on Write(RDBMS) • Writing data using schema
 for improving query performance • Pros: • minimum query overhead • Cons: • Need to design schema and workload before • Data load is expensive operation
  • 19. Schema on Read(Hadoop) • Writing data without schema and
 map schema at query time • Pros: • Robust over schema and workload change • Data load is cheap operation • Cons: • High overhead at query time
  • 20. Data Lake • Schema management is hard • Volume is increasing and format is often changed • There are lots of log types • Feasible approach is storing raw data and
 converting it before analyze data • Data Lake is a single storage for any logs • Note that no clear definition for now
  • 21. Data Lake Patterns • Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before • Use Data Lake storage and related tools • These storages support Hadoop ecosystem
  • 22. Apache Hadoop • Distributed computing framework • First implementation based on Google MapReduce
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  • 25. Cool! Data load becomes robust! EL T Raw data Transformed data
  • 26. Apache Tez • Low level framework for YARN Applications • Hive, Pig, new query engine and more • Task and DAG based processing flow
 
 
 
 
 ProcessorInput Output Task DAG
  • 27. MapReduce vs Tez MapReduce Tez M HDFS R R M M HDFS HDFS R M M R M M R M R M MM M M R R R SELECT g1.x, g2.avg, g2.cnt
 FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP b BY b.xGROUP a BY a.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x JOIN (a, b) ORDER BY http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
  • 28. Superstition • HDFS and YARN have SPOF • Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2 • Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.
 Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.
  • 29. Which Hadoop package
 should we use? • Distribution by Hadoop distributor is better • CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR • If you are familiar with Hadoop and its ecosystem,
 Apache community edition becomes an option. • For example, Treasure Data has patches and
 they want to use patched version.
  • 30. Good :) In addition, we want to collect data in efficient way!
  • 31. Ingestion tools • There are two execution model! • Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel • Streaming load: • For low-latency • Almost tools transfer data in micro-batch
  • 32. Bulk load tools • Embulk • Pluggable bulk data loader for
 various inputs and outputs • Write plugins using Java and JRuby • Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions • Or each bulk loader for each data store
  • 33. Streaming load tools • Fluentd • Pluggable and json based streaming collector • Lots of plugins in rubygems • Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions • Or Logstash, Heka, Splunk and etc…
  • 34. Data ingestion also
 becomes robust and efficient! Raw data Transformed data
  • 35. It works! but…
 we want to issue ad-hoc query to entire data. We can’t wait loading data into database.
  • 36. You can use MPP query engine for data stores.
  • 37. MPP query engine • It doesn’t have own storage unlike parallel RDBMS • Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend • Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.
  • 38. • Distributed Query Engine for interactive queries
 against various data sources and large data. • Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query • Lots of useful functions for data analytics • window functions, approximate query,
 machine learning, etc…
  • 39. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 40. HDFS Hive Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ More work to manage
 2 platforms ✓ Can’t query against
 “live” data directly Batch analysis platform Visualization platform PostgreSQL, etc.
  • 41. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 42. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform
  • 44. Execution Model All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data
 to disk Wait between
 stages
  • 45. Okay, we have now low latency and batch combination. Raw data
  • 46. Resolved our concern! But… we also need quick estimation.
  • 47. Currently, we have several stream processing softwares. Let’s try!!
  • 48. Apache Storm • Distributed realtime processing framework • Low latency: tuple at a time • Trident mode uses micro batch
 
 
 
 
 
 
 https://storm.apache.org/
  • 49. Norikra • Schema-less CEP engine for stream processing • Use SQL like Esper EPL • Not distributed unlike Storm for now
 
 
 
 
 
 
 Calculated result
  • 50. Great! We can get insight by streaming and batch way :)
  • 51. One more. We can make data transfer more reliable for multiple data streams with distributed queue
  • 52. • Distributed messaging system • Producer - Broker - Consumer pattern • Pull model, replication, etc…
 
 
 
 
 Apache Kafka App Push Pull
  • 53. Push vs Pull • Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams • Pull: • Easy to control stream ratio • Should manage consumers correctly
  • 54. This is a modern analytics platform
  • 55. Seems complex and hard to maintain? Let’s use useful services!
  • 56. Amazon Redshift • Parallel RDBMS on AWS • Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems • With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift • EMR has Hadoop ecosystem
  • 58. Google BigQuery • Distributed query engine and scalable storage • Tree model, Columnar storage, etc… • Separate storage from workers • High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus • Can’t manage Parallel RDBMS properties like distkey,
 but it works on almost cases.
  • 61. Treasure Data • Cloud based end-to-end data analytics service • Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now • Service concept is Data Lake • JSON based schema-less storage • Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties
  • 63. Resource Model Trade-off Pros Cons Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm Guaranteed with 
 multi-tenanted Stable execution Good scalability less controllable resource Fully multi-tenanted Boosted performance Great scalability Unstable execution
  • 64. MS Azure also has useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…
  • 65. Use service or build a platform? • Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform • Do you have enough resources to maintain it? • If specific analytics platform is a differentiator,
 building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms
  • 66. Conclusion • Many softwares and services for data analytics • Lots of trade-off, performance, complexity, connectivity, execution model, etc • SQL is a primary language on data analytics • Should focus your goal! • data analytics platform is your business core?
 If not, consider using services first.
  • 67. Cloud service for entire data pipeline!
  • 69. Apache Spark • Another Distributed computing framework • Mainly for in-memory computing with DAG • RDD and DataFrame based clean API • Combination with Hadoop is popular
 
 
 
 
 
 
 http://slidedeck.io/jmarin/scala-talk
  • 70. Apache Flink • Streaming based execution engine • Support batch and pipelined processing • Hadoop and Spark are batch based • 
 
 
 
 https://ci.apache.org/projects/ flink/flink-docs-master/
  • 71. Batch vs Pipelined All stages are pipe-lined ✓ No wait time ✓ fault-tolerance with
 check pointing Batch(Staged) Pipelined task task task task task task memory-to-memory data transfer ✓ use disk if needed task disk disk Wait between
 stagestask task task task task task task stage3 stage2 stage1
  • 72. Visualization • Tableau • Popular BI tool in many area • Awesome GUI, easy to use, lots of charts, etc • Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc • Chartio • Cloud based BI tool
  • 73. How to manage job dependency? We want to issue Job X after Job A and Job B are finished.
  • 74. Data pipeline tool • There are some important features • Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks • Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…
  • 75. Luigi • Python module for building job pipeline • Write python code and run it. • task is defined as Python class • Easy to manage by VCS • Need some extra tools • scheduled job, job hisotry, etc… class T1(luigi.task): def requires(self): # dependencies def output(self): # store result def run(self): # task body
  • 76. Airflow • Python and DAG based workflow • Write python code but it is for defining ADAG • Task is defined by Operator • There are good features • Management web UI • Task information is stored into database • Celery based distributed execution dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)