SlideShare a Scribd company logo
1 of 39
Download to read offline
Masahiro Nakagawa
Feb 7, 2015
dots. Summit 2015
Treasure Data

and OSS
Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc)
> etc…
Company background
•  Founded 2011 in Mountain View, CA!
–  The first cloud service for the entire data
pipeline!
–  Including: Acquisition, Storage, & Analysis!
•  Provide a “Cloud Data Service”!
–  Fast Time to Value!
–  Cloud Flexibility and Economics!
–  Simple and Well Supported!
•  Treasure Data has over 100+ customers
in production!
–  Incl. Fortune 500 companies!
–  400k new records / second!
–  Almost 9 Trillion records loaded!
–  Variety of use cases and verticals!
The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Sada Furuhashi – Software Architect
MessagaPack / Fluentd Author
Notable Investors
Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku
Sierra Ventures - Tim Guleri
Leading venture capital firm in Big Data
TD Service Architecture
Time to Value
Send query result 
Result Push
Acquire
 Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc /

Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights 
Tableau, 
Motion Board etc. 
POS
REST API
ODBC / JDBC
SQL, Pig 
Bulk Uploader
Embulk,

TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported
Data Acquisition
Log collecting in TD
> Treasure Agent
> Fluentd based log collector
> Embulk
> JavaScript SDK
> Mobile SDK (iOS, Android, Unity)
Structured logging	

!
Reliable forwarding	

!
Pluggable architecture
http://fluentd.org/
Fluentd
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???
Divide & Conquer & Retry
error retry
error retry retry
retry
Batch
Stream
Other stream
Core Plugins
> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism
> read / receive data
> from API, database,

command, etc…
> write / send data
> to API, database, alert,
graph, etc…
Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep
> record_transfomer	

> …
> Forward	

> File tail	

> ...
> Forward	

> File	

> ...
Output
> File	

> Memory
not pluggable
FormatterParser
Before (M x N)
After (M + N)
or Embulk
Other Fluentd related OSS
> Treasure Agent
> https://github.com/treasure-data/omnibus-td-agent
> Fluentd Forwarder
> https://github.com/fluent/fluentd-forwarder
> Simple forwarder for Windows / Leaf node
> Fluentd UI
> https://github.com/fluent/fluentd-ui
> Management web UI
Other OSS products
> Scribed (C++)
> Developed by Facebook
> No maintained
> Apache Flume (Java)
> Mainly for Hadoop HDFS / HBase
> Logstash (JRuby)
> Mainly for Elasticsearch
Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages (TBD)
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behaviour
✓ Idempotent retrying
Plugins Plugins
bulk load
Computing Framework
3 query engines in TD
> Hive (HiveQL, Batch)
> for ETL and large jobs
> Hivemall for machine learning
> Pig (Pig Latin, Batch)
> DataFu for data mining and statistics
> Presto (SQL, Short batch)
> for Ad hoc queries
Hadoop
> Distributed computing framework
> Consist of many components…













http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
http://nosqlessentials.com/
http://nosqlessentials.com/
> Low level framework for YARN applications
> New Query Engine
> Provide good IR for Hive, Pig and more
> Task and DAG based pipelining







Apache Tez
ProcessorInput Output
Task DAG
http://tez.apache.org/
Hive on MR vs. Hive on Tez
MapReduce Tez
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
Avoid unnecessary HDFS write!
SELECT g1.x, g2.avg, g2.cnt

FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1"
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2"
ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x"
JOIN (a, b)
ORDER BY
Other OSS products
> Apache Spark
> Mainly for on-memory processing
> Spark ecosystem is now growing
> Apache Flink
> Mainly for iterative processing
> Microsoft’s Dryad
> This was premature for human being…
Presto
A distributed SQL query engine

for interactive data analisys

against GBs to PBs of data.
Presto overview
> Open sourced by Facebook
> http://prestodb.io/
> written in Java
> Built-in useful features
> Connectors
> Machine Learning
> Window function
> Approximate query
> etc…
> Used by Netflix, Dropbox, Treasure Data,
Qubole, Airbnb, LINE, GREE, Scaleout, etc
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable 
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ More work to manage

2 platforms
✓ Can’t query against

“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
All stages are pipe-
lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data

to disk
Wait between

stages
Other OSS products
> Cloudera Impala
> Mainly for HDFS / HBase
> Apache Drill
> More flexible architecture
> Apache Tajo
> For building data warehouse
Visualization
Hmm…
> There are no popular OSS products
> We don’t focus on developing
visualization tool for now
> Commercial BI tools are popular
> Tableau, Motion board and etc
> Maybe, next presentation talk about

this area deeply
Treasure Data resources
> https://github.com/treasure-data
> perfectqueue, perfectsched, etc
> https://sql.treasuredata.com/
> HiveQL syntax checker
> https://examples.treasuredata.com/
> Query catalog
http://blog.treasuredata.com/2014/11/26/12-open-source-
software-innovations-from-treasure-data-engineers/
Check: treasuredata.com
Cloud service for the entire data pipeline

More Related Content

What's hot

ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
ArangoDB Database
 

What's hot (20)

Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Software + Babies
Software + BabiesSoftware + Babies
Software + Babies
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Presto
PrestoPresto
Presto
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Scaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOSScaling ArangoDB on Mesosphere DCOS
Scaling ArangoDB on Mesosphere DCOS
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 

Viewers also liked

Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
DataWorks Summit
 
20150207 dots ラクスルの開発体制
20150207 dots ラクスルの開発体制20150207 dots ラクスルの開発体制
20150207 dots ラクスルの開発体制
Raksul Inc.
 

Viewers also liked (20)

Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
20150207 dots ラクスルの開発体制
20150207 dots ラクスルの開発体制20150207 dots ラクスルの開発体制
20150207 dots ラクスルの開発体制
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Gradleでビルドするandroid NDKアプリ
Gradleでビルドするandroid NDKアプリGradleでビルドするandroid NDKアプリ
Gradleでビルドするandroid NDKアプリ
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on TezA Benchmark Test on Presto, Spark Sql and Hive on Tez
A Benchmark Test on Presto, Spark Sql and Hive on Tez
 
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
 
SPEEDA/NewsPicksを支える価値を生み出す技術の選定手法
SPEEDA/NewsPicksを支える価値を生み出す技術の選定手法SPEEDA/NewsPicksを支える価値を生み出す技術の選定手法
SPEEDA/NewsPicksを支える価値を生み出す技術の選定手法
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
 
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
 
Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
会員数180万人のマッチングサービスpairsの 急成長を支える技術基盤 ディレクターズカット版
会員数180万人のマッチングサービスpairsの 急成長を支える技術基盤 ディレクターズカット版会員数180万人のマッチングサービスpairsの 急成長を支える技術基盤 ディレクターズカット版
会員数180万人のマッチングサービスpairsの 急成長を支える技術基盤 ディレクターズカット版
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
ゼロから始めるSparkSQL徹底活用!
ゼロから始めるSparkSQL徹底活用!ゼロから始めるSparkSQL徹底活用!
ゼロから始めるSparkSQL徹底活用!
 

Similar to Treasure Data and OSS

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similar to Treasure Data and OSS (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Experiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure teamExperiences using CouchDB inside Microsoft's Azure team
Experiences using CouchDB inside Microsoft's Azure team
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 

More from N Masahiro

More from N Masahiro (20)

Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EU
 
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalkFluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Presto changes
Presto changesPresto changes
Presto changes
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
 
Fluentd v0.14 Overview
Fluentd v0.14 OverviewFluentd v0.14 Overview
Fluentd v0.14 Overview
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
 
fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14fluent-plugin-beats at Elasticsearch meetup #14
fluent-plugin-beats at Elasticsearch meetup #14
 
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Docker and Fluentd
Docker and FluentdDocker and Fluentd
Docker and Fluentd
 
How to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataHow to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdata
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
 
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaFluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014
 
Can you say the same words even in oss
Can you say the same words even in ossCan you say the same words even in oss
Can you say the same words even in oss
 
I am learing the programming
I am learing the programmingI am learing the programming
I am learing the programming
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 

Treasure Data and OSS

  • 1. Masahiro Nakagawa Feb 7, 2015 dots. Summit 2015 Treasure Data
 and OSS
  • 2. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc) > etc…
  • 3. Company background •  Founded 2011 in Mountain View, CA! –  The first cloud service for the entire data pipeline! –  Including: Acquisition, Storage, & Analysis! •  Provide a “Cloud Data Service”! –  Fast Time to Value! –  Cloud Flexibility and Economics! –  Simple and Well Supported! •  Treasure Data has over 100+ customers in production! –  Incl. Fortune 500 companies! –  400k new records / second! –  Almost 9 Trillion records loaded! –  Variety of use cases and verticals! The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran Kaz Ohta – CTO Founder of world’s largest Hadoop Group Sada Furuhashi – Software Architect MessagaPack / Fluentd Author Notable Investors Othman Laraki Ex-VP of Growth at Twitter Jerry Yang Founder of Yahoo! Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language James Lindenbaum Founder of Heroku Sierra Ventures - Tim Guleri Leading venture capital firm in Big Data
  • 4. TD Service Architecture Time to Value Send query result Result Push Acquire Analyze Store Plazma DB Flexible, Scalable, Columnar Storage Web Log App Log Censor CRM ERP RDBMS Treasure Agent(Server) SDK(JS, Android, iOS, Unity) Streaming Collector Batch / Reliability Ad-hoc /
 Low latency KPI$ KPI Dashboard BI Tools Other Products RDBMS, Google Docs, AWS S3, FTP Server, etc. Metric Insights Tableau, Motion Board etc. POS REST API ODBC / JDBC SQL, Pig Bulk Uploader Embulk,
 TD Toolbelt SQL-based query @AWS or @IDCF Connectivity Economy & Flexibility Simple & Supported
  • 6. Log collecting in TD > Treasure Agent > Fluentd based log collector > Embulk > JavaScript SDK > Mobile SDK (iOS, Android, Unity)
  • 7. Structured logging ! Reliable forwarding ! Pluggable architecture http://fluentd.org/
  • 8. Fluentd > Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby > Gem based various plugins > http://www.fluentd.org/plugins > Working in production > http://www.fluentd.org/testimonials
  • 9. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  • 10. Data Analytics Flow Store Process Cloudera Horton Works Treasure Data Collect Visualize Tableau Excel R easier & shorter time ???
  • 11. Divide & Conquer & Retry error retry error retry retry retry Batch Stream Other stream
  • 12. Core Plugins > Divide & Conquer
 > Buffering & Retrying
 > Error handling
 > Message routing
 > Parallelism > read / receive data > from API, database,
 command, etc… > write / send data > to API, database, alert, graph, etc…
  • 13. Architecture (v0.12 or later) EngineInput Filter Output Buffer > grep > record_transfomer > … > Forward > File tail > ... > Forward > File > ... Output > File > Memory not pluggable FormatterParser
  • 15. After (M + N) or Embulk
  • 16. Other Fluentd related OSS > Treasure Agent > https://github.com/treasure-data/omnibus-td-agent > Fluentd Forwarder > https://github.com/fluent/fluentd-forwarder > Simple forwarder for Windows / Leaf node > Fluentd UI > https://github.com/fluent/fluentd-ui > Management web UI
  • 17. Other OSS products > Scribed (C++) > Developed by Facebook > No maintained > Apache Flume (Java) > Mainly for Hadoop HDFS / HBase > Logstash (JRuby) > Mainly for Elasticsearch
  • 18. Embulk > Bulk Loader version of Fluentd > Pluggable architecture > JRuby, JVM languages (TBD) > High performance parallel processing > Share your script as a plugin > https://github.com/embulk http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
  • 19. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying Plugins Plugins bulk load
  • 21. 3 query engines in TD > Hive (HiveQL, Batch) > for ETL and large jobs > Hivemall for machine learning > Pig (Pig Latin, Batch) > DataFu for data mining and statistics > Presto (SQL, Short batch) > for Ad hoc queries
  • 22. Hadoop > Distributed computing framework > Consist of many components…
 
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  • 25. > Low level framework for YARN applications > New Query Engine > Provide good IR for Hive, Pig and more > Task and DAG based pipelining
 
 
 
 Apache Tez ProcessorInput Output Task DAG http://tez.apache.org/
  • 26. Hive on MR vs. Hive on Tez MapReduce Tez http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9 M HDFS R R M M HDFS HDFS R M M R M M R M R M MM M M R R R Avoid unnecessary HDFS write! SELECT g1.x, g2.avg, g2.cnt
 FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1" JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2" ON (g1.x = g2.x) ORDER BY avg; GROUP b BY b.xGROUP a BY a.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x" JOIN (a, b) ORDER BY
  • 27. Other OSS products > Apache Spark > Mainly for on-memory processing > Spark ecosystem is now growing > Apache Flink > Mainly for iterative processing > Microsoft’s Dryad > This was premature for human being…
  • 28. Presto A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  • 29. Presto overview > Open sourced by Facebook > http://prestodb.io/ > written in Java > Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc… > Used by Netflix, Dropbox, Treasure Data, Qubole, Airbnb, LINE, GREE, Scaleout, etc
  • 30. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 31. HDFS Hive Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ More work to manage
 2 platforms ✓ Can’t query against
 “live” data directly Batch analysis platform Visualization platform PostgreSQL, etc.
  • 32. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 33. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform
  • 34. All stages are pipe- lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data
 to disk Wait between
 stages
  • 35. Other OSS products > Cloudera Impala > Mainly for HDFS / HBase > Apache Drill > More flexible architecture > Apache Tajo > For building data warehouse
  • 37. Hmm… > There are no popular OSS products > We don’t focus on developing visualization tool for now > Commercial BI tools are popular > Tableau, Motion board and etc > Maybe, next presentation talk about
 this area deeply
  • 38. Treasure Data resources > https://github.com/treasure-data > perfectqueue, perfectsched, etc > https://sql.treasuredata.com/ > HiveQL syntax checker > https://examples.treasuredata.com/ > Query catalog http://blog.treasuredata.com/2014/11/26/12-open-source- software-innovations-from-treasure-data-engineers/
  • 39. Check: treasuredata.com Cloud service for the entire data pipeline