The architecture of data analytics PaaS on AWS

Treasure Data
The architecture of data analytics PaaS on AWS

Masahiro Nakagawa

JAWS Days: 2013/03/16

Friday, April 5, 13

Who are you?
 Masahiro Nakagawa
• @repeatedly / masa@treasure-data.com

 Treasure Data, Inc.
• Senior Software Engineer, since 2012/11

 Open Source projects
• D Programming Language
• MessagePack: D, Python, etc...
• Fluentd: Core, mongo, etc...
• etc...

2

Friday, April 5, 13

Introduction to
Treasure Data

Friday, April 5, 13

Company Overview
 Silicon Valley-based Company
• All Founders are Japanese
• Hironobu Yoshikawa
• Kazuki Ohta
• Sadayuki Furuhashi

 OSS Enthusiasts
• MessagePack, Fluentd, etc.

4

Friday, April 5, 13

Investors
 Bill Tai
 Naren Gupta - Nexus Ventures, Director of Redhat, TIBCO
 Othman Laraki - Former VP Growth at Twitter
 James Lindenbaum, Adam Wiggins, Orion Henry - Heroku
Founders
 Anand Babu Periasamy, Hitesh Chellani - Gluster Founders
 Yukihiro “Matz” Matsumoto - Creator of Ruby
 Dan Scheinman - Director of Arista Networks
 Jerry Yang - Founder of Yahoo!
 + 10 more people
• and....
5

Friday, April 5, 13

Treasure Data = Cloud + Big Data
Cloud Big Data-as-a-Service

Database-as-a-service

Enterprise
Lightweight RDBMS Traditional
RDBMS Data Warehouse

DB2
On-Premise
$34B $10B
market market

1Bil entry Data Volume
Or 10TB

© 2012 Forrester Research, Inc. Reproduction Prohibited 6

Friday, April 5, 13

Why Cloud? ‘Time’ is Money
Ideal
Customer Expectation
Value

Obsolete
over time

Reality
(On-Premise)

Upgrade
HW/SW Selection, PoC, Deploy...
Time
Sign-up or PO

7

Friday, April 5, 13

Big Data Adoption Stages
Optimization What’s the best?
Predictive Analysis What’s a trend? Analytics
Statistical Analysis Treasure Data’s FOCUS
Why?
Alerts Error?(80% of needs)
Drill Down Query Where exactly?
Reporting
Ad-hoc Reports Where?
Standard Reports What happened?

Intelligence Sophistication
8

Friday, April 5, 13

Full Stack Support for Big Data Reporting

Our best-in-class architecture Data from almost any source
and operations team ensure the can be securely and reliably
integrity and availability of your uploaded using td-agent in
data. streaming or batch mode.

Our SQL, REST, JDBC, ODBC You can store gigabytes to
and command-line interfaces petabytes of data eﬃciently and
support all major query tools securely in our cloud-based
and approaches. columnar datastore.

9

Friday, April 5, 13

Vision: Single Analytics Platform for the World
10

Friday, April 5, 13

11

Our Customers – Fortune Global 500 leaders and
start-ups including:

Friday, April 5, 13

Treasure Data’s
Service Architecture

Friday, April 5, 13

Treasure Data = Collect + Store + Query
13

Friday, April 5, 13

Example in AdTech: MobFox

1. Europe’s largest independent mobile ad exchange.
2. 20 billion imps/month (circa Jan. 2013)
3. Serving ads for 15,000+ mobile apps (circa Jan. 2013)
4. Needed Big Data Analytics infrastructure ASAP.

14

Friday, April 5, 13

Two Weeks From Start to Finish!

15

Friday, April 5, 13

Used AWS Products (1)
 RDS
• Store user information, job status, etc...
• Store metadata of our columnar database
• Queue of worker (perfectqueue / perfectsched)

 EC2
• API servers
• Hadoop clusters
• Job workers
• Using Chef to deploy

16

Friday, April 5, 13

Used AWS Products (2)
 ELB
• Load balancing of API servers
• Load balancing of td-agents

 S3
• Columnar storage built on top of S3
• MessagePack columnar format
• realtime / archive storage
• Our Result feature supports S3 output.

No EMR, SQS and other products !
17

Friday, April 5, 13

Architecture Breakdown

Data Collection Data Store/Analytics Connectivity
• Increasing variety of • Remaining complexity in • Required to ensure
data sources both traditional DWH connectivity with
• No single data schema and Hadoop (very slow existing BI/visualization/
• Lack of streaming data time to market) apps by JDBC, REST
collection method • Challenges in scaling and ODBC.
• 60% of Big Data project data volume and • Output ot other services,
resource consumed expanding cost. e.g. S3, RDBMS, etc.

18

Friday, April 5, 13

1) Data Collection
 60% of BI project resource is consumed here
 Most ‘underestimated’ and ‘unsexy’ but MOST important
 Fluentd: OSS lightweight but robust Log Collector
• http://fluentd.org/

19

Friday, April 5, 13

Fluentd
the missing log collector

ﬂuentd.org

20

Friday, April 5, 13

In short
 Open sourced log collector written in Ruby
 Using rubygems ecosystem for plugins

It’s like syslogd, but
uses JSON for log messages

21

Friday, April 5, 13

Time 2012-02-04 01:33:51
Apache Tag apache.log
Record {
"host": "127.0.0.1",
tail "method": "GET",
"path": "/",
write ...
}

insert
127.0.0.1
127.0.0.1
127.0.0.1
-
-
-
-
-
-
[11/Dec/2012:07:26:27]
[11/Dec/2012:07:26:30]
[11/Dec/2012:07:26:32]
"GET
"GET
"GET
/
/
/
...
...
...
Fluentd
127.0.0.1 - - [11/Dec/2012:07:26:40] "GET / ...
127.0.0.1 - - [11/Dec/2012:07:27:01] "GET / ...
...

event
buffering
Mongo
22

Friday, April 5, 13

Architecture
Pluggable Pluggable Pluggable

Input Buffer Output

> Forward > Memory > Forward
> HTTP > File > File
> File tail > Amazon S3
> dstat > MongoDB
> ... > ...

23

Friday, April 5, 13

Before Fluentd
Server1 Server2 Server3

Application Application Application

･････････

High Latency!
must wait for a day...
Fluent
Log Server
24

Friday, April 5, 13

After Fluentd
Server1 Server2 Server3

Application Application Application

Fluentd ･･･ Fluentd ･･･ Fluentd ･･･

In streaming!

Fluentd Fluentd

25

Friday, April 5, 13

Access logs Alerting
Apache Nagios

App logs Analysis
Frontend MongoDB
Backend
MySQL

System logs Hadoop
syslogd
filter / buffer / routing
Archiving
Databases Amazon S3
26

Friday, April 5, 13

td-agent
 Open sourced distribution package of fluentd
 ETL part of Treasure Data
 Including useful components
• ruby, jemalloc, fluentd
• 3rd party gems: td, mongo, webhdfs, etc...
• td plugin is for Treasure Data

 http://packages.treasure-data.com/

27

Friday, April 5, 13

Treasure Data Service Architecture
This!

Apache

App Treasure Data
td-agent columnar data
App RDBMS warehouse

Other data sources

MAPREDUCE JOBS

HIVE, PIG (to be supported)
td-command
Query
Query
Processing
API
JDBC, REST Cluster
User BI apps

28

Friday, April 5, 13

AWS plugins
 S3
 SNS
 SQS
 DynamoDB
 foward-aws
 RDS http://ﬂuentd.org/plugin/
 RedShift
 CloudWatch
 Yet Another Cloud Watch
 CloudWatch Lite

29

Friday, April 5, 13

2) Data Store / Analytics - Columnar Storage

30

Friday, April 5, 13

Treasure Data Service Processing Flow
Worker
Frontend
Job Queue Hadoop

Hadoop

Applications push
metrics to Fluentd
sums up data minutes
(via local Fluentd) Fluentd Fluentd (partial aggregation)

Treasure
Librato Metrics
Data
for historical analysis for realtime analysis

31

Friday, April 5, 13

Structure of Columnar Storages

import bulk import SELECT ...

Import Storage Bulk Import Storage

Realtime Storage Archive Storage

merge (every 1 hour)

23c82b0ba3405d4c15aa85d2190e 2013-03-15 00:23:00 912ec80
6d7b1482412ab14f0332b8aee119 2013-03-16 00:01:00 277a259
8a7bc848b2791b8fd603c719e54f ...
0e3d402b17638477c9a7977e7dab
...

33

Friday, April 5, 13

Query Language

Query Execution

Columnar Data

Object Storage

34

Friday, April 5, 13

1/4: Compile SQL into MapReduce

SQL Statement
SELECT COUNT(DISTINCT ip) FROM tbl;

Hive
SQL - to - MapReduce

35

Friday, April 5, 13

2/4: MapReduce is executed in parallel


cc2.8xlarge cluster compute instance (up to 100 nodes * 32 threads)

36

Friday, April 5, 13

3/4: Columnar Data Access


10Gbps Network

Read ONLY the Required Part of Data

37

Friday, April 5, 13

4/4: Object-based Storage

38

Friday, April 5, 13

Data first, Schema later

SELECT 54 (int) “test” (string) 120 (int) NULL

Schema user:int name:string value:int host:int

Raw data（JSON） {“user”:54, “name”:”test”, “value”:”120”, “host”:”local”}

39

Friday, April 5, 13

3) Connectivity

REST API
td-command
Query
Query
Query API
Processing
JDBC, ODBC Driver Cluster
BI apps

Web App
Treasure Data
Result MySQL Columnar Storage

S3
…

40

Friday, April 5, 13

Multi-Tenancy
 All customers share the Hadoop clusters (Multi Data Centers)
 Resource Sharing (Burst Cores), Rapid Improvement, Ease of Upgrade

Job Submission
+ Plan Change
Local FairScheduler

datacenter A

Local FairScheduler
Global
datacenter B
Scheduler
Local FairScheduler

datacenter C On-Demand
Resouce Allocation
Local FairScheduler
datacenter D

41

Friday, April 5, 13

Conclusion
 Treasure Data
• Cloud based Big-data analytics platform
• Provide Machete for Big data reporting

 Big Data processing
• Collect / Store / Analytics / Visualization
Our focus!
 Our used AWS products
• EC2, S3, RDS, ELB
• Building Treasure Data specific systems on AWS

42

Friday, April 5, 13

Big Data for the Rest of Us

www.treasure-data.com | @TreasureData

Friday, April 5, 13

The architecture of data analytics PaaS on AWS

In this document

More Related Content

What's hot

Viewers also liked

Similar to The architecture of data analytics PaaS on AWS

More from Treasure Data, Inc.

Recently uploaded

The architecture of data analytics PaaS on AWS