Designing An Evolving
Database Service with Presto
Taro L. Saito
leo@tresaure-data.com
Oct 6th, 2015.
Presto Meetup @ Boston
Presto Usage at Treasure Data
2
• 100~ customers are actively using Presto
• 30,000~ Presto queries every day
• Importing 1,000,000~ records / sec.
Import Export
Store Analyze with
Presto/Hive
Mobile and Web Sources
Mobile SDKs
JavaScript SDK
(web access logs)
3
Stream Sources
Streaming
Apache Logs
nginx logs
syslog

JSON logs
…
4
JSON
Existing Data Sources
Bulk Import
Data files (CSV, TSV, etc.)
MySQL

PostgreSQL

Oracle
…
5
Embedded Devices
• Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc.
6
Import data, now.
7
Treasure Data Architecture
8
LogLogLogLogLogLog
1-hour

partition1-hour

partition1-hour

partition
Hadoop

MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time
Storage
Archive

Storage
time column-based partitioning
…
Hive Presto
Log
many small log files log merge job
LogLogLogLogLog
Distributed SQL Query Engine
S3 (AWS)
Rick CS (IDCF)
Columnar Format
• JSON data
• {“time”: 1412380700, “user”:1}
• Additional Column
• {“time”: 1412381000, “user”:2, “status”:200}
• Type Escalation (int -> string)
• {“time”: 1412390000, “user”:”U01”, “status”:200}
• MessagePack
• A fast and compact JSON-like format
• Auto type conversion
• Table schema <=> MessagePack types
Extensible Columnar Store
9
Use Cases
E-COMMERCE
BEFORE
AFTER
Biggest Mobile Shopping
WISH.COM
• Reduced costs
• Scalability
• Single data warehouse11
GAMING
BEFORE
AFTER
Daily Upload Delay of 1-2 days
2500+ servers
Real-time
Real-time
2500+ servers
1 Billion records/day
• Reduced TCO
• Real-time collection
• Real-time access to KPIs
Top 10 globally; 40M+ users
x 20
12
AD TECH
Publishers’ Dashboard Advertisers’ Dashboard
• 800 B/month
• Live in 2 weeks with 1 engineer!
• 300% growth
Europe’s largest mobile ad-exchange
More than 50 billion impressions/month
13
LOYALTY
Aggregation
E-Commerce
Marketing Campaigns;
Promotions
• Customer Segmentation
• A/B Testing
14
Challenges
• Handle Huge Query Result Output
• SELECT */ CREATE TABLE AS /INSERT INTO
• Parallel Result Upload to S3
• Bypass JSON result generation at the coordinator
• td-presto connector
• Accesses MessagePack based columnar store
• Handle S3 access retry / pipelining
• Future:
• Better query plan visualization
• Quickly find the performance bottleneck and memory consuming tasks
• Storing intermediate query results to disks
• Process large joins, query resource limitation
15
Extensible Schema
SQL via Hive, Presto
Unlimited Users, Queries
Enterprise Apps
Enterprise Apps Data Science
Tools
REST API
Ingestion: Streaming, Bulk
BI Tools
treasuredata.com/request_demo

Presto @ Treasure Data - Presto Meetup Boston 2015