Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto @ Treasure Data - Presto Meetup Boston 2015

1,192 views

Published on

Treasure Data simplifies event analytics for the complex digital
world. Our customers send us 1,000,000 events per second and issue 30,000+ Presto queries everyday to understand their customers better. One of the challenges is designing a cloud database with zero downtime to support a global customer base. We have achieved this goal by developing several open-source technologies; Fluentd and Embulk enable seamless log collection from stream/batch sources, and with MessagePack we can provide an extensible columnar store that accommodates future schema changes. Finally, Presto allows us to serve a wide variety of data processing our customers perform on our service. In this talk, I will present an overview of our system, and how our customers keep using Presto while collecting and extending their data set.

Published in: Technology
  • Be the first to comment

Presto @ Treasure Data - Presto Meetup Boston 2015

  1. 1. Designing An Evolving Database Service with Presto Taro L. Saito leo@tresaure-data.com Oct 6th, 2015. Presto Meetup @ Boston
  2. 2. Presto Usage at Treasure Data 2 • 100~ customers are actively using Presto • 30,000~ Presto queries every day • Importing 1,000,000~ records / sec. Import Export Store Analyze with Presto/Hive
  3. 3. Mobile and Web Sources Mobile SDKs JavaScript SDK (web access logs) 3
  4. 4. Stream Sources Streaming Apache Logs nginx logs syslog
 JSON logs … 4 JSON
  5. 5. Existing Data Sources Bulk Import Data files (CSV, TSV, etc.) MySQL
 PostgreSQL
 Oracle … 5
  6. 6. Embedded Devices • Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc. 6
  7. 7. Import data, now. 7
  8. 8. Treasure Data Architecture 8 LogLogLogLogLogLog 1-hour
 partition1-hour
 partition1-hour
 partition Hadoop
 MapReduce 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 Real-Time Storage Archive
 Storage time column-based partitioning … Hive Presto Log many small log files log merge job LogLogLogLogLog Distributed SQL Query Engine S3 (AWS) Rick CS (IDCF) Columnar Format
  9. 9. • JSON data • {“time”: 1412380700, “user”:1} • Additional Column • {“time”: 1412381000, “user”:2, “status”:200} • Type Escalation (int -> string) • {“time”: 1412390000, “user”:”U01”, “status”:200} • MessagePack • A fast and compact JSON-like format • Auto type conversion • Table schema <=> MessagePack types Extensible Columnar Store 9
  10. 10. Use Cases
  11. 11. E-COMMERCE BEFORE AFTER Biggest Mobile Shopping WISH.COM • Reduced costs • Scalability • Single data warehouse11
  12. 12. GAMING BEFORE AFTER Daily Upload Delay of 1-2 days 2500+ servers Real-time Real-time 2500+ servers 1 Billion records/day • Reduced TCO • Real-time collection • Real-time access to KPIs Top 10 globally; 40M+ users x 20 12
  13. 13. AD TECH Publishers’ Dashboard Advertisers’ Dashboard • 800 B/month • Live in 2 weeks with 1 engineer! • 300% growth Europe’s largest mobile ad-exchange More than 50 billion impressions/month 13
  14. 14. LOYALTY Aggregation E-Commerce Marketing Campaigns; Promotions • Customer Segmentation • A/B Testing 14
  15. 15. Challenges • Handle Huge Query Result Output • SELECT */ CREATE TABLE AS /INSERT INTO • Parallel Result Upload to S3 • Bypass JSON result generation at the coordinator • td-presto connector • Accesses MessagePack based columnar store • Handle S3 access retry / pipelining • Future: • Better query plan visualization • Quickly find the performance bottleneck and memory consuming tasks • Storing intermediate query results to disks • Process large joins, query resource limitation 15
  16. 16. Extensible Schema SQL via Hive, Presto Unlimited Users, Queries Enterprise Apps Enterprise Apps Data Science Tools REST API Ingestion: Streaming, Bulk BI Tools treasuredata.com/request_demo

×