Presto @ Treasure Data - Presto Meetup Boston 2015

•

2 likes•1,872 views

Treasure Data simplifies event analytics for the complex digital world. Our customers send us 1,000,000 events per second and issue 30,000+ Presto queries everyday to understand their customers better. One of the challenges is designing a cloud database with zero downtime to support a global customer base. We have achieved this goal by developing several open-source technologies; Fluentd and Embulk enable seamless log collection from stream/batch sources, and with MessagePack we can provide an extensible columnar store that accommodates future schema changes. Finally, Presto allows us to serve a wide variety of data processing our customers perform on our service. In this talk, I will present an overview of our system, and how our customers keep using Presto while collecting and extending their data set.

Designing An Evolving
Database Service with Presto
Taro L. Saito
leo@tresaure-data.com
Oct 6th, 2015.
Presto Meetup @ Boston

Presto Usage at Treasure Data
2
• 100~ customers are actively using Presto
• 30,000~ Presto queries every day
• Importing 1,000,000~ records / sec.
Import Export
Store Analyze with
Presto/Hive

Mobile and Web Sources
Mobile SDKs
JavaScript SDK
(web access logs)
3

Stream Sources
Streaming
Apache Logs
nginx logs
syslog 
JSON logs
…
4
JSON

Existing Data Sources
Bulk Import
Data ﬁles (CSV, TSV, etc.)
MySQL 
PostgreSQL 
Oracle
…
5

Embedded Devices
• Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc.
6

Treasure Data Architecture
8
LogLogLogLogLogLog
1-hour 
partition1-hour 
partition1-hour 
partition
Hadoop 
MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time
Storage
Archive 
Storage
time column-based partitioning
…
Hive Presto
Log
many small log ﬁles log merge job
LogLogLogLogLog
Distributed SQL Query Engine
S3 (AWS)
Rick CS (IDCF)
Columnar Format

• JSON data
• {“time”: 1412380700, “user”:1}
• Additional Column
• {“time”: 1412381000, “user”:2, “status”:200}
• Type Escalation (int -> string)
• {“time”: 1412390000, “user”:”U01”, “status”:200}
• MessagePack
• A fast and compact JSON-like format
• Auto type conversion
• Table schema <=> MessagePack types
Extensible Columnar Store
9

E-COMMERCE
BEFORE
AFTER
Biggest Mobile Shopping
WISH.COM
• Reduced costs
• Scalability
• Single data warehouse11

GAMING
BEFORE
AFTER
Daily Upload Delay of 1-2 days
2500+ servers
Real-time
Real-time
2500+ servers
1 Billion records/day
• Reduced TCO
• Real-time collection
• Real-time access to KPIs
Top 10 globally; 40M+ users
x 20
12

AD TECH
Publishers’ Dashboard Advertisers’ Dashboard
• 800 B/month
• Live in 2 weeks with 1 engineer!
• 300% growth
Europe’s largest mobile ad-exchange
More than 50 billion impressions/month
13

LOYALTY
Aggregation
E-Commerce
Marketing Campaigns;
Promotions
• Customer Segmentation
• A/B Testing
14

Challenges
• Handle Huge Query Result Output
• SELECT */ CREATE TABLE AS /INSERT INTO
• Parallel Result Upload to S3
• Bypass JSON result generation at the coordinator
• td-presto connector
• Accesses MessagePack based columnar store
• Handle S3 access retry / pipelining
• Future:
• Better query plan visualization
• Quickly ﬁnd the performance bottleneck and memory consuming tasks
• Storing intermediate query results to disks
• Process large joins, query resource limitation
15

Extensible Schema
SQL via Hive, Presto
Unlimited Users, Queries
Enterprise Apps
Enterprise Apps Data Science
Tools
REST API
Ingestion: Streaming, Bulk
BI Tools
treasuredata.com/request_demo

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Presto as a Service - Tips for operation and monitoringTaro L. Saito

20140120 presto meetup_en

Ogibayashi

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Martin Traverso

Presto in my_use_case

wyukawa

How to ensure Presto scalability  in multi use case

Kai Sasaki

tdtechtalk20160330johan

Johan Gustavsson

Real time analytics at uber @ strata data 2019

Zhenxiao Luo

Introduction to Data Engineer and Data Pipeline at Credit OK

Kriangkrai Chaonithi

Presto Meetup (2015-03-19)

Dain Sundstrom

Clickhouse at Cloudflare. By Marek Vavrusa

Valery Tkachenko

Hoodie: How (And Why) We built an analytical datastore on Spark

Vinoth Chandar

Boston Hadoop Meetup: Presto for the Enterprise

Matt Fuller

Presentation Title: Presto for the Enterprise Presenter(s): Matt Fuller and Kamil Bajda-Pawlikowski Company: Teradata Center for Hadoop Short Description: Teradata will provide a technical overview and demo of Presto, focusing on Presto's architecture and Teradata's contributions to the project and community. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Originally developed by Facebook, Teradata now joins Facebook as the second largest contributor to the open source project. Come join us and learn more about Presto. And how you can join the Presto community.

presto-at-netflix-hadoop-summit-15

Zhenxiao Luo

Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

Benjamin Hopp (Solutions Architect) @ Imply: Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets. This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit. Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack. The most important contributor to a fast analytical setup is getting the data model right. The talk will center around various choices you can make to prepare your data to get best possible query performance. We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed. We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.

Presto+MySQLで分散SQL

Sadayuki Furuhashi

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

Yellowbrick Webcast with DBTA for Real-Time Analytics

Yellowbrick Data

ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev

Altinity Ltd

What's hot

Presto updates to 0.178

Kai Sasaki

Presto - Analytical Database. Overview and use cases.

Wojciech Biela

Bullet: A Real Time Data Query Engine

DataWorks Summit

Presto as a Service - Tips for operation and monitoringTaro L. Saito

20140120 presto meetup_en

Ogibayashi

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Martin Traverso

Presto in my_use_case

wyukawa

How to ensure Presto scalability  in multi use case

Kai Sasaki

tdtechtalk20160330johan

Johan Gustavsson

Real time analytics at uber @ strata data 2019

Zhenxiao Luo

Introduction to Data Engineer and Data Pipeline at Credit OK

Kriangkrai Chaonithi

Presto Meetup (2015-03-19)

Dain Sundstrom

Clickhouse at Cloudflare. By Marek Vavrusa

Valery Tkachenko

Hoodie: How (And Why) We built an analytical datastore on Spark

Vinoth Chandar

Boston Hadoop Meetup: Presto for the Enterprise

Matt Fuller

presto-at-netflix-hadoop-summit-15

Zhenxiao Luo

Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi

A Day in the Life of a Druid Implementor and Druid's Roadmap

Itai Yaffe

Presto+MySQLで分散SQL

Sadayuki Furuhashi

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

What's hot (20)

Presto updates to 0.178

Presto - Analytical Database. Overview and use cases.

Bullet: A Real Time Data Query Engine

Presto as a Service - Tips for operation and monitoring

20140120 presto meetup_en

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto in my_use_case

How to ensure Presto scalability  in multi use case

tdtechtalk20160330johan

Real time analytics at uber @ strata data 2019

Introduction to Data Engineer and Data Pipeline at Credit OK

Presto Meetup (2015-03-19)

Clickhouse at Cloudflare. By Marek Vavrusa

Hoodie: How (And Why) We built an analytical datastore on Spark

Boston Hadoop Meetup: Presto for the Enterprise

presto-at-netflix-hadoop-summit-15

Understanding Presto - Presto meetup @ Tokyo #1

A Day in the Life of a Druid Implementor and Druid's Roadmap

Presto+MySQLで分散SQL

Enabling Presto Caching at Uber with Alluxio

Similar to Presto @ Treasure Data - Presto Meetup Boston 2015

Yellowbrick Webcast with DBTA for Real-Time Analytics

Yellowbrick Data

ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev

Altinity Ltd

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics

Open Data Summit Presentation by Joe OlsenChristopher Whitaker

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Apache Apex

From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017

Thomas Weise

https://berlinbuzzwords.de/17/session/batch-streaming-etl-apache-apex Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: Pipeline functionality from event source through queryable state for real-time insights. API for application development and development process. Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. Stateful processing with event time windowing. Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. Recent project development and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.

Nyc web perf-final-july-23

Dan Boutin

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

CCI2018 - Real-time dashboard whatif analysis

walk2talk srl

Marco Pozzan Power BI consultant & Trainer Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.

Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...

confluent

(Bruno Simic, Solutions Engineer, Couchbase) Breakout during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

confluent

Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.

Google for モバイルアプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query

Google Cloud Platform - Japan

Les objets connectés : de nombreux cas d'usage

Jedha Bootcamp

Aujourd'hui, les objets connectés sont partout et nous entourent sans même s'en apercevoir : téléphones, transports, musique, montres, "The Internet of Things" (IoT) a pris une part importante dans notre vie. En nous montrant des cas d'usages des entreprises telles que la NASA, Airbus, Red bull et d'autres, Sean nous expliquera comment ils fonctionnent et comment sont gérées toutes ces données récoltées.

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY

AWS Cloud Kata | Bangkok - Getting to Scale on AWSAmazon Web Services

Streaming Visualization

Guido Schmutz

Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data. If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.

How bol.com makes sense of its logs, using the Elastic technology stack.

Renzo Tomà

Presentation given by Renzo Tomà as "Tech and Use Case Deep Dive", during the Elastic{ON}Tour 2015 event in Amsterdam on October 29th. Explanation of how bol.com is using the Elastic ELK stack to power a logsearch platform. Lots of details on the types of sources and number of feeds. Some history and reasoning why the current set of in-process JSON based logshippers are used. Links to the bol.com github account for the logshipper projects. The presentation ends with two special sauces: fun things you can do with lots of data in Elasticsearch. The 1st sauce is 'the call stack' - tagging each request with a unique ID, passing that ID along to all service calls and making sure this ID ends up in all access logging, enables you to group all calls together and get a call stack. The 2nd sauce is a way of generating a service map using access logging and some logstash magic. I love questions and feedback. My mail address can be found in the presentation.

From Batch to Streaming ET(L) with Apache Apex

DataWorks Summit

Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: * Pipeline functionality from event source through queryable state for real-time insights. * API for application development and development process. * Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. * Stateful processing with event time windowing. * Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery * Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. * Who is using Apex in production, and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.

Flink in Zalando's world of Microservices

ZalandoHayley

Apache Flink Meetup at Zalando Technology, May 2016 By Javier Lopez & Mihail Vieru, Zalando In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases.

Flink in Zalando's World of Microservices

Zalando Technology

Berlin Apache Flink Meetup, May 2016 In this talk we present Zalando's microservices architecture and introduce Saiki – our next generation data integration and distribution platform on AWS. We show why we chose Apache Flink to serve as our stream processing framework and describe how we employ it for our current use cases: business process monitoring and continuous ETL. We then have an outlook on future use cases. By Javier Lopez & Mihail Vieru, Zalando, Zalando SE

Similar to Presto @ Treasure Data - Presto Meetup Boston 2015 (20)

Yellowbrick Webcast with DBTA for Real-Time Analytics

ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...

Open Data Summit Presentation by Joe Olsen

From Batch to Streaming with Apache Apex Dataworks Summit 2017

From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017

Nyc web perf-final-july-23

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

CCI2018 - Real-time dashboard whatif analysis

Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...

Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar

Google for モバイルアプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query

Les objets connectés : de nombreux cas d'usage

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014

AWS Cloud Kata | Bangkok - Getting to Scale on AWS

Streaming Visualization

How bol.com makes sense of its logs, using the Elastic technology stack.

From Batch to Streaming ET(L) with Apache Apex

Flink in Zalando's world of Microservices

Flink in Zalando's World of Microservices

More from Taro L. Saito

Unifying Frontend and Backend Development with Scala - ScalaCon 2021

Presto @ Treasure Data - Presto Meetup Boston 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presto @ Treasure Data - Presto Meetup Boston 2015

Similar to Presto @ Treasure Data - Presto Meetup Boston 2015 (20)

More from Taro L. Saito

More from Taro L. Saito (20)

Recently uploaded

Recently uploaded (20)

Presto @ Treasure Data - Presto Meetup Boston 2015