SlideShare a Scribd company logo
1 of 44
Presto Meetup 2016
My
Use case
Small Start
self‐introduction
• @toyama0919
• Analytics Infra.
• Nearly working embulk…
• Presto using one and a half years.
Our development situation
• We commonly used sql.
• Marketing occupation don't write sql.
• I often write the complicated SQL, that is
100 lines..
• We love OSS.
• Not use Update, Insert, Delete by Presto.
Our Business situation
• We manage and operate web site of BtoB.
• Our data lifecycle is long.
• Business side not write sql.
• watching re:dash and Adobe analytics.
• Sales increase 15 straight year.
analyst want data quickly
Ruby Batch
CollectBatchVisualize Data Store
(Digdag)
Analytics Priolity
1. Direct SQL
2. Presto
3. ETL
Cost is large
difference
from 1 to 3
Why use presto?
• Cross server Join
• Window function
• UDF
Cross Server Join
Join
• Cross server and cross database.
• A single Presto query can combine data
from multiple sources.
• We use multiple sources join query.
• reduce ETL pain.
Collect data in one place?
• Equal able to get data by one query.
• I not want to have duplicate data.(master
data, user data)
• Collect the data in one place, high
develop cost.
with mysql_user as (
select
user_id,
user_name
from
mysql.schema.users
),
redshift_user_log as (
select
user_id,
log_time
from
redshift.schema.pageview
)
select
user_id,
user_name,
count(*)
from mysql_user
inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id
group by user_id, user_name
UDF
Mysql not support
mechanism
• window function
• with query
– not support Recursive.
• URL function
• Array data type
• cross join unnest
URL Function
select url_encode('Presto最高');
=> Presto%8d%c5%8d%82
select url_decode('Presto%8d%c5%8d%82');
=> Presto最高
Regexp Function
select regexp_extract_all('1a 2b 14m', 'd+')
=> [1, 2, 14]
select regexp_extract(
'超低床型自動梱包機 RQ-8LD',
'([a-zA-z0-9-]+)’
)
=> RQ-8LD
SQL no good at it
• Normalization of the character string.
• split csv string.
• Morphological analysis.
Normalization
select normalize(upper('hoge'), NFKC)
#=> HOGE
Array type
select
split(keywords, ',') as keywords
From
mysql_keywords_table
keywords
----------------------------
keyword1,keyword2,keyword3
keywords
----------------------------
['keyword1','keyword2','keyword3']
horizontal to vertical
SELECT
keyword
FROM
mysql_keywords_table
CROSS JOIN
UNNEST(split(keywords, ',')) AS t (keyword)
horizontal to vertical
keywords
----------------------------
keyword1,keyword2,keyword3
keyword4,keyword5
keyword6
keyword1
keyword
----------------------------
keyword1
keyword2
keyword3
keyword4
keyword5
keyword6
keyword1
window function
• We use window function for Mysql.
(Presto on mysql)
• data source is Mysql, But Presto world
can use.
• But can not use original function of mysql.
Rank function on mysql
select
company_id,
category_id,
count(*),
rank() over (
partition by company_id
order by count(*) desc
)
from
mysql.schema.mysql_table
group by company_id, category_id
other window function
• last_value
• first_value
• dense_rank
• percent_rank
Prestogres
• PostgreSQL protocol gateway for Presto.
• rewrite queries before sending Presto to
PostgreSQL.
• have password-based authentication and
SSL.
Why Prestogres?
• Other application connectivity.
– pgAdmin, psql command.
– re:dash connecte with PostgreSQL protocol to presto.
– But can directly connect to presto.
• We connect to presto, need Presto client.
– I not want use java client.
• Weak security.
– certification is taken by prestogres
Prestogres Limitation
• prepared statement.
– not support Presto too.
– so not work embulk-input-postgresql
• Can’t fetch schema by sql.
• Temporary table
• DROP TABLE
re:dash
• Visualization platform, write by python.
• Supports many data sources.
• Sharing query with member.
• Scheduling query.(per day, per hour)
• Very active contribution.
increased rapidly Presto
query by re:dash
• Number of the presto queries increased
than 10 times.
• That won't change with writing ETL on
re:dash.
• Re:dash having a good reputation in
internal.
Okay,
analytics
problems all clear!
No.. Can’t escape from ETL
Embulk with Presto
• use embulk-input-presto of own making.
– Support json type.
• Create point in time data.
• Create machine learning data.
Why Embulk?
• Very active plugin ecosystem.
• Complicated string analysis can not only
sql.
• With digdag combination is very
powerful.
• Want can do it shortest distance.
• Fluentd overwork..
Operation
Install by RPM
• Presto have RPM.
– not distribution.
– need source build..
• include init script.
• But not support open-jdk..
– Pull requesting..
AWS integration
• We build Presto on ec2.
• Not use EMR.
• Worker is spot instance, multi instance
types.
– prevent down all at once
networking
• Presto cluster(coordinator and workers)
place in the same AZ.
• If other AZ, very high traffic cost(and
money).
– should not multi AZ.
Networking on AWS
Availability Zone Availability
Zone
cordinator
worker
worker
worker
problem
• Very huge repository.
• SPOF cordinator.
• run long range query, occur
OutOfMemory Error.
Very huge repository
• monolithic application.
– I want Separate repository.
• First build takes 30 minutes.
• After the second time build takes 10 minutes.
• All connector is main repository.
– MongoDB、Kafka、cassandra..
– will nearly support Elasticsearch
• Hard to do the contribution.
Big change for jdbc
• support multi data type predicate
pushdown.
• We used apply patch presto…
• Let's try mysql people.
listened Presto impression
• extended technology of Hadoop.
=>I don't know hadoop. Presto have many
connector.
• parallel processing looks difficult.
=>Presto not have storage, There is not so
influence.
・I do not have so big data.
=>I don't so big player.
Summary
• Presto is great software.
• So not difficult.
• Let's use it more.

More Related Content

What's hot

Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
DataWorks Summit
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
Qubole
 

What's hot (20)

Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case How to ensure Presto scalability 
in multi use case
How to ensure Presto scalability 
in multi use case
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Presto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and FuturePresto @ Facebook: Past, Present and Future
Presto @ Facebook: Past, Present and Future
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
 
Presentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membasePresentation: mongo db & elasticsearch & membase
Presentation: mongo db & elasticsearch & membase
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 

Viewers also liked

Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 

Viewers also liked (17)

Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Presto Meetup (2015-03-19)
Presto Meetup (2015-03-19)Presto Meetup (2015-03-19)
Presto Meetup (2015-03-19)
 
Presto Meetup @ Facebook (3/22/2016)
Presto Meetup @ Facebook (3/22/2016)Presto Meetup @ Facebook (3/22/2016)
Presto Meetup @ Facebook (3/22/2016)
 
Bq sushi(BigQuery lessons learned)
Bq sushi(BigQuery lessons learned)Bq sushi(BigQuery lessons learned)
Bq sushi(BigQuery lessons learned)
 
Data management platform_market
Data management platform_marketData management platform_market
Data management platform_market
 
Zeromq anatomy & jeromq
Zeromq anatomy & jeromqZeromq anatomy & jeromq
Zeromq anatomy & jeromq
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 
Presto in my_use_case
Presto in my_use_casePresto in my_use_case
Presto in my_use_case
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Py "Baseball" Data入門 - 広島東洋カープ編 #pyconhiro
Py "Baseball" Data入門 - 広島東洋カープ編 #pyconhiroPy "Baseball" Data入門 - 広島東洋カープ編 #pyconhiro
Py "Baseball" Data入門 - 広島東洋カープ編 #pyconhiro
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Upgrading from-hdp-21-to-hdp-25
Upgrading from-hdp-21-to-hdp-25Upgrading from-hdp-21-to-hdp-25
Upgrading from-hdp-21-to-hdp-25
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
20170303 java9 hadoop
20170303 java9 hadoop20170303 java9 hadoop
20170303 java9 hadoop
 

Similar to Presto Meetup 2016 Small Start

Similar to Presto Meetup 2016 Small Start (20)

pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
Distributed query deep dive conor cunningham
Distributed query deep dive   conor cunninghamDistributed query deep dive   conor cunningham
Distributed query deep dive conor cunningham
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practices30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practices
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp0220140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
 
Your backend architecture is what matters slideshare
Your backend architecture is what matters slideshareYour backend architecture is what matters slideshare
Your backend architecture is what matters slideshare
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 

More from Hiroshi Toyama (7)

re:dash is awesome
re:dash is awesomere:dash is awesome
re:dash is awesome
 
Fluentdで本番環境を再現
Fluentdで本番環境を再現Fluentdで本番環境を再現
Fluentdで本番環境を再現
 
Fluentd
FluentdFluentd
Fluentd
 
Vagrant
VagrantVagrant
Vagrant
 
Aws cli
Aws cliAws cli
Aws cli
 
Apache jmeter
Apache jmeterApache jmeter
Apache jmeter
 
java-feature-on-scala
java-feature-on-scalajava-feature-on-scala
java-feature-on-scala
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

Presto Meetup 2016 Small Start