Presto Meetup 2016 Small Start

Presto Meetup 2016
My
Use case
Small Start

self‐introduction
• @toyama0919
• Analytics Infra.
• Nearly working embulk…
• Presto using one and a half years.

Our development situation
• We commonly used sql.
• Marketing occupation don't write sql.
• I often write the complicated SQL, that is
100 lines..
• We love OSS.
• Not use Update, Insert, Delete by Presto.

Our Business situation
• We manage and operate web site of BtoB.
• Our data lifecycle is long.
• Business side not write sql.
• watching re:dash and Adobe analytics.
• Sales increase 15 straight year.

Ruby Batch
CollectBatchVisualize Data Store
(Digdag)

Analytics Priolity
1. Direct SQL
2. Presto
3. ETL

Cost is large
difference
from 1 to 3

Why use presto?
• Cross server Join
• Window function
• UDF

Join
• Cross server and cross database.
• A single Presto query can combine data
from multiple sources.
• We use multiple sources join query.
• reduce ETL pain.

Collect data in one place?
• Equal able to get data by one query.
• I not want to have duplicate data.(master
data, user data)
• Collect the data in one place, high
develop cost.

with mysql_user as (
select
user_id,
user_name
from
mysql.schema.users
),
redshift_user_log as (
select
user_id,
log_time
from
redshift.schema.pageview
)
select
user_id,
user_name,
count(*)
from mysql_user
inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id
group by user_id, user_name

Mysql not support
mechanism
• window function
• with query
– not support Recursive.
• URL function
• Array data type
• cross join unnest

URL Function
select url_encode('Presto最高');
=> Presto%8d%c5%8d%82
select url_decode('Presto%8d%c5%8d%82');
=> Presto最高

Regexp Function
select regexp_extract_all('1a 2b 14m', 'd+')
=> [1, 2, 14]
select regexp_extract(
'超低床型自動梱包機 RQ-8LD',
'([a-zA-z0-9-]+)’
)
=> RQ-8LD

SQL no good at it
• Normalization of the character string.
• split csv string.
• Morphological analysis.

Normalization
select normalize(upper('hoｇｅ'), NFKC)
#=> HOGE

Array type
select
split(keywords, ',') as keywords
From
mysql_keywords_table
keywords
----------------------------
keyword1,keyword2,keyword3
keywords
----------------------------
['keyword1','keyword2','keyword3']

horizontal to vertical
SELECT
keyword
FROM
mysql_keywords_table
CROSS JOIN
UNNEST(split(keywords, ',')) AS t (keyword)

horizontal to vertical
keywords
----------------------------
keyword1,keyword2,keyword3
keyword4,keyword5
keyword6
keyword1
keyword
----------------------------
keyword1
keyword2
keyword3
keyword4
keyword5
keyword6
keyword1

window function
• We use window function for Mysql.
(Presto on mysql)
• data source is Mysql, But Presto world
can use.
• But can not use original function of mysql.

Rank function on mysql
select
company_id,
category_id,
count(*),
rank() over (
partition by company_id
order by count(*) desc
)
from
mysql.schema.mysql_table
group by company_id, category_id

other window function
• last_value
• first_value
• dense_rank
• percent_rank

Prestogres
• PostgreSQL protocol gateway for Presto.
• rewrite queries before sending Presto to
PostgreSQL.
• have password-based authentication and
SSL.

Why Prestogres?
• Other application connectivity.
– pgAdmin, psql command.
– re:dash connecte with PostgreSQL protocol to presto.
– But can directly connect to presto.
• We connect to presto, need Presto client.
– I not want use java client.
• Weak security.
– certification is taken by prestogres

Prestogres Limitation
• prepared statement.
– not support Presto too.
– so not work embulk-input-postgresql
• Can’t fetch schema by sql.
• Temporary table
• DROP TABLE

re:dash
• Visualization platform, write by python.
• Supports many data sources.
• Sharing query with member.
• Scheduling query.(per day, per hour)
• Very active contribution.

increased rapidly Presto
query by re:dash
• Number of the presto queries increased
than 10 times.
• That won't change with writing ETL on
re:dash.
• Re:dash having a good reputation in
internal.

Okay,
analytics
problems all clear!

Embulk with Presto
• use embulk-input-presto of own making.
– Support json type.
• Create point in time data.
• Create machine learning data.

Why Embulk?
• Very active plugin ecosystem.
• Complicated string analysis can not only
sql.
• With digdag combination is very
powerful.
• Want can do it shortest distance.
• Fluentd overwork..

Install by RPM
• Presto have RPM.
– not distribution.
– need source build..
• include init script.
• But not support open-jdk..
– Pull requesting..

AWS integration
• We build Presto on ec2.
• Not use EMR.
• Worker is spot instance, multi instance
types.
– prevent down all at once

networking
• Presto cluster(coordinator and workers)
place in the same AZ.
• If other AZ, very high traffic cost(and
money).
– should not multi AZ.

Networking on AWS
Availability Zone Availability
Zone
cordinator
worker
worker
worker

problem
• Very huge repository.
• SPOF cordinator.
• run long range query, occur
OutOfMemory Error.

Very huge repository
• monolithic application.
– I want Separate repository.
• First build takes 30 minutes.
• After the second time build takes 10 minutes.
• All connector is main repository.
– MongoDB、Kafka、cassandra..
– will nearly support Elasticsearch
• Hard to do the contribution.

Big change for jdbc
• support multi data type predicate
pushdown.
• We used apply patch presto…
• Let's try mysql people.

listened Presto impression
• extended technology of Hadoop.
=>I don't know hadoop. Presto have many
connector.
• parallel processing looks difficult.
=>Presto not have storage, There is not so
influence.
・I do not have so big data.
=>I don't so big player.

Summary
• Presto is great software.
• So not difficult.
• Let's use it more.

Presto Meetup 2016 Small Start

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Presto Meetup 2016 Small Start

Similar to Presto Meetup 2016 Small Start (20)

More from Hiroshi Toyama

More from Hiroshi Toyama (7)

Recently uploaded

Recently uploaded (20)

Presto Meetup 2016 Small Start