3. Our development situation
• We commonly used sql.
• Marketing occupation don't write sql.
• I often write the complicated SQL, that is
100 lines..
• We love OSS.
• Not use Update, Insert, Delete by Presto.
4. Our Business situation
• We manage and operate web site of BtoB.
• Our data lifecycle is long.
• Business side not write sql.
• watching re:dash and Adobe analytics.
• Sales increase 15 straight year.
11. Join
• Cross server and cross database.
• A single Presto query can combine data
from multiple sources.
• We use multiple sources join query.
• reduce ETL pain.
12. Collect data in one place?
• Equal able to get data by one query.
• I not want to have duplicate data.(master
data, user data)
• Collect the data in one place, high
develop cost.
13. with mysql_user as (
select
user_id,
user_name
from
mysql.schema.users
),
redshift_user_log as (
select
user_id,
log_time
from
redshift.schema.pageview
)
select
user_id,
user_name,
count(*)
from mysql_user
inner join redshift_user_log on mysql_user.user_id = redshift_user_log.user_id
group by user_id, user_name
23. window function
• We use window function for Mysql.
(Presto on mysql)
• data source is Mysql, But Presto world
can use.
• But can not use original function of mysql.
24. Rank function on mysql
select
company_id,
category_id,
count(*),
rank() over (
partition by company_id
order by count(*) desc
)
from
mysql.schema.mysql_table
group by company_id, category_id
26. Prestogres
• PostgreSQL protocol gateway for Presto.
• rewrite queries before sending Presto to
PostgreSQL.
• have password-based authentication and
SSL.
27. Why Prestogres?
• Other application connectivity.
– pgAdmin, psql command.
– re:dash connecte with PostgreSQL protocol to presto.
– But can directly connect to presto.
• We connect to presto, need Presto client.
– I not want use java client.
• Weak security.
– certification is taken by prestogres
28. Prestogres Limitation
• prepared statement.
– not support Presto too.
– so not work embulk-input-postgresql
• Can’t fetch schema by sql.
• Temporary table
• DROP TABLE
29. re:dash
• Visualization platform, write by python.
• Supports many data sources.
• Sharing query with member.
• Scheduling query.(per day, per hour)
• Very active contribution.
30. increased rapidly Presto
query by re:dash
• Number of the presto queries increased
than 10 times.
• That won't change with writing ETL on
re:dash.
• Re:dash having a good reputation in
internal.
33. Embulk with Presto
• use embulk-input-presto of own making.
– Support json type.
• Create point in time data.
• Create machine learning data.
34. Why Embulk?
• Very active plugin ecosystem.
• Complicated string analysis can not only
sql.
• With digdag combination is very
powerful.
• Want can do it shortest distance.
• Fluentd overwork..
36. Install by RPM
• Presto have RPM.
– not distribution.
– need source build..
• include init script.
• But not support open-jdk..
– Pull requesting..
37. AWS integration
• We build Presto on ec2.
• Not use EMR.
• Worker is spot instance, multi instance
types.
– prevent down all at once
40. problem
• Very huge repository.
• SPOF cordinator.
• run long range query, occur
OutOfMemory Error.
41. Very huge repository
• monolithic application.
– I want Separate repository.
• First build takes 30 minutes.
• After the second time build takes 10 minutes.
• All connector is main repository.
– MongoDB、Kafka、cassandra..
– will nearly support Elasticsearch
• Hard to do the contribution.
42. Big change for jdbc
• support multi data type predicate
pushdown.
• We used apply patch presto…
• Let's try mysql people.
43. listened Presto impression
• extended technology of Hadoop.
=>I don't know hadoop. Presto have many
connector.
• parallel processing looks difficult.
=>Presto not have storage, There is not so
influence.
・I do not have so big data.
=>I don't so big player.
44. Summary
• Presto is great software.
• So not difficult.
• Let's use it more.