● Systems Infrastructure team
○ Checking trends in server performance
○ Performance analysis of Oracle
● Application development team
○ Improving system and business
Application for Oracle DB Performance Analysis
- Check existing/potential problems of
Oracle database, for certain system,
- Utilize logs stored in HDFS. Queries
were executed on Hive.
- But, it took more than one hour to
get the result...
- (So, we migrated to Presto.)
● Frequent use of Interactive / ad-hoc
● Of cource, faster is better.
● A decicated physical machine as a
● Workers run on each Hadoop slaves.
● Logs in HDFS are periodically
converted to RCfiles.
● Presto versions
Deployment Effect - Elapsed time of a single query
- Elapsed time of one of
the queries issued by the
- Query was run on CDH4
Deployment and Operation
○ Easy;Just extract binaries in each server and modify
○ Automated by Ansible + yum.
● What we use in operation
○ Query history
■ Coordinator Web UI
metrics)⇒Fluentd⇒Elasticsearch + Kibana
○ sys schema
● Worker crash / hang.
○ OutOfMemory. In case of hanging, we resolve to “kill -9”.
○ We Modified the memory parameter: task.shard.max-
threads×task.max-memory < -Xmx
● At first, we set node-scheduler.include-coordinator=true.
In which case, Coordinator crashed due to heavy query.
● SQL difference from HiveQL
○ At first our Application used both Hive and Presto because we used
Presto experimentally.Hence the Application had to support both
HiveQL and Presto(ANSI SQL).
○ Now, the application no longer use Hive.
● Improve Coodinator’s availability.
○ Now, all queries are executed as Presto’s daemon user.
● Resource isolation between Presto and Hadoop daemons.
Query1: select count(*) from TestTBL
Query2: select * from TestTBL where col1 = ‘XXX’
Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’
Query4: select col1, count(*) from TestTBL group by col1
Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
Query × Data format
Is using RCfile the most stable and fastest
● Presto was faster than Hive+Tez in all queries.
● Presto was faster than Hive+Tez in all data formats.
● With repetitive Querying, presto became faster.
● By Using RCfile, Presto was the most stable and fastest.
● Benchmark from node scaling and data volumn
● Benchmark while using compression functions of
● Benchmark with HDP2.2.