Presentation slide at Presto Meetup#1 in Tokyo/Japan

  • Since Presto 0.80, it had implemented Metadata-Only Query Optimization. see So more complex queries might be better to see the performance between hive & presto.
20140120 presto meetup_en

  1. 1. Our Presto use case and performance test Hironori Ogibayashi Shin Matsuura
  2. 2. About us ● Hironori Ogibayashi(@angostura11) ● Shin Matsuura ○ IT Infrastructure team in Japanese telecommunications carrier ○ Mainly working on middleware - test, installation, deployment.
  3. 3. Todays Topic ● Presto use case ○ Deployment ○ Use case ○ Challenges ○ Future work ● Performance comparison between Hive+Tez and Presto
  4. 4. Presto use case
  5. 5. Log Collection Flow Fluentd Aggregator Hadoop Cluster Application WebHDFS ・1500 Fluentd instances ・25,000 msg / sec ・400GB / day ・150 types of log
  6. 6. Log Usage ● Systems Infrastructure team ○ Checking trends in server performance ○ Performance analysis of Oracle Database ● Application development team ○ Improving system and business operations.
  7. 7. Application for Oracle DB Performance Analysis - Check existing/potential problems of Oracle database, for certain system, certain period. - Utilize logs stored in HDFS. Queries were executed on Hive. - But, it took more than one hour to get the result... - (So, we migrated to Presto.)
  8. 8. Why Presto? ● Frequent use of Interactive / ad-hoc queries. ● Of cource, faster is better.
  9. 9. Hadoop Slave Presto Deployment Hadoop Slave DataNode TaskTracker Presto Worker Presto Coordinator Hive Metastore Application/Client ・・・ ● A decicated physical machine as a Coordinator. ● Workers run on each Hadoop slaves. ● Logs in HDFS are periodically converted to RCfiles. ● Presto versions ○ 0.66⇒0.73⇒0.75⇒0.82
  10. 10. Deployment Effect - Elapsed time of a single query 230sec 7sec - Elapsed time of one of the queries issued by the application. - Query was run on CDH4 (MRv1) cluster.
  11. 11. Deployment and Operation ● Deployment ○ Easy;Just extract binaries in each server and modify configuration file. ○ Automated by Ansible + yum. ● What we use in operation ○ Query history ■ Coordinator Web UI ○ Logs ■ /var/presto/data/logs/{server.log,launcher.log} ○ Metrics ■ presto-metrics( metrics)⇒Fluentd⇒Elasticsearch + Kibana ○ sys schema
  12. 12. Challenges ● Worker crash / hang. ○ OutOfMemory. In case of hanging, we resolve to “kill -9”. ○ We Modified the memory parameter: task.shard.max- threads×task.max-memory < -Xmx ● At first, we set node-scheduler.include-coordinator=true. In which case, Coordinator crashed due to heavy query. ● SQL difference from HiveQL ○ At first our Application used both Hive and Presto because we used Presto experimentally.Hence the Application had to support both HiveQL and Presto(ANSI SQL). ○ Now, the application no longer use Hive.
  13. 13. Future work ● Improve Coodinator’s availability. ● Security ○ Now, all queries are executed as Presto’s daemon user. ● Resource isolation between Presto and Hadoop daemons.
  14. 14. Presto VS Hive+Tez
  15. 15. Contents From a Performance perspective Presto VS Hive+Tez (not tuning any parameteres)
  16. 16. Conclusion Presto VS Hive+Tez Win Lose
  17. 17. How Fast?? Presto VS Hive+Tez 2.0~136 times
  18. 18. more details
  19. 19. Testing environment Configurations 2p12c 64GB Mem 36TB Disk NN DN DN DN Hadoop(HDP2.1) Presto(0.82) Coodinator Worker Worker Worker Master : 3nodes Slave : 3nodes NN Metastore
  20. 20. Sample data 300GB csv file 50 columns 1.1B records
  21. 21. Performance measurement perspectives • Query patterns • Data format patterns • Repetitive Querying
  22. 22. Query patterns
  23. 23. Queries Query1: select count(*) from TestTBL Query2: select * from TestTBL where col1 = ‘XXX’ Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’ Query4: select col1, count(*) from TestTBL group by col1 Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
  24. 24. data format :Txt Results: Query patterns
  25. 25. data format :Txt Results: Query patterns 100x faster Presto was faster in processing speed than Hive+Tez in all queries.
  26. 26. Data format patterns
  27. 27. Data formats • Text File (Textfile) • Record Columnar File (RCfile) • Optimized Row Columnar File (ORCfile)
  28. 28. Results: Data format patterns ※Query: Query2
  29. 29. Results: Data format patterns ※Query: Query2 Presto was faster in processing speed than Hive+Tez in all data formats.
  30. 30. Repetitive Querying
  31. 31. Change in processing time with repetitions(Presto) ※Query: Query2 ※Data format: Txt
  32. 32. Change in processing time with repetitions (Presto) ※Query: Query2 ※Data format: Txt Became faster After the second time. Cache ??? 2.5x faster
  33. 33. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt
  34. 34. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt No real change in processing time
  36. 36. Engine:Presto Query × Data format
  37. 37. Engine:Presto Query × Data format Is using RCfile the most stable and fastest way ??
  38. 38. Summary Result ● Presto was faster than Hive+Tez in all queries. ● Presto was faster than Hive+Tez in all data formats. ● With repetitive Querying, presto became faster. ● By Using RCfile, Presto was the most stable and fastest. Next ● Benchmark from node scaling and data volumn perspectives. ● Benchmark while using compression functions of ORCfile. ● Benchmark with HDP2.2.
  39. 39. Appendix
