Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20140120 presto meetup_en


Published on

Presentation slide at Presto Meetup#1 in Tokyo/Japan

Published in: Technology
  • Since Presto 0.80, it had implemented Metadata-Only Query Optimization. see So more complex queries might be better to see the performance between hive & presto.
    Are you sure you want to  Yes  No
    Your message goes here

20140120 presto meetup_en

  1. 1. Our Presto use case and performance test Hironori Ogibayashi Shin Matsuura
  2. 2. About us ● Hironori Ogibayashi(@angostura11) ● Shin Matsuura ○ IT Infrastructure team in Japanese telecommunications carrier ○ Mainly working on middleware - test, installation, deployment.
  3. 3. Todays Topic ● Presto use case ○ Deployment ○ Use case ○ Challenges ○ Future work ● Performance comparison between Hive+Tez and Presto
  4. 4. Presto use case
  5. 5. Log Collection Flow Fluentd Aggregator Hadoop Cluster Application WebHDFS ・1500 Fluentd instances ・25,000 msg / sec ・400GB / day ・150 types of log
  6. 6. Log Usage ● Systems Infrastructure team ○ Checking trends in server performance ○ Performance analysis of Oracle Database ● Application development team ○ Improving system and business operations.
  7. 7. Application for Oracle DB Performance Analysis - Check existing/potential problems of Oracle database, for certain system, certain period. - Utilize logs stored in HDFS. Queries were executed on Hive. - But, it took more than one hour to get the result... - (So, we migrated to Presto.)
  8. 8. Why Presto? ● Frequent use of Interactive / ad-hoc queries. ● Of cource, faster is better.
  9. 9. Hadoop Slave Presto Deployment Hadoop Slave DataNode TaskTracker Presto Worker Presto Coordinator Hive Metastore Application/Client ・・・ ● A decicated physical machine as a Coordinator. ● Workers run on each Hadoop slaves. ● Logs in HDFS are periodically converted to RCfiles. ● Presto versions ○ 0.66⇒0.73⇒0.75⇒0.82
  10. 10. Deployment Effect - Elapsed time of a single query 230sec 7sec - Elapsed time of one of the queries issued by the application. - Query was run on CDH4 (MRv1) cluster.
  11. 11. Deployment and Operation ● Deployment ○ Easy;Just extract binaries in each server and modify configuration file. ○ Automated by Ansible + yum. ● What we use in operation ○ Query history ■ Coordinator Web UI ○ Logs ■ /var/presto/data/logs/{server.log,launcher.log} ○ Metrics ■ presto-metrics( metrics)⇒Fluentd⇒Elasticsearch + Kibana ○ sys schema
  12. 12. Challenges ● Worker crash / hang. ○ OutOfMemory. In case of hanging, we resolve to “kill -9”. ○ We Modified the memory parameter: task.shard.max- threads×task.max-memory < -Xmx ● At first, we set node-scheduler.include-coordinator=true. In which case, Coordinator crashed due to heavy query. ● SQL difference from HiveQL ○ At first our Application used both Hive and Presto because we used Presto experimentally.Hence the Application had to support both HiveQL and Presto(ANSI SQL). ○ Now, the application no longer use Hive.
  13. 13. Future work ● Improve Coodinator’s availability. ● Security ○ Now, all queries are executed as Presto’s daemon user. ● Resource isolation between Presto and Hadoop daemons.
  14. 14. Presto VS Hive+Tez
  15. 15. Contents From a Performance perspective Presto VS Hive+Tez (not tuning any parameteres)
  16. 16. Conclusion Presto VS Hive+Tez Win Lose
  17. 17. How Fast?? Presto VS Hive+Tez 2.0~136 times
  18. 18. more details
  19. 19. Testing environment Configurations 2p12c 64GB Mem 36TB Disk NN DN DN DN Hadoop(HDP2.1) Presto(0.82) Coodinator Worker Worker Worker Master : 3nodes Slave : 3nodes NN Metastore
  20. 20. Sample data 300GB csv file 50 columns 1.1B records
  21. 21. Performance measurement perspectives • Query patterns • Data format patterns • Repetitive Querying
  22. 22. Query patterns
  23. 23. Queries Query1: select count(*) from TestTBL Query2: select * from TestTBL where col1 = ‘XXX’ Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’ Query4: select col1, count(*) from TestTBL group by col1 Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
  24. 24. data format :Txt Results: Query patterns
  25. 25. data format :Txt Results: Query patterns 100x faster Presto was faster in processing speed than Hive+Tez in all queries.
  26. 26. Data format patterns
  27. 27. Data formats • Text File (Textfile) • Record Columnar File (RCfile) • Optimized Row Columnar File (ORCfile)
  28. 28. Results: Data format patterns ※Query: Query2
  29. 29. Results: Data format patterns ※Query: Query2 Presto was faster in processing speed than Hive+Tez in all data formats.
  30. 30. Repetitive Querying
  31. 31. Change in processing time with repetitions(Presto) ※Query: Query2 ※Data format: Txt
  32. 32. Change in processing time with repetitions (Presto) ※Query: Query2 ※Data format: Txt Became faster After the second time. Cache ??? 2.5x faster
  33. 33. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt
  34. 34. Change in processing time with repetitions (Hive+Tez) ※Query: Query2 ※Data format: Txt No real change in processing time
  35. 35.
  36. 36. Engine:Presto Query × Data format
  37. 37. Engine:Presto Query × Data format Is using RCfile the most stable and fastest way ??
  38. 38. Summary Result ● Presto was faster than Hive+Tez in all queries. ● Presto was faster than Hive+Tez in all data formats. ● With repetitive Querying, presto became faster. ● By Using RCfile, Presto was the most stable and fastest. Next ● Benchmark from node scaling and data volumn perspectives. ● Benchmark while using compression functions of ORCfile. ● Benchmark with HDP2.2.
  39. 39. Appendix
  40. 40. ほぼすべての条件で 2回目以降高速