Treasure Data on The YARN - Hadoop Conference Japan 2014

10,619 views

Published on

Published in: Software, Technology

Treasure Data on The YARN - Hadoop Conference Japan 2014

  1. 1. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Treasure Data on The YARN Ryu Kobayashi ! Hadoop Conference Japan 2014 8 July 2014
  2. 2. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Who am I? • Ryu Kobayashi • @ryu_kobayashi • https://github.com/ryukobayashi • Treasure Data, Inc. • Software Engineer • Background • Hadoop, Cassandra, Machine Learning, ... • I developed Huahin(Hadoop) Framework. 
 http://huahinframework.org/
  3. 3. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is Treasure Data?
  4. 4. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  5. 5. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  6. 6. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Query Language
  7. 7. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  8. 8. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used
  9. 9. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used • Customize Hadoop • Customize Hive • Customize Pig • Customize Impala • Customize Presto
  10. 10. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster
  11. 11. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster user1,&user4,& user5,&… user2,&user9,& user34,&… user10,&user40,& user102,&… user50,&user88,& user1023,&…
  12. 12. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Scheduler and Queue QueueScheduler Hadoop&Cluster Hadoop&Cluster
  13. 13. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster and Hadoop Cluster(YARN) YARN&Cluster
  14. 14. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 and YARN Queue Queue Hadoop&Cluster Hadoop&Cluster
  15. 15. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service • About 4700 users • About 6 trillion records • About 12 million Jobs • About 40,000 Job by day
  16. 16. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is YARN?
  17. 17. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YARN(Yet Another Resource Negotiator) Architecture
  18. 18. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker
  19. 19. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server
  20. 20. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server * ******(We*can*not*see*the*log*history*If*it*do*not*install)
  21. 21. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Note!!!
  22. 22. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Use the Hadoop 2.4.0 and later!!!
  23. 23. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • The versions which must not be used • Apache Hadoop 2.2.0 • Apache Hadoop 2.3.0 • HDP 2.0(2.2.0 based)
  24. 24. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Currently • Apache Hadoop 2.4.1 • CDH 5.0.2(2.3.0 based and patch) • HDP 2.1(2.4.0 based)
  25. 25. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Why should not use? • Capacity Scheduler • There is a bug • Fair Scheduler • There is a bug
  26. 26. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Any bugs? • Each Scheduler will cause a deadlock
  27. 27. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Distribution • CDH 5.0.2 • Red Hat/CentOS/Oracle 5 • Red Hat/CentOS/Oracle 6 • Ubuntu/Debian • HDP 2.1 • Red Hat/CentOS/SLES (64-bit) • (There is already Ubuntu12 to the repository) • Windows Server 2008 & 2012
  28. 28. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Configuration file has been changed several(YARN from MRv1) ! reference: http://goo.gl/vBIYQP
  29. 29. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Deprecated Properties
  30. 30. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Other notes for configuration file • hadoop-conf-pseudo does not work • some mistakes ex : yarn.nodemanager.aux-services mapreduce.shuffle -> mapreduce_shuffle • 2.2.0 and 2.4.0 • There are some differences
  31. 31. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Copy of CDH VM and HDP VM configuration files • Use the Ambari or Cloudera Manager • I work hard on their own!
  32. 32. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Slot has been changed(YARN from MRv1) • MRv1 • map slot, reduce slot • YARN(MRv2) • resource(container)
  33. 33. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. mapred-site.xml • mapred.tasktracker.map.tasks.maximum • mapred.tasktracker.reduce.tasks.maximum scheduler.xml • maxMaps, minMaps • maxReduces, minReduces MRv1
  34. 34. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn-site.xml • yarn.nodemanager.resource.memory-mb • (yarn.nodenamager.vmem-pmem-ratio) • (yarn.scheduler.minimum-allocation-mb) mapred-site.xml • yarn.app.mapreduce.am.resource.mb • mapreduce.map.memory.mb • mapreduce.reduce.memory.mb fair-scheduler.xml • maxResources, minResources YARN(MRv2)
  35. 35. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb => Memory that NodeManager uses ! yarn.app.mapreduce.am.resource.mb => Memory that ApplicationMaster uses ! mapreduce.map.memory.mb => Memory that Map uses ! mapreduce.reduce.memory.mb => Memory that Reduce uses YANR Resource Management
  36. 36. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb = 4096 yarn.app.mapreduce.am.resource.mb = 1024 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 2048 ! MRv2 Application ApplicationMaster => 1 Mapper => 3 Reducer => 1 YANR Resource Example
  37. 37. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): minResources maxResources maxRunningApps schedulingPolicy YANR Resource Example
  38. 38. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): pool -> queue user. maxRunningJobs -> user. maxRunningApps userMaxJobsDefault -> userMaxAppsDefault etc… Changes Fair scheduler
  39. 39. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memoryDmb
  40. 40. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YANR Scheduler Management
  41. 41. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. e.g. Use hdp-configuration-utils.py script http://goo.gl/L2hxyq ! Use Ambari http://ambari.apache.org/ (not supported Ubuntu12. Ubuntu 12 support is coming soon) YANR Resource Management
  42. 42. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. DefaultContainerExecuter • Container launch process based • Same as the conventional(MRv1) ! LinuxContainerExecuter • Only Linux • Some restrictions • cgroup, etc… YANR Container Executer
  43. 43. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 • The need to set the initial ! YARN • The need to set the initial • There is a change from MRv1 (ex: /tmp/hadoop-yarn/) YANR Directory Structure
  44. 44. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Reference the CDH VM and HDP VM HDFS directory • Use the Ambari or Cloudera Manager • I work hard on their own!
  45. 45. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Enjoy the YARN!!!
  46. 46. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We are hiring!!!
  47. 47. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Thanks!!!

×