Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Treasure Data on The YARN - Hadoop Conference Japan 2014

12,075 views

Published on

Published in: Software, Technology
  • Be the first to comment

Treasure Data on The YARN - Hadoop Conference Japan 2014

  1. 1. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Treasure Data on The YARN Ryu Kobayashi ! Hadoop Conference Japan 2014 8 July 2014
  2. 2. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Who am I? • Ryu Kobayashi • @ryu_kobayashi • https://github.com/ryukobayashi • Treasure Data, Inc. • Software Engineer • Background • Hadoop, Cassandra, Machine Learning, ... • I developed Huahin(Hadoop) Framework. 
 http://huahinframework.org/
  3. 3. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is Treasure Data?
  4. 4. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  5. 5. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  6. 6. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Query Language
  7. 7. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service ! ! ! ! Columnar Storage! +! Hadoop! MapReduce! Data Collection Data Warehouse Data Analysis ! ! ! Open-Source! Log Collector! Bulk Loader! ! CSV / TSV! MySQL, Postgres! Oracle, etc. Web Log App Log Sensor RDBMS CRM ERP Streaming Upload BI Tools! Tableau, QlickView,! Pentaho, Excel, etc.! ! TD command / 
 Web Console REST API JDBC / ODBC SQL (HiveQL) or Pig Bulk Upload Parallel Upload External Service/ Storage! Custom App,! RDBMS, FTP, etc. Result push schema-less!
  8. 8. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used
  9. 9. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Hadoop&Cluster PlazmaDB Our System HDFS is not used • Customize Hadoop • Customize Hive • Customize Pig • Customize Impala • Customize Presto
  10. 10. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster
  11. 11. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster user1,&user4,& user5,&… user2,&user9,& user34,&… user10,&user40,& user102,&… user50,&user88,& user1023,&…
  12. 12. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Scheduler and Queue QueueScheduler Hadoop&Cluster Hadoop&Cluster
  13. 13. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We have 4 production’s Hadoop Cluster and Hadoop Cluster(YARN) YARN&Cluster
  14. 14. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 and YARN Queue Queue Hadoop&Cluster Hadoop&Cluster
  15. 15. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Our Service • About 4700 users • About 6 trillion records • About 12 million Jobs • About 40,000 Job by day
  16. 16. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What is YARN?
  17. 17. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YARN(Yet Another Resource Negotiator) Architecture
  18. 18. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker
  19. 19. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server
  20. 20. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • MRv1 • JobTracker • TaskTracker • YARN • ResourceManager • NodeManager • ApplicationMaster • Job History Server * ******(We*can*not*see*the*log*history*If*it*do*not*install)
  21. 21. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Note!!!
  22. 22. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Use the Hadoop 2.4.0 and later!!!
  23. 23. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • The versions which must not be used • Apache Hadoop 2.2.0 • Apache Hadoop 2.3.0 • HDP 2.0(2.2.0 based)
  24. 24. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Currently • Apache Hadoop 2.4.1 • CDH 5.0.2(2.3.0 based and patch) • HDP 2.1(2.4.0 based)
  25. 25. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Why should not use? • Capacity Scheduler • There is a bug • Fair Scheduler • There is a bug
  26. 26. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. • Any bugs? • Each Scheduler will cause a deadlock
  27. 27. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Distribution • CDH 5.0.2 • Red Hat/CentOS/Oracle 5 • Red Hat/CentOS/Oracle 6 • Ubuntu/Debian • HDP 2.1 • Red Hat/CentOS/SLES (64-bit) • (There is already Ubuntu12 to the repository) • Windows Server 2008 & 2012
  28. 28. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Configuration file has been changed several(YARN from MRv1) ! reference: http://goo.gl/vBIYQP
  29. 29. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Deprecated Properties
  30. 30. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Other notes for configuration file • hadoop-conf-pseudo does not work • some mistakes ex : yarn.nodemanager.aux-services mapreduce.shuffle -> mapreduce_shuffle • 2.2.0 and 2.4.0 • There are some differences
  31. 31. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Copy of CDH VM and HDP VM configuration files • Use the Ambari or Cloudera Manager • I work hard on their own!
  32. 32. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Slot has been changed(YARN from MRv1) • MRv1 • map slot, reduce slot • YARN(MRv2) • resource(container)
  33. 33. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. mapred-site.xml • mapred.tasktracker.map.tasks.maximum • mapred.tasktracker.reduce.tasks.maximum scheduler.xml • maxMaps, minMaps • maxReduces, minReduces MRv1
  34. 34. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn-site.xml • yarn.nodemanager.resource.memory-mb • (yarn.nodenamager.vmem-pmem-ratio) • (yarn.scheduler.minimum-allocation-mb) mapred-site.xml • yarn.app.mapreduce.am.resource.mb • mapreduce.map.memory.mb • mapreduce.reduce.memory.mb fair-scheduler.xml • maxResources, minResources YARN(MRv2)
  35. 35. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb => Memory that NodeManager uses ! yarn.app.mapreduce.am.resource.mb => Memory that ApplicationMaster uses ! mapreduce.map.memory.mb => Memory that Map uses ! mapreduce.reduce.memory.mb => Memory that Reduce uses YANR Resource Management
  36. 36. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memory-mb = 4096 yarn.app.mapreduce.am.resource.mb = 1024 mapreduce.map.memory.mb = 1024 mapreduce.reduce.memory.mb = 2048 ! MRv2 Application ApplicationMaster => 1 Mapper => 3 Reducer => 1 YANR Resource Example
  37. 37. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): minResources maxResources maxRunningApps schedulingPolicy YANR Resource Example
  38. 38. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. In addition to this(ex: Fair Scheduler): pool -> queue user. maxRunningJobs -> user. maxRunningApps userMaxJobsDefault -> userMaxAppsDefault etc… Changes Fair scheduler
  39. 39. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. yarn.nodemanager.resource.memoryDmb
  40. 40. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. YANR Scheduler Management
  41. 41. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. e.g. Use hdp-configuration-utils.py script http://goo.gl/L2hxyq ! Use Ambari http://ambari.apache.org/ (not supported Ubuntu12. Ubuntu 12 support is coming soon) YANR Resource Management
  42. 42. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. DefaultContainerExecuter • Container launch process based • Same as the conventional(MRv1) ! LinuxContainerExecuter • Only Linux • Some restrictions • cgroup, etc… YANR Container Executer
  43. 43. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. MRv1 • The need to set the initial ! YARN • The need to set the initial • There is a change from MRv1 (ex: /tmp/hadoop-yarn/) YANR Directory Structure
  44. 44. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. What should we do? • Reference the CDH VM and HDP VM HDFS directory • Use the Ambari or Cloudera Manager • I work hard on their own!
  45. 45. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Enjoy the YARN!!!
  46. 46. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. We are hiring!!!
  47. 47. Copyright*©2014*Treasure*Data.**All*Rights*Reserved. Thanks!!!

×