Hadoop and subsystems in livedoor #Hcj11f

11,453 views
11,352 views

Published on

Published in: Technology

Hadoop and subsystems in livedoor #Hcj11f

  1. 1. Hadoop and Subsystems in livedoor Hadoop Conference Japan 2011 Fall 2011/09/26 tagomoris2011 9 26
  2. 2. 2011 9 26
  3. 3. we are hiring!2011 9 26
  4. 4. whats livedoor?2011 9 26
  5. 5. 2011 9 26
  6. 6. large scale web services 2800+ servers 3200+ hosts 530+ web servers2011 9 26
  7. 7. 20 Aug 20092011 9 26
  8. 8. Aug 2011 15Gbps (10Gbps + CDN 5Gbps)2011 9 26
  9. 9. Hadoop in livedoor • 10 nodes (1+9) • 36 core, 32TB HDFS • CDH3b2 • with libhdfs, fuse-hdfs • Hive 0.6.0 (community package)2011 9 26
  10. 10. Hadoop in livedoor data mining reporting page views, unique users, traffic amount per page, ...2011 9 26
  11. 11. super large scale sed | grep | wc with Hadoop Streaming + Hive2011 9 26
  12. 12. httpd logs from 96 servers (apache / nginx) 580GB/day (raw)2011 9 26
  13. 13. overview hourly daily on hourly demand2011 9 26
  14. 14. topics •log delivery network with scribe •and scribeline •hive client web application shib2011 9 26
  15. 15. overview hourly daily on hourly demand2011 9 26
  16. 16. scribe log delivery daemon based on Thrift scalable, reliable supports HDFS https://github.com/facebook/scribe2011 9 26
  17. 17. scribe nodes scribed scribed scribed2011 9 26
  18. 18. deliver node traffic2011 9 26
  19. 19. scribe nodes scribed scribed scribed2011 9 26
  20. 20. what we want from scribe agent •easy to deploy •works w/o any httpd configurations •delivery target failover/takeback •lightweight (without JVM) •stable2011 9 26
  21. 21. scribe nodes scribed scribed scribeline scribed2011 9 26
  22. 22. scribeline log delivery agent tool python 2.4, thrift easy to setup and start/stop works without any httpd configurations works with logrotate-ed log files automatic delivery target failover/takeback https://github.com/tagomoris/scribe_line2011 9 26
  23. 23. how to setup scribeline in livedoor 1. yum install scribeline (tar xzf && cd && sudo make install) 2. vi /etc/scribeline.conf blog /var/log/httpd/access_log blogimg /var/log/nginx/access_log 3. /etc/init.d/scribeline start2011 9 26
  24. 24. scribe nodes scribed scribed scribed2011 9 26
  25. 25. overview hourly daily on hourly demand2011 9 26
  26. 26. what we want about hive client •easy to experiment •from PC on our desks •result caching •protection against data loss •friendly look & feel2011 9 26
  27. 27. shib hive client web application node.js, thrift, kyoto tycoon query history browser query editor, based on copy&paste result caching & download tsv/csv filter INSERT/DROP/CREATE ... https://github.com/tagomoris/shib2011 9 26
  28. 28. 2011 9 26
  29. 29. shib system overview2011 9 26
  30. 30. what shib cannot do now •access control •graph & chart •hive 0.7.0+ features support •database, authentication and ... •mapreduce status notification2011 9 26
  31. 31. what we are trying now •New cluster •more nodes •CDH3b2 + Hive 0.6.0 -> CDH3u1 •New tools •Hoop (instead of fuse-hdfs) •Any stream processing framework2011 9 26
  32. 32. thanks!2011 9 26

×