NTT Data - Shinichi Yamada - Hadoop World 2010

1,710 views
1,553 views

Published on

Hadoop - Lessons Learned from Deploying Enterprise Clusters.

Shinichi Yamada
EVP & CTO, NTT Dada Corporation

Learn more @ http://www.cloudera.com/hadoop/

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,710
On SlideShare
0
From Embeds
0
Number of Embeds
413
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

NTT Data - Shinichi Yamada - Hadoop World 2010

  1. 1. Hadoop – Lessons Learned from Enterprise Clusters Shinichi Yamada EVP & CTO NTT DATA CORPORATION
  2. 2. Copyright © 2010 NTT DATA CORPORATION Company Overview •Name: NTT DATA CORPORATION •Headquarters: Tokyo, Japan •Revenue: USD 11.4 billion (March, 2010 ; USD 1 = JPY 100) •Employees: 34,543 (March, 2010) •Business Areas: Broad range of IT services •Systems integration •IT consulting •IT outsourcing •History: •1967 - established as a division of NTT •1988 - spun off from NTT and incorporated (May 23, 1988) •1995 - went public (Tokyo Stock Exchange: 9613)
  3. 3. Copyright © 2010 NTT DATA CORPORATION Net Sales by Sector07 Transition Consolidated (USD million) (USD 1 = JPY 100) Ratio Public Administration Sector 18% Financial sector 43% Industrial sector 33% Others 6% 3265 3442 3005 2564 2327 2160 2745 3245 4210 4737 4942 5110 2382 3482 3248 3774 3826 3990 679 278 279 313 332 740 0 2,000 4,000 6,000 8,000 10,000 12,000 FY2005 FY2006 FY2007 FY2008 FY2009 FY2010 (forecast) Consolidated FY ended March 31,2011 (USD 1 = JPY 100) Public Administration Sector Financial Sector Industrial Sector Others (maintenance and operations, etc.) (FY)
  4. 4. Copyright © 2010 NTT DATA CORPORATION Positioning in NTT Group • NTT Group is one of the 50 largest companies in the world*, specializing in IT & Telecommunications with USD 104 billion in revenue. • NTT DATA is the IT solutions arm of the NTT Group, specializing in IT solutions and systems integration services. • NTT Group regards IT business as one of its most important domains, and emphasizes on NTT DATA’s growth as the telecom industry faces commoditization. Sales Breakdown of NTT Group NTT Holdings USD 104 bil NTT EAST Regional telephone company USD 20 bil NTT WEST Regional telephone company USD 18 bil NTT DATA IT solutions and integration company USD 11 bil NTT COMMUNI CATIONS Network, International telecommunic ations company USD 11 bil NTT DOCOMO Mobile / Network company USD 44 bil ・・・
  5. 5. Copyright © 2010 NTT DATA CORPORATION Best Fitting Strategic Partnership  NTT DATA is a leading IT service provider and already has over 3 years experience and production cases on Hadoop  Help enterprise customer design, integrate, deploy and run large clusters at the range of 100 ~ 1000+ nodes  Deep and wide experience introducing Open Source Software technologies for enterprise customers. For the data management 8 years with PostgreSQL including mission critical cases  Cloudera is the leading provider of Hadoop-based software, services and education, and CDH is the best qualified Hadoop distribution  Have a strong relationship with Hadoop OSS community and aggressively promote Hadoop’s ecosystem
  6. 6. Copyright © 2010 NTT DATA CORPORATION The Objective of Partnership Jointly Promote and Accelerate Hadoop Business in Japan /APAC
  7. 7. Copyright © 2010 NTT DATA CORPORATION  NTT DATA Delivers Cloudera’s Product in Japan  Promote CDH and provide support in Japanese and with local staff  Promote Cloudera’s training in Japan and provide knowledge-base in Japanese  Qualified Professional Services for Hadoop  Enhance and extend NTT DATA’s Hadoop professional services by sharing experience and resources with Cloudera’s team  Common Development and Feedback from NTT DATA’s Enhancement  Utilize open source tools (Heartbeat, Puppet etc) to improve reliability and to optimize cluster operation Some enhancements are publicly available via: http://www.meti.go.jp/policy/mono_info_service/joho/downloadfiles/2010software_res earch/clou_dist_software.pdf (only Japanese yet) Deliverables of Partnership
  8. 8. Copyright © 2010 NTT DATA CORPORATION Construction of the Hadoop Environment  Established fully automated Hadoop environment construction system by OSS  Utilize Puppet and Kickstart (based on commodity functions)  Developed scripts to set up a cluster consist of heterogeneous hardware.  IP address and hostname are assigned to fulfill operational and maintenance rules For example: Each hostname represents the server’s topological location of the rack and the port of the switch connected.  Install 100 servers: 90minutes / Update 100 servers configurations: 3 minutes DHCP Server TFTP Server(1) Install OS and packages (2) Configure servers HTTP Server Slave Servers PhasesOperators Give IP and stage_1 boot loader Get stage_1 boot loader Get OS installer and config files Get install packages DHCP ServerNotify hostname made from topology/location DNS Server Register name Puppet Server Notify machine spec Give config files according to spec Wire & Power-on No Human intervention (3) Configure applications Detailed Flow of Construction
  9. 9. Copyright © 2010 NTT DATA CORPORATION Master Server Redundancy of the Hadoop Environment  The Heartbeat-DRBD method is already known to Hadoop community.  Having down-time to failover from active to slave.  It needs to retry the job after the failover  The Kemari-DRBD method (Experimental)  Kemari is a software for Fault Tolerant and is developed by NTT Laboratories.  No down-time and no need to retry job System Disk Data Disk (VM Image) OS (Dom-0) DRBD Heartbeat Kemari RA OS (Dom-U) NameNode KemariProcess xc_kemari_save Xen Virtual Machine Active System Disk Data Disk (VM Image) OS (Dom-0)DRBD Heartbeat KemariProcess xc_kemari_restore OS (Dom-U) NameNode Xen Virtual Machine Stand-by Storage Sync Memory Sync between Virtual Machines Monitoring each nodes Start Stand-by machine  Kemari synchronizes state of Dom-U, such as memory  Kemari preliminary prototype was implemented on Xen  It is under development to KVM / Qemu now
  10. 10. Copyright © 2010 NTT DATA CORPORATION  Early Adapters, i.e. Web/Internet Service Companies  Process various types of phenomenal data daily and those are growing steadily  have in-house engineering resources and start Hadoop project as a skunk work  Clusters are typically around 20~50 nodes, then in these days, experienced companies are going to consolidate scattered clusters  Optimistic Attitudes is not Majority  Japanese Enterprises are sophisticated on emerging technology and have high expectation, however conservative on deployment  Wants “Best Practices” from the beginning in every scope on quality, robustness, sustainability, economy of platform  What is the “Best Practices” in Hadoop ? Working with Japanese Enterprises, we observes two types of opportunities, from system integrator’s viewpoint Hadoop in Japan
  11. 11. Copyright © 2010 NTT DATA CORPORATION “Frontiers” expects “Scalability is an Objective”  There are several enterprises, who already have Excessive Amount of Data not being effectively and economically analyzed yet. Typically in telecom, telemetries industries  Hadoop is inevitable choice for scalability on their big data, thus deployment immediately goes over 100 nodes clusters, then System Integration on top of Hadoop cluster will be major concern  “Best Practice” expects knowledge and experience for - tuned integration with data collectors/sensors, i.e. custom Hadoop cluster - specialized custom analytic application, - and design for operational economies for reducing management complexity Lessons Learned from Enterprise Customers
  12. 12. Copyright © 2010 NTT DATA CORPORATION “Establishment” expects “Scalability is a Requirement”  Growing amount of data becomes a burden typically on large batch jobs, which has been processed by mainframes or UNIX enterprise servers  Starts from small clusters, then need consulting starting from evaluating POC, comparing with other technologies, then planning for migration  “Best Practice” expects handy deployment (up to 20 nodes) and standard tools, which support planning off-load and migrating existing applications  Scalability means elastic deployment from user’s viewpoint  Challenge is the migration of application, which sometimes require re-factoring data and algorithm. It shall be minimal but bold changes Lessons Learned from Enterprise Customers
  13. 13. Hadoop in RECRUIT Oct 12, 2010 RECRUIT CO.,LTD. Executive Manager, Osamu YONETANI
  14. 14. Company Information and Data
  15. 15. RECRUIT CO.,LTD Founded: March 31, 1960 (incorporated August 26, 1963) Financial Information:  Recruit Group Consolidated Sales: about 9 billion dollars (※1) Consolidated Ordinary Income: about 831 million dollars (※1)  Recruit Co., Ltd. Capital: 30 million dollars (since March 1, 1995) Number of Employees: 5,929 (male: 2,659, female: 3,270) Sales: about 3.7 billion dollars (※1) Ordinary Income: about 623 million dollars (※1) (※1) April 1 2009 - March 31, 2010 Affiliated Companies: 86 (as of March 31, 2010) Web site: http://www.recruit.co.jp/corporate/english/ Company Information and Data
  16. 16. Products & Services Human Resources When you want to get a job! We provide a large amount of top-quality job information through various media such as information magazines and websites. For Clients. We support "Strategic Human Resources Management" from recruitment through evaluation, remuneration, and staff training to placement. In the area of "Human Resources Recruitment," we offer business solutions such as human resource arrangement and effective staffing by outsourcing .
  17. 17. Products & Services Coupons Support ladies in their 20s and 30s. We provide a service based on the respective local areas and that target mainly women, encouraging them to try different shops and restaurants. For Clients. Our staff members visit each participating business to gather information and suggest the most effective coupon approach.
  18. 18. Products & Services Housing Publication and sales of "SUUMO", "HOUSING" etc. Operation of "SUUMO", "SUUMO mobile," etc. Further education and Learning Publication and sales of "KEIKO TO MANABU", "RECRUIT SHINGAKU BOOK", "COLLEGE MANAGEMENT," etc. Operation of "KEIKO TO MANABU.net", "Career Guidance.net," etc.
  19. 19. Products & Services Travel Publication and sales of "JALAN" etc. Operation of "jalan.net", "AB-ROAD" and mobile sites etc. Bridal Publication and sales of " ZEXY", "ZEXY INTERIOR", "ZEXY Anhelo, " etc. Operation of "ZEXY net, ", "ZEXY net mobile," etc.
  20. 20. Our division MIT = "Marketing and IT" Division. Information Systems division for all company. Cost management Checking project budget spending. Project Solution Group (a.k.a PMO) Reviewing major development projects of web sites. Infrastructure Solution Group Sharing Infrastructure. Operate over 1500 servers. The group of exploring new technology is here! Board CEO Job Div. MITCar Div. ・・・
  21. 21. Comparison of 4 DWH Middlewares
  22. 22. Needs Prolonged process time and growing needs for analysis. From increasing access and actions, our data size increases. Evolution of shared-nothing technology. Shared-everything technology has the tendency to be expensive.. Products verified.  Proprietary RDBMS (DWH version)  Proprietary RDBMS with RAM disk  Brand new Commercial RDBMS (like PostgreSQL cluster)  Hadoop + HIVE Comparison of 4 DWH Middlewares I Hadoop HIVE O G
  23. 23. I Hadoop HIVE O G Offline Perf. Reliability Scalability Serv. for Dev. Economy Graph with their features Comparison of 4 DWH Middlewares Serv. for Ope. Availability Flex./Opp. Ease of Migr. Online Perf.
  24. 24. Model 1:Short Term Target (For EUC platform) Without changing programs codes. Focus on Availability and Ease of Migration. Online performance is needed. Model 2:Short / Middle Term Target (For offline processing) Small change is acceptable. Focus on Reliability. Offline performance is needed. Model 3:Long Term Target (For new needs) Can make with zero base. Focus on Economy, Scalability and Flexibility. TB or PB class data size. Comparison of 4 DWH Middlewares Evaluation model
  25. 25. 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 47p 71p 26p 79p I Hadoop HIVE OG points distribution Comparison of 4 DWH Middlewares Model 1:Short Term Target (For EUC platform) 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 Offline Perf. Serv. for Ope. Reliability Scalability Availability Serv. for Dev. Ease of Migr. Online Perf. Economy Flex./Opp.
  26. 26. 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 54p 46p 62p62p Comparison of 4 DWH Middlewares I Hadoop HIVE OG points distribution Model 2:Short / Middle Term Target (For offline processing) 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 Offline Perf. Serv. for Ope. Reliability Scalability Availability Serv. for Dev. Ease of Migr. Online Perf. Economy Flex./Opp.
  27. 27. 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 66p 53p 35p 69p Comparison of 4 DWH Middlewares I Hadoop HIVE OG points distribution Model 3:Long Term Target (For new needs) 0 10 20 30 40 50 60 70 80 90 100 配点 Greenplum InfoSphere Hadoop+HIVE RailGun 製品別得点比較 バッチ処理性能 基盤運用容易性 製品信頼性 拡張性 可用性 AP開発容易性 移行容易性 オンライン処理適合性 経済性 先進性/将来性 Offline Perf. Serv. for Ope. Reliability Scalability Availability Serv. for Dev. Ease of Migr. Online Perf. Economy Flex./Opp.
  28. 28. Next Step
  29. 29. Next Step To start Hadoop... With small, but real data. Replace some small part of our system with Hadoop one. Same output with new distributed architecture logic. To take advantage of Hadoop... For software development Getting know-how and tips through small project. To enable applying an some projects, sharing this knowledge with other teams. For operating Hadoop Must have an infra engineer familiar with Hadoop architecture. To save cost, shared infrastructure and engineers on some Hadoop project.
  30. 30. Future Challenges
  31. 31. Future Challenges To make Value for relevant business... Improve our business. Our thinking limits will be released by the power of Hadoop. Example: Through our web page for clients. 1. Better suggestions to sell their product with recommendation logic. 2. Near realtime reports for specific markets. Contribute community. Share our experiences to making systems for non-special users. Share our library and operation tools (maybe!).
  32. 32. Copyright © 2010 NTT DATA CORPORATION Elephant Ear Cookies contact: hadoop at kits.nttdata.co.jp
  33. 33. Copyright © 2010 NTT DATA CORPORATION Thank you

×