Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData

2,634
views

Published on

NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal …

NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc.

Published in: Business, Technology

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,634
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hadoop’s Life in Enterprise Systems NTT DATA has been providing Hadoop professional services for enterprise customers for years. In this talk we will categorize Hadoop integration cases based on our experience and illustrate archetypal design practices how Hadoop clusters are deployed into existing infrastructure and services. We will also present enhancement cases motivated by customer’s demand including GPU for big math, HDFS capable storage system, etc. Y Masatani Senior Specialist NTT DATA Masatani is a senior specialist at System Platforms Sector in NTT DATA Corporation. He has more than 15 years experience in software engineering and Internet services. He has been directed OSS professional services unit from 2006 and delivering technical services and developing platform solutions. The team first became acquainted with Hadoop late 2007 and started operational support services from mid 2008.
  • Who we are? The situation of Hadoop in Japan Our experience .. What have been learnt , What have we observe in our customers and their clusters. More than fingers of both arms and both legs.
  • We will introduce who we are? 11.6 B, SI, Consulting , Outsourcing
  • Left Middle Right .. All rounder in Japanese IT Service Market
  • Nov 2009 – there was one session from Cloudera 2 nd one takes 15months 3 rd one comes earlier in 7 months there were Cloudera, Horton, MapR from US. Hope will have Hadoop World Japan or ASIA in the near future..
  • Regarding the popularity and deployment of Hadoop. It is not wide and not matured enough yet but APPARENTLY it is accelerated in this year
  • Let’s look at landscape first…
  • Let’s look at “data processing domains” and “applicable engines” データの流れ・変化と処理内容の変遷 Data warehouse servers Mid-tire servers
  • Let’s talk about our experience Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
  • So, the landscape changes from here to here..
  • データの流れ・変化と処理内容の変遷 According to our customer’s cases.. Data warehouse servers Mid-tire servers
  • データの流れ・変化と処理内容の変遷 According to our customer’s cases.. Data warehouse servers Mid-tire servers
  • We have been over 3 years support for customers. Then the oldest clusters are going to be renewed and expanded We called these area as “Frontiers” and “Establishment” last year.. We call these as “Involvement” and “Expansion” after some more reasoning… Here is the story
  • These groups are not different in processing domain, but also in Life-Cycle We haven’t seen huge cases yet.. Let’s talk about our experience Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
  • Many clusters or a Big cluster Hadoop cluster itself has good scalability and expandability..
  • Do we have flexible / useful scalability ??? According to our customer’s cases.. Data warehouse servers Mid-tire servers
  • Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance
  • Parallel processing based on “data locality” That would be beneficial large amount of data and also repetitive sweeping of data. Receipt processing on healthcare / insurance PostgreSQL is more polu
  • 利点 高速 DB サーバの負荷が小さい。 (WAL 、共有バッファをバイパスできる。 ) エラーが発生するレコードを飛ばしてデータを RDB にロードできる。 エラーが発生するレコードがどれかをログから確認できる。 エラーが発生しても export 先テーブルにゴミが残らない。 欠点 DB の管理者権限がなければ使いにくい 各 Map タスクで一時テーブルの作成、削除を行う。 pg_bulkoad のログは DB サーバ側に出力される。 全スレーブノードに pg_bulkload をインストールする必要がある。
  • RDBMS serves online-batch processing
  • Copyright 2011 FUJITSU LIMITED
  • Transcript

    • 1. Hadoop’s Life in Enterprise Systems Y Masatani OSS Professional Services System Platform Sector NTT DATA CORPORATION Hadoop World 2011 Nov 8 th
    • 2.
      • Who is NTT DATA
      • Hadoop in Japan
      • Archetype of Enterprise Hadoop
        • Where does Hadoop fit..
        • How Hadoop Cluster evolve..
        • How Hadoop is integrated..
      • What enhancements are expected from our customers..
      Agenda
    • 3.
      • NTT DATA CORPORATION
        • Headquarters: Tokyo, Japan
        • Revenue: USD 11.6billion (March, 2011 ; USD 1 = JPY 100)
        • Employees: 10,173 [non-consolidated] (Dec 2010 ) 51,433 [consolidated] (Dec 2010)
      • Business Areas: Broad range of IT services
        • Systems integration
        • IT consulting
        • IT outsourcing
      • History:
        • 1967 - established as a division of NTT
        • 1988 - spun off from NTT and incorporated (May 23, 1988)
        • 1995 - went public (Tokyo Stock Exchange: 9613)
      Company Overview
    • 4. Size of IT Services Market by Sectors <FY ended March 31,2011> [ Moderate Case ] <2010> Source: Gartner, &quot;Forecast: IT Services Japan by Industry, 1Q 2011&quot; Tsuyoshi Ebina, 20 May 2011 Note: Chart created by NTT Data based on Gartner data 42.2% 20.4% Government and healthcare Financial Enterprise, services, etc. 31.7% Other 5.7% Government and healthcare-related 15.2% 23.4% Financial Enterprise, services, etc. 61.5% Approx. 15.9% Our Shares in Markets IT Services Market in Japan NTT DATA’s Consolidated Net Sales JPY 9.83 trillion JPY 1.16 trillion Percent of our net sales accounted for by each customer field /service when results are totaled using the criteria below Government and healthcare: Central Government and Related Agencies, Overseas Public Institutions, etc. / Local Government and Community-based Business/Healthcare Financial: Banks/Financial Unions/Insurance, Security and Credit Corporations/Settlement Services Enterprise, services, etc.: Global IT Services Company Other: Sales not included in the above : (JPY Trillion) Approx. 6.1% Approx. 21.3%
    • 5. Positioning in NTT Group NTT DATA IT solutions and Integration company USD 11 billion
        • NTT Group is the 31st largest companies in the world*, specializing in IT & Telecommunications with USD 103 billion in revenue.
        • NTT DATA is the IT solutions arm of the NTT Group, specializing in providing IT solutions and systems integration services.
        • NTT Group regards IT business as one of its most important domains, and emphasizes NTT DATA’s growth as the telecom industry faces commoditization
      * “Fortune Global 500 July 2010” (USD 1 = JPY 100) Sales Breakdown of NTT Group NTT Holdings USD 103 billion NTT EAST Regional Telephone company USD 20 billion NTT WEST Regional Telephone company USD 18 billion NTT COMMUNICATIONS Network, International Telecommunications company USD 10 billion NTT DOCOMO Mobile/Network Company USD 42 billion : Dimension Data IT communication of enterprises and service providers
    • 6.
      • NTT DATA is a leading IT service provider and has over 4 years experience and production cases on Hadoop, and has been a partner with Cloudera since 2010
      • Help enterprise customer design, integrate, deploy and run large clusters at the range of 20 ~ 1200+ nodes
      • Deep and wide experience introducing Open Source Software technologies for enterprise customers. For the data management 8 years with PostgreSQL
      • Members contribute to the first book for Hadoop in Japan
      Hadoop and NTT DATA
    • 7.
      • Development of Hadoop Conference in Japan
      Hadoop is Getting Hot in Japan
    • 8. Popularity of Hadoop ~ 2011 Fall 3+ years none < 3 months 3 < 6 6 < 12 months 1 < 3 years ~50% attendees are still under research ~30% just started within 6 months
    • 9.
      • Where does Hadoop fit..
      • How Hadoop Cluster evolve..
      • How Hadoop is integrated..
      • What enhancements are expected from our customers..
      Archetype of Enterprise Hadoop
    • 10. Data Processing Domains and Engines
      • Hadoop fits natively for Big Data processing
      Latency Size GB TB PB RDBMS Hadoop Low-Latency Serving Systems DWH, Search Engine, etc sec min hour day Big Data Processing Online Processing Enterprise Batch Processing Online Batch Processing Query & Search Processing
    • 11.
      • Hadoop fits not only for “Large amount of data” but also for “Large number of records”
        • Typical target range is from a dozen of gigabytes to a few terabytes
      • “ Enterprise Batch Processing”
        • Critical windows to be finished within “a tolerable elapsed time”, which is daily and hourly in typical
        • Process large number of records with various data types and flags, which are repeatedly swiped and items are transferred and copied
        • Calculation is relatively simple, but conditional
        • Generate large amount of temporal intermediate data, and small set of data source may create multiple set of large output records
        • @Japan ASAKUSA * the specialized Hadoop batch framework emerges
      Fit also to “Enterprise Batch Processing” * http://www.asakusafw.com/
    • 12. Data Processing Domains and Engines
      • Hadoop fits natively for Big Data processing
      Latency Size GB TB PB RDBMS Hadoop Low-Latency Serving Systems DWH, Search Engine, etc sec min hour day Big Data Processing Online Processing Enterprise Batch Processing Online Batch Processing Query & Search Processing
    • 13. Data Processing Domains and Engines “Revised”
      • Hadoop fits natively for Big Data processing
      • Also fits for “Enterprise Batch processing”
      Big Data Processing Latency Size GB TB PB RDBMS Low-Latency Serving Systems DWH, Search Engine, etc Hadoop Enterprise Batch Processing sec min hour day Online Processing Online Batch Processing Query & Search Processing
    • 14. Customers Fit into Two Areas
      • Enterprise customers’ deployment splits into two areas
      Big Data Processing Latency Size GB TB PB Enterprise Batch Processing financial media public media telcom telcom public telcom sec min hour day Online Processing Online Batch Processing Query & Search Processing
    • 15. Hadoop Cluster’s Life-Cycle
      • Each area has typical Life-Cycle
      Big Data Processing Latency Size GB TB PB Enterprise Batch Processing financial media public media telcom telcom public telcom Expansion Involvement sec min hour day Online Processing Online Batch Processing Query & Search Processing
    • 16.
      • A cluster designed for Big Data and
        • sized by the amount of data based on two to three years prospect assuming sufficient margins
        • However, in almost cases, data grows faster than expected (150~250%)
        • More computation demand for deep research and data science
        • Primary and ad-hoc tasks require to access whole accumulated data
      • “ Big Mother and Children”
        • “ The Mother” cluster continue to grow
        • Adding “Children” clusters in order to isolate experimental and secondary activity
      “ Expansion”
    • 17.
      • A cluster designed based on “economy of scale effect”
        • Small cluster is sufficient for a dozen of gigabytes to a few terabytes data
        • Limited scalability by serialized data bandwidth between existing systems
      • Small-Medium “Siblings”
        • The successful Hadoop project tend to involve more projects
        • Burden to manage multiple small-medium clusters and multiple tenants, which have different owners and policies
      “ Involvement” Conversion from HDFS to POSIX Conversion from HDFS to POSIX Data Processing Data Processing Hadoop
    • 18. Archetype of Integration between Engines Big Data Processing Latency Size GB TB PB Enterprise Batch Processing financial media public media telcom telcom public telcom RDBMS Low-Latency Serving Systems DWH, Search Engine, etc Hadoop Raw Data Source Input Coherent Import and Export Reduction sec min hour day Online Processing Online Batch Processing Query & Search Processing
    • 19.
      • Typically there are a staging data storage and multiple collectors
        • which keeps raw input data for a limited amount of time window and allows Hadoop cluster to recover from unexpected system down and to secure scheduled maintenance
        • The staging capacity is determined based on the worst case scenario
        • It is also important Hadoop cluster has extra processing capacity in order to crunch up undone piles within the targeted time to recovery
      • The custom integration in the viewpoint of implementation
      Large Raw Data Source Input
    • 20.
      • Batch integrating with Sqoop
        • Sqoop is a good general purpose tool and it nicely guarantee the integrity of data export by introducing optional staging table within RDBMS
      • For enterprise systems we expect more, then we enhanced Sqoop especially for PostgreSQL, which is the most popular Open Source RDBMS for enterprise in Japan.
      • and the enhancement is open source now !
      • https://issues.apache.org/jira/browse/SQOOP-390
      • https://issues.apache.org/jira/browse/SQOOP-387
      Coherent Import and Export with RDBMS
    • 21.
      • Minimize the load against DB server using “pg_bulkload”, which bypass WAL and shared buffer, then achieve better performance
      • Skip and exclude exceptional records into error logs , thus the export does not accidentally stop by errors and does not remain any garbage onto tables. The postmortem recovery is possible via logged error records.
      Enhanced PostgreSQL connecter for Sqoop HDFS RDBMS Map Task Map Task Map Task HDFS RDBMS Optional Map Task Map Task Map Task pg_bulkload pg_bulkoad pg_bulkload Reduce Task BEGIN INSERT INTO dest ( SELECT * FROM tmp1 ) DROP TABLE tmp1 INSERT INTO dest ( SELECT * FROM tmp2 ) DROP TABLE tmp2 INSERT INTO dest ( SELECT * FROM tmp3 ) DROP TABLE tmp3 COMMIT Issue INSERT for each chunk of records: INSERT INTO stg VALUES (?, ?), (?, ?), ... INSERT INTO stg VALUES (?, ?), (?, ?), ... ... INSERT INTO dest ( SELECT * FROM stg) CREATE TABLE tmp3(LIKE dest INCLUDING CONSTRAINTS) Sqoop (baseline implementation) Specialized implementation for PostgreSQL Exclude error records into a separate file Staging Table File Split File Split File Split Destination Table File Split File Split File Split Destination Table tmp1 tmp2 tmp3
    • 22. Feature of Sqoop PostgreSQL Connector 1 Robust and efficient direct export using “pg_bulkload”
      • “ pg_bulkload” bypass WAL and shared buffer access, then minimize the load of DB servers
      • Exclude and skip erroneous data which causes parse error or checked-out by filter rules
      • Requires administrator privilege for DB servers
      2 Tune export using PostgreSQL COPY
      • More efficient than insert query
      • Handy than “pg_bulkload” because it does not require administrator access to the PostgreSQL
      3 Import using “ctid” for string type key value
      • Eliminate possible duplication of data on HDFS when key’s value type is a string Remove overlap data on HDFS in character data by “ctid”
      5 Balanced import using statistical information
      • When the key value is a numeric type, utilize PostgreSQL’s statistic data in order to equally allocate records for import map tasks
      4 Tune deletion method for staging table
      • Use “TRUNCATE” instead of “DELETE”
    • 23.
      • Typically it requires reduction of data
        • Mahout is a set of good general purpose tools and it going to be used gradually, but some tools are not efficient enough
        • Currently there is a cluster for web data mining, which deploy over a few hundred servers only for data reduction. There is a need to reduce the number of servers, thus NTT
      • Hybrid of Hadoop and GPGPU is under prototype based on the tools developed by NTT R&D Laboratories *
      Integration with Low-Latency Serving System * http://www.ntt.co.jp/RD/OFIS/index_en.html
    • 24.
      • Hadoop handles pre-processing, which requires I/O throughput against raw data
      • GPGPU handles encoded numeric data, which requires huge iterative calculation
      Prototype of Hadoop and GPGPU Integration Data Collection Feature Data Extraction Kmeans Clustering Result of Clustering Feature Data Compression Compressed Feature Data: - ROWs: 1000~100000 - COLs: 100~1000 - SIZE: order of ~GB Feature Data: - ROWs: 1000~100000 - COLs: 10000~100000 - SIZE: order of ~10GB Input Data (Query Log): - Unique User: 30,000[UU/H] - SIZE: order of ~TB Hadoop Slave Hadoop Master Flume Collector Flume Master /ZooKeeper GPU Server Raw Data Source
    • 25. Breakdown of Elapsed Time for K-means 24 cores 3 nodes 256 cores 1 node
    • 26.
      • HDFS API enabled shared storage system
        • Best mix of open technology (HDFS APIs) and established storage technology
      Connector Integration Beyond
      • Efficient data transfer and sharing between Hadoop and enterprise system without time consuming conversion between HDFS and POSIX
      • Consistent operation among enterprise system, i.e. similar established policies are applicable for backup recovery and access control (ACL) etc.
      • Utilize existing system resources, tools and operational procedure
      • Storage layer meets Hadoop’s scalability in the context of parallel data processing
      • Hadoop applications run without changes
      Common development in and Hadoop Cluster for Enterprise Batch Processing Backup and Recovery POSIX HDFS APIs System B System A
    • 27. Copyright 2011 FUJITSU LIMITED Enhanced Storage Architecture Established storage management technology (memory caching and disk I/O scheduling) and enhanced dedicated network enables boosted HDFS performance Local FS Mem CPU Extract Disk I/O bandwidth as of Locality Local FS Mem CPU Local FS Mem CPU Mem CPU Mem CPU Mem CPU Meshed network (40Gb b/w) Pros: Achieve Read 5x and Write 10x performance based on a financial enterprise batch benchmark case compared to local disk HDFS. Cons: Limited scalability (up to 40~50 nodes based on the prototype configuration, will be extended to ~120) Enhanced Bandwidth between Nodes and Storage Storage File system supports HDFS APIs
    • 28.
      • HDFS API enabled shared storage system
      Connector Integration Beyond
      • Efficient data transfer and sharing between Hadoop and enterprise system without time consuming conversion between HDFS and POSIX
      • Consistent operation among enterprise system, i.e. similar established policies are applicable for backup recovery and access control (ACL) etc.
      Common development in and Can Eliminate This Overhead Conversion from HDFS to POSIX Hadoop
    • 29. Hadoop with Enterprise Market
      • NTT DATA intensively promotes Open Source Software in the enterprise market, so we do on Hadoop and contribute its community
      • OSS and open architecture are essence of sustainable systems, then we seek to leverage commodity hardware and open platforms
    • 30. Thank you contact: hadoop at kits.nttdata.co.jp