Hadoop Architecture

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    Hadoop Architecture - Presentation Transcript

    1. > Hadoop Architecture Philippe Julio – Principal Field Technologist Sun Microsystems France October, 2009
    2. Data Management Vision « The Data are not created relevant, they become so ! » October, 2009 2
    3. Data-Driven on-Line Websites • To run the apps : messages, posts, blog entries, video clips, maps, web graph... • To give the data context : friends networks, social networks, collaborative filtering... • To keep the applications running : web logs, system logs, system metrics, database query logs... October, 2009 3
    4. New Data Management Economics Compute Trend Data Trend New Analytics Emerge Semi-structured Data (MapReduce...) (Mogile, Bigtable, HDFS...) Master/Slave Semi- Object Store structured Architectural shift to the cloud Database and HPC-style workloads ScaleDB, Big Table, SimpleDB HBase Master/Master Open source, general purpose datawarehouse Distributed FS Federated/ Proprietary, dedicated Sharded datawarehouse Unstructured Structured Data Data OLTP is the datawarehouse October, 2009 4
    5. Changing Software Economics 1998 2008 2009 October, 2009 5
    6. What Is Hadoop ? “Flexible and available architecture for large scale computation and data processing on a network of commodity hardware” Open Source + Hardware Commodity = IT Costs Reduction October, 2009 6
    7. Who Used Hadoop ? • Top level Apache Foundation project • Large, active user base, mailing lists, user groups • Very active development, strong development team October, 2009 7
    8. Who Support Hadoop ? • 101tec Inc. Integration, customization, consulting. (Hadoop, Pig, Zookeeper, Lucene, Nutch) • Cloudera, Inc. Get Cloudera's Distribution for Hadoop - it's free, and help you to optimize your configuration. We also provide commercial support and professional training for Hadoop. Basic training is online for free • Cloudify - assist organizations in integrating Cloud Computing into their IT and Business strategies and in building and managing scalable, next-generation infrastructure environments (Hadoop, Solr, AWS, distributed architectures) • Doculibre Inc. Open source and information management consulting. (Lucene, Nutch, Hadoop, Solr, Lius etc.) • ScaleUnlimited, Inc. Training and mentoring on large architectures. Hadoop Bootcamp now available • Tinvention -Ingegneria Informatica - Italian Consulting Company, offer support on open source architecture based on Java, including architectures based on Hadoop. http://wiki.apache.org/hadoop/Support October, 2009 8
    9. How to Interface your Application ? • Implement “Save As Cloud...” to Write to the cloud • Implement “Open From Cloud...” to Read from the cloud • Hadoop 0.20.0 http://hadoop.apache.org/common/docs/r0.20.0/index.html • Commands Guide http://hadoop.apache.org/common/docs/r0.20.0/commands_manual.html • Develop your application with Hadoop API : http://hadoop.apache.org/common/docs/current/api/ October, 2009 9
    10. OpenSolaris Live Hadoop • Avaliable on your Laptop • OpenSolaris • Hadoop Live CD based on 3 zones > Global Zone : NameNode, JobTracker > Local Zone 1 : DataNode, TaskTracker > Local Zone 2 : DataNode, TaskTracker • Hbase • JDK with Java compiler and related tools • http://opensolaris.org/os/project/livehadoop/ October, 2009 10
    11. OpenSolaris for Hadoop Network AutoMagic Automated Install Security D-Light Image Packaging Containers System ZFS DTrace Time Slider CIFS Predictive Self Healing Distribution Solaris Virtualization Constructor Technology Open Storage COMSTAR October, 2009 11
    12. Open Source Hadoop Subprojects Pig Jagl Hql Hive ZooKeeper • Platform for • Query language for • Hibernate Query • Data Warehouse • High-performance analyzing large JavaScript Object Language infrastructure coordination service data sets Notation (JSON) for distributed • Makes mapping • Tools to enable applications • High-level • Jaql designed to objects to be easy data language for flexibly read and persisted to summarization • Exposes common expressing data write data from a underlying services such as analysis variety of data database easier • Adhoc querying naming, programs, stores and formats configuration coupled with • Fully object- • Analysis of large management, infrastructure for • To model a wide oriented, datasets data synchronization, evaluating these spectrum of data, understanding stored in Hadoop and group services programs. ranging from notions like files in a simple interface homogenous flat inheritence, data to polymorphism • Sophisticated • Implement heterogeneous and association analysis consensus, group nested data • No Realtime management, querying leader election, and presence protocols October, 2009 12
    13. Hadoop Architecture Infrastructure as a Services General Purpose Storage Servers • Combine server with disk & networking • Specialized software enables general purpose systems designs to provide high performance data services Sun's Open Platform direction Data moves to the infrastructure Sun Fire x4xxx (Data Compute and Store) Sun Sparc Enterprise T5xxx (Data Compute and Store) Sun Storage 7xxx (Data Store) October, 2009 13
    14. Hadoop Architecture x86 Servers Components Low Cost Server & Storage : Sun Fire x4xxx Interface Sun Virtualization Technologie : Solaris Containers Performance • Hadoop • NFS • HTTP • ... Sun Fire x4540 ● 2 CPU Quad Core AMD Sun Fire x4240 ● Up to 64 GB RAM ● 2 CPU Quad Core AMD 48 x 1TB SATA II 7200 RPM Sun Fire x4140 ● ● Up to 128GB RAM ● 4 RU ● 2 CPU Quad Core AMD ● Up to 4,67 TB Disks ● Up to 128 GB RAM ● 16 x 300 GB SAS 10000 RPM ● Up to 2,34 TB Disks ● 2 RU ● 8 x 300 GB SAS 10000 RPM ● 1 RU Capacity October, 2009 14
    15. Hadoop Architecture CMT Servers Components Low Cost Server : Sun Sparc Enterprise T5xxx Interface Sun Virtualization Technologies : Solaris Containers, LDoms Performance • Hadoop • NFS Sun Enterprise T5240 • HTTP ● 2 CPU Octo Core UST2+ • ... ● Up to 256 GB RAM Sun Enterprise T5440 ● Up to 4,68 TB Disks ● 2 CPU Octo Ccore UST2+ ● 16 x 300 GB SAS 10000 RPM ● Up to 512GB RAM ● 2 RU Sun Blade T6340 ● Up to 1,17 TB Disks ● 2 CPU Octo Core UST2+ ● 4 x 300 GB SAS 10000 RPM ● Up to 256 GB RAM ● 4 RU ● Up to 600 GB Disk ● 2 x 300 GB 10000 RPM ● 10 RU Capacity October, 2009 15
    16. Hadoop Architecture Open Storage Components Low Cost Storage : Sun Storage 7xxx OpenStorage Interfaces Performance • NFS • CIFS • ISCSI • WebDAV • HTTP Sun Storage 7410 ● 16GB, 64GB and 128GB • FTP Sun Storage 7210 RAM options ● 32GB and 64GB RAM options ● Up to 288TB • D-Trace Sun Storage 7110 ● Up to 142TB ● Hybrid Storage Pool I/O Alalytics ● 8GB RAM ● Hybrid Storage Pool I/O Acceleration ● 14 x 300 GB SAS 10K Acceleration ● Read/Write Flash/SSD options RPM Drives ● Write Flash/SSD options for higher performance ● Up to 4, 2TB for higher performance Capacity October, 2009 16
    17. Hadoop Architecture HDFS & Map Reduce Software • Hadoop Distributed File System > A scalable, Fault tolerant, High performance distributed file system capable of running on x4xxx hardware > Hadoop cluster with 3 nodes minimum > Data divided into 64MB or 128MB blocks, each block replicated 3 times > No 15k RPM disks or RAID required > NameNode holds filesystem metadata > Files are broken up and spread over the DataNodes • Hadoop Map Reduce > Software framework for distributed computation > Input | Map() | Copy/Sort | Reduce() | Output > JobTracker schedules and manages jobs > TaskTracker executes individual map() and reduce() tasks on each cluster node October, 2009 17
    18. Hadoop Architecture Hadoop – Hive - HBase • Hadoop • Hive • HBase > Designed to store > Data Warehouse > Distributed and stream infrastructure > Column-Oriented extremely large > Tools to enable easy datasets in batch > Multi-Dimensional data summarization > No realtime > High-Availability > Adhoc querying querying (based on SQL) > High-Performance > Does not support > No realtime querying > Storage-System random access > Realtime querying > Analysis of large datasets data stored > Tables are stored in in Hadoop files regions (row range for > Sophisticated a table) in Hadoop files analysis October, 2009 18
    19. Hadoop Architecture HDFS Architecture DataNode October, 2009 19
    20. Hadoop Architecture MapReduce Architecture TaskTracker DataNode October, 2009 20
    21. Hadoop Architecture Hive Architecture • MetaStore > Table/Partitions properties Hive Users Hive CLI > Thrift API : Current clients in Php (Web Web UI Browsing, Querying, Interface), Python interface to Hive, Java DDL (Query Engine and CLI) > Metadata stored in any SQL backend • Hive Query Language Thrift API Hive QL > Basic SQL : Select, From, Join, Group By MetaStore Parser > Equi-Join, Multi-Table Insert, Multi-Group-By Execution Planner • Dealing with Strutured Data > Primitive Type > ObjectInspector interface for user-defined types Thrift, JSON... > Generic [De]Serialization Interface (SerDe) SerDe > Serialization families implement interface October, 2009 21
    22. Hadoop Architecture HBase Architecture • Master Server > Assigns regions to region servers > Scans ROOT region to find META regions,assigns Output them to region servers > Scans META regions to find user regions, assigns Hadoop HBase them to region servers > Assigns regions for load balancing or if region server Input fails > Tells client where ROOT region is located Intput Intput • Region Server > Main thread is master heart beat loop Output Output > Process long running operations resulting from master heart beat response > Check to see if regions need to be split, regions need compaction, cache needs to be flushed and log needs to be rolled > One thread per client request (leases) HDFS • Client API > Create new HTable object to open a table > Client requests are sent directly to region servers. Master is not involved > Administrative functions to manipulate tables and columns October, 2009 22
    23. Hadoop Architecture Sizing • Application Servers sizing > Number of concurrent low users > Number of concurrent medium users > Number of concurrent high users • Hadoop Servers sizing > Useable Data Volume : Customer data volume (customer need) > Number of replication blocks : 3 (is recommanded) > Block size : 64MB or 128MB > 2 CPU minimum / Node > Work Data Volume : for HDFS, MapReduce, Hive, HBase > Raw Data Volume : (Useable Data Volume * Nb of replication blocks) + Work Data Volume > Number of cluster nodes : Max (Number of replication blocks, (Raw Data Volume / Server capacity storage)) • No RAID factor, No HBA port October, 2009 23
    24. Hadoop Architecture Hadoop Cluster Overview Hadoop Cluster Master Node Slaves Nodes Single Point Of Failure 0 1 2 3 4 (SPOF) Hadoop Cluster with Solaris Cluster (HA) Master Node Slaves Nodes Active Passive Cluster 0 0' Cluster 1 2 3 4 Node Node FsImage + EditLog (Metadata) October, 2009 24
    25. Hadoop Architecture Hadoop High Availability 2 x HD 300 GB 2 x HD 300 GB Active Master Node Passive Master Node + NameNode + NameNode + JobTracker + JobTracker Solaris Cluster FsImage + EditLog Sun StorageTek Array (Metadata) Disks Mirroring RAID1 12 Disks SAS 300 GB October, 2009 25
    26. Hadoop Architecture x86 Reference Architecture Cloud Storage Model Sun Fire x4140 Application Servers ● 2 CPU Quad Core AMD ● 32 GB RAM SF x4140 SF x4140 ● 2 x 300 GB SAS 10000 RPM SF x4540 SF x4540 SF x4540 Sun Fire 4540 ● 2 CPU Quad Core AMD ● 32 GB RAM ● 48 x 1TB SATA II 7200 RPM Master Node Slave Node Slave Node Interfaces • Hadoop OpenSolaris on each server • NFS Hadoop on each cluster node • HTTP October, 2009 26
    27. Hadoop Architecture CMT Reference Architecture Cloud Storage Model Sun Enterprise T5240 Application Servers ● 2 CPU Octo Core UST2+ ● 32 GB RAM SE T5240 SE T5240 ● 2 x 300 GB SAS 10000 RPM SE T5240 SE T5240 SE T5240 Sun Enterprise T5240 ● 2 CPU Oc*to Core UST2+ Master Node Slave Node Slave Node ● 32 GB RAM ● 16 x 300 GB 10000 RPM Interfaces • Hadoop OpenSolaris on each server • NFS Hadoop on each cluster node • HTTP October, 2009 27
    28. Hadoop Architecture Open Storage Reference Architecture Cloud Storage Model Sun Fire x4140 Master Node Slave Node Slave Node ● 2 CPU Quad Core AMD SF x4140 SF x4140 SF x4140 ● 32 GB RAM ● 2 x 300 GB SAS 10000 RPM SS 7110 SS 7110 Sun Storage 7110 OpenStorage ● 8 GB RAM ● 14 x 300 GB SAS 10K RPM Interfaces OpenSolaris on each server • Hadoop Hadoop on each cluster node Performances analysis with D-Trace Analytics on Sun Storage 7xxx • NFS • HTTP • D-Trace Analytics October, 2009 28
    29. Hadoop Architecture Open Storage Reference Architecture SF x4140 Master Node Cloud Storage Model Slave Node OpenStorage Sun Fire x4140 Switch ● 2 CPU Quad Core AMD ● 32 GB RAM ● 4 x 300 GB SAS 10000 Sun Storage 7410 RPM ● 64 GB RAM ● 288 TB Interfaces Hybrid Storage Pool I/O Hadoop ● Acceleration • ● Read/Write Flash/SSD • NFS options for higher performance • HTTP • D-Trace Analytics OpenSolaris on each server Hadoop on each cluster node Performances analysis with D-Trace Analytics on Sun Storage 7xxx October, 2009 29
    30. Hadoop Architecture Configuration Management with Zookeeper • ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services • All servers store a copy of the data • A leader is elected at startup • Followers service clients, all updates go through leader • Update responses are sent when a majority of servers have persisted the change October, 2009 30
    31. Hadoop Architecture Pricing Model - Example • Utility pay-as-you-go pricing model, competitive to market • Based on capacity > Volume of data used > Volume of data transfer > Number of requests • Easy sign-up and self provisioning October, 2009 31
    32. Hadoop Infrastructure Sun Value Proposition Data processing in less • Mean processing time • TECHNOLOGY / METHODOLOGY CMT, AMD and INTEL Processors KEY PERFORMANCE INDICATORS CUSTOMER REFERENCE • Volume of data used than 36 hours • Multi OS : Linux, Solaris, Windows • Volume of data transfer • Open Solaris Objective : Archives Every newspaper • Number of requests from 1851 to 1922. 405,000 very large • Sun Cloud Services (compute, storage) TIFF images, 3.3 million articles in SGML • Number od Hadoop cluster nodes • Open Storage : Sun Storage 7000 Unified Storage and 405,000 xml files -> converted to a • Return On Investment System more web-friendly 810,000 PNG images and 405,000 JavaScript files • Total Cost of Ownership • Sun Fire x4540 (Hybrid Server with 48TB) Solution : Utilizes Amazon Web Services • Infratsructure SwaP ratio • Hadoop : Distributed applications with high density (public cloud) and Hadoop (OpenSolaris) • Use rate of equipment environment of data • Time to deploy a new service • OpenSolaris Configuration kit Live Hadopp Customer Benefit : this client was able to utilize hundreds of machines • Time to Market • HBase : Multi-dimensional database concurrently and process all the data in less than 36 hours. Efficiency • Hive : datawarehoing with Hadoop exemplified. No captial expense for • Zookeeper for configuration management equipment. Ability to deploy in hours not in months • Solaris Containers • VMWare, Microsoft Virtual Server • Sun xVM Infrastructure with Sun xVM Server ( LDom, Xen) and Sun xVM Ops Center • Solaris Cluster • Storage virtualization : Sun Virtual Tape Library, Solaris ZFS • Cloud Computing Workshop SERVICES • • Sun Virtual Desktop Infrastructure Software Sun Cloud Services (compute, storage) • Product Deployment Services • VirtualBox (Client virtualization) • Sun Learning Services • Sun Managed Services • Sun Support Services • Sun Global Financial Services Operation October, 2009 32
    33. Hadoop Architecture Key Links • http://developer.yahoo.com/hadoop • http://wiki.apache.org/hadoop/Support • http://wiki.apache.org/hadoop/FrontPage • http://hadoop.apache.org/core/docs/current/hdfs_design.html • http://opensolaris.org/os/project/livehadoop October, 2009 33
    34. > Philippe Julio philippe.julio@sun.com http://blogs.sun.com/philippejulio

    + Philippe JulioPhilippe Julio, 1 month ago

    custom

    487 views, 2 favs, 0 embeds more stats

    Hadoop, flexible and available architecture for lar more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 487
      • 487 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 43
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories