Successfully reported this slideshow.
Nov 22, 2011
Sophisticated data instrumentation and collection technologies are leading to unprecedented growth. Data-driven organizations need to be able to scalably store data and perform complex data processing on the collected data(i.e. not just "queries"). Given the unstructured nature of the source data, and the need to stay agile, organizations also need to be able to change their schemas dynamically (at read-time vs write-time). Apache Hadoop is an open-source distributed fault-tolerant system that leverages commodity hardware to achieve large-scale agile data storage and processing. In this presentation, Dr. Amr Awadallah will introduce the design principles behind Apache Hadoop and explain the architecture of its core sub-systems (the Hadoop Distributed File System and MapReduce). Amr will also contrast Hadoop to relational database systems and illustrate how they truly complement each other. Finally, Amr will cover the Hadoop ecosystem at large which includes a number of projects that together form a cohesive Data Operating System for the modern data center.
11/16/2011, Stanford EE380 Computer Systems ColloquiumIntroducing Apache Hadoop:The Modern Data Operating SystemDr. Amr Awadallah | Founder, CTO, VP of Engineeringaaa@cloudera.com, twitter: @awadallah
Limitations of Existing Data Analytics Architecture BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation2 ©2011 Cloudera, Inc. All Rights Reserved.
So What is Apache Hadoop ?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).• Core Hadoop has two main systems: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.3 ©2011 Cloudera, Inc. All Rights Reserved.
The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before • Data is simply copied to the file any data can be loaded. store, no transformation is needed.• An explicit load operation has to • A SerDe (Serializer/Deserlizer) is take place which transforms applied during read time to extract data to DB internal structure. the required columns (late binding)• New columns must be added • New data can start flowing anytime explicitly before new data for and will appear retroactively once such columns can be loaded the SerDe is updated to parse it. into the database. • Read is Fast • Load is Fast Pros • Standards/Governance • Flexibility/Agility 4 ©2011 Cloudera, Inc. All Rights Reserved.
Innovation: Explore Original Raw Data Data Committee Data Scientist5 ©2011 Cloudera, Inc. All Rights Reserved.
Flexibility: Complex Data Processing1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop).2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce.3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava)4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads.5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes.6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above.6 ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE 7 ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Data Beats Algorithm Smarter Algos More DataA. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009 8 ©2011 Cloudera, Inc. All Rights Reserved.
Scalability: Keep All Data Alive ForeverArchive to Tape and Extract Value From Never See It Again All Your Data9 ©2011 Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job Relational Databases: Hadoop:Use when: Use when:• Interactive OLAP Analytics (<1sec) • Structured or Not (Flexibility)• Multistep ACID Transactions • Scalability of Storage/Compute• 100% SQL Compliance • Complex Data Processing10 ©2011 Cloudera, Inc. All Rights Reserved.
HDFS: Hadoop Distributed File SystemA given file is broken down into blocks(default=64MB), then blocks arereplicated across cluster (default=3).Optimized for:• Throughput• Put/Get/Delete• AppendsBlock Replication for:• Durability• Availability• ThroughputBlock Replicas are distributedacross servers and racks. 11 ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Computational Frameworkcat *.txt | mapper.pl | sort | reducer.pl > out.txt (words, counts) Split 1 (docid, text) Map 1 (sorted words, counts) Output Be, 5 Reduce 1 (sorted words, sum of counts) File 1 “To Be Or Not Be, 30 To Be?” Be, 12 Output (sorted words, Reduce i File i Split i (docid, text) Map i sum of counts) Be, 7 Be, 6 Shuffle Output (sorted words, Reduce R File R sum of counts) Split N (docid, text) Map M (words, counts) (sorted words, counts) Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) 12 ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce: Resource Manager / SchedulerA given job is broken down into tasks,then tasks are scheduled to be asclose to data as possible.Three levels of data locality:• Same server as data (local disk)• Same rack as data (rack/leaf switch)• Wherever there is a free slot (cross rack)Optimized for:• Batch Processing• Failure RecoverySystem detects laggard tasks andspeculatively executes parallel taskson the same slice of data. 13 ©2011 Cloudera, Inc. All Rights Reserved.
But Networks Are Faster Than Disks!Yes, however, core and disk density per serverare going up very quickly:• 1 Hard Disk = 100MB/sec (~1Gbps)• Server = 12 Hard Disks = 1.2GB/sec (~12Gbps)• Rack = 20 Servers = 24GB/sec (~240Gbps)• Avg. Cluster = 6 Racks = 144GB/sec (~1.4Tbps)• Large Cluster = 200 Racks = 4.8TB/sec (~48Tbps)• Scanning 4.8TB at 100MB/sec takes 13 hours.14 ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file names Tracks resources and schedules to blocks to data node slaves. jobs across task tracker slaves. Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node15 ©2011 Cloudera, Inc. All Rights Reserved.
Changes for Better Availability/Scalability Hadoop ClientFederation partitions Contacts Name Node for data Each job has its own or Job Tracker to submit jobsout the name space, Application Manager,High Availability via Resource Manager isan Active Standby. decoupled from MR. Name Node Job Tracker Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 16 ©2011 Cloudera, Inc. All Rights Reserved.
CDH: Cloudera’s Distribution Including Apache Hadoop File System Mount UI Framework/SDK Data Mining Build/Test: APACHE BIGTOP FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER SCM Express (Installation Wizard for CDH) 17 ©2011 Cloudera, Inc. All Rights Reserved.
Books18 ©2011 Cloudera, Inc. All Rights Reserved.
Conclusion• The Key Benefits of Apache Hadoop: – Agility/Flexibility (Quickest Time to Insight). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Storage (Keep All Your Data Alive Forever).• The Key Systems for Apache Hadoop are: – Hadoop Distributed File System: self-healing high- bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management coupled with scalable data processing. 19 ©2011 Cloudera, Inc. All Rights Reserved.
Appendix BACKUP SLIDES20 ©2011 Cloudera, Inc. All Rights Reserved.
Unstructured Data is Exploding Complex, Unstructured Relational• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2“zettabytes” this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.21 ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Creation History • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes22 ©2011 Cloudera, Inc. All Rights Reserved.
Hadoop in the Enterprise Data Stack Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators ODBC, JDBC, Cloudera NFS, Native Mgmt Suite Enterprise ETL Tools Data Warehouse Sqoop DataArchitects Customers Low-Latency Web Flume Flume Flume Sqoop Serving Application Relational Systems Logs Files Web Data Databases 23 ©2011 Cloudera, Inc. All Rights Reserved.
MapReduce Next GenMain idea is to split up the JobTracker functions:• Cluster resource management (for tracking and allocating nodes)• Application life-cycle management (for MapReduce scheduling and execution)Enables:• High Availability• Better Scalability• Efficient Slot Allocation• Rolling Upgrades• Non-MapReduce Apps24 ©2011 Cloudera, Inc. All Rights Reserved.
Two Core Use Cases Common Across Many IndustriesUse Case Application Industry Application Use Case Web ADVANCED ANALYTICS Social Network Analysis Clickstream Sessionization DATA PROCESSING Content Optimization Media Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking 25 ©2011 Cloudera, Inc. All Rights Reserved.
What is Cloudera Enterprise?Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTSsource Apache Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Cloudera Production- Management Level Support Reduce Adoption Costs and Risks Suite Lower the Cost of Administration Comprehensive Our Team of Experts Increase the Transparency & Control of Hadoop On-Call to Help You Toolset for Hadoop Leverage the Experience of Our Experts Administration Meet Your SLAs 3 of the top 5 telecommunications, mobile services, defense &intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production26 ©2011 Cloudera, Inc. All Rights Reserved.
Hive vs Pig Latin (count distinct values > 0)• Hive Syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0;• Pig Latin Syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable;27 ©2011 Cloudera, Inc. All Rights Reserved.
Apache Hive Key Features• A subset of SQL covering the most common statements• JDBC/ODBC support• Agile data types: Array, Map, Struct, and JSON objects• Pluggable SerDe system to work on unstructured files directly• User Defined Functions and Aggregates• Regular Expression support• MapReduce support• Partitions and Buckets (for performance optimization)• Microstrategy/Tableau Compatibility (through ODBC)• In The Works: Indices, Columnar Storage, Views, Explode/Collect• More details: http://wiki.apache.org/hadoop/Hive28 ©2011 Cloudera, Inc. All Rights Reserved.
Hive Agile Data Types• STRUCTS: – SELECT mytable.mycolumn.myfield FROM …• MAPS (Hashes): – SELECT mytable.mycolumn[mykey] FROM …• ARRAYS: – SELECT mytable.mycolumn FROM …• JSON: – SELECT get_json_object(mycolumn, objpath) FROM …29 ©2011 Cloudera, Inc. All Rights Reserved.