Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011

on

  • 6,540 views

Cloudera VP Engineering, Amr Awadallah explains "Apache Hadoop in the Enterprise."

Cloudera VP Engineering, Amr Awadallah explains "Apache Hadoop in the Enterprise."

Statistics

Views

Total Views
6,540
Views on SlideShare
6,041
Embed Views
499

Actions

Likes
9
Downloads
445
Comments
0

5 Embeds 499

http://www.cloudera.com 484
http://blog.cloudera.com 10
http://localhost 2
https://twitter.com 2
http://test.cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011 Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011 Presentation Transcript

  • Apache Hadoop in the EnterpriseCloudera, Inc.Amr Awadallah, Founder, CTO, VP of Engineering.aaa@cloudera.com, twitter: @awadallahMicrostrategy World – January 2011 – Las Vegas
  • Unstructured Data Explosion Complex, Unstructured Relational • 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. Copyright © 2011, Cloudera, Inc. All Rights Reserved. . 2
  • Dramatic Changes in Enterprise Data Needs Data Explosion • Any Type of Data • From Many Sources • Instrument Everything Hard Problems • Complex Analysis • At Lowest Granularity • Data Beats Algorithm Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
  • What is Hadoop?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)• Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing• Key business values • Flexible -> Store any data, run any analysis (Mine First, Govern Later) • Affordable -> Cost per TB at a fraction of traditional options • Broadly adopted -> A large and active ecosystem • Proven at scale -> Several petabyte deployments in production today • Open Source -> No Lock-In, low cost, large developer community. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
  • Cloudera’s Data Operating System (CDH) Hue Hue SDK Avro, Oozie Oozie Hive Pig. Hive Avro, Flume, Sqoop HBase Zookeeper• Open Source – 100% Apache licensed• Simplified – Component versions & dependencies managed for you• Reliable – Predictable release schedules, Patched with fixes to improve stability• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32 or 64bit, etc.• Integrated – All components & functions interoperate through standard API’s• Supported – Founders, committers, contributors across all projects Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
  • Benefit #1: AgilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before • Data is simply copied to the file data is loaded store, no special transformation is needed• Explicit load operation has to take place which transforms data • A SerDe (Serializer/Deserlizer) is to database internal structure applied during read time to extract the required columns• New columns must be added explicitly before data for such • New data can start flowing columns can be loaded into the anytime and will appear database retroactively once the SerDe is updated to parse them• Read is Fast • Load is Fast Benefits• Standards/Governance • Evolving Schemas/Agility Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
  • Benefit #2: Data Consolidation Complex Data Documents SharePoint Web feeds Sensor data System logs EMB archives Online forums Images/Video Structured Data (“relational”) CRM Inventory Financials Sales records Logistics HR records Data Marts Web Profiles A single data system to enable processing across the universe of data types. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
  • Benefit #3: Any Programing Language (Not Only SQL)1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes.4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe.6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
  • Benefit #4: Balancing Return on Investment (or Byte!) • Return on Byte = value to be extracted from that byte divided by the cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage. High ROB Low ROB Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
  • Use The Right Tool For The Right Job Relational Databases: Hadoop:Use when: Use when:• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)• Multistep ACID Transactions • Scalable Storage/Compute• SQL Compliance • Complex Data Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
  • Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting System Administrators Cloudera Mgmt Apps Enterprise Data Warehouse Data UsersArchitects Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
  • Apache Hive Features• A subset of SQL covering the most common statements• JDBC/ODBC support• Agile data types: Array, Map, Struct, and JSON objects• Pluggable SerDe system to work on unstructured files directly.• User Defined Functions and Aggregates• Regular Expression support• MapReduce support• Partitions and Buckets (for performance optimization)• In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect• More details: http://wiki.apache.org/hadoop/Hive Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
  • Broad Adoption in Key Verticals Financial Services Telecom Retail Government Example Risk management: BSS: Brand Equity: Traffic Analysis:Applications “Examine purchase “Analyze calling “Monitor customer “Use multimedia behavior across patterns among and product data data from various debit and credit users and current recorded across sources to build an properties to better capacity to forecast internal & external actionable graph of identify high-risk traffic growth and sources to trend relationships among customers.” locate new towers.” brand valuation.” targets.” IT: OperationsStakeholders IT: Data Engineering Risk Analysts Research Insight Team Intelligence Copyright © 2011, Cloudera, Inc. All Rights Reserved. 13
  • Customers Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
  • How are Customers Using Cloudera?Answering Questions that Were Impossible to Ask Before Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention Model site visitor behavior with analytics that deliver better recommendations for new purchases Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster Correlate educational outcomes with programs and student histories to improve results Big Bank Examine customer behavior to improve loan risk scoring More: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
  • Cloudera OfferingsFacilitating enterprise adoption of Hadoop Software Services Training Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
  • Cloudera EnterpriseEnterprise Support and Management Applications • Improves conformance to important IT SLAs, policies and procedures • Lowers the cost of management and administration • Increases reliability and consistency of the platform • Certified integration with RDBMS, ETL, BI, Server, and Cloud Systems Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • Integrating with Existing IT Infrastructure BI/Analytics ETL RDBMS Cloud/OS Hardware Copyright © 2011, Cloudera, Inc. All Rights Reserved. 18
  • MicroStrategy (for interactive Dashboards) Copyright © 2011 Couldera, Inc. All Rights Reserved. 19
  • Informatica (for Extract-Transform-Load, aka ETL) Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
  • Summary• Cloudera’s Data OS (CDH) enables: • Data Agility (Evolving Schemas) • Consolidation (Structured or Not) • Complex Data Processing (Any Language) • Economical Storage (Enable Return-on-Byte > 1)• Cloudera Enterprise enables: • Conformance to important IT SLAs, policies and procedures • Lower cost of management and administration • Increased reliability and consistency • Certified integration with existing IT infrastructure Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
  • Contact Information and Free Hadoop Book Amr Awadallah CTO, Cloudera, Inc. aaa@cloudera.com 650-644-3921 twitter.com/awadallah twitter.com/cloudera Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
  • Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
  • Appendix Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
  • Cloudera Overview Jeff Hammerbacher, Chief ScientistHadoop… Amr Awadallah, CTO, VP Engineering Doug Cutting, Chief Architect Mike Olson - CEO Omer Trajman – VP, Customer Solutions… meets enterprise John Kreisa –VP, Marketing Charles Zedlewski – VP, Product Management Ed Albanese – Head of Business DevelopmentInvestors Accel Partners, Greylock Partners, Meritech Capital PartnersProduct category Data ManagementBusiness model Cloudera offers Software, Support, Training, and Professional ServicesEmployees 70+Customers 75+Headquarters Palo Alto, CaliforniaElevator pitch The leading provider of Apache Hadoop-based software and services for the enterpriseVision We enable organizations to profit from all of their data Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
  • Why CDH (Cloudera Distribution for Hadoop)? Features Benefits It’s packaged Much easier for users to install CDH than any other form of Hadoop. It’s patched This makes CDH more stable and secure than just downloading an Apache branch It’s proven Thousands of organizations already use CDH today so risk is lower It’s highly functional CDH will cover more use cases and users will be more productive than if they were just using core Hadoop. It’s integrated Save time (of piecing a system together yourself) and lower risk (of choosing the wrong combination of versions or patches) It’s the accepted standard More of your preexisting investments in RDBMS, ETL and BI work best with CDH It’s supported CDH is one of only two distributions that has a commercial entity standing behind it It’s 100% Apache licensed Investment in this technology is insured. Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • Hadoop Timeline Fastest sort of a TB, 3.5mins over 910 nodes Cutting adds DFS & MapReduce support to Nutch • Fastest sort of a TB, 62secs over 1,460 nodes NY Times converts 4TB of • Sorted a PB in 16.25hoursDoug Cutting & Mike Cafarella over 3,658 nodes image archives over 100 EC2s started working on Nutch 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & Cloudera Yahoo! hires Cutting, Cloudera MapReduce papers Founded Hadoop spins out of Nutch hires Cutting Web-scale deployments at Y!, Facebook, Last.fm Hadoop Summit 2009, 750 attendees Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
  • 10 Common Hadoop-able Problems 1. Modeling true risk 6. Analyzing network data to predict failure 2. Customer churn analysis 7. Threat analysis 3. Recommendation 8. Trade surveillance engine 9. Search quality 4. Ad targeting 10. Data “sandbox” 5. PoS transaction analysis Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28
  • Case Studies: Hadoop World 2009 •VISA: Large Scale Transaction Analysis •JP Morgan Chase: Data Processing for Financial Services •China Mobile: Data Mining Platform for Telecom Industry •Rackspace: Cross Data Center Log Processing •Booz Allen Hamilton: Protein Alignment using Hadoop •eHarmony: Matchmaking in the Hadoop Cloud •General Sentiment: Understanding Natural Language •Yahoo!: Social Graph Analysis •Visible Technologies: Real-Time Business Intelligence Slides and Videos: http://www.cloudera.com/hadoop-world-nyc Copyright © 2011, Cloudera, Inc. All Rights Reserved. 29
  • Case Studies: Hadoop World 2010 •eBay: Hadoop at eBay •Twitter: The Hadoop Ecosystem at Twitter •Yale University: MapReduce and Parallel Database Systems •General Electric: Sentiment Analysis powered by Hadoop •Facebook: HBase in Production •AOL: AOL’s Data Layer •Raytheon: SHARD: Storing and Querying Large-Scale Data •StumbleUpon: Mixing Real-Time and Batch Processing More Info: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Copyright © 2011, Cloudera, Inc. All Rights Reserved. 30
  • Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible Copyright © 2011, Cloudera, Inc. All Rights Reserved. 31
  • HDFS: Hadoop Distributed File System Block Size = 64MBReplication Factor = 3 Cost/GB is a few ¢/month vs $/month Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • MapReduce: Distributed Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • MapReduce Example for Word Countcat *.txt | mapper.pl | sort | reducer.pl > out.txt (words, counts) Split 1 (docid, text) Map 1 (sorted words, counts) Output Be, 5 Reduce 1 (sorted words, sum of counts) File 1 “To Be Or Not Be, 30 To Be?” Be, 12 Output (sorted words, Reduce i File i Split i (docid, text) Map i sum of counts) Be, 7 Be, 6 Shuffle Output (sorted words, Reduce R File R sum of counts) Split N (docid, text) Map M (words, counts) (sorted words, counts) Map(in_key, in_value) => list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) => out_value(s) Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file blocks Schedules jobs across to data node slaves task tracker slaves Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node Copyright © 2011, Cloudera, Inc. All Rights Reserved.
  • Hive vs Pig Example (count distinct values > 0)• Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0;• Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; Copyright © 2011, Cloudera, Inc. All Rights Reserved. 36
  • Hive Agile Data Types• STRUCTS: • SELECT mytable.mycolumn.myfield FROM …• MAPS (Hashes): • SELECT mytable.mycolumn[mykey+ FROM …• ARRAYS: • SELECT mytable.mycolumn*5+ FROM …• JSON: • SELECT get_json_object(mycolumn, objpath Copyright © 2011, Cloudera, Inc. All Rights Reserved. 37
  • Copyright © 2011, Cloudera, Inc. All Rights Reserved. 38