How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics - Strata Conf - Sept 2011

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. How Apache Hadoop is RevolutionizingBusiness Intelligence and Data AnalyticsStrata Conference, Sept 22nd 2011, New York, NYDr. Amr Awadallah, Founder, CTO, VP of, twitter: @awadallah
  • 2. Business Intelligence Before Adopting Apache Hadoop BI Reports + Interactive Apps Can’t Explore Original High Fidelity Raw Data RDBMS (processed data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Collection Data Death Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2
  • 3. Business Intelligence After Adopting Apache Hadoop Data Exploration & BI Reports + Interactive Apps Advanced Analytics RDBMS ETL and Aggregations Complex Data Processing Hadoop: Storage + Compute Grid Mostly Append Keep Data Alive For Ever Collection Instrumentation Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
  • 4. So What is Apache Hadoop?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)• Core Hadoop has two main components: • Hadoop Distributed File System: self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing• Key business values: • Flexible – Store any data, Run any analysis (Mine First, Govern Later) • Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes • Affordable – Cost per TB at a fraction of traditional options • Open Source – No Lock-In, Rich Ecosystem, Large developer community • Broadly adopted – A large and active ecosystem, Proven to run at scale Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
  • 5. The Main Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before • Data is simply copied to the file data is loaded store, no special transformation is needed• Explicit load operation has to take place which transforms data • A SerDe (Serializer/Deserlizer) is to database internal structure applied during read time to extract the required columns• New columns must be added explicitly before data for such • New data can start flowing columns can be loaded into the anytime and will appear database retroactively once the SerDe is updated to parse them• Read is Fast • Load is Fast Benefits• Standards/Governance • Flexibility/Agility Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
  • 6. What is Complex Data Processing?1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop).2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility.3. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads.4. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe.5. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
  • 7. What This Means For You: AgilityUp Front Design Just in Time Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
  • 8. What This Means For You: Innovation Data Committee Data Scientist Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
  • 9. What This Means For You: Consolidation Silos Sharing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
  • 10. What This Means For You: Extract Value from Latent Data Archive to Tape Keep Data Alive Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
  • 11. What This Means For You: Ability to Grow FluidlyBenefit #2: Scalability Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
  • 12. What This Means For You: Data Beats Algorithm Smarter Algos More Data Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
  • 13. Where Does Hadoop Fit in the Enterprise Data Stack? Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics System Reporting Operators Development Tools Business Intelligence Tools Cloudera Mgmt Suite Enterprise Data Data ETL ToolsArchitects Warehouse Customers Low-Latency Web Serving Application Relational Systems Logs Files Web Data Databases Copyright © 2011, Cloudera, Inc. All Rights Reserved. 13
  • 14. Use The Right Tool For The Right Job Relational Databases: Hadoop:Use when: Use when:• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)• Multistep ACID Transactions • Scalability of Storage/Compute• 100% SQL Compliance • Complex Data Processing Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
  • 15. Two Core Use Cases Common Across Many IndustriesUse Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS Media DATA PROCESSING Content Optimization Clickstream Sessionization Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping Product Quality Manufacturing Mfg Process Tracking Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
  • 16. CDH: Cloudera’s Distribution Including Apache Hadoop UI Framework HUE SDK HUE SDK Workflow OOZIE Scheduling OOZIE Metadata HIVE Languages / Compilers PIG, HIVE Fast Read/Write Data Integration Access FLUME, SQOOP, ODBC HBASE Coordination ZOOKEEPER• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA• Integrated – All required component versions & dependencies are managed for you• Industry Standard – Existing RDBMS, ETL and BI systems work best with it• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
  • 17. SCM Express: Simplifies Installation and Configuration Service & Configuration Manager (SCM) Express takes the complexity out of deploying and configuring CDH.  Provision a complete Hadoop stack in minutes  Centrally manage system services through a user- friendly interface  Manages services for up to 50 nodes  FREE to downloadKEY FEATURESAutomated, wizard-based Central, real-time Ability to configure the Incorporates Automates the expansion installation of the dashboard for cluster while it’s running comprehensive validation of services to new nodes complete Hadoop stack configuration and error checking when they come online management 1 2 3 4 5 ©2011 Cloudera, Inc. All Rights Reserved. 17
  • 18. What is Cloudera Enterprise?Cloudera Enterprise makes open source CLOUDERA ENTERPRISE COMPONENTSApache Hadoop enterprise-easy Cloudera Production-Level Simplify and Accelerate Hadoop Deployment Management Suite Support Reduce Adoption Costs and Risks Lower the Cost of Administration Comprehensive Our Team of Experts Toolset for Hadoop On-Call to Help You Increase the Transparency & Control of Hadoop Administration Meet Your SLAs Leverage the Experience of Our Experts 3 of the top 5 telecommunications, mobile services, defense & intelligence, banking, media and retail organizations depend on Cloudera Enterprise EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production ©2011 Cloudera, Inc. All Rights Reserved. 18
  • 19. Hadoop World 2011 The largest gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem.• 1400 attendees, 25+ sponsors November 8-9• 60 sessions across 5 tracks for: Sheraton New York Hotel – Business Decision Makers & Towers, NYC – Enterprise Architects – IT Operators Learn more and register at – Data Scientists – Developers• Cloudera Training and Certification $50 discount for (November 7, 10, 11) Strata attendees ©2011 Cloudera, Inc. All Rights Reserved. 19
  • 20. What I Would Like You To Remember:• The Key Benefits of the Apache Hadoop Data Platform: • Agility/Flexibility (Enables Innovation/Exploration). • Complex Data Processing (Any Language, Any Problem). • Scalability of Storage/Compute (Freedom to Grow). • Economical Active Archive (Keep All Your Data Alive).• Cloudera Enterprise enables: • Lower the Cost of Management and Administration. • Simplify and Accelerate Hadoop Deployment. • Increase the Transparency & Control of Hadoop. • Firm SLAs on Issue Resolution. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
  • 21. Contact Information: Amr Awadallah 650-644-3921 Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
  • 22. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 22
  • 23. Appendix Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
  • 24. Hadoop Timeline Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch • Fastest sort of a TB, 62secs over 1,460 nodes NY Times converts 4TB of • Sorted a PB in 16.25hoursDoug Cutting & Mike Cafarella over 3,658 nodes image archives over 100 EC2s started working on Nutch 2002 2003 2004 2005 2006 2007 2008 2009 Google publishes GFS & Yahoo! hires Cutting, Cloudera Doug Cutting MapReduce papers Hadoop spins out of Nutch Founded joins Cloudera Facebooks launches Hive: SQL Support for Hadoop Hadoop Summit 2009, 750 attendees Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
  • 25. Cloudera’s Track Record• Customers: Multiple customers with >1,000 Hadoop nodes under management• Supporting dozens of diverse production use cases including ones that are revenue critical with tight SLA’s• Community: years of demonstrated leadership in the Apache Hadoop ecosystem. Cloudera employees are: • The largest contributor to the Hadoop ecosystem in patches • Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache Hadoop itself • The first to build & integrate what is now the reference Hadoop stack• Industry: Multiple years of experience providing Hadoop solutions across industries: • 2 of the top 5 payments companies run Cloudera • 3 of the top 5 commerical banks run Cloudera • 2 of the top 4 online travel companies run Cloudera Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
  • 26. Cloudera Enterprise Management SuiteUtility It Helps You… So You Can… It’s Like…Activity Monitor • Consolidate all user activities into a real-time view • Improve performance • MySQL Enterprise Monitor • Improve conformance to • Quest Foglight for Oracle / • Diagnose user performance SLAs SQL Server • Track activity metrics • Improve QOSService & • Manage system services • Lower cost of administration • Red Hat Satellite Server • Automate changes • Improve uptime • Microsoft System CenterConfiguration • Validate settings • Oracle Enterprise ManagerManager • 1-click securityResource • Report on the usage of scarce resources • Improve quality of service • VMware vCenter • Extend the life of the clusterManager • Plan for capacity expansionAuthorization • Centralize management of all users, groups and privileges • Lower the costs of administration • Teradata security administrationManager • Manage permissions via • Improve compliance delegated administration ©2011 Cloudera, Inc. All Rights Reserved. 26
  • 27. CDH Integrates with Existing IT Infrastructure BI/Analytics ETL Databases Cloud/OS Hardware Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
  • 28. Copyright © 2011, Cloudera, Inc. All Rights Reserved. 28