Your SlideShare is downloading. ×
Boston HUG - Cloudera presentation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Boston HUG - Cloudera presentation


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • FinSvc companies are realizing that they need to understand the fundamental risk in their customer base.All of a bank’s working capital originals with customers.Being able to better predict fluctuations can help them optimize how to put that capital to work.
  • Much of the discussions about brands today happens in the social media.This not only impacts the companies perception but can have a direct influence on relationships with customers and the ability to sell.Hadoop is a natural solution for gathering and contextualizing discussions about company brands and products.
  • Transcript

    • 1. PROFIT FROM ALL OF YOUR DATAFebruary 2012Hadoop in the EnterpriseAdam Smieszny | Systems Engineer
    • 2. Agenda • Hadoop Overview • History of Hadoop • What is Hadoop • Hadoop in the Enterprise2 ©2011 Cloudera, Inc. All Rights Reserved.
    • 3. Existing Data Management 10,000GIGABYTES OF DATA CREATED (IN BILLIONS) Current Database Solutions are designed for structured data.  Optimized to answer known questions quickly  Schemas dictate form/context 5,000  Difficult to adapt to new data types and new questions  Expensive at Petabyte scale 0 10% 2005 2010 2015 STRUCTURED DATA UNSTRUCTURED DATA 3 ©2011 Cloudera, Inc. All Rights Reserved.
    • 4. Why the Need for Hadoop? 10,000 GIGABYTES OF DATA CREATED (IN BILLIONS) 1.8 trillion gigabytes of data was created in 2011…  More than 90% is unstructured data  Approx. 500 quadrillion files 5,000  Quantity doubles every 2 years More More Content Devices New & New Better Sources Info 0 2005 2010 2015 STRUCTURED DATA UNSTRUCTURED DATASource: IDC 2011 4 ©2011 Cloudera, Inc. All Rights Reserved.
    • 5. The Origins of Hadoop Launches SQL support for Hadoop Open Source Open source web Publishes MapReduce MapReduce and Runs 4,000-node Hadoop wins Terabyte Releases CDH and crawler project created and GFS Paper HDFS project created Hadoop cluster sort benchmark Cloudera Enterprise by Doug Cutting by Doug Cutting2002 2007 2012 5 ©2011 Cloudera, Inc. All Rights Reserved.
    • 6. What is Apache Hadoop? CORE HADOOP COMPONENTS Hadoop is a platform for data storage and processing that is… Hadoop MapReduce Distributed File  Scalable System (HDFS)  Fault tolerant  Open source File Sharing & Data Protection Across Distributed Computing Across Physical Servers Physical Servers Flexibility Scalability Low Cost A single repository for storing  Scale-out architecture divides  Can be deployed on commodity processing & analyzing any type workloads across multiple hardware of data nodes  Open source platform guards Not bound by a single schema  Flexible file system eliminates against vendor lock ETL bottlenecks 6 ©2011 Cloudera, Inc. All Rights Reserved.
    • 7. What is CDH? Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…  100% Apache open source  Contains all components needed for deployment  Fully documented and supported  Released on a reliable schedule Fastest Path to Success Stable and Reliable Community Driven No need to write your own scripts or  Extensive Cloudera QA systems,  Incorporates only main-line do integration testing on different software & processes components from the Apache components Hadoop ecosystem – no forks or  Tested & run in production at scale proprietary underpinnings Works with a wide range of operating  Proven at scale in dozens of systems, hardware, databases and  FREE enterprise environments data warehouses 7 ©2011 Cloudera, Inc. All Rights Reserved.
    • 8. CDH & Enterprise Ecosystem Drivers, language enhancements, testing File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Sqoop Workflow APACHE OOZIE Scheduling APACHE OOZIE Metadata APACHE HIVE frame- work, Languages / Compilers More adapters Data Integration APACHE PIG, APACHE HIVE Fast Read/Write coming… Access APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER Packaging, testing8
    • 9. Hadoop / RDBMS Use Cases Create context Analyze unstructured data (classification, text mining) Parse, aggregate Analyze, report semi-structured data Active archival Analyze, report Long running queries structured dataSlide borrowed from Krishnan Parasuraman presentation at Enzee’11 9 Copyright 2011 Cloudera Inc. All rights reserved
    • 10. Hadoop in Production How Apache Hadoop fits into your existing infrastructure. OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS Management Enterprise Web IDE’s BI / Analytics Tools Reporting Application Enterprise Data Warehouse Low-Latency Serving Systems Relational Logs Files Web Data Data10 ©2011 Cloudera, Inc. All Rights Reserved.
    • 11. Hadoop Use CasesUse Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization Content Optimization Media Clickstream Sessionization ADVANCED ANALYTICS DATA PROCESSING Network Analytics Telco Mediation Loyalty & Promotions Retail Data Factory Analysis Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 11 ©2011 Cloudera, Inc. All Rights Reserved.
    • 12. Use Case: Customer Risk Build comprehensive data picture of customer side risk Publish a consolidated set of attributes for analysis Map ratings across products Parse and aggregate data from difference sources Credit and debit cards, product payments, deposits and savings Banking activity, browsing behavior, call logs, e-mails and chats Merge data into a single view A “fuzzy join” among data sources Structure and normalize attributes Sentiment analysis, pattern recognition12 Copyright 2010 Cloudera Inc. All rights reserved
    • 13. Use Case: Sentiment Analysis Internet generates a lot of chatter about brands Understanding what’s being said is crucial to protecting brand value Facebook, Twitter generate a lot of data for a global top brand Capturing and Processing direct feedback Better engagement and alerting via Sentiment Analysis Not yet ready for fully automated customer service Hadoop handles the diverse data types and processing Sources of data changing and semantics continuously evolving Sophistication of algorithms is improving daily13 Copyright 2010 Cloudera Inc. All rights reserved
    • 14. Journey of CDH UsersDiscover the Benefits Deploy Subscribe to of Apache Hadoop CDH Cloudera Enterprise Gain the flexibility to store and mine The fastest, surest path to success Simplify and accelerate Apache all types of data with Apache Hadoop Hadoop deployment ••• ••• ••• Leverage the scale-out architecture Stable, reliable version of Apache Reduce adoption costs and risks for complex data analysis Hadoop without the vendor lock-in ••• ••• imposed by proprietary vendors More effectively manage cluster Easily scale to meet growing data ••• resources requirements Integrates with your other ••• ••• technology platforms ensuring Leverage the experience of our investment protection experts Avoid vendor lock-in with an open source technology14 ©2011 Cloudera, Inc. All Rights Reserved.
    • 15. Get Hadoop cloudera cloudera15 ©2011 Cloudera, Inc. All Rights Reserved.