vBACD July 2012 - Apache Hadoop, Now and Beyond
Upcoming SlideShare
Loading in...5
×
 

vBACD July 2012 - Apache Hadoop, Now and Beyond

on

  • 1,901 views

“Apache Hadoop, Now and Beyond”, Jim Walker, Director of Product Marketing, Hortonworks ...

“Apache Hadoop, Now and Beyond”, Jim Walker, Director of Product Marketing, Hortonworks
Hadoop is an open source project that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. It is shifting the way many traditional organizations think of analytics and business models. While it is deigned to take advantage of cheap commodity hardware, it is also perfect for the cloud as it is built to scale up or down without system interruption. In this presentation, Jim Walker will provide an overview of Apache Hadoop and its current state of adoption in and out of the cloud.

Statistics

Views

Total Views
1,901
Views on SlideShare
1,901
Embed Views
0

Actions

Likes
1
Downloads
79
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    vBACD July 2012 - Apache Hadoop, Now and Beyond vBACD July 2012 - Apache Hadoop, Now and Beyond Presentation Transcript

    • Apache Hadoop & the CloudJim WalkerDir. Product Marketing, HortonworksTwitter @jaymceJuly 10, 2012© Hortonworks Inc. 2012
    • 1941 2012 Page 2© Hortonworks Inc. 2012
    • Big data market segments Software Hardware ETL & Mgmnt Analytics Applications Services Distributions•  Storage •  OSS Apache •  Distributed file •  Analytic •  Data •  Consulting•  Servers Hadoop stores application visualization •  Training•  Networking •  Enterprise •  NoSQL development tools •  Tech support Distributions databases platforms •  Business •  Software •  Non-Hadoop •  Data •  Advanced intelligence maintenance big data integration analytics applications •  Hardware frameworks •  Data quality & applications maintenance governance •  hosting Next Generation Data Warehouse•  MPP columnar data warehouse appliances•  In-memory analytics engines•  Fast data loading © Hortonworks Inc. 2012
    • Big data market segments Software Hardware ETL & Mgmnt Analytics Applications Services Distributions•  Storage •  OSS Apache •  Distributed file •  Analytic •  Data •  Consulting•  Servers Hadoop stores application visualization •  Training•  Networking •  Enterprise •  NoSQL development tools •  Tech support Distributions databases platforms •  Business •  Software •  Non-Hadoop •  Data •  Advanced intelligence maintenance big data integration analytics applications •  Hardware frameworks •  Data quality & applications maintenance governance •  hosting cloud cloud cloud cloud Next Generation Data Warehouse•  MPP columnar data warehouse appliances•  In-memory analytics engines•  Fast data loading © Hortonworks Inc. 2012
    • Analytics started with basic purchase history… Megabytes ERP Purchase detail Purchase record Payment record Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
    • then we added customer information…Gigabytes CRM Segmentation Customer Touches Megabytes ERP Purchase detail Support Contacts Purchase record Payment record Offer details Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
    • and the web started to impact…Terabytes WEB Web logs A/B testing Behavioral Targeting Gigabytes CRM Dynamic Pricing Segmentation Search Marketing Customer Touches Megabytes ERP Affiliate Networks Purchase detail Support Contacts Dynamic Funnels Purchase record Payment record Offer details Offer history Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
    • Big data changes the game Transactions + InteractionsPetabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Dynamic Pricing Business Data Feeds Segmentation External Demographics Search Marketing Customer Touches User Generated Content Megabytes ERP Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. © Hortonworks Inc. 2012
    • Next-gen data architecture driversBusiness •  Enable new business models & drive faster growth (20%+) Drivers •  Find insights for competitive advantage & optimal returnsTechnical •  Data continues to grow exponentially Drivers •  Data is increasingly everywhere and in many formats •  Legacy solutions unfit for new requirements growth cloudFinancial •  Cost of data systems, as % of IT spend, continues to grow Drivers •  Cost advantages of commodity hardware & open source © Hortonworks Inc. 2012
    • Apache Hadoop Open Source Data Management Software One of the best examples of open source driving innovation and creating a market •  Foundation for big data solutions •  Enables a rational economics model •  Powers data-driven business •  Commodity hardware •  Loosely coupled, ship early/ship often •  Consists of many specialized sub-projects© Hortonworks Inc. 2012
    • Apache Hadoop & Cloud Makes Sense •  Broader access of Hadoop to end users, IT professionals, and developers cloud •  Easy installation and configuration and simplified programming •  Enterprise-ready distribution with greater security, performance, ease of management and options for Hybrid IT usage. •  Integrate with everything via RESTful API •  Spin up a cluster on demand •  Ease management Page 11 © Hortonworks Inc. 2012
    • 5 Reasons for Hadoop in the Cloud People say "should you run Hadoop in the cloud?” I say "it depends". http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 12 © Hortonworks Inc. 2012
    • 5 Reasons for Hadoop in the Cloud 1 If your data is stored in a cloud, local analysis may make more sense… "work near the data" 2 For periodic processing (nightly, etc…) it might make sense to just rent. 3 No upfront capital expense, fund from success 4 Easier to expand a cluster; no need to buy just find 5 Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 13 © Hortonworks Inc. 2012
    • What is Apache Hadoop?1 PROCESSING – Map/Reduce •  Splits a task across processors “near” the data & assembles results •  2004 white paper MapReduce: Simplified Data Processing on Large Clusters •  Base of much new tech2 STORAGE – Hadoop Distributed File System •  Distributed across “nodes” •  Natively redundant •  Name node tracks locations © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase Apache Hive is a data5 HCatalog warehouse infrastructure built on top of Hadoop (originally by6 Pig Facebook) for providing data summarization, ad-hoc query,7 Oozie and analysis of large datasets. It provides a mechanism to project structure onto this data8 Ambari and query the data using a SQL-like language called9 Sqoop HiveQL (HQL).10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase5 HCatalog HBase is a non-relational database. It is columnar and provides fault-tolerant storage6 Pig and quick access to large quantities of sparse data. It7 Oozie also adds transactional capabilities to Hadoop,8 Ambari allowing users to conduct updates, inserts and deletes.9 Sqoop10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive HCatalog4 HBase HCatalog is a metadata management service for5 HCatalog Apache Hadoop. It opens up the platform and allows6 Pig interoperability across data processing tools such as Pig, Map Reduce and Hive. It also7 Oozie provides a table abstraction so that users need not be8 Ambari concerned with where or how their data is stored.9 Sqoop Aster SQL-H interfaces with HCatalog10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase Apache Pig allows you to write complex map reduce5 HCatalog transformations using a simple scripting language. Pig latin6 Pig (the language) defines a set of transformations on a data set7 Oozie such as aggregate, join and sort among others. Pig Latin is sometimes extended using8 Ambari UDF (User Defined Functions), which the user can9 Sqoop write in Java and then call directly from the language.10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase5 HCatalog Oozie coordinates jobs written in multiple languages such as6 Pig Map Reduce, Pig and Hive. It is a workflow system that links7 Oozie these jobs and allows specification of order and dependencies between them.8 Ambari9 Sqoop10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase5 HCatalog Apache Ambari operationalizes Hadoop. It provides a mechanism to6 Pig monitor and manage a cluster. It also provisions nodes.7 Oozie Ambari is a monitoring,8 Ambari administration and lifecycle management project for Apache Hadoop clusters9 Sqoop10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase5 HCatalog Sqoop is a set of tools that allow non-Hadoop data stores6 Pig to interact with traditional relational databases and data7 Oozie warehouses.8 Ambari9 Sqoop10 Zookeeper © Hortonworks Inc. 2012
    • Apache Hadoop related projects3 Hive4 HBase5 HCatalog ZooKeeper is a centralized service for maintaining6 Pig configuration information, naming, providing distributed7 Oozie synchronization, and providing group services.8 Ambari9 Sqoop10 Zookeeper © Hortonworks Inc. 2012
    • Hadoop in Action Interfaces with HCatalog to 1 Web Log files via WebHDFS APIs 4 analyze website visits by the type of end results Website WebInteractions Logs Big Data Order Refinery DB DataCustomer DB Data Customer & Order data via Talend Pre-processes, refines, and 2 3 & HCatalog for schema joins data via Talend, Pig, & HCatalog © Hortonworks Inc. 2012
    • Hortonworks Vision & Role We believe that by the end of 2015, more than half the worlds data will be processed by Apache Hadoop. 1 Be diligent stewards of the open source core 2 Be tireless innovators beyond the core 3 Provide robust data platform services & open APIs 4 Enable the ecosystem at each layer of the stack 5 Make the platform enterprise-ready & easy to use © Hortonworks Inc. 2012
    • Balancing Innovation & Stabilitycustomers relative % The CHASM Innovators, Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience Source: Geoffrey Moore - Crossing the Chasm Page 25 © Hortonworks Inc. 2012
    • Enabling Hadoop as Enterprise Big Data Platform Applications, Installation & Configuration, Business Tools, Administration, Development Tools, Monitoring, Open APIs and access High Availability, Data Movement & Integration, Replication, Data Management Systems, Multi-tenancy, .. Systems Management Hortonworks Data Platform DEVELOPER Data Platform Services & Open APIs Metadata, Indexing, Search, Security, Management, Data Extract & Load, APIs © Hortonworks Inc. 2012
    • Hortonworks Data Platform The ONLY 100% open source data platform for Hadoop •  Tightly aligned with core Apache code line •  All code committed back to open source •  Most complete Apache Hadoop platform •  Comprehensive management and monitoring •  Intuitive graphical data integration tools •  Centralized metadata services for easy data sharing Page 27 © Hortonworks Inc. 2012
    • Hortonworks Data Platform •  Simplify deployment to get started quickly and easily •  Monitor, manage any size cluster with familiar console and tools •  Only platform to include data integration services to interact 1 with any data source •  Metadata services opens the platform for integration with Hortonworks Data Platform existing applications Delivers enterprise grade functionality on a proven Apache Hadoop distribution to ease management, •  Dependable high availability simplify use and ease integration into the enterprise architectureThe only 100% open source data platform for Apache Hadoop © Hortonworks Inc. 2012
    • Apache Distribution StackBuilt on Hadoop 1.0(a.k.a. 0.20.205) •  Proven at large scale enterprise implementations 0.92.1+ 5.1.1 •  Most stable and reliable version 1.0.3 0.9.2 3.3.4 of Hadoop to date •  First Apache line supporting 0.4.0 security, HBase, WebHDFS •  Driven by core committers and 0.9.0+ 3.1.3 architects at Hortonworks 0.9.0+ beta ZookeeperIncludes necessary components HCatalog Ambari HBase Talend Sqoopalready integrated and tested Oozie Core Hive Pigtogether 1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 3.3.4 beta 5.1.1Most stable versions of all Hortonworks Distributioncomponents are chosen Tested, Hardened & Proven Distribution Reduces Risk Page 29 © Hortonworks Inc. 2012
    • Management & Monitoring SvcsHortonworks Management Center – View the health of cluster operations, server utilization and performance levels – Customizable dashboards – APIs for integration into 3rd party monitoring tools – 100% open source management & monitoring, powered by Apache Ambari, Puppet, Nagios and Gaglia – Simple wizard-based installation, configuration & provisioning of any size Hadoop clusterOptimize performance for your Hadoop clusterSimplify Installation and provisioning Page 30 © Hortonworks Inc. 2012
    • Data Integration Services•  Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig•  Oozie scheduling allows you to manage and stage jobs•  Connectors for any database, business application or system•  Integrated HCatalog storage Bridge the gap between legacy data & Hadoop Simplify and speed development Page 31 © Hortonworks Inc. 2012
    • Which is best for the cloud? vs. Page 32 © Hortonworks Inc. 2012
    • Metadata ServicesApache HCatalog provides flexible metadataservices across tools and external access •  Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API HCatalog Shared table and schema management •  Raw Hadoop data Table access opens the •  Inconsistent, unknown Aligned metadata platform •  Tool specific access REST API © Hortonworks Inc. 2012
    • Services IntegrationProvides RESTful API as“front door” for Hadoop Existing & New Applications•  Opens the door to WebHDFS HCatalog RESTful Web Services languages other than Java•  Thin clients via web MapReduce Pig Hive services vs. fat-clients in HCatalog gateway•  Insulation from interface External HDFS HBase changes release to release Store Opens Hadoop to integration with existing and new applications © Hortonworks Inc. 2012
    • Use cases: optimize outcomes at scale Media optimize Content Intelligence optimize Detection Investment optimize Algorithms Advertising optimize Performance Fraud optimize Prevention Regulation optimize Compliance Retail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. © Hortonworks Inc. 2012
    • Connecting Transactions + Interactions + Observations Audio, Retain runtime models and Video,Images historical data for ongoing 5 Business Web, Mobile, CRM, refinement & analysis ERP, SCM, … Transactions Docs, & Interactions Text, XML Web Logs, Clicks Big Data 4 DataSocial, Refinery Discovery & ClassicGraph, 1 ETLFeeds Investigative processing AnalyticsSensors, 3 Share refinedDevices, RFID data & runtime 2 Store, aggregate, and models Interactive transform multi-structured dataSpatial, data to unlock value Business exploration GPS Intelligence & Analytics Retain historical data toEvents, Other unlock additional value 6 Dashboards, Reports, Visualization, … © Hortonworks Inc. 2012
    • 5 Reasons for Hadoop in the Cloud 1 If your data is stored in a cloud, local analysis may make more sense… "work near the data" 2 For periodic processing (nightly, etc…) it might make sense to just rent. 3 No upfront capital expense, fund from success 4 Easier to expand a cluster; no need to buy just find 5 Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html Page 37 © Hortonworks Inc. 2012
    • THANK YOU Jim Walker jim@hortonworks.com @jaymce1 Get Hortonworks Data Platform hortonworks.com/download2 Use the getting started guide hortonworks.com/get-started3 Learn more… get support hortonworks.com/training hortonworks.com/support Page 38 © Hortonworks Inc. 2012