Hadoop & Hortonworks
Open Source Wild Fire
November 2012
OW2 Con




© Hortonworks Inc. 2012   Page 1
Big data changes the game

                                                                  Transactions + Interactions
Petabytes
               BIG DATA                 Mobile Web                        + Observations
                                        Sentiment

                                         User Click Stream
                                                               SMS/MMS
                                                                                = BIG DATA
                                                                         Speech to Text

                                                           Social Interactions & Feeds
  Terabytes    WEB        Web logs
                                                                         Spatial & GPS Coordinates
                                 A/B testing
                                                                               Sensors / RFID / Devices
                                         Behavioral Targeting
   Gigabytes   CRM                                                                     Business Data Feeds
                                                       Dynamic Pricing
                          Segmentation                                                     External Demographics
                                                                Search Marketing
                                 Customer Touches                                           User Generated Content
               ERP
   Megabytes                                                        Affiliate Networks
               Purchase detail        Support Contacts                                         HD Video, Audio, Images
                                                                         Dynamic Funnels
               Purchase record
                                          Offer details                    Offer history         Product/Service Logs
               Payment record



                                         Increasing Data Variety and Complexity


                                               © Hortonworks Inc. 2012
Big Data: Optimize Outcomes at Scale
           Sports           optimize                       Championships
      Intelligence          optimize                       Detection
         Finance            optimize                       Algorithms
      Advertising           optimize                       Performance
            Fraud           optimize                       Prevention
Retail / Wholesale          optimize                       Inventory turns
   Manufacturing            optimize                       Supply chains
       Healthcare           optimize                       Patient outcomes
       Education            optimize                       Learning outcomes
     Government             optimize                       Citizen services
                                        Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.

                                                                        Page 3
                     © Hortonworks Inc. 2012
Apache Hadoop


Open Source data management                                             Key Characteristics
                                                                        • Scalable
with scale-out storage &
                                                                            – Efficiently store and process
distributed processing                                                        petabytes of data
                                                                            – Linear scale driven by additional
                                                                              processing and storage
             HDFS
                                                                        • Reliable
Storage




             •   Distributed across “nodes”
                                                                            – Redundant storage
             •   Natively redundant
                                                                            – Failover across nodes and racks
             •   Name node tracks locations
                                                                        • Flexible
                                                                            – Store all types of data in any format
                                                                            – Apply schema on analysis and
             Map Reduce                                                       sharing of the data
Processing




             •   Splits a task across processors                        • Economical
                 “near” the data & assembles results                        – Use commodity hardware
             •   Self-Healing, High Bandwidth                               – Open source software guards
                 Clustered Storage                                            against vendor lock-in


                                                                                         Page 4
                                              © Hortonworks Inc. 2012
What is a Hadoop “Distribution”
                                   Talend             WebHDFS           Sqoop          Flume
A complimentary set
                                                      HCatalog
of open source                                                                    HBase
                                          Pig                    Hive
technologies that                         MapReduce                             HDFS
make up a complete                     Ambari                    Oozie                 HA
data platform                                               ZooKeeper




• Tested and pre-packaged to ease installation and usage
• Collects the right versions of the components that all have different
  release cycles and ensures they work together




                            © Hortonworks Inc. 2012
Hadoop in Enterprise Data Architectures

    Existing Business Infrastructure                                            Web                       New Tech

                                                                                                               Datameer
                                                                                                                Tablaeu
                                                                                                              Karmasphere
   IDE &       ODS &          Applications &   Visualization &                   Web                             Splunk
  Dev Tools   Datamarts       Spreadsheets       Intelligence                 Applications

                                                                                                                                  Operations

                  Discovery                                                  Low
                    Tools                      EDW                       Latency/NoSQ
                                                                               L
                                                                                                                                  Custom   Existing



                                                        Templeton       WebHDFS                Sqoop            Flume
                                                                       HCatalog
                                                                                                              HBase
                                                                 Pig                   Hive
                                                                 MapReduce                             HDFS
                                                              Ambari                   Oozie                    HA
                                                                                  ZooKeeper




                                                     Social                  Exhaust                   logs          files
       CRM       ERP           financials            Media                    Data


                                            Big Data Sources
                                (transactions, observations, interactions)


                                                                                                                         Page 6
                                                   © Hortonworks Inc. 2012
Apache Hadoop & Big Data Use Cases

                         Big Data
         Transactions, Interactions, Observations




         Refine             Explore         Enrich




              Business Case

                                                     Page 7
                  © Hortonworks Inc. 2012
Operational Data Refinery
Hadoop as platform for ETL modernization                                                     Enric
                                                                          Refine   Explore
                                                                                              h


Unstructured   Log files       DB data      Capture
                                            • Capture new unstructured data along with log
                                              files all alongside existing sources
                                            • Retain inputs in raw form for audit and
         Capture and archive
                                              continuity purposes
           Parse & Cleanse
                                            Process
          Structure and join                • Parse the data & cleanse
               Upload
                                            • Apply structure and definition
                                            • Join datasets together across disparate data
                           Refinery
                                              sources
                                            Exchange
                                            • Push to existing data warehouse for
                                              downstream consumption
                                            • Feeds operational reporting and online systems
              Enterprise
           Data Warehouse



                                                                        Page 8
                                         © Hortonworks Inc. 2012
Big Data Exploration & Visualization
  Hadoop as agile, ad-hoc data mart
                                                                                   Refine   Explore   Enrich



  Unstructured      Log files       DB data        Capture
                                                   • Capture multi-structured data and retain inputs
                                                     in raw form for iterative analysis
             Capture and archive
                                                   Process
                                                   • Parse the data into queryable format
              Structure and join
                                                   • Explore & analyze using Hive, Pig, Mahout and
            Categorize into tables                   other tools to discover value
                                                   • Label data and type information for
           upload          JDBC / ODBC
                                                     compatibility and later discovery
                                   Explore         • Pre-compute stats, groupings, patterns in data
Optional                                             to accelerate analysis
                                                   Exchange
                                                   • Use visualization tools to facilitate exploration
                                                     and find key insights
                                Visualization
                                    Tools
                                                   • Optionally move actionable insights into EDW
  EDW / Datamart
                                                     or datamart
                                                                                 Page 9
                                                © Hortonworks Inc. 2012
Application Enrichment
Deliver Hadoop analysis to online apps
                                                                                    Refine   Explore     Enrich



Unstructured      Log files          DB data      Capture
                                                  • Capture data that was once
                                                    too bulky and unmanageable
      Capture
                          Enrich
       Parse                                      Process
    Derive/Filter                                 •   Uncover aggregate characteristics across data
                              Scheduled &         •   Use Hive Pig and Map Reduce to identify patterns
                              near real time
   NoSQL, HBase                                   •   Filter useful data from mass streams (Pig)
    Low Latency
                                                  •   Micro or macro batch oriented schedules

                                                  Exchange
                                                  • Push results to HBase or other NoSQL alternative
                                                    for real time delivery
                                                  • Use patterns to deliver right content/offer to the
     Online                                         right person at the right time
   Applications




                                                                                  Page 10
                                               © Hortonworks Inc. 2012
Balancing Innovation & Stability
                                                                                                • Hadoop is “pre-chasm”
                                                                                                • Ecosystem still evolving
customers
 relative %




                                                                                                • Enterprises endure 1-3
                                                                                                  year adoption cycle




                                         The CHASM
         Innovators,          Early                     Early
                                                                               Late majority,            Laggards,
         technology        adopters,                  majority,
                                                                               conservatives              Skeptics
         enthusiasts       visionaries               pragmatists




                                                                                                                              time
                  Customers want                                               Customers want
              technology & performance                                     solutions & convenience

                                                                                                Source: Geoffrey Moore - Crossing the Chasm



                                                                                                         Page 11
                                                     © Hortonworks Inc. 2012
What Hortonworks does…

                              We believe that by the end of 2015,
                              more than half the world's data will be
                              processed by Apache Hadoop.

  Strategy: invest in Apache Hadoop to make it “The enterprise big data platform”


Distribution                     Ecosystem                       Support
• Hortonworks Data            • Enable an Ecosystem of         • Deliver highest quality
  Platform (HDP)                Big Data Apps                    support and expertise
• Enterprise Ready, Stable,   • Our goal os to make sure all   • Access to Apache Hadoop
  Reliable, Tested              your tools work WITH             Experts
• 100% open source              Hadoop                         • Hadoop training an
• Built by the architects,    • HDP is Hadoop for                certification by the Hadoop
  builders and operators of           • Microsoft                experts(web, public, private)
  Apache Hadoop                       • Teradata



                                                                     Page 12
                                 © Hortonworks Inc. 2012
AMSTERDAM
         March 20-21, 2013
         Enabling the Next Generation
         Enterprise Data Platform
           • LEARN: Dozens of Sessions
           • INTERACT: Community Focused Event




Register today! @ hadoopsummit.org
                                                 Page 13
             © Hortonworks Inc. 2012

Hadoop's Role in the Big Data Architecture, OW2con'12, Paris

  • 1.
    Hadoop & Hortonworks OpenSource Wild Fire November 2012 OW2 Con © Hortonworks Inc. 2012 Page 1
  • 2.
    Big data changesthe game Transactions + Interactions Petabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity © Hortonworks Inc. 2012
  • 3.
    Big Data: OptimizeOutcomes at Scale Sports optimize Championships Intelligence optimize Detection Finance optimize Algorithms Advertising optimize Performance Fraud optimize Prevention Retail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 3 © Hortonworks Inc. 2012
  • 4.
    Apache Hadoop Open Sourcedata management Key Characteristics • Scalable with scale-out storage & – Efficiently store and process distributed processing petabytes of data – Linear scale driven by additional processing and storage HDFS • Reliable Storage • Distributed across “nodes” – Redundant storage • Natively redundant – Failover across nodes and racks • Name node tracks locations • Flexible – Store all types of data in any format – Apply schema on analysis and Map Reduce sharing of the data Processing • Splits a task across processors • Economical “near” the data & assembles results – Use commodity hardware • Self-Healing, High Bandwidth – Open source software guards Clustered Storage against vendor lock-in Page 4 © Hortonworks Inc. 2012
  • 5.
    What is aHadoop “Distribution” Talend WebHDFS Sqoop Flume A complimentary set HCatalog of open source HBase Pig Hive technologies that MapReduce HDFS make up a complete Ambari Oozie HA data platform ZooKeeper • Tested and pre-packaged to ease installation and usage • Collects the right versions of the components that all have different release cycles and ensures they work together © Hortonworks Inc. 2012
  • 6.
    Hadoop in EnterpriseData Architectures Existing Business Infrastructure Web New Tech Datameer Tablaeu Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Tools EDW Latency/NoSQ L Custom Existing Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Big Data Sources (transactions, observations, interactions) Page 6 © Hortonworks Inc. 2012
  • 7.
    Apache Hadoop &Big Data Use Cases Big Data Transactions, Interactions, Observations Refine Explore Enrich Business Case Page 7 © Hortonworks Inc. 2012
  • 8.
    Operational Data Refinery Hadoopas platform for ETL modernization Enric Refine Explore h Unstructured Log files DB data Capture • Capture new unstructured data along with log files all alongside existing sources • Retain inputs in raw form for audit and Capture and archive continuity purposes Parse & Cleanse Process Structure and join • Parse the data & cleanse Upload • Apply structure and definition • Join datasets together across disparate data Refinery sources Exchange • Push to existing data warehouse for downstream consumption • Feeds operational reporting and online systems Enterprise Data Warehouse Page 8 © Hortonworks Inc. 2012
  • 9.
    Big Data Exploration& Visualization Hadoop as agile, ad-hoc data mart Refine Explore Enrich Unstructured Log files DB data Capture • Capture multi-structured data and retain inputs in raw form for iterative analysis Capture and archive Process • Parse the data into queryable format Structure and join • Explore & analyze using Hive, Pig, Mahout and Categorize into tables other tools to discover value • Label data and type information for upload JDBC / ODBC compatibility and later discovery Explore • Pre-compute stats, groupings, patterns in data Optional to accelerate analysis Exchange • Use visualization tools to facilitate exploration and find key insights Visualization Tools • Optionally move actionable insights into EDW EDW / Datamart or datamart Page 9 © Hortonworks Inc. 2012
  • 10.
    Application Enrichment Deliver Hadoopanalysis to online apps Refine Explore Enrich Unstructured Log files DB data Capture • Capture data that was once too bulky and unmanageable Capture Enrich Parse Process Derive/Filter • Uncover aggregate characteristics across data Scheduled & • Use Hive Pig and Map Reduce to identify patterns near real time NoSQL, HBase • Filter useful data from mass streams (Pig) Low Latency • Micro or macro batch oriented schedules Exchange • Push results to HBase or other NoSQL alternative for real time delivery • Use patterns to deliver right content/offer to the Online right person at the right time Applications Page 10 © Hortonworks Inc. 2012
  • 11.
    Balancing Innovation &Stability • Hadoop is “pre-chasm” • Ecosystem still evolving customers relative % • Enterprises endure 1-3 year adoption cycle The CHASM Innovators, Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience Source: Geoffrey Moore - Crossing the Chasm Page 11 © Hortonworks Inc. 2012
  • 12.
    What Hortonworks does… We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop. Strategy: invest in Apache Hadoop to make it “The enterprise big data platform” Distribution Ecosystem Support • Hortonworks Data • Enable an Ecosystem of • Deliver highest quality Platform (HDP) Big Data Apps support and expertise • Enterprise Ready, Stable, • Our goal os to make sure all • Access to Apache Hadoop Reliable, Tested your tools work WITH Experts • 100% open source Hadoop • Hadoop training an • Built by the architects, • HDP is Hadoop for certification by the Hadoop builders and operators of • Microsoft experts(web, public, private) Apache Hadoop • Teradata Page 12 © Hortonworks Inc. 2012
  • 13.
    AMSTERDAM March 20-21, 2013 Enabling the Next Generation Enterprise Data Platform • LEARN: Dozens of Sessions • INTERACT: Community Focused Event Register today! @ hadoopsummit.org Page 13 © Hortonworks Inc. 2012