Capturing Big Value in Big Data –
How Use Case Segmentation Drives
Solution Design and Technology
Selection at Deutsche Telekom
Jürgen Urbanski
Vice President Cloud & Big Data Architectures & Technologies, T-Systems
Cloud Leadership Team, Deutsche Telekom
Board Member, BITKOM Big Data & Analytics Working Group
Inserting Hadoop in your organization – value
 proposition by buying center / stakeholder
                     IT Infrastructure   IT Applications          LOB             CXO
            Higher
                                                                               New
                                                                                business
                                                                                models
                                                            Faster
                                                             customer
                                                             acquisition
Potential                                                   Better
   value                                  Lower             product
                                           enterprise        development
                                           data             Better quality
                                           warehouse
                                                            Lower churn
                       Lower              cost
                        storage cost                        Lower fraud
                                                            Etc.

            Lower
                     Shorter                                                        Longer
                                                  Time to value


                                                                                             1
Waves of adoption – crossing the chasm
                                                                           Wave 3
                                                  Wave 2            Real-Time Orientation
                                          Interactive Orientation
                       Wave 1
                  Batch Orientation


Adoption          Mainstream,              Early adopters,          Bleeding edge,
today              70% of organizations      20% of organizations      10% of organizations
Example use       Enterprise log file      Forensic analysis          Sensor analysis
cases              analysis                 Analytic modeling          “Twitterscraping”
                  ETL offload              BI user focus              Telematics
                  Active archive                                       Process optimization
                  Fraud detection
                  Clickstream
                   analytics
Response time     Hour(s)                  Minutes                  Seconds
Data              Volume                                             Velocity
characteristic
Architectural     EDW / RDBMS talk         Analytic apps talk       Derived data also
characteristic     to Hadoop                 directly to Hadoop        stored in Hadoop


                                                                                            2
Data warehouse and ETL offload are promising
use cases with immediate ROI
 Data Warehouse Offload
  – Legacy data warehouse costly so can only keep one year of data
  – Older data is stored but “dark,” cannot swim around and explore it
  – With HDFS you could explore it, active archive
  – “Data refinery" where massively parallel processing (MPP) solution is
    saturated performance wise

 ETL Offload
  – ETL may have more than a dozen steps
  – Many can be offloaded to a Hadoop cluster

 Mainframe Offload
  – May have potential

                                                                            3
Big Data is about new application landscapes
               New apps taking advantage of Big Data
                Rapid app development
                Bridges back to legacy systems (wrapping with API, or data integration
                 via federation or data transport)




New data fabrics for a new IT                                                Fast data
 More data                                                                   In real-time
 More sources                                                                In context (what, when,
 More types                                                                    who, where)
 In ONE place                                                                Telemetry / sensor based
 NOSQL databases                                                               (serving humans or
                                                                                machines, where you
                                                                                need to reason over data
                                                                                as it comes in RT)




       These 3 areas need to come together in a platform
        Cloud abstraction (so it can run on any private or public cloud, no lock-in)
        Automated deployment and monitoring (rolling upgrades, no patching)
        Various deployment form factors (on premise as software, on premise as appliance, in the cloud)

                                                                                                       4
Example application landscape
                                          Machine Learning
                        Real Time             (Mahout, etc…)
                         Streams
                          (Social,
                         sensors)

                               Real-Time
                               Processing
                                    (s4, storm,
                                      spark)                       Data Visualization
                                                                      (Excel, Tableau)


      ETL                                Real Time           Interactive                 HIVE
                                         Database             Analytics
                                                               (Impala,
                                           (Shark,                                  Batch
(Informatica, Talend,                                         Greenplum,
Spring Integration)
                                        Gemfire, hBase,
                                                              AsterData,
                                                                                  Processing
                                          Cassandra)                              (Map-Reduce)
                                                              Netezza…)


                                     Structured and Unstructured Data
                                                    (HDFS, MAPR)

                                            Cloud Infrastructure
                          Compute                 Storage          Networking


  Source: Vmware
Reference architecture – high-level view

                 Presentation


                 Application
 Data




                                                 Operations
                                      Security
 Inte-
gration
               Data Processing


              Data Management


            Infrastructure


                                                              6
Reference architecture – component view
    Data                                   Presentation
Integration




                                                                                                                        Workflow and Scheduling
                                                                                         Data Isolation
                 Data Visualization and Reporting                 Clients
 Real Time
 Ingestion
                                            Application

                    Analytics Apps       Transactional Apps      Analytics Middleware
   Batch




                                                                                         Access Management
 Ingestion




                                                                                                                                                    Operations
                                                                                                             Security
                                         Data Processing

   Data                                   Real Time/Stream
                  Batch Processing                               Search and Indexing




                                                                                                                        Management and Monitoring
Connectors                                  Processing


                                        Data Management
 Metadata          Distributed




                                                                                         Data Encryption
 Services                            Distributed      Non-relational        Structured
                    Storage
                                     Processing            DB               In-Memory
                     (HDFS)

                                  Infrastructure
              Virtualization                       Compute / Storage / Network



                                                                                                                                                             7
Questions to ask in designing a solution
for a particular business use case
            Presentation                              What physical infrastructure best fits your needs?
                                                      What are your data placement requirements (service provider
Data         Application




                                        Operations
Inte-

                            Security
gra-
tion      Data Processing                              data centers or on-premise, jurisdiction)?
         Data Management

         Infrastructure                                           Innovation: Cheaper storage
                                                                  but not just storage…
Illustrative acquisition cost                                          ?                 !




SAN Storage                            NAS Filers             Enterprise Class    White Box DAS1)        Data Cloud1)
        3-5€/GB                               1-3€/GB         Hadoop Storage        0.50-1.00€/GB        0.10-0.30€ /GB
                                                                   ???€/GB

Based on HDS                Based on Netapp Based on Netapp                        Hardware can be      Based on large
 SAN Storage                  FAS-Series    E-Series (NOSH)                         self-assembled        scale object
                                                                                                       storage interfaces

   1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions       8
Dat     Presentation
                                                                                                         a




                                                                                                                                          Operations
                                                                                                                Application




                                                                                                                               Security
                                                                                                       Inte

 Questions to ask in designing a solution                                                                -
                                                                                                       gra-
                                                                                                       tion
                                                                                                             Data Processing
                                                                                                            Data Management


 for a particular business use case                                                                         Infrastructure




                  Enterprise Class Hadoop                          Enterprise Class Hadoop
              Packaged ready-to-deploy modular                    Packaged ready-to-deploy modular Hadoop
              Compute / Memory intensive Hadoop cluster           cluster
                Compute intensive applications                    The Data has intrinsic value $$$
                Tic Data Analysis                                 Usable capacity must expand faster than
                Extremely tight Service Level                      compute
                 expectations                                      Higher storage performance
                Severe financial consequences if the              Real human consequences if the system fails
                 analytic run is late                               (Threats, treatments, financial losses)
                                                                   System has to allow for asymmetric growth
Compute
 Power
                                                                  Enterprise Class Hadoop
                      White Box Hadoop                           Bounded Compute algorithm / Memory
                  Values associated with early adopters of       intensive Hadoop cluster
                  Hadoop                                            Compute intensive applications
                                                                    Additional CPUs do not improve run time
                      Social Media Space                           Extremely tight Service Level
                      Contributors to Apache                        expectations
                      Strong bias to JBOD
                                                                    Severe financial consequences if the
                      Skeptical of ALL vendors
                                                                     analytic run is late
                                                                    Need for deeper storage per datanode


                                                      Storage Capacity

 Source: NetApp                                              9
Questions to ask in designing a solution
for a particular business use case
            Presentation                             Do you run your Hadoop cluster bare-metal or virtual? Most
Data         Application                              run bare-metal today but virtualization helps with…

                                       Operations
Inte-

                            Security
gra-
tion      Data Processing                              –   Different failure domains
         Data Management                               –   Different hardware pools
         Infrastructure
                                                       –   Development vs. production

   Three big types of isolation are required for mixing workloads:

                                                               Resource Isolation
                                                                – Control the greedy neighbor
                                 Nosy                           – Reserve resources to meet needs
                                                               Version Isolation
                                                                – Allow concurrent OS, App, Distro versions
        Reckless                                                – For instance, test/dev vs. production, high
                                                                   performance vs. low cost
                                                               Security Isolation
                                                                – Provide privacy between users/groups
                                                                – Runtime and data privacy required


Adapted from: Vmware, see Apache Hadoop on vSphere http://www.vmware.com/de/hadoop/serengeti.html               10
Questions to ask in designing a solution
for a particular business use case
           Presentation                              Which distribution is right for your needs today vs. tomorrow?
                                                     Which distribution will ensure you stay on the main path of
Data        Application




                                      Operations
Inte-

                           Security
gra-
tion     Data Processing                              open source innovation, vs. trap you in proprietary forks?
        Data Management

        Infrastructure

                                       Widely adopted, mature distribution
                                       GTM partners include Oracle, HP, Dell, IBM

                                                  Fully open source distribution (incl. management tools)
                                                  Reputation for cost-effective licensing
                                                  Strong developer ecosystem momentum
                                                  GTM partners include Microsoft, Teradata, Informatica, Talend

                                       More proprietary distribution with features that appeal to some
                                        business critical use cases
                                       GTM partner AWS (M3 and M5 versions only)

                                       Just announced by EMC, very early stage
                                       Differentiator is HAWQ – claims 600x query speed improvement,
                                        full SQL instruction set
Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation.   11
Not shown: Intel, Fujitsu and other distributions
Questions to ask in designing a solution
for a particular business use case

           Presentation                             What data sources could be of value (internal vs. external,
Data
Inte-
            Application               Operations     people vs. machine generated)? Follow data privacy for
                           Security



gra-
tion     Data Processing                             people-generated data.
         Data Management                            How much data volume do you have (entry barrier discussion)
        Infrastructure
                                                     and of what type (structured, semi, unstructured)?
                                                    Data latency requirements (measured in minutes)?


        Hadoop APIs                                   NFS for file-         REST APIs           ODBC (JDBC)
        for Hadoop                                      based               for internet        for SQL-based
        Applications                                  applications            access              applications




                                                                                                                 12
Questions to ask in designing a solution
for a particular business use case
           Presentation                             What type of analytics is required (machine learning,
Data        Application                              statistical analysis)?

                                      Operations
Inte-
                           Security
                                                    How fast do decisions need to be made (decision latency)?
gra-
tion     Data Processing

        Data Management
                                                    Is multi-stage data processing a requirement (before data
        Infrastructure
                                                     gets stored)?
                                                    Do you need stream computing and complex event
                                                     processing (CEP)? If so do you have strict time-based SLAs?
                                                     Is data loss acceptable?
                                                    How often does data get updated and queried (real time vs.
                                                     batch)?
                                                    How tightly coupled are your Hadoop data with existing
                                                     relational data sets?
                                                    Which non-relational DB suits your needs? Hbase and
                                                     Cassandra work natively on HDFS, while Couchbase and
                                                     MongoDB work on copies of the data

                                      Stay focused on what is possible quickly

                                                                                                              13
Innovations: Store first, ask questions later
Data
             Parallel processing (scale out)
           Presentation

            Application




                                         Operations
Inte-




                              Security
gra-
tion     Data Processing

        Data Management
                                                                                          “Hadoop”
        Infrastructure
                                                           High Performance              Ecosystem
                                                                  BI                 Forward-looking
                                              Legacy BI                               predictive analysis
                                                           Quasi-real-time
                                                            analysis                 Questions defined in
                            Backward-looking                                         the moment, using
                             analysis                      Using data out of
  Business                                                  business applications     data from many
                            Using data out of                                        sources
  problem                    business applications



                                                            Selected Vendors
              SAP Business Objects                        Oracle Exadata           Hadoop distributions
              IBM Cognos                                  SAP HANA
  Technology  MicroStrategy
  Solution                                                Data Type/Scalability
                            Structured                    Structured               Structured or
                            Limited (2 – 3 TB in          Limited (2 – 8 TB in      unstructured
                             RAM)                           RAM)                     Unlimited (20 – 30 PB)
                                                                                     „True“ big data
                                                               Legacy vendor definition of big data
Questions to ask in designing a solution
for a particular business use case
           Presentation                             Is backup and recovery critical (number of copies in the
Data        Application                              HDFS cluster)?

                                      Operations
Inte-
                           Security
                                                    Do you need disaster recovery on the raw data?
gra-
tion     Data Processing

        Data Management
                                                    How do you optimize TCO over the life time of a cluster?
        Infrastructure
                                                    How to ensure the cluster remains balanced and performing
                                                     well as the underlying hardware pool becomes
                                                     heterogeneous?
                                                    What are the implications of a migration between different
                                                     distributions or versions of one distribution? Can you do
                                                     rolling upgrades to minimize disruption?
                                                    What level of multi-tenancy do you implement? Even within
                                                     the enterprise, one general purpose Hadoop cluster might
                                                     serve different legal entities / BUs.
                                                    How do you bring along existing talent? E.g., train developers
                                                     on Pig, database admins on Hive, IT operations on the
                                                     platform



                                                                                                                 15
Navigating the broader BI and big data vendor
ecosystem can be confusing
Do you really need Hadoop?
 Is your data structured and less than 10 TB?
 Is your data structured, less than 100 TB but tightly integrated with
  your existing data?
 Is your data structured, more than 100 TB but processing has to
  occur real-time with less than a minute of latency?*

        Then you could stay with legacy BI landscapes
            including RDBMS, MPP DB and EDW

                                 Otherwise


              Come and join us on a journey into
                  Hadoop based solutions!

 * Hadoop is making rapid progress in the real-time arena             17
ILLUSTRATIVE
Use Hadoop for VOLUME                                      NOT EXHAUSTIVE



 You require parallel / complex data processing power
  and you can live with minutes or more of latency to derive reports
 You need data storage and indexing for analytic applications


   Platform




   Data                                         MapReduce
   Transformation
ILLUSTRATIVE
Use Hadoop for VARIETY                                                                            NOT EXHAUSTIVE


 Your data is multi-structured
 You want to derive reports in batch on full data sets
 You have complex data flows or multi-stage data pipelines

    Workflow Mgt.


    Data                                                                         MapReduce
    Transformation

  Data Visualization
   and Reporting

    Low Latency
    Data Access*



 * Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data             19
ILLUSTRATIVE
Use Hadoop for VELOCITY                                     NOT EXHAUSTIVE




 You are inundated with a flood of real-time data: Numerous live
  feeds from multiple data sources like machines, business systems
  or Internet sources
    Data                                                  Apache Kafka
    Ingestion

 You want to derive reports in (near) real time on a sample or full
  data sets

  Data Visualization
   and Reporting
                                               Shark

    Fast Analytics*


                                                                       20
 * May also use MPP database
Where to start inserting Hadoop in your
company? A call to action…
 IT Infrastructure IT Applications         LOB                CXO
    Accelerating implementation        Understanding Big Data
      – Solution design driven by        – Definition
         target use cases                – Benefits over adjacent and
      – Reference architecture              legacy technologies
      – Technology selection and         – Current mode vs. future
         POC                                mode for analytics
      – Implementation lessons          Assessing the Economic
         learnt                          Potential
                                         – Target use cases by
                                            function and industry
                                         – Best approach to adoption
     Puddles, pools                          Lakes, oceans
     AVOID: Systems separated by             GOAL: Platform that natively
     workload type due to contention         supports mixed workloads, shared
                                             service
                                                                                21

Don't be Hadooped when looking for Big Data ROI

  • 1.
    Capturing Big Valuein Big Data – How Use Case Segmentation Drives Solution Design and Technology Selection at Deutsche Telekom Jürgen Urbanski Vice President Cloud & Big Data Architectures & Technologies, T-Systems Cloud Leadership Team, Deutsche Telekom Board Member, BITKOM Big Data & Analytics Working Group
  • 2.
    Inserting Hadoop inyour organization – value proposition by buying center / stakeholder IT Infrastructure IT Applications LOB CXO Higher  New business models  Faster customer acquisition Potential  Better value  Lower product enterprise development data  Better quality warehouse  Lower churn  Lower cost storage cost  Lower fraud  Etc. Lower Shorter Longer Time to value 1
  • 3.
    Waves of adoption– crossing the chasm Wave 3 Wave 2 Real-Time Orientation Interactive Orientation Wave 1 Batch Orientation Adoption  Mainstream,  Early adopters,  Bleeding edge, today 70% of organizations 20% of organizations 10% of organizations Example use  Enterprise log file  Forensic analysis  Sensor analysis cases analysis  Analytic modeling  “Twitterscraping”  ETL offload  BI user focus  Telematics  Active archive  Process optimization  Fraud detection  Clickstream analytics Response time  Hour(s)  Minutes  Seconds Data  Volume  Velocity characteristic Architectural  EDW / RDBMS talk  Analytic apps talk  Derived data also characteristic to Hadoop directly to Hadoop stored in Hadoop 2
  • 4.
    Data warehouse andETL offload are promising use cases with immediate ROI  Data Warehouse Offload – Legacy data warehouse costly so can only keep one year of data – Older data is stored but “dark,” cannot swim around and explore it – With HDFS you could explore it, active archive – “Data refinery" where massively parallel processing (MPP) solution is saturated performance wise  ETL Offload – ETL may have more than a dozen steps – Many can be offloaded to a Hadoop cluster  Mainframe Offload – May have potential 3
  • 5.
    Big Data isabout new application landscapes New apps taking advantage of Big Data  Rapid app development  Bridges back to legacy systems (wrapping with API, or data integration via federation or data transport) New data fabrics for a new IT Fast data  More data  In real-time  More sources  In context (what, when,  More types who, where)  In ONE place  Telemetry / sensor based  NOSQL databases (serving humans or machines, where you need to reason over data as it comes in RT) These 3 areas need to come together in a platform  Cloud abstraction (so it can run on any private or public cloud, no lock-in)  Automated deployment and monitoring (rolling upgrades, no patching)  Various deployment form factors (on premise as software, on premise as appliance, in the cloud) 4
  • 6.
    Example application landscape Machine Learning Real Time (Mahout, etc…) Streams (Social, sensors) Real-Time Processing (s4, storm, spark) Data Visualization (Excel, Tableau) ETL Real Time Interactive HIVE Database Analytics (Impala, (Shark, Batch (Informatica, Talend, Greenplum, Spring Integration) Gemfire, hBase, AsterData, Processing Cassandra) (Map-Reduce) Netezza…) Structured and Unstructured Data (HDFS, MAPR) Cloud Infrastructure Compute Storage Networking Source: Vmware
  • 7.
    Reference architecture –high-level view Presentation Application Data Operations Security Inte- gration Data Processing Data Management Infrastructure 6
  • 8.
    Reference architecture –component view Data Presentation Integration Workflow and Scheduling Data Isolation Data Visualization and Reporting Clients Real Time Ingestion Application Analytics Apps Transactional Apps Analytics Middleware Batch Access Management Ingestion Operations Security Data Processing Data Real Time/Stream Batch Processing Search and Indexing Management and Monitoring Connectors Processing Data Management Metadata Distributed Data Encryption Services Distributed Non-relational Structured Storage Processing DB In-Memory (HDFS) Infrastructure Virtualization Compute / Storage / Network 7
  • 9.
    Questions to askin designing a solution for a particular business use case Presentation  What physical infrastructure best fits your needs?  What are your data placement requirements (service provider Data Application Operations Inte- Security gra- tion Data Processing data centers or on-premise, jurisdiction)? Data Management Infrastructure Innovation: Cheaper storage but not just storage… Illustrative acquisition cost ? ! SAN Storage NAS Filers Enterprise Class White Box DAS1) Data Cloud1) 3-5€/GB 1-3€/GB Hadoop Storage 0.50-1.00€/GB 0.10-0.30€ /GB ???€/GB Based on HDS Based on Netapp Based on Netapp Hardware can be Based on large SAN Storage FAS-Series E-Series (NOSH) self-assembled scale object storage interfaces 1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions 8
  • 10.
    Dat Presentation a Operations Application Security Inte Questions to ask in designing a solution - gra- tion Data Processing Data Management for a particular business use case Infrastructure Enterprise Class Hadoop Enterprise Class Hadoop Packaged ready-to-deploy modular Packaged ready-to-deploy modular Hadoop Compute / Memory intensive Hadoop cluster cluster  Compute intensive applications  The Data has intrinsic value $$$  Tic Data Analysis  Usable capacity must expand faster than  Extremely tight Service Level compute expectations  Higher storage performance  Severe financial consequences if the  Real human consequences if the system fails analytic run is late (Threats, treatments, financial losses)  System has to allow for asymmetric growth Compute Power Enterprise Class Hadoop White Box Hadoop Bounded Compute algorithm / Memory Values associated with early adopters of intensive Hadoop cluster Hadoop  Compute intensive applications  Additional CPUs do not improve run time  Social Media Space  Extremely tight Service Level  Contributors to Apache expectations  Strong bias to JBOD  Severe financial consequences if the  Skeptical of ALL vendors analytic run is late  Need for deeper storage per datanode Storage Capacity Source: NetApp 9
  • 11.
    Questions to askin designing a solution for a particular business use case Presentation  Do you run your Hadoop cluster bare-metal or virtual? Most Data Application run bare-metal today but virtualization helps with… Operations Inte- Security gra- tion Data Processing – Different failure domains Data Management – Different hardware pools Infrastructure – Development vs. production Three big types of isolation are required for mixing workloads:  Resource Isolation – Control the greedy neighbor Nosy – Reserve resources to meet needs  Version Isolation – Allow concurrent OS, App, Distro versions Reckless – For instance, test/dev vs. production, high performance vs. low cost  Security Isolation – Provide privacy between users/groups – Runtime and data privacy required Adapted from: Vmware, see Apache Hadoop on vSphere http://www.vmware.com/de/hadoop/serengeti.html 10
  • 12.
    Questions to askin designing a solution for a particular business use case Presentation  Which distribution is right for your needs today vs. tomorrow?  Which distribution will ensure you stay on the main path of Data Application Operations Inte- Security gra- tion Data Processing open source innovation, vs. trap you in proprietary forks? Data Management Infrastructure  Widely adopted, mature distribution  GTM partners include Oracle, HP, Dell, IBM  Fully open source distribution (incl. management tools)  Reputation for cost-effective licensing  Strong developer ecosystem momentum  GTM partners include Microsoft, Teradata, Informatica, Talend  More proprietary distribution with features that appeal to some business critical use cases  GTM partner AWS (M3 and M5 versions only)  Just announced by EMC, very early stage  Differentiator is HAWQ – claims 600x query speed improvement, full SQL instruction set Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation. 11 Not shown: Intel, Fujitsu and other distributions
  • 13.
    Questions to askin designing a solution for a particular business use case Presentation  What data sources could be of value (internal vs. external, Data Inte- Application Operations people vs. machine generated)? Follow data privacy for Security gra- tion Data Processing people-generated data. Data Management  How much data volume do you have (entry barrier discussion) Infrastructure and of what type (structured, semi, unstructured)?  Data latency requirements (measured in minutes)? Hadoop APIs NFS for file- REST APIs ODBC (JDBC) for Hadoop based for internet for SQL-based Applications applications access applications 12
  • 14.
    Questions to askin designing a solution for a particular business use case Presentation  What type of analytics is required (machine learning, Data Application statistical analysis)? Operations Inte- Security  How fast do decisions need to be made (decision latency)? gra- tion Data Processing Data Management  Is multi-stage data processing a requirement (before data Infrastructure gets stored)?  Do you need stream computing and complex event processing (CEP)? If so do you have strict time-based SLAs? Is data loss acceptable?  How often does data get updated and queried (real time vs. batch)?  How tightly coupled are your Hadoop data with existing relational data sets?  Which non-relational DB suits your needs? Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data Stay focused on what is possible quickly 13
  • 15.
    Innovations: Store first,ask questions later Data Parallel processing (scale out) Presentation Application Operations Inte- Security gra- tion Data Processing Data Management “Hadoop” Infrastructure High Performance Ecosystem BI  Forward-looking Legacy BI predictive analysis  Quasi-real-time analysis  Questions defined in  Backward-looking the moment, using analysis  Using data out of Business business applications data from many  Using data out of sources problem business applications Selected Vendors  SAP Business Objects  Oracle Exadata  Hadoop distributions  IBM Cognos  SAP HANA Technology  MicroStrategy Solution Data Type/Scalability  Structured  Structured  Structured or  Limited (2 – 3 TB in  Limited (2 – 8 TB in unstructured RAM) RAM)  Unlimited (20 – 30 PB) „True“ big data Legacy vendor definition of big data
  • 16.
    Questions to askin designing a solution for a particular business use case Presentation  Is backup and recovery critical (number of copies in the Data Application HDFS cluster)? Operations Inte- Security  Do you need disaster recovery on the raw data? gra- tion Data Processing Data Management  How do you optimize TCO over the life time of a cluster? Infrastructure  How to ensure the cluster remains balanced and performing well as the underlying hardware pool becomes heterogeneous?  What are the implications of a migration between different distributions or versions of one distribution? Can you do rolling upgrades to minimize disruption?  What level of multi-tenancy do you implement? Even within the enterprise, one general purpose Hadoop cluster might serve different legal entities / BUs.  How do you bring along existing talent? E.g., train developers on Pig, database admins on Hive, IT operations on the platform 15
  • 17.
    Navigating the broaderBI and big data vendor ecosystem can be confusing
  • 18.
    Do you reallyneed Hadoop?  Is your data structured and less than 10 TB?  Is your data structured, less than 100 TB but tightly integrated with your existing data?  Is your data structured, more than 100 TB but processing has to occur real-time with less than a minute of latency?* Then you could stay with legacy BI landscapes including RDBMS, MPP DB and EDW Otherwise Come and join us on a journey into Hadoop based solutions! * Hadoop is making rapid progress in the real-time arena 17
  • 19.
    ILLUSTRATIVE Use Hadoop forVOLUME NOT EXHAUSTIVE  You require parallel / complex data processing power and you can live with minutes or more of latency to derive reports  You need data storage and indexing for analytic applications Platform Data MapReduce Transformation
  • 20.
    ILLUSTRATIVE Use Hadoop forVARIETY NOT EXHAUSTIVE  Your data is multi-structured  You want to derive reports in batch on full data sets  You have complex data flows or multi-stage data pipelines Workflow Mgt. Data MapReduce Transformation Data Visualization and Reporting Low Latency Data Access* * Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data 19
  • 21.
    ILLUSTRATIVE Use Hadoop forVELOCITY NOT EXHAUSTIVE  You are inundated with a flood of real-time data: Numerous live feeds from multiple data sources like machines, business systems or Internet sources Data Apache Kafka Ingestion  You want to derive reports in (near) real time on a sample or full data sets Data Visualization and Reporting Shark Fast Analytics* 20 * May also use MPP database
  • 22.
    Where to startinserting Hadoop in your company? A call to action… IT Infrastructure IT Applications LOB CXO  Accelerating implementation  Understanding Big Data – Solution design driven by – Definition target use cases – Benefits over adjacent and – Reference architecture legacy technologies – Technology selection and – Current mode vs. future POC mode for analytics – Implementation lessons  Assessing the Economic learnt Potential – Target use cases by function and industry – Best approach to adoption Puddles, pools Lakes, oceans AVOID: Systems separated by GOAL: Platform that natively workload type due to contention supports mixed workloads, shared service 21

Editor's Notes

  • #5 Automated deployment and monitoring. The cloud infrastructure has to provide 10 “verbs” so that the apps don't have to know anything about the infrastructure. Philosophy is No patching, rolling upgrades, constantly compares what the app needs with what the cloud provides
  • #7 Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.
  • #8 Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.