Deutsche Telekom on Big Data


Published on

Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value lies with thinking smart and hard about the business requirements for a Big Data solution. There is a long list of crucial questions to think about. Is Hadoop really the best solution for all Big Data needs? Should companies run a Hadoop cluster on expensive enterprise-grade storage, or use cheap commodity servers? Should the chosen infrastructure be bare metal or virtualized? The picture becomes even more confusing at the analysis and visualization layer. The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes in choosing hardware and software. This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and the selection of technologies and vendors.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Big Data = Transactions + Interactions + ObservationsTransactions are pretty simple to understand.  This is our ERP data.  It is the data that we maintain and track in our OLTP systems.  It can be any record of any system-to-system or human-to-system interaction.  It can even be a human-to-human interaction as long as it is captured electronically. We use a lot of this data in our analytics today.Interactions are the points in time we relate with a system.  It could be a tweet or a facebook post.  It could be an electronic or paper customer satisfaction survey.  Interactions are web logs and A/B tests.  We have a lot of this data but typically no efficient way to understand or extract value from it.Observations are interesting because they represent a world of net new data sources that we once never thought of analyzing.  It is data that was once thought of as low to medium value data or even exhaust data that was too bulky and just too expensive to store. This can be machine-generated data from sensors or web logs and clickstreams or even audio/video or largely unstructured content.  Typically, we never even thought of this data before.
  • Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.
  • Line of BusinessDemand 360 view of customer, employee, market, etc, but cannot be certain about what matters for analysisBusiness AnalystsNeed to incorporate more data into analysis, LOBs not sure what matters; want to reuse existing skill setsData Warehouse OwnersMust efficiently store, process, organize, deliver massive and growing data volume and variety while meeting SLAsIT ManagementDrive innovation, reduce costs, meet growing analytic demands of LOBs, mitigate risk of adopting new technologySystem AdministratorsEnsure stability and reliability of systemsBuyers:VP AnalyticsVP/Director Business IntelligenceVP/Director Data Warehousing/ManagementVP/Director InfrastructureVP/Director Operations/IT SystemsFaster customer acquisitionBetter product developmentBetter qualityLower churn
  • Deutsche Telekom on Big Data

    1. 1. Deutsche Telekom Perspective on HADOOP and Big Data Technologies Gregory Smith VP Solution Design and Emerging Technologies and Architectures T-Systems North America
    2. 2. Deutsche Telekom and T-Systems Key Stats  Deutsche Telekom is Europe’s largest telecom service provider – Revenue: $75 billion – Employees: 232,342  T-Systems is the enterprise division of Deutsche Telekom – Revenue: $13 billion – Employees: 52,742 – Services: data center, end user computing, networking, systems integration, cloud and big data 1
    3. 3. Overwhelmed by new data types? 2 Sentiment data Call detail records (CDRs) Sensor- / machine-based data Big Data Transactions, Interactions, Observations Clickstream data
    4. 4. 80% of new data in 2015 will land on Hadoop! 3 Hadoop is like a data warehouse, but it can store more data, more kinds of data, and perform more flexible analyses Hadoop is open source and runs on industry standard hardware, so it's 1-2 orders of magnitude more economical than conventional data warehouse solutions Hadoop provides more cost effective storage, processing, and analysis. Some existing workloads run faster, cheaper, better Hadoop can deliver a foundation for profitable growth: Gain value from all your data by asking bigger questions
    5. 5. Reference architecture view of Hadoop 4 Security Operations Infrastructure Virtualization Compute / Storage / Network WorkflowandSchedulingManagementandMonitoring DataIsolationAccessManagementDataEncryption Data Integration Data Processing Batch Processing Real Time/Stream Processing Search and Indexing Application Analytics Apps Transactional Apps Analytics Middleware Presentation Data Visualization and Reporting Clients Real Time Ingestion Batch Ingestion Data Connectors Metadata Services Data Management Distributed Processing (MapReduce) Non-relational DB Structured In-Memory Distributed Storage (HDFS) Hadoop Core Hadoop Projects Adjacent Categories
    6. 6. Example application landscape ETL Real Time Streams (Social, sensors) Structured and Unstructured Data (HDFS, MAPR) Real Time Database (Shark, Gemfire, hBase, Cassandra) Interactive Analytics (Impala, Greenplum, AsterData, Netezza…) Batch Processing (Map-Reduce) Real-Time Processing (s4, storm, spark) Data Visualization (Excel, Tableau) (Informatica, Talend, Spring Integration) Compute Storage Networking Cloud Infrastructure HIVE Machine Learning (Mahout, etc…) Source: Vmware
    7. 7. Disruptive innovations in Big Data 6 Traditional Database HADOOP NoSQL Database MPP Analytics Data Warehouse Schema Pre-defined, fixed Required on write Required on read Store first, ask questions later Processing No or limited data processing Processing coupled with data Parallel processing / scale out Data typesStructured Any, including unstructured .. Physical infrastructure Enterprise grade Mission critical Commodity is an option Much cheaper storage
    8. 8. Business problem Technology Solution Legacy BI  Backward-looking analysis  Using data out of business applications  SAP Business Objects  IBM Cognos  MicroStrategy  Structured  Limited (2 – 3 TB in RAM) High Performance BI  Quasi-real-time analysis  Using data out of business applications  Oracle Exadata  SAP HANA  Structured  Limited (2 – 8 TB in RAM) “Hadoop” Ecosystem  Forward-looking predictive analysis  Questions defined in the moment, using data from many sources  Hadoop distributions  No ACID transactions  Limited SQL Set (joins)  Structured or unstructured  Unlimited (20 – 30 PB) „True“ big data Legacy vendor definition of big data Selected Vendors Data Type/Scalability Innovations: Hadoop is 100x cheaper per TB than in-memory appliances like HANA and handles unstructured data as well
    9. 9. Innovations: Store first, ask questions later 8 SAN Storage 3-5€/GB Based on HDS SAN Storage NAS Filers 1-3€/GB Based on Netapp FAS-Series White Box DAS1) 0.50-1.00€/GB Hardware can be self-assembled Data Cloud1) 0.10-0.30€ /GB Based on large scale object storage interfaces Enterprise Class Hadoop Storage ???€/GB Based on Netapp E-Series (NOSH) 1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions ? !Illustrative acquisition cost Much cheaper storage but not just storage…
    10. 10. Target use cases 9 IT Infrastructure & Operations Business Intelligence & Data Warehousing Line of Business & Business Analysts CXO Time to value LongerShorter Lower Higher Potential value  Lower Cost Storage  Enterprise Data Lake  Enterprise Data Warehouse Offload  Enterprise Data Warehouse Archive  ETL Offload  Capacity Planning & Utilization  Customer Profiling & Revenue Analytics  Targeted Advertising Analytics  Service Renewal Implementation  CDR based Data Analytics  Fraud Management  New Business Models Cost effective storage, processing, and analysis Foundation for profitable growth
    11. 11. Enterprise data warehouse offload use case 10 The Challenge  Many EDWs are at capacity  Running out of budget before running out of relevant data  Older data archived “in the dark”, not available for exploration The Solution  Hadoop for data storage and processing: parse, cleanse, apply structure and transform  Free EDW for valuable queries  Retain all data for analysis! Operational (44%) ETL Processing (42%) Analytics (11%) DATA WAREHOUSE Storage & Processing HADOOP Operational (50%) Analytics (50%) DATA WAREHOUSE Cost is 1/10th
    12. 12. GOAL: Platform that natively supports mixed workloads as shared service AVOID: Systems separated by workload type due to contention From data puddles and ponds to lakes and oceans Page 11 Big Data BU1 Big Data BU2 Big Data BU3 Big Data Transactions, Interactions, Observations Refine Explore Enrich Batch Interactive Online
    13. 13. Questions to ask in designing a solution for a particular business use case  Which distribution is right for your needs today vs. tomorrow?  Which distribution will ensure you stay on the main path of open source innovation, vs. trap you in proprietary forks? 12 Security Operations Infrastructure Data Inte- gra- tion Data Processing Application Presentation Data Management Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation. Not shown: Intel, Fujitsu and other distributions  Widely adopted, mature distribution  GTM partners include Oracle, HP, Dell, IBM  Fully open source distribution (incl. management tools)  Reputation for cost-effective licensing  Strong developer ecosystem momentum  GTM partners include Microsoft, Teradata, Informatica, Talend  More proprietary distribution with features that appeal to some business critical use cases  GTM partner AWS (M3 and M5 versions only)  Just announced by EMC, very early stage  Differentiator is HAWQ – claims manifold query speed improvement, full SQL instruction set
    14. 14. Common objections to Hadoop 13 We don’t have big data problems We don’t have petabytes of data We can’t justify the budget for a new project We don’t have the skills We’re not sure Hadoop is mature/secure/ enterprise-ready We already have a scale-out strategy for our EDW/ETL
    15. 15. MYTH: Big Data means “Petabytes”  Not just Volume  Remember Variety, Velocity  Plenty of issues at smaller scales – Data processing – Unstructured data  Often warehouse volumes are small because the technology is expensive, not because there is no relevant data  Scalability is about growing with the business, affordably and predictably Every organization has data problems! Hadoop can help… 14 MYTH: Big Data means Data Science  Hadoop solves existing problems faster, better, cheaper than conventional technology, e.g. – Landing zone – capturing and refining multi-structured data types with unknown future value – Cost effective platform for retaining lots of data for long periods of time  Walk before you run  Big Data Is a State of Mind
    16. 16. Waves of adoption – crossing the chasm 15 Wave 1 Batch Orientation Wave 2 Interactive Orientation Wave 3 Real-Time Orientation  Mainstream, 70% of organizations  Early adopters, 20% of organizations  Bleeding edge, 10% of organizations Adoption today*  Refine: archival and transformation  Explore: query and visualization  Enrich: real-time decisions Example use cases  Hour(s)  Minutes  SecondsResponse time  Volume  VelocityData characteristic  EDW / RDBMS talk to Hadoop  Analytic apps talk directly to Hadoop  Derived data also stored in Hadoop Architectural characteristic  MapReduce, Pig, Hive  ODBC/JDBC, Hive  HBase, NoSQL, SQL Example technologies * Among organizations using Hadoop
    17. 17. Hadoop in a nutshell  The Hadoop open source ecosystem delivers powerful innovation in storage, databases and business intelligence, promising unprecedented price / performance compared to existing technologies.  Hadoop is becoming an enterprise-wide landing zone for large amounts of data. Increasingly it is also used to transform data.  Large enterprises have realized substantial cost reductions by offloading some enterprise data warehouse, ETL and archiving workloads to a Hadoop cluster. 16
    18. 18. Challenges in the Enterprise  Use-case identification and cost justification  Cooperation and coordination from independent business units  As Hadoop increases its footprint in business-critical areas, the business will demand mature enterprise capabilities, e.g. DR, snap-shots, etc.  Hadoop’s disruptive approve is challenging strong legacy EDW People, processes and technologies.  Data harmonization is often a significant challenge.  Fear of forking (think UNIX)  Proprietary absorption (Borged)  Audience: Hadoop address business problems, not IT problems  Fear of data complexity (“I hated statistics class!”) 17
    19. 19. Questions?