• Save
Don't be Hadooped when looking for Big Data ROI
Upcoming SlideShare
Loading in...5
×
 

Don't be Hadooped when looking for Big Data ROI

on

  • 1,732 views

Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist ...

Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.

There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.

This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.

Statistics

Views

Total Views
1,732
Views on SlideShare
1,709
Embed Views
23

Actions

Likes
6
Downloads
0
Comments
0

2 Embeds 23

http://eventifier.co 21
http://eventifier.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Automated deployment and monitoring. The cloud infrastructure has to provide 10 “verbs” so that the apps don't have to know anything about the infrastructure. Philosophy is No patching, rolling upgrades, constantly compares what the app needs with what the cloud provides
  • Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.
  • Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.

Don't be Hadooped when looking for Big Data ROI Don't be Hadooped when looking for Big Data ROI Presentation Transcript

  • Capturing Big Value in Big Data –How Use Case Segmentation DrivesSolution Design and TechnologySelection at Deutsche TelekomJürgen UrbanskiVice President Cloud & Big Data Architectures & Technologies, T-SystemsCloud Leadership Team, Deutsche TelekomBoard Member, BITKOM Big Data & Analytics Working Group
  • Inserting Hadoop in your organization – value proposition by buying center / stakeholder IT Infrastructure IT Applications LOB CXO Higher  New business models  Faster customer acquisitionPotential  Better value  Lower product enterprise development data  Better quality warehouse  Lower churn  Lower cost storage cost  Lower fraud  Etc. Lower Shorter Longer Time to value 1
  • Waves of adoption – crossing the chasm Wave 3 Wave 2 Real-Time Orientation Interactive Orientation Wave 1 Batch OrientationAdoption  Mainstream,  Early adopters,  Bleeding edge,today 70% of organizations 20% of organizations 10% of organizationsExample use  Enterprise log file  Forensic analysis  Sensor analysiscases analysis  Analytic modeling  “Twitterscraping”  ETL offload  BI user focus  Telematics  Active archive  Process optimization  Fraud detection  Clickstream analyticsResponse time  Hour(s)  Minutes  SecondsData  Volume  VelocitycharacteristicArchitectural  EDW / RDBMS talk  Analytic apps talk  Derived data alsocharacteristic to Hadoop directly to Hadoop stored in Hadoop 2
  • Data warehouse and ETL offload are promisinguse cases with immediate ROI Data Warehouse Offload – Legacy data warehouse costly so can only keep one year of data – Older data is stored but “dark,” cannot swim around and explore it – With HDFS you could explore it, active archive – “Data refinery" where massively parallel processing (MPP) solution is saturated performance wise ETL Offload – ETL may have more than a dozen steps – Many can be offloaded to a Hadoop cluster Mainframe Offload – May have potential 3
  • Big Data is about new application landscapes New apps taking advantage of Big Data  Rapid app development  Bridges back to legacy systems (wrapping with API, or data integration via federation or data transport)New data fabrics for a new IT Fast data More data  In real-time More sources  In context (what, when, More types who, where) In ONE place  Telemetry / sensor based NOSQL databases (serving humans or machines, where you need to reason over data as it comes in RT) These 3 areas need to come together in a platform  Cloud abstraction (so it can run on any private or public cloud, no lock-in)  Automated deployment and monitoring (rolling upgrades, no patching)  Various deployment form factors (on premise as software, on premise as appliance, in the cloud) 4
  • Example application landscape Machine Learning Real Time (Mahout, etc…) Streams (Social, sensors) Real-Time Processing (s4, storm, spark) Data Visualization (Excel, Tableau) ETL Real Time Interactive HIVE Database Analytics (Impala, (Shark, Batch(Informatica, Talend, Greenplum,Spring Integration) Gemfire, hBase, AsterData, Processing Cassandra) (Map-Reduce) Netezza…) Structured and Unstructured Data (HDFS, MAPR) Cloud Infrastructure Compute Storage Networking Source: Vmware
  • Reference architecture – high-level view Presentation Application Data Operations Security Inte-gration Data Processing Data Management Infrastructure 6
  • Reference architecture – component view Data PresentationIntegration Workflow and Scheduling Data Isolation Data Visualization and Reporting Clients Real Time Ingestion Application Analytics Apps Transactional Apps Analytics Middleware Batch Access Management Ingestion Operations Security Data Processing Data Real Time/Stream Batch Processing Search and Indexing Management and MonitoringConnectors Processing Data Management Metadata Distributed Data Encryption Services Distributed Non-relational Structured Storage Processing DB In-Memory (HDFS) Infrastructure Virtualization Compute / Storage / Network 7
  • Questions to ask in designing a solutionfor a particular business use case Presentation  What physical infrastructure best fits your needs?  What are your data placement requirements (service providerData Application OperationsInte- Securitygra-tion Data Processing data centers or on-premise, jurisdiction)? Data Management Infrastructure Innovation: Cheaper storage but not just storage…Illustrative acquisition cost ? !SAN Storage NAS Filers Enterprise Class White Box DAS1) Data Cloud1) 3-5€/GB 1-3€/GB Hadoop Storage 0.50-1.00€/GB 0.10-0.30€ /GB ???€/GBBased on HDS Based on Netapp Based on Netapp Hardware can be Based on large SAN Storage FAS-Series E-Series (NOSH) self-assembled scale object storage interfaces 1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions 8
  • Dat Presentation a Operations Application Security Inte Questions to ask in designing a solution - gra- tion Data Processing Data Management for a particular business use case Infrastructure Enterprise Class Hadoop Enterprise Class Hadoop Packaged ready-to-deploy modular Packaged ready-to-deploy modular Hadoop Compute / Memory intensive Hadoop cluster cluster  Compute intensive applications  The Data has intrinsic value $$$  Tic Data Analysis  Usable capacity must expand faster than  Extremely tight Service Level compute expectations  Higher storage performance  Severe financial consequences if the  Real human consequences if the system fails analytic run is late (Threats, treatments, financial losses)  System has to allow for asymmetric growthCompute Power Enterprise Class Hadoop White Box Hadoop Bounded Compute algorithm / Memory Values associated with early adopters of intensive Hadoop cluster Hadoop  Compute intensive applications  Additional CPUs do not improve run time  Social Media Space  Extremely tight Service Level  Contributors to Apache expectations  Strong bias to JBOD  Severe financial consequences if the  Skeptical of ALL vendors analytic run is late  Need for deeper storage per datanode Storage Capacity Source: NetApp 9
  • Questions to ask in designing a solutionfor a particular business use case Presentation  Do you run your Hadoop cluster bare-metal or virtual? MostData Application run bare-metal today but virtualization helps with… OperationsInte- Securitygra-tion Data Processing – Different failure domains Data Management – Different hardware pools Infrastructure – Development vs. production Three big types of isolation are required for mixing workloads:  Resource Isolation – Control the greedy neighbor Nosy – Reserve resources to meet needs  Version Isolation – Allow concurrent OS, App, Distro versions Reckless – For instance, test/dev vs. production, high performance vs. low cost  Security Isolation – Provide privacy between users/groups – Runtime and data privacy requiredAdapted from: Vmware, see Apache Hadoop on vSphere http://www.vmware.com/de/hadoop/serengeti.html 10
  • Questions to ask in designing a solutionfor a particular business use case Presentation  Which distribution is right for your needs today vs. tomorrow?  Which distribution will ensure you stay on the main path ofData Application OperationsInte- Securitygra-tion Data Processing open source innovation, vs. trap you in proprietary forks? Data Management Infrastructure  Widely adopted, mature distribution  GTM partners include Oracle, HP, Dell, IBM  Fully open source distribution (incl. management tools)  Reputation for cost-effective licensing  Strong developer ecosystem momentum  GTM partners include Microsoft, Teradata, Informatica, Talend  More proprietary distribution with features that appeal to some business critical use cases  GTM partner AWS (M3 and M5 versions only)  Just announced by EMC, very early stage  Differentiator is HAWQ – claims 600x query speed improvement, full SQL instruction setNote: Distributions include more than just the Data Management layer but are discussed at this point in the presentation. 11Not shown: Intel, Fujitsu and other distributions
  • Questions to ask in designing a solutionfor a particular business use case Presentation  What data sources could be of value (internal vs. external,DataInte- Application Operations people vs. machine generated)? Follow data privacy for Securitygra-tion Data Processing people-generated data. Data Management  How much data volume do you have (entry barrier discussion) Infrastructure and of what type (structured, semi, unstructured)?  Data latency requirements (measured in minutes)? Hadoop APIs NFS for file- REST APIs ODBC (JDBC) for Hadoop based for internet for SQL-based Applications applications access applications 12
  • Questions to ask in designing a solutionfor a particular business use case Presentation  What type of analytics is required (machine learning,Data Application statistical analysis)? OperationsInte- Security  How fast do decisions need to be made (decision latency)?gra-tion Data Processing Data Management  Is multi-stage data processing a requirement (before data Infrastructure gets stored)?  Do you need stream computing and complex event processing (CEP)? If so do you have strict time-based SLAs? Is data loss acceptable?  How often does data get updated and queried (real time vs. batch)?  How tightly coupled are your Hadoop data with existing relational data sets?  Which non-relational DB suits your needs? Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data Stay focused on what is possible quickly 13
  • Innovations: Store first, ask questions laterData Parallel processing (scale out) Presentation Application OperationsInte- Securitygra-tion Data Processing Data Management “Hadoop” Infrastructure High Performance Ecosystem BI  Forward-looking Legacy BI predictive analysis  Quasi-real-time analysis  Questions defined in  Backward-looking the moment, using analysis  Using data out of Business business applications data from many  Using data out of sources problem business applications Selected Vendors  SAP Business Objects  Oracle Exadata  Hadoop distributions  IBM Cognos  SAP HANA Technology  MicroStrategy Solution Data Type/Scalability  Structured  Structured  Structured or  Limited (2 – 3 TB in  Limited (2 – 8 TB in unstructured RAM) RAM)  Unlimited (20 – 30 PB) „True“ big data Legacy vendor definition of big data
  • Questions to ask in designing a solutionfor a particular business use case Presentation  Is backup and recovery critical (number of copies in theData Application HDFS cluster)? OperationsInte- Security  Do you need disaster recovery on the raw data?gra-tion Data Processing Data Management  How do you optimize TCO over the life time of a cluster? Infrastructure  How to ensure the cluster remains balanced and performing well as the underlying hardware pool becomes heterogeneous?  What are the implications of a migration between different distributions or versions of one distribution? Can you do rolling upgrades to minimize disruption?  What level of multi-tenancy do you implement? Even within the enterprise, one general purpose Hadoop cluster might serve different legal entities / BUs.  How do you bring along existing talent? E.g., train developers on Pig, database admins on Hive, IT operations on the platform 15
  • Navigating the broader BI and big data vendorecosystem can be confusing
  • Do you really need Hadoop? Is your data structured and less than 10 TB? Is your data structured, less than 100 TB but tightly integrated with your existing data? Is your data structured, more than 100 TB but processing has to occur real-time with less than a minute of latency?* Then you could stay with legacy BI landscapes including RDBMS, MPP DB and EDW Otherwise Come and join us on a journey into Hadoop based solutions! * Hadoop is making rapid progress in the real-time arena 17
  • ILLUSTRATIVEUse Hadoop for VOLUME NOT EXHAUSTIVE You require parallel / complex data processing power and you can live with minutes or more of latency to derive reports You need data storage and indexing for analytic applications Platform Data MapReduce Transformation
  • ILLUSTRATIVEUse Hadoop for VARIETY NOT EXHAUSTIVE Your data is multi-structured You want to derive reports in batch on full data sets You have complex data flows or multi-stage data pipelines Workflow Mgt. Data MapReduce Transformation Data Visualization and Reporting Low Latency Data Access* * Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data 19
  • ILLUSTRATIVEUse Hadoop for VELOCITY NOT EXHAUSTIVE You are inundated with a flood of real-time data: Numerous live feeds from multiple data sources like machines, business systems or Internet sources Data Apache Kafka Ingestion You want to derive reports in (near) real time on a sample or full data sets Data Visualization and Reporting Shark Fast Analytics* 20 * May also use MPP database
  • Where to start inserting Hadoop in yourcompany? A call to action… IT Infrastructure IT Applications LOB CXO  Accelerating implementation  Understanding Big Data – Solution design driven by – Definition target use cases – Benefits over adjacent and – Reference architecture legacy technologies – Technology selection and – Current mode vs. future POC mode for analytics – Implementation lessons  Assessing the Economic learnt Potential – Target use cases by function and industry – Best approach to adoption Puddles, pools Lakes, oceans AVOID: Systems separated by GOAL: Platform that natively workload type due to contention supports mixed workloads, shared service 21