The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Upcoming SlideShare
Loading in...5
×
 

The Value of the Modern Data Architecture with Apache Hadoop and Teradata

on

  • 4,470 views

This webinar discusses why Apache Hadoop most typically the technology underpinning "Big Data". How it fits in a modern data architecture and the current landscape of databases and data warehouses ...

This webinar discusses why Apache Hadoop most typically the technology underpinning "Big Data". How it fits in a modern data architecture and the current landscape of databases and data warehouses that are already in use.

Statistics

Views

Total Views
4,470
Views on SlideShare
4,396
Embed Views
74

Actions

Likes
13
Downloads
78
Comments
0

3 Embeds 74

http://www.scoop.it 67
https://twitter.com 6
https://www.rebelmouse.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • IDC study:http://cdn.idc.com/research/Predictions12/Main/downloads/IDCTOP10Predictions2012.pdfIDC projects that the digital universe will reach 40 zettabytes (ZB) by 2020, resulting in a 50-fold growth from the beginning of 2010According to the study, 2.8ZB of data will have been created and replicated in 2012.Machine-generated data is a key driver in the growth of the world’s data – which is projected to increase 15x by 2020.Report| McKinsey Global Institutehttp://www.mckinsey.com/insights/americas/us_game_changersGame changers: Five opportunities for US growth and renewalJuly 2013 | by Susan Lund, James Manyika, Scott Nyquist, Lenny Mendonca, and SreenivasRamaswamy“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”– Gartner, Mark Beyer, “Information Management in the 21st Century”By 2015, Gartner believes 65 percent of prepackaged analytic applications will have Hadoop already embedded.Gartner also sees a rising trend in “Hadoop-enabled database management systems” to help organizations deploy appliances and apps (virtual or physical) with Big Data capabilities baked-in.- http://channelnomics.com/2013/01/28/gartner-predicts-big-data-explosion/“Global data growth will outperform Moore’s law over the next few years.” – Forrester, http://blogs.forrester.com/holger_kisker/12-08-15-big_data_meets_cloud
  • Let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. Your environment is undoubtedly more complicated, but conceptually it is likely similar. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database.[CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. So in the world of Big Data, we’ve got classic TRANSACTIONS and New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
  • As the volume of data has exploded, Enterprise Hadoop has emerged as a peer to traditional data systems. The momentum for Hadoop is NOT about revolutionary replacement of traditional databases. Rather it’s about adding a data system uniquely capable of handling big data problems at scale and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with every layer of the stack:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
  • It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
  • Industry research shows the shift from a single system to an ecosystem where different technologies can unify and process data in the most efficient and specialized way to add the most valueGartner calls this movement the “logical data warehouse” which is being driven by a “desire for high-value analytics”EMA and 9sight research shows that, on average, most companies tackle big data with 3 systems including “analytic databases, discovery platforms, and NoSQL solutions” (more details below)===============When asked how many nodes (nodes refers to a separate system/DB in their architecture) were part of their Big Data initiatives, the EMA/9sight survey respondents indicated that a wide number of Hybrid Data Ecosystem nodes were part of their plans. The most common answer among the 255 respondents was a total of Three Hybrid Data Ecosystem nodes as part of the respondents’ Big Data Initiatives, showing that Big Data strategies are not limited to a single platform or solution. When the Two to Five Hybrid Data Ecosystem nodes indications are aggregated, over two thirds of respondents are included in this segment. This shows Big Data Initiatives are focused on more than just a single platform (e.g. Hadoop) augmentation of the core of operational platforms or the enterprise data warehouse. Rather, Big Data requirements are solved by a range of platforms including analytical databases, discovery platforms and NoSQL solutions beyond Hadoop.
  • This is an example of meaningful enterprise-level integration which minimizes data replication and increases analyst productivity. Closes gaps in Hadoop which will take them years and years to close. Leverage the scale and cost of Hadoop, but provide a proper SQL-compliant interface, performance, and higher analytic value with pre-built analytic functions that solve specific business problems like marketing attribution.
  • We see common uses for Hadoop in capturing “dark data” such as email, call center IVR records, documents, and other “no schema” data which does not fit easily into a relational model without pre-processing. Hadoop provides a landing/staging/refining area to munge this data and make it available to join with other data. In some cases, the text can be parsed and “scored” for sentiment as a one-time batch job when interactivity isn’t required.
  • From http://www.odbms.org/blog/2011/10/analytics-at-ebay-an-interview-with-tom-fastner/Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.We have 3 different platforms for Analytics:A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platformSingularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.

The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata Presentation Transcript

  • The Value of a Modern Data Architecture with Apache Hadoop and Teradata © Hortonworks Inc. 2013 Page 1
  • Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • EDW’s role in the MDA • Q&A © Hortonworks Inc. 2013 Page 2
  • DATA  SOURCES   Sources   DATA  SYSTEMS   Data  Systems   APPLICATIONS   Applica/ons   Existing Data Architecture Packaged     Analy/c  App   RDBMS   EDW   Custom     Analy/c  App   Discovery   PlaEorm   Tradi/onal  Sources     (RDBMS,  OLTP,  OLAP)   © Hortonworks Inc. 2013 Page 3
  • Big Data Market Trends & Projections 15x 1 Zettabyte (ZB) = 1 Billion TBs growth rate of machine generated data by 2020 The US has 1/3 of the world’s data Big Data is 1 of 5 US GDP Game Changers $325 billion incremental annual GDP from big data analytics in retail and manufacturing by 2020 Big Data Explosion © Hortonworks Inc. 2013 ñ 20% % by which org’s leveraging modern info management systems outperform peers by 2015 Page 4
  • APPLICATIONS   Traditional Data Architecture Pressured Custom   Applica/ons   Business   Analy/cs   Packaged   Applica/ons   DATA  SYSTEMS   2.8  ZB  in  2012   85%  from  New  Data  Types   RDBMS   EDW   Discovery   PlaEorm   15x  Machine  Data  by  2020   40  ZB  by  2020   DATA  SOURCES   Source: IDC                 Tradi/onal                                          New  Sources       YSTEMS   OLTP,  POS  S(RDBMS,  OLTP,  OLAP)                              (sen/ment,  click,  geo,  sensor,  …)   © Hortonworks Inc. 2013   Page 5
  • APPLICATIONS   Modern Data Architecture Enabled Custom   Applica/ons   Business   Analy/cs   Packaged   Applica/ons   DEV  &  DATA   TOOLS   DATA  SOURCES   DATA  SYSTEMS   BUILD  &   TEST   OPERATIONAL   TOOLS   RDBMS                 EDW   MANAGE  &   MONITOR   Discovery   PlaEorm     Tradi/onal                                          New  Sources     OLTP,  POS    (RDBMS,  OLTP,  OLAP)                              (sen/ment,  click,  geo,  sensor,  …)   SYSTEMS   © Hortonworks Inc. 2013   Page 6
  • Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • EDW’s role in the MDA • Q&A © Hortonworks Inc. 2013 Page 7
  • What Data is Being Stored in Hadoop? 1.  Social Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geolocation Analyze location-based data to manage operations where they occur Value 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Unstructured (text, video, pictures, etc..) Understand patterns in text across millions of unstructured work products: web pages, emails, video, pictures and documents © Hortonworks Inc. 2013 Page 8
  • Modern Data Architecture Applied DATA  SOURCES   Sources   DATA  Systems   Data   SYSTEMS   APPLICATIONS   Applica/ons   Shared Data Lake Packaged     Analy/c  App   RDBMS   EDW   Custom     Analy/c  App   Discovery   PlaEorm   Tradi/onal                                        New  Sources                    (RDBMS,  OLTP,  OLAP)                    (sen/ment,  click,  geo,  sensor,  …)   © Hortonworks Inc. 2013 •  Store all data and build/ enable applications on shared “data lake” •  As orgs mature they move to this as a goal for Hadoop Infrastructure  -­‐  Data  Lake   Modern  Data  Architecture   •  Delivers broad value across the enterprise Page 9
  • Drivers for Hadoop Adoption Modern Data Architecture Hadoop has a central role in next generation data architectures while integrating with existing data systems Driving Efficiency Business Applications Use Hadoop to extract insights that enable new customer value and competitive edge Driving Opportunity Big Data Sets Existing Traditional Server log Clickstream © Hortonworks Inc. 2013 Emerging Sentiment/Social Machine/Sensor Geo-locations Page 10
  • 3 Requirements for Hadoop Adoption Requirements for Hadoop’s Role in the Modern Data Architecture Integrated Interoperable with existing data center investments Key Services Skills Platform, operational and data services essential for the enterprise Leverage your existing skills: development, operations, analytics © Hortonworks Inc. 2013 Page 11
  • APPLICATIONS   Interoperating With Your Tools DATA  SOURCES   DATA  SYSTEMS   Microsoft Applications DEV  &  DATA   TOOLS   OPERATIONAL   TOOLS   Viewpoint   Tradi/onal                                          New  Sources                    (RDBMS,  OLTP,  OLAP)                              (sen/ment,  click,  geo,  sensor,  …)     © Hortonworks Inc. 2013 Page 12
  • Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • EDW’s role in the MDA • Q&A © Hortonworks Inc. 2013 Page 13
  • Shift from a Single Platform to an Ecosystem "Logical" Data Warehouse “We will abandon the old models based on the desire to implement for high-value analytic applications.” “Big Data requirements are solved by a range of platforms including analytical databases, discovery platforms, and NoSQL solutions beyond Hadoop.” Source: “Big Data Comes of Age”. EMA and 9sight Consulting. Nov 2012. 14 2/28/14 Teradata Confidential
  • UNIFIED DATA ARCHITECTURE ERP MANAGE MOVE ACCESS Marketing Marketing Executives Applications Operational Systems SCM INTEGRATED DATA WAREHOUSE CRM Images DATA PLATFORM Business Intelligence Audio and Video Data Mining Frontline Workers Customers Partners Engineers Machine Logs DISCOVERY PLATFORM Math and Stats Data Scientists Text Languages Business Analysts Web and Social USERS SOURCES ANALYTIC TOOLS
  • UNIFIED DATA ARCHITECTURE ERP MANAGE MOVE ACCESS Marketing Marketing Executives Applications Operational Systems SCM INTEGRATED DATA WAREHOUSE CRM Images DATA PLATFORM Business Intelligence Predictive Analytics Business Intelligence Operational Intelligence Audio and Video Machine Logs Text Fast Loading Data Mining Filtering and Processing Online Archival Customers Partners Engineers DISCOVERY PLATFORM Math and Stats Data Scientists Data Discovery Path, graph, time-series analysis Web and Social Frontline Workers Languages Business Analysts Pattern Detection USERS SOURCES ANALYTIC TOOLS
  • TERADATA UNIFIED DATA ARCHITECTURE ERP MANAGE MOVE ACCESS Marketing Marketing Executives Applications Operational Systems SCM INTEGRATED DATA WAREHOUSE CRM Images DATA PLATFORM Business Intelligence Audio and Video Data Mining Frontline Workers Customers Partners Engineers Machine Logs DISCOVERY PLATFORM Math and Stats Data Scientists Text Languages Business Analysts Web and Social USERS SOURCES ANALYTIC TOOLS
  • Teradata Appliance for Hadoop Value-Added Software Bringing Hadoop to Enterprise Access: SQL-H™, Teradata Studio Management: Viewpoint, TVI Administration: Hadoop Builder, Intelligent start/stop, DataNode swap, deferred drive replace High Availability : NameNode HA, Master Machine Failover Refining, Metadata, Entity Resolution HCatalog 18 2/28/14 Security & Data Access Kerberos Kerberos Teradata Confidential
  • Modern Data Architecture Details TVI – Proactive system monitoring tied to Teradata customer support Alerts Viewpoint SOURCE DATA Sensor Log Data System Health Node Health Space Usage Capacity Heatmap Metrics Analysis Query/Visualization/ Reporting/Analytical Tools and Apps DB JDBC/ODBC Compliant Tool KNOX AMBARI File MAPREDUCE Customer/ Inventory Data JMS YARN Clickstream Data REST Flat Files Services HDFS LOAD SQOOP EXTRACT REFINE HIVE FLUME PIG NFS ETL Web HDFS HCATALOG (metadata services) CUSTOM HTTP Sentiment Analysis Streaming Data STRUCTURING © Hortonworks Inc. 2013 EXPORT INTERACTIVE Teradata SQL-H Analytical Platforms Aster Discovery Platform LOAD Teradata IDW SQOOP / HIVE TDCH Page 19
  • Teradata Vital Infrastructure (TVI) PROACTIVE RELIABILITY, AVAILABILITY, AND MANAGEABILITY 1U server virtualizes system and cabinet management software Server Management VMS •  Cabinet Management Interface Controller (CMIC) •  Service Work Station (SWS) •  Automatically installed on base/first cabinet VMS allows full rack solutions without additional cabinet for traditional SWS Eliminates need for expansion racks, reducing customers’ floor space and energy costs Supports Teradata hardware and Hadoop software TVI Support for Hadoop 62–70% of Incidents Discovered through TVI 20 2/28/14 Teradata Confidential
  • Standard SQL Access to Hadoop Data Give business users on-the-fly access to data in Hadoop Teradata SQL-H Aster SQL-H Hadoop MR •  Fast: Queries run on Teradata or Aster, data accessed from Hadoop Data •  Standard: 100% ANSI SQL access to Hadoop data Data Filtering •  Trusted: Use existing tools/skills and enable self-service BI with granular security HCatalog Hive Pig •  Efficient: Intelligent data access leveraging the Hadoop HCatalog 21 2/28/14 Teradata Confidential Hadoop Layer: HDFS
  • Teradata Unified Data Architecture™ Partners Support Many Layers 22 2/28/14 Teradata Confidential
  • Teradata Aster Discovery Portfolio: Accelerate Time to Insights Some of the 80+ out-of-the-box analytical apps PATH ANALYSIS TEXT ANALYSIS Discover Patterns in Rows of Sequential Data Derive Patterns and Extract Features in Textual Data STATISTICAL ANALYSIS High-Performance Processing of Common Statistical Calculations SEGMENTATION MARKETING ANALYTICS DATA TRANSFORMATION Analyze Customer Interactions to Optimize Marketing Decisions Graph Analysis Graph analytics processing and visualization 23 2/28/14 Teradata Confidential Discover Natural Groupings of Data Points Transform Data for More Advanced Analysis SQL-MapReduce Visualization Graphing and visualization tools linked to key functions of the MapReduce analytics library
  • More Accurate Customer Churn Prevention Hadoop captures, stores and transforms social, images and call records SOCIAL FEEDS Multi-Structured Raw Data Call Center Voice Records CLICKSTREAM DATA Call Data Hadoop Sentiment Scores Aster Discovery Platform Data Sources ETL Tools 24 2/28/14 Teradata Confidential Analytic Results Traditional Data Flow Capture, Retain and Refine Layer Dimensional Data Check Data eMail Aster does path and pattern analysis Teradata Integrated DW Analysis + Marketing Automation (Customer Retention Campaign)
  • MPP RDBMS + Hadoop Customer Successes 25 2/28/14 Teradata Confidential
  • Key Considerations For EDW and Hadoop MPP RDBMS Hadoop Stable Schema Evolving Schema Leverages Structured Data Structure Agnostic ANSI SQL Flexible Programming Iterative Analysis Batch Analysis Fine Grain Security N/A Cleansed Data Raw Data Seeks Scans Updates/Deletes Ingest Service Level Agreements Flexibility Core Data All Data Complex Joins Efficient Use of CPU/IO 26 Complex Processing Low Cost of Storage 2/28/14 Teradata Confidential
  • Complete Consulting and Training Services Teradata Analytic Architecture Services Services to scope, design, build, operate and maintain an optimal UDA approach for Teradata, Aster, and Hadoop Teradata DI Optimization Assess structured/non-structured data, discuss data loading techniques, determine best platform, optimize load scripts/processes Teradata Big Analytics Assess data value/cost of capture, identify source of “exhaust” data, create conceptual architecture, refine and enrich the data, implement initial analytics in Aster or best-fit tool Teradata Workshop for Hadoop Introduction workshop (across all of UDA) Teradata Data Staging for Hadoop Load data into landing-area; set-up data exploration/refining area; Scope architecture and analytics; set-up Hadoop repository; Load sample data Teradata Platform for Hadoop Installation guidance and mentoring for Hadoop platform, D-I-Y after installation Teradata Managed Services for Hadoop Operations, management, administration, backup, security, process control for Hadoop Teradata Training Courses for Hadoop 27 Areas of Focus Two comprehensive, multi-day training offerings: 1) Administration of Apache Hadoop and 2) Developing Solutions Using Apache Hadoop 2/28/14 Teradata Confidential
  • Discovering Deep Insights in Retail Transforming Web Walks into DNA Sequences Impact Situation Large retailer with 700M visits/ year, 2M customers / day look at 1M products online Problem Increase ability of web content owners to self-serve insights Solution Treat web walks like DNA sequences of simple patterns. 28 2/28/14 •  Data: loaded logs into Hortonworks •  Loaded 2 months of raw data in 1 hour, vs. 1 day on old system •  Can load a day’s log data in 60 sec •  Sessionize: Creates sequence for visit, e.g., boils 20 customer clicks down to 1 line: •  <Home –Search -Look at Product Add to Basket – Pay – Exit> •  Analyze: Business analysts can now do path analysis •  Act: •  Segmentations by behavior can increase conversion rates by 5-10%. •  Web design changes can drive another 10-20% more visitors into the sales funnel Teradata Confidential
  • Demo Demo 29 2/28/14 Teradata Confidential