This webinar discusses why Apache Hadoop most typically the technology underpinning "Big Data". How it fits in a modern data architecture and the current landscape of databases and data warehouses that are already in use.
14. Shift from a Single Platform to an Ecosystem
"Logical" Data Warehouse
“We will abandon the old models
based on the desire to implement for
high-value analytic applications.”
“Big Data requirements are solved by
a range of platforms including
analytical databases, discovery
platforms, and NoSQL solutions
beyond Hadoop.”
Source: “Big Data Comes of Age”. EMA and 9sight Consulting. Nov 2012.
14
2/28/14
Teradata Confidential
16. UNIFIED DATA ARCHITECTURE
ERP
MANAGE
MOVE
ACCESS
Marketing
Marketing
Executives
Applications
Operational
Systems
SCM
INTEGRATED DATA WAREHOUSE
CRM
Images
DATA
PLATFORM
Business Intelligence
Predictive Analytics
Business
Intelligence
Operational Intelligence
Audio
and Video
Machine
Logs
Text
Fast Loading
Data
Mining
Filtering and
Processing
Online Archival
Customers
Partners
Engineers
DISCOVERY PLATFORM
Math
and Stats
Data
Scientists
Data Discovery
Path, graph, time-series analysis
Web and
Social
Frontline
Workers
Languages
Business
Analysts
Pattern Detection
USERS
SOURCES
ANALYTIC
TOOLS
17. TERADATA UNIFIED DATA ARCHITECTURE
ERP
MANAGE
MOVE
ACCESS
Marketing
Marketing
Executives
Applications
Operational
Systems
SCM
INTEGRATED DATA WAREHOUSE
CRM
Images
DATA
PLATFORM
Business
Intelligence
Audio
and Video
Data
Mining
Frontline
Workers
Customers
Partners
Engineers
Machine
Logs
DISCOVERY PLATFORM
Math
and Stats
Data
Scientists
Text
Languages
Business
Analysts
Web and
Social
USERS
SOURCES
ANALYTIC
TOOLS
18. Teradata Appliance for Hadoop
Value-Added Software Bringing Hadoop to Enterprise
Access: SQL-H™, Teradata Studio
Management: Viewpoint, TVI
Administration: Hadoop Builder,
Intelligent start/stop, DataNode
swap, deferred drive replace
High Availability : NameNode
HA, Master Machine Failover
Refining, Metadata,
Entity Resolution
HCatalog
18
2/28/14
Security & Data Access
Kerberos
Kerberos
Teradata Confidential
20. Teradata Vital Infrastructure (TVI)
PROACTIVE RELIABILITY, AVAILABILITY, AND MANAGEABILITY
1U server virtualizes system and cabinet management software
Server Management VMS
• Cabinet Management Interface Controller (CMIC)
• Service Work Station (SWS)
• Automatically installed on base/first cabinet
VMS allows full
rack solutions
without additional
cabinet for
traditional SWS
Eliminates need
for expansion
racks, reducing
customers’ floor
space and energy
costs
Supports
Teradata
hardware and
Hadoop software
TVI Support for
Hadoop
62–70% of Incidents Discovered through TVI
20
2/28/14
Teradata Confidential
21. Standard SQL Access to Hadoop Data
Give business users on-the-fly access to data in Hadoop
Teradata SQL-H
Aster SQL-H
Hadoop
MR
• Fast: Queries run on Teradata or Aster,
data accessed from Hadoop
Data
• Standard: 100% ANSI SQL access to
Hadoop data
Data Filtering
• Trusted: Use existing tools/skills and
enable self-service BI with granular
security
HCatalog
Hive
Pig
• Efficient: Intelligent data access
leveraging the Hadoop HCatalog
21
2/28/14
Teradata Confidential
Hadoop Layer: HDFS
22. Teradata Unified Data Architecture™
Partners Support Many Layers
22
2/28/14
Teradata Confidential
23. Teradata Aster Discovery Portfolio:
Accelerate Time to Insights
Some of the 80+ out-of-the-box analytical apps
PATH ANALYSIS
TEXT ANALYSIS
Discover Patterns in Rows of
Sequential Data
Derive Patterns and Extract
Features in Textual Data
STATISTICAL ANALYSIS
High-Performance Processing of
Common Statistical Calculations
SEGMENTATION
MARKETING ANALYTICS
DATA TRANSFORMATION
Analyze Customer Interactions to
Optimize Marketing Decisions
Graph Analysis
Graph analytics processing and
visualization
23
2/28/14
Teradata Confidential
Discover Natural Groupings of
Data Points
Transform Data for More
Advanced Analysis
SQL-MapReduce
Visualization
Graphing and visualization tools
linked to key functions of the
MapReduce analytics library
24. More Accurate Customer Churn Prevention
Hadoop
captures, stores
and transforms
social, images
and call records
SOCIAL
FEEDS
Multi-Structured
Raw Data
Call Center
Voice Records
CLICKSTREAM
DATA
Call Data
Hadoop
Sentiment
Scores
Aster
Discovery
Platform
Data Sources
ETL Tools
24
2/28/14
Teradata Confidential
Analytic Results
Traditional Data Flow
Capture, Retain
and Refine Layer
Dimensional Data
Check Data
eMail
Aster does path
and pattern
analysis
Teradata
Integrated DW
Analysis +
Marketing
Automation
(Customer
Retention
Campaign)
26. Key Considerations For EDW and Hadoop
MPP RDBMS
Hadoop
Stable Schema
Evolving Schema
Leverages Structured Data
Structure Agnostic
ANSI SQL
Flexible Programming
Iterative Analysis
Batch Analysis
Fine Grain Security
N/A
Cleansed Data
Raw Data
Seeks
Scans
Updates/Deletes
Ingest
Service Level Agreements
Flexibility
Core Data
All Data
Complex Joins
Efficient Use of CPU/IO
26
Complex Processing
Low Cost of Storage
2/28/14
Teradata Confidential
27. Complete Consulting and Training
Services
Teradata Analytic
Architecture Services
Services to scope, design, build, operate and maintain an optimal UDA
approach for Teradata, Aster, and Hadoop
Teradata DI
Optimization
Assess structured/non-structured data, discuss data loading techniques,
determine best platform, optimize load scripts/processes
Teradata Big
Analytics
Assess data value/cost of capture, identify source of “exhaust” data, create
conceptual architecture, refine and enrich the data, implement initial
analytics in Aster or best-fit tool
Teradata Workshop
for Hadoop
Introduction workshop (across all of UDA)
Teradata Data Staging
for Hadoop
Load data into landing-area; set-up data exploration/refining area; Scope
architecture and analytics; set-up Hadoop repository; Load sample data
Teradata Platform for
Hadoop
Installation guidance and mentoring for Hadoop platform, D-I-Y after
installation
Teradata Managed
Services for Hadoop
Operations, management, administration, backup, security, process control
for Hadoop
Teradata Training
Courses for Hadoop
27
Areas of Focus
Two comprehensive, multi-day training offerings: 1) Administration of
Apache Hadoop and 2) Developing Solutions Using Apache Hadoop
2/28/14
Teradata Confidential
28. Discovering Deep Insights in Retail
Transforming Web Walks into DNA Sequences
Impact
Situation
Large retailer with 700M visits/
year, 2M customers / day look
at 1M products online
Problem
Increase ability of web content
owners to self-serve insights
Solution
Treat web walks like DNA
sequences of simple patterns.
28
2/28/14
• Data: loaded logs into Hortonworks
• Loaded 2 months of raw data in 1
hour, vs. 1 day on old system
• Can load a day’s log data in 60 sec
• Sessionize: Creates sequence for
visit, e.g., boils 20 customer clicks
down to 1 line:
• <Home –Search -Look at Product Add to Basket – Pay – Exit>
• Analyze: Business analysts can now
do path analysis
• Act:
• Segmentations by behavior can
increase conversion rates by 5-10%.
• Web design changes can drive
another 10-20% more visitors into
the sales funnel
Teradata Confidential
IDC study:http://cdn.idc.com/research/Predictions12/Main/downloads/IDCTOP10Predictions2012.pdfIDC projects that the digital universe will reach 40 zettabytes (ZB) by 2020, resulting in a 50-fold growth from the beginning of 2010According to the study, 2.8ZB of data will have been created and replicated in 2012.Machine-generated data is a key driver in the growth of the world’s data – which is projected to increase 15x by 2020.Report| McKinsey Global Institutehttp://www.mckinsey.com/insights/americas/us_game_changersGame changers: Five opportunities for US growth and renewalJuly 2013 | by Susan Lund, James Manyika, Scott Nyquist, Lenny Mendonca, and SreenivasRamaswamy“By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent.”– Gartner, Mark Beyer, “Information Management in the 21st Century”By 2015, Gartner believes 65 percent of prepackaged analytic applications will have Hadoop already embedded.Gartner also sees a rising trend in “Hadoop-enabled database management systems” to help organizations deploy appliances and apps (virtual or physical) with Big Data capabilities baked-in.- http://channelnomics.com/2013/01/28/gartner-predicts-big-data-explosion/“Global data growth will outperform Moore’s law over the next few years.” – Forrester, http://blogs.forrester.com/holger_kisker/12-08-15-big_data_meets_cloud
Let’s set some context before digging into the Modern Data Architecture.While overly simplistic, this graphic represents the traditional data architecture:- A set of data sources producing data- A set of data systems to capture and store that data: most typically a mix of RDBMS and data warehouses- A set of custom and packaged applications as well as business analytics that leverage the data stored in those data systems. Your environment is undoubtedly more complicated, but conceptually it is likely similar. This architecture is tuned to handle TRANSACTIONS and data that fits into a relational database.[CLICK] Fast-forward to recent years and this traditional architecture has become PRESSURED with New Sources of data that aren’t handled well by existing data systems. So in the world of Big Data, we’ve got classic TRANSACTIONS and New Sources of data that come from what I refer to as INTERACTIONS and OBSERVATIONS.INTERACTIONS come from such things as Web Logs, User Click Streams, Social Interactions & Feeds, and User-Generated Content including video, audio, and images.OBSERVATIONS tend to come from the “Internet of Things”. Sensors for heat, motion, and pressure and RFID and GPS chips within such things as mobile devices, ATM machines, automobiles, and even farm tractors are just some of the “things” that output Observation data.
As the volume of data has exploded, Enterprise Hadoop has emerged as a peer to traditional data systems. The momentum for Hadoop is NOT about revolutionary replacement of traditional databases. Rather it’s about adding a data system uniquely capable of handling big data problems at scale and doing so in a way that integrates easily with existing data systems, tools and approaches.This means it must interoperate with every layer of the stack:- Existing applications and BI tools- Existing databases and data warehouses for loading data to / from the data warehouse- Development tools used for building custom applications- Operational tools for managing and monitoringMainstream enterprises want to get the benefits of new technologies in ways that leverage existing skills and integrate with existing systems.
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
Industry research shows the shift from a single system to an ecosystem where different technologies can unify and process data in the most efficient and specialized way to add the most valueGartner calls this movement the “logical data warehouse” which is being driven by a “desire for high-value analytics”EMA and 9sight research shows that, on average, most companies tackle big data with 3 systems including “analytic databases, discovery platforms, and NoSQL solutions” (more details below)===============When asked how many nodes (nodes refers to a separate system/DB in their architecture) were part of their Big Data initiatives, the EMA/9sight survey respondents indicated that a wide number of Hybrid Data Ecosystem nodes were part of their plans. The most common answer among the 255 respondents was a total of Three Hybrid Data Ecosystem nodes as part of the respondents’ Big Data Initiatives, showing that Big Data strategies are not limited to a single platform or solution. When the Two to Five Hybrid Data Ecosystem nodes indications are aggregated, over two thirds of respondents are included in this segment. This shows Big Data Initiatives are focused on more than just a single platform (e.g. Hadoop) augmentation of the core of operational platforms or the enterprise data warehouse. Rather, Big Data requirements are solved by a range of platforms including analytical databases, discovery platforms and NoSQL solutions beyond Hadoop.
This is an example of meaningful enterprise-level integration which minimizes data replication and increases analyst productivity. Closes gaps in Hadoop which will take them years and years to close. Leverage the scale and cost of Hadoop, but provide a proper SQL-compliant interface, performance, and higher analytic value with pre-built analytic functions that solve specific business problems like marketing attribution.
We see common uses for Hadoop in capturing “dark data” such as email, call center IVR records, documents, and other “no schema” data which does not fit easily into a relational model without pre-processing. Hadoop provides a landing/staging/refining area to munge this data and make it available to join with other data. In some cases, the text can be parsed and “scored” for sentiment as a one-time batch job when interactivity isn’t required.
From http://www.odbms.org/blog/2011/10/analytics-at-ebay-an-interview-with-tom-fastner/Ebay is rapidly changing, and analytics is driving many key initiatives like buyer experience, search optimization, buyer protection or mobile commerce. We are investing heavily in new technologies and approaches to leverage new data sources to drive innovation.We have 3 different platforms for Analytics:A) EDW: Dual systems for transactional (structured) data; Teradata 3.5PB and 2.5 PB spinning disk; 10+ years experience; very high concurrency; good accessibility; hundreds of applications.B) Singularity: deep Teradata system for semi-structured data; 36 PB spinning disk; lower concurrency that EDW, but can store more data; biggest use case is User Behavior Analysis; largest table is 1.2 PB with ~1.9 Trillion rows.C) Hadoop: for unstructured/complex data; ~40 PB spinning disk; text analytics, machine learning; has the User Behavior data and selected EDW tables; lower concurrency and utilization.When dealing with terabytes to petabytes of data, how do you ensure scalability and performance?Tom Fastner: EDW: We model for the unknown (close to 3rd NF) to provide a solid physical data model suitable for many applications, that limits the number of physical copies needed to satisfy specific application requirements. A lot of scalability and performance is built into the database, but as any shared resource it does require an excellent operations team to fully leverage the capabilities of the platformSingularity: The platform is identical to EDW, the only exception are limitations in the workload management due to configuration choices. But since we are leveraging the latest database release we are exploring ways to adopt new storage and processing patterns. Some new data sources are stored in a denormalized form significantly simplifying data modeling and ETL. On top we developed functions to support the analysis of the semi-structured data. It also enables more sophisticated algorithms that would be very hard, inefficient or impossible to implement with pure SQL. One example is the pathing of user sessions. However the size of the data requires us to focus more on best practices (develop on small subsets, use 1% sample; process by day),Hadoop: The emphasis on Hadoop is on optimizing for access. The reusability of data structures (besides “raw” data) is very low.Un-structured data is handled on Hadoop only. The data is copied from the source systems into HDFS for further processing. We do not store any of that on the Singularity (Teradata) system.