Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Landscape and
Implementation Strategy
Ben Duan
Director, Enterprise Architecture
Oracle
08/2014
1
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
Big Data and Its Impact
Big Data Technology Landscape
Big Data Implementation Strategy
1
2
3
2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Definition
“Big data is a term that describes large volumes of high velocity,
complex and variable data that require advanced techniques and
technologies to enable the capture, storage, distribution,
management, and analysis of the information”
--TechAmerica
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
VELOCITY VARIETY VALUE
SOCIAL
BLOG
SMART
METER
VOLUME
4
Big Data Characteristics
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Is About…
Tapping into diverse data sets
Finding and monetizing
unknown relationships
Creating data driven business decisions
5
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Why Is Big Data Important?
Source: * McKinsey Global Institute: Big Data – The next frontier for innovation,
competition and productivity (May 2011)
US HEALTH CARE MANUFACTURING GLOBAL PERSONAL
LOCATION DATA
$300 B –50% $100 B
“In a big data world, a competitor that fails to sufficiently
develop its capabilities will be left behind.”
Increase industry
value per year by
Decrease dev.,
assembly costs by
Increase service
provider revenue by
McKinsey Global Institute
US RETAIL
60+%
Increase net
margin by
EUROPE PUBLIC
SECTOR ADMIN
€250 B
Increase industry
value per year by
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Investment Are Happening Everywhere
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
MEDIA/
ENTERTAINMENT
Viewers / advertising
effectiveness
COMMUNICATIONS
Location-based advertising
EDUCATION &
RESEARCH
Experiment sensor
analysis
CONSUMER PACKAGED
GOODS
Sentiment analysis of
what’s hot, problems
HEALTH CARE
Patient sensors,
monitoring, EHRs
Quality of care
LIFE SCIENCES
Clinical trials
Genomics
HIGH TECHNOLOGY /
INDUSTRIAL MFG.
Mfg quality
Warranty analysis
OIL & GAS
Drilling exploration
sensor analysis
FINANCIAL
SERVICES
Risk & portfolio analysis
AUTOMOTIVE
Auto sensors
reporting location,
problems
RETAIL
Consumer
sentiment
Optimized sales &
marketing
LAW ENFORCEMENT
& DEFENSE
Threat analysis - social
media monitoring,
photo analysis
TRAVEL &
TRANSPORTATION
Sensor analysis for optimal
traffic flows
Customer sentiment
UTILITIES
Smart Meter analysis
Address Specific Industry Needs
ON-LINE SERVICES /
SOCIAL MEDIA
People & career matching
Web-site optimization
Challenged by: Data Volume, Velocity, Variety in finding Value
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Table Join Complexity
Data Update
Pattern
Schema
Complexity
Total Data Volume
Responsiveness
Per Job
Data Volume
Processing
Freedom
Concurrent
Jobs
Traditional
RDBMS
Big Data
Analytics
Generic
Data
Processing
1000
100PB
10PB
1PB
StructuredAppend Only UnstructuredTransactional
100 Tables
100T B
SQL
Interactive
Batch 100T B
Batch
10PB
1PB
100PB
9
Big Data Analytics Sweet Spot
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data
Velocity
Batch
Real Time
Data Structure
Structured Semi-Structured Unstructured
Credit and Market Risk at Banks
Fraud Detection (Credit Card) & Financial Crimes (AML) in Banks
(including Social Network Analysis)
Event-based Marketing in Financial Services and Telecoms
Markdown Optimization in Retail
Claims and Tax Fraud in Public Sector
Potential Use Cases for Big Data Analytics
Predictive Maintenance in
Aerospace
Social Media Sentiment
Analysis
Demand
Forecasting in
Manufacturing
Disease Analysis on Electronic
Health Records
Traditional Data
Warehousing
Text Mining Video Surveillance/Analysis
10
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
A Common Use Case:
Predictive Ad and Content Generation
NoSQL
DB
Expert
System
Real-time: Determine
best ad to place
on page for this user
Input into
Lookup user
profile
Add user
if not present
Web
logs
HDFS
Profiles
NoSQL DB
High scale
data reductions
BI and
Analytics
Billing
Predictions
on browsing
Actual
ads
served
Low
Latency
Batch
11
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
Big Data and Its Impact
Big Data Technology Landscape
Big Data Implementation Strategy
2
1
3
12
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Integrated Big Data Architecture Capabilities
Transaction
Data
Management
Security,Governance
Advanced
Analytics
Visual
Discovery
DBMS
(OLTP)
Master &
Ref Data
Structured
DW/DM
Text Analytics
and Search
Reporting &
Dashboards
Real-Time
Machine
Generated
Social
Media
Text, Image
Video, Audio
NoSQL
UnstructuredSemi-
structured
HDFS
Alerting
In-Database
Analytics
EPM
BI Applications
Message-
Based
ETL/ELT
CDC
ODS
Streaming
(CEP Engine)
Acquire/Store Organize Analyze Decide
Hadoop
(MapReduce)
Servers Storage Network OS
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Apache Open Source
– Hadoop Common, HDFS, YARN,
MapReduce, etc.
• Pure-play Hadoop Distribution
Vendors
– Cloudera
– Hortonworks
– MapR
• Enterprise Software Vendors that
Offer Hadoop Distribution
– Oracle with Cloudera
– Intel with Cloudera
– SAP with Hortonworks
– Microsoft with Hortonworks
– Teradata with Hortonworks
– HP with Hortonworks
– Amazon with MapR
Hadoop Marketplace
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Acquisition
Sqoop Provides command-line tools to load and extract data from Hadoop and relational data
sources
Flume Continuously captures and ingests/streams data into Hadoop
Chukwa Collects data from various sources and stores in HDFS, leveraging MapReduce
Storm Distributed real time computation system. Storm makes it easy to reliably process unbounded
streams of data, doing for real time processing what Hadoop did for batch processing
Oracle Data
Integrator
High performance ETL tool to load file from HDFS to MapReduce
15
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Storage
HDFS HDFS is an Apache open source distributed file system and combined with Apache Map
Reduce engine to store and process large data size to terabyte to petabyte levels
in a highly distributed manner with redundancy and performance.
NoSQL DB Not Only SQL. It is specific designed for read and append based on distributed, fault-tolerant
architecture. It offers many orders of magnitude performance improvement on read on large
volume of data (terabytes). But it only offers Basically Available, Soft State, Eventually
Consistent (BASE) guarantees on transaction. Major players include: Oracle NoSql DB,
Marklogic, Cassandra, MongoDB, Accumulo, etc.
In Memory DB Data of In-Memory database resides primarily in memory. Vendors of In-Memory database
have achieved break-through performance improvement by leveraging better CPU cache
utilization, parallel execution through multi-core processors, columnar data compression,
among others. Major players include: Oracle TimesTen, Oracle DB 12C, Microsoft SQL Server
2014, SAP HANA, IBM DB2 BLU Acceleration, etc.
16
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Organization
17
MapReduce MapReduce is a programming model and an associated implementation for processing large
data sets. Data is first organized into key/value pairs. Map function is used to process the
data set to produce a new set of intermediate key/value pairs. Then reduce function is used
to merge all intermediate values with the same key.
YARN (Yet Another
Resource Negotiator)
Next generation of MapReduce for dramatic performance improvement. It is the foundation
of Hadoop 2.0. It facilitates multiple workloads, runs multiple data engines, and supports
multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2.
Hive Provides ad hoc query and analysis for data on Hadoop using a SQL interface
Impala Real time version of Hive, claimed 3x-90x faster than Hive
Pig Provides interface and data flow language for processing data on Hadoop
Zookeeper Provides configuration, naming, and other coordination capability for Hadoop tasks
Oozie Provides workflows and coordination services for jobs that are running on Hadoop
Oracle SQL Loader for
Hadoop
High speed data loader to load data from HDFS to Oracle DB up to 15TB/hour
Oracle Data Integrator ELT tool to load data from HDFS to Oracle DB
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data Analysis and Decision
18
In Database R Open Source data analytics software. Venders including Oracle enhanced it to reside in
database to improve performance and data processing capability. Vendors including Oracle
also developed adaptor to enable R to read data directly from HDFS.
Enterprise Search Solr. Its major features include powerful full-text search, hit highlighting, faceted search,
near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word,
PDF) handling, and geospatial search.
Stream Analytics,
Complex Event
Processing (CEP)
Stream Analytics involves continuously process the real time data, and combines with the
historic data, to derive decisions. There normally is a window to define how much historic
data to take into consideration, for example, last 10 minutes, or last 10,000 transactions, etc.
Vendors include Oracle, IBM, etc.
Business Intelligence Dashboard, adhoc query, scorecard, etc. Products include OBIEE, Cognos, Microstrategy,
Business Objects, Qlikview, Palantir, Tableau. Some of them provide Big Data connectors via
Hive.
Advanced
Analytics/Information
Discovery
Text mining, predictive analytics, contextual search and analysis, statistical analysis, Cluster
analysis, spatial analysis. Products include SAS, SPSS, Oracle Endeca, Qlikview, Palantir,
Tableau, etc. Most of them provide Big Data connectors.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Authentication
– Kerberos, AD, LDAP
• Authorization
– Apache Sentry
• Audit
– Cloudera Navigator
– Oracle Audit Vault
– Oracle Database Firewall
• Encryption
– Filesystem level encryption AES-256
• Key Management
– Oracle Data Vault, Cloudera Enterprise
Data Hub
19
Data Security
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Cloudera Manager
– End to End Administration for Cloudera
Distributed Hadoop (CDH)
20
System Management
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Public Cloud
– MapR Hadoop on Amazon
AWS
– Microsoft HDInsight on
Azure
21
Infrastructure
• On Premise
– Oracle Big Data Appliance
– Teradata Appliance for Hadoop
• Deep Analytics
• Agile Development
• Massive Scalability
• Real Time Results
• High Throughput
• In-Place Preparation
• All Data Sources/Structures
• Low, predictable Latency
• High Transaction Count
• Flexible Data Structures
Acquire Organize Analyze
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Software
– Oracle Linux 6.4
– JDK 7
– CDH 4.x and 5.x
– Cloudera Manager
– Oracle R Distribution
– Oracle NoSQL DB
– Oracle Enterprise Manager
– Oracle Big Data SQL
– Oracle Big Data Connectors
– Oracle Audit Vault and Database Firewall
– Oracle Data Integrator
• Hardware (Full Rack)
– 18 Servers
• 288 Cores
• 1 - 9 TB Memory
• 864 TB Storage
– 40 GB/sec Infiniband
22
Oracle Big Data Appliance X4-2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Hortonworks Data Platform • Cloudera Enterprise Data Hub
23
Pure Players: End-to-End Big Data Solution
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Solution
24
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Agenda
Big Data and Its Impact
Big Data Technology Landscape
Big Data Implementation Strategy3
2
1
25
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 26
Gartner Big Data Hype Curve
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Organization’s Top Big Data Challenges
27
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Development Process
– Start with a hypothesis, the big
idea
– Gather available and relevant
data sources
– Explore results through various
data processing and integration
methods
– Reduce ambiguity, apply
statistical models, eliminate
outliers, find concentrations,
and make correlations.
– Interpret the outcome,
continuously refine models,
and establish an improved
hypothesis
28
• New Characteristics
– New process for high volume
and very detailed information
that requires hardware and
software parallelization and
optimization.
– New analytical methods. New
analyses and methodologies
need tp take into account of
massive volume of data.
– New association, aligning new
big data sets with existing data
assets.
Big Data and Analytics Development Process
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Implementation Strategy
Process TechnologyPeople
Data scientists, big data
architects, developers, admins
work hand-in-hand to pursue
agility and team work
Adopt the proven big data
architecture development
process
Choose the right
technologies that align
with business and IT
strategy on Big Data
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
People: Develop the Talent
Big Data
Scientist
Big Data
Developers
Big Data
Administrators
Agility and Teamwork
Training, Knowledge Management, Center of Excellence
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Process: Oracle Architecture Development Process for
Information Architecture
31
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Business Drivers
– Better insight for better customer
intimacy
– Product innovation
– Risk analysis
– Fraud detection
• IT Drivers
– Reduce storage cost
– Reduce data movement
– Faster time to market
– Standardized toolset
– Ease of management and operation
– Security and governance
32
Business Context
Drivers for Big Data Architecture (Sample)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Architecture Vision
Architecture Principles (Samples)
Principles Rationale Implication
Horizontal
scaling model
To ensure sufficient scalability the model must
be extensible across a large number of nodes
Virtually “unlimited” horizontal scalability of data
across nodes. Data replication to ensure
redundancy in case of failure
Minimal data
structures
Minimal data structures in the initial stage of
filtering/analysis allows the rapid loading of a
many different data types
Enables flexibility in the programming layer to
rapidly tailor to specific needs
Alignment with
the overall
information
architecture
The value of big data is maximized when it can
be correlated with existing enterprise
information and business processes
Appropriate technology, process and people are
needed to correlate and integrate big data
analytic results with enterprise information,
applications and business processes
Appropriate
governance
The governance model must be aligned with the
organizational needs
and phase of the Big Data initiative.
Data exploration requires a looser governance
model to maximize agility. However, as valuable
data is uncovered, and becomes incorporated
into standard business processes, the governance
model on the data must be tightened.
33
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Sample Current State • Maturity Assessment
34
Current State
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Data
Sources
Data
Quality
Data
Staging
Master
(Customer, Product, etc)
Reference
Historical
(3NF)
Analytical
Dimensions,
(Cubes, Snowflake)
Data Marts
Enterprise
Information
Services
Reporting
Applications
Operational
Analytics
Predictive/
Advanced
Analytics
Knowledge
Discovery
Content
Management
Operational
Systems
User
Interface
Big Data
Map/Reduce , NoSQL
Distributed File System
FederationandVirtualization
Security, Data Governance, Metadata Management
Infrastructure
Enterprise Data
Warehouse
SOA/ESB
ELT/ETL
CEP
Weblog Aggregation
Internal
Systems
External
Systems
Data
Integration
Portals
Mashups
Search
Dashboards
Alerts
Visuals
Mobile
Profile
De-dupe
Cleanse
Match
Oracle Information Architecture Conceptual View
Shared Services
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Architecture Conceptual View
36
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Big Data Architecture Logical View
37
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Roadmap (Example)
38
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Governance
• Goals
– Define, approve, and communicate Big Data strategies,
policies, standards, architecture, procedures, and metrics
– Track and enforce regulatory compliance and conformance
to Big Data policies, Big Data standards, Big Data
architecture and procedures
– Sponsor, track, and oversee the delivery of Big Data
management projects and services to deliver the intended
business outcome
– Manage and resolve Big Data–related quality issues
– Understand and promote the value of Big Data assets and
related governance
39
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Governance
• Approach
– Align with business strategy
– Establish Big Data governance and incorporate roles into existing Governance organization
– Be flexible to change existing governance policies and processes
– Embed governance process into Big Data projects
– Establish key data subject areas in the Big Data space that are linked to high-potential business
outcome
– Establish business as well as technical ownership of Big Data governance
– Establish and align Big Data architecture with the Enterprise Information Architecture
– Establish data quality, master data, and metadata that may be new and unique to Big Data
– Clearly establish Information Lifecycle Management components for Big Data governance
– Establish regular meetings between business stewards, data stewards, and data scientists on Big Data
that is impacted by regulatory compliance and protection of sensitive data
40
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Big Data Best Practices
• Align Big Data initiatives with specific business goals
• Ensure a centralized IT strategy for standards and governance
• Use a Center of Excellence to minimize training and risk
• Correlate Big Data with structured data
• Provide high-performance and scalable analytical sandboxes
• Reshape the IT operating model.
41
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Questions
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |

Big Data

  • 1.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. |Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Big Data Landscape and Implementation Strategy Ben Duan Director, Enterprise Architecture Oracle 08/2014 1
  • 2.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Agenda Big Data and Its Impact Big Data Technology Landscape Big Data Implementation Strategy 1 2 3 2
  • 3.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Big Data Definition “Big data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information” --TechAmerica 3
  • 4.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | VELOCITY VARIETY VALUE SOCIAL BLOG SMART METER VOLUME 4 Big Data Characteristics
  • 5.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Big Data Is About… Tapping into diverse data sets Finding and monetizing unknown relationships Creating data driven business decisions 5
  • 6.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Why Is Big Data Important? Source: * McKinsey Global Institute: Big Data – The next frontier for innovation, competition and productivity (May 2011) US HEALTH CARE MANUFACTURING GLOBAL PERSONAL LOCATION DATA $300 B –50% $100 B “In a big data world, a competitor that fails to sufficiently develop its capabilities will be left behind.” Increase industry value per year by Decrease dev., assembly costs by Increase service provider revenue by McKinsey Global Institute US RETAIL 60+% Increase net margin by EUROPE PUBLIC SECTOR ADMIN €250 B Increase industry value per year by
  • 7.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Big Data Investment Are Happening Everywhere
  • 8.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, problems HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis AUTOMOTIVE Auto sensors reporting location, problems RETAIL Consumer sentiment Optimized sales & marketing LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis Address Specific Industry Needs ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization Challenged by: Data Volume, Velocity, Variety in finding Value
  • 9.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Table Join Complexity Data Update Pattern Schema Complexity Total Data Volume Responsiveness Per Job Data Volume Processing Freedom Concurrent Jobs Traditional RDBMS Big Data Analytics Generic Data Processing 1000 100PB 10PB 1PB StructuredAppend Only UnstructuredTransactional 100 Tables 100T B SQL Interactive Batch 100T B Batch 10PB 1PB 100PB 9 Big Data Analytics Sweet Spot
  • 10.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Velocity Batch Real Time Data Structure Structured Semi-Structured Unstructured Credit and Market Risk at Banks Fraud Detection (Credit Card) & Financial Crimes (AML) in Banks (including Social Network Analysis) Event-based Marketing in Financial Services and Telecoms Markdown Optimization in Retail Claims and Tax Fraud in Public Sector Potential Use Cases for Big Data Analytics Predictive Maintenance in Aerospace Social Media Sentiment Analysis Demand Forecasting in Manufacturing Disease Analysis on Electronic Health Records Traditional Data Warehousing Text Mining Video Surveillance/Analysis 10
  • 11.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | A Common Use Case: Predictive Ad and Content Generation NoSQL DB Expert System Real-time: Determine best ad to place on page for this user Input into Lookup user profile Add user if not present Web logs HDFS Profiles NoSQL DB High scale data reductions BI and Analytics Billing Predictions on browsing Actual ads served Low Latency Batch 11
  • 12.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Agenda Big Data and Its Impact Big Data Technology Landscape Big Data Implementation Strategy 2 1 3 12
  • 13.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Integrated Big Data Architecture Capabilities Transaction Data Management Security,Governance Advanced Analytics Visual Discovery DBMS (OLTP) Master & Ref Data Structured DW/DM Text Analytics and Search Reporting & Dashboards Real-Time Machine Generated Social Media Text, Image Video, Audio NoSQL UnstructuredSemi- structured HDFS Alerting In-Database Analytics EPM BI Applications Message- Based ETL/ELT CDC ODS Streaming (CEP Engine) Acquire/Store Organize Analyze Decide Hadoop (MapReduce) Servers Storage Network OS
  • 14.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Apache Open Source – Hadoop Common, HDFS, YARN, MapReduce, etc. • Pure-play Hadoop Distribution Vendors – Cloudera – Hortonworks – MapR • Enterprise Software Vendors that Offer Hadoop Distribution – Oracle with Cloudera – Intel with Cloudera – SAP with Hortonworks – Microsoft with Hortonworks – Teradata with Hortonworks – HP with Hortonworks – Amazon with MapR Hadoop Marketplace
  • 15.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Acquisition Sqoop Provides command-line tools to load and extract data from Hadoop and relational data sources Flume Continuously captures and ingests/streams data into Hadoop Chukwa Collects data from various sources and stores in HDFS, leveraging MapReduce Storm Distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing Oracle Data Integrator High performance ETL tool to load file from HDFS to MapReduce 15
  • 16.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Storage HDFS HDFS is an Apache open source distributed file system and combined with Apache Map Reduce engine to store and process large data size to terabyte to petabyte levels in a highly distributed manner with redundancy and performance. NoSQL DB Not Only SQL. It is specific designed for read and append based on distributed, fault-tolerant architecture. It offers many orders of magnitude performance improvement on read on large volume of data (terabytes). But it only offers Basically Available, Soft State, Eventually Consistent (BASE) guarantees on transaction. Major players include: Oracle NoSql DB, Marklogic, Cassandra, MongoDB, Accumulo, etc. In Memory DB Data of In-Memory database resides primarily in memory. Vendors of In-Memory database have achieved break-through performance improvement by leveraging better CPU cache utilization, parallel execution through multi-core processors, columnar data compression, among others. Major players include: Oracle TimesTen, Oracle DB 12C, Microsoft SQL Server 2014, SAP HANA, IBM DB2 BLU Acceleration, etc. 16
  • 17.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Organization 17 MapReduce MapReduce is a programming model and an associated implementation for processing large data sets. Data is first organized into key/value pairs. Map function is used to process the data set to produce a new set of intermediate key/value pairs. Then reduce function is used to merge all intermediate values with the same key. YARN (Yet Another Resource Negotiator) Next generation of MapReduce for dramatic performance improvement. It is the foundation of Hadoop 2.0. It facilitates multiple workloads, runs multiple data engines, and supports multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2. Hive Provides ad hoc query and analysis for data on Hadoop using a SQL interface Impala Real time version of Hive, claimed 3x-90x faster than Hive Pig Provides interface and data flow language for processing data on Hadoop Zookeeper Provides configuration, naming, and other coordination capability for Hadoop tasks Oozie Provides workflows and coordination services for jobs that are running on Hadoop Oracle SQL Loader for Hadoop High speed data loader to load data from HDFS to Oracle DB up to 15TB/hour Oracle Data Integrator ELT tool to load data from HDFS to Oracle DB
  • 18.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Analysis and Decision 18 In Database R Open Source data analytics software. Venders including Oracle enhanced it to reside in database to improve performance and data processing capability. Vendors including Oracle also developed adaptor to enable R to read data directly from HDFS. Enterprise Search Solr. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Stream Analytics, Complex Event Processing (CEP) Stream Analytics involves continuously process the real time data, and combines with the historic data, to derive decisions. There normally is a window to define how much historic data to take into consideration, for example, last 10 minutes, or last 10,000 transactions, etc. Vendors include Oracle, IBM, etc. Business Intelligence Dashboard, adhoc query, scorecard, etc. Products include OBIEE, Cognos, Microstrategy, Business Objects, Qlikview, Palantir, Tableau. Some of them provide Big Data connectors via Hive. Advanced Analytics/Information Discovery Text mining, predictive analytics, contextual search and analysis, statistical analysis, Cluster analysis, spatial analysis. Products include SAS, SPSS, Oracle Endeca, Qlikview, Palantir, Tableau, etc. Most of them provide Big Data connectors.
  • 19.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Authentication – Kerberos, AD, LDAP • Authorization – Apache Sentry • Audit – Cloudera Navigator – Oracle Audit Vault – Oracle Database Firewall • Encryption – Filesystem level encryption AES-256 • Key Management – Oracle Data Vault, Cloudera Enterprise Data Hub 19 Data Security
  • 20.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Cloudera Manager – End to End Administration for Cloudera Distributed Hadoop (CDH) 20 System Management
  • 21.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Public Cloud – MapR Hadoop on Amazon AWS – Microsoft HDInsight on Azure 21 Infrastructure • On Premise – Oracle Big Data Appliance – Teradata Appliance for Hadoop • Deep Analytics • Agile Development • Massive Scalability • Real Time Results • High Throughput • In-Place Preparation • All Data Sources/Structures • Low, predictable Latency • High Transaction Count • Flexible Data Structures Acquire Organize Analyze
  • 22.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Software – Oracle Linux 6.4 – JDK 7 – CDH 4.x and 5.x – Cloudera Manager – Oracle R Distribution – Oracle NoSQL DB – Oracle Enterprise Manager – Oracle Big Data SQL – Oracle Big Data Connectors – Oracle Audit Vault and Database Firewall – Oracle Data Integrator • Hardware (Full Rack) – 18 Servers • 288 Cores • 1 - 9 TB Memory • 864 TB Storage – 40 GB/sec Infiniband 22 Oracle Big Data Appliance X4-2
  • 23.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Hortonworks Data Platform • Cloudera Enterprise Data Hub 23 Pure Players: End-to-End Big Data Solution
  • 24.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Oracle Big Data Solution 24
  • 25.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Agenda Big Data and Its Impact Big Data Technology Landscape Big Data Implementation Strategy3 2 1 25
  • 26.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | 26 Gartner Big Data Hype Curve
  • 27.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Organization’s Top Big Data Challenges 27
  • 28.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Development Process – Start with a hypothesis, the big idea – Gather available and relevant data sources – Explore results through various data processing and integration methods – Reduce ambiguity, apply statistical models, eliminate outliers, find concentrations, and make correlations. – Interpret the outcome, continuously refine models, and establish an improved hypothesis 28 • New Characteristics – New process for high volume and very detailed information that requires hardware and software parallelization and optimization. – New analytical methods. New analyses and methodologies need tp take into account of massive volume of data. – New association, aligning new big data sets with existing data assets. Big Data and Analytics Development Process
  • 29.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Big Data Implementation Strategy Process TechnologyPeople Data scientists, big data architects, developers, admins work hand-in-hand to pursue agility and team work Adopt the proven big data architecture development process Choose the right technologies that align with business and IT strategy on Big Data
  • 30.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | People: Develop the Talent Big Data Scientist Big Data Developers Big Data Administrators Agility and Teamwork Training, Knowledge Management, Center of Excellence
  • 31.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Process: Oracle Architecture Development Process for Information Architecture 31
  • 32.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Business Drivers – Better insight for better customer intimacy – Product innovation – Risk analysis – Fraud detection • IT Drivers – Reduce storage cost – Reduce data movement – Faster time to market – Standardized toolset – Ease of management and operation – Security and governance 32 Business Context Drivers for Big Data Architecture (Sample)
  • 33.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Architecture Vision Architecture Principles (Samples) Principles Rationale Implication Horizontal scaling model To ensure sufficient scalability the model must be extensible across a large number of nodes Virtually “unlimited” horizontal scalability of data across nodes. Data replication to ensure redundancy in case of failure Minimal data structures Minimal data structures in the initial stage of filtering/analysis allows the rapid loading of a many different data types Enables flexibility in the programming layer to rapidly tailor to specific needs Alignment with the overall information architecture The value of big data is maximized when it can be correlated with existing enterprise information and business processes Appropriate technology, process and people are needed to correlate and integrate big data analytic results with enterprise information, applications and business processes Appropriate governance The governance model must be aligned with the organizational needs and phase of the Big Data initiative. Data exploration requires a looser governance model to maximize agility. However, as valuable data is uncovered, and becomes incorporated into standard business processes, the governance model on the data must be tightened. 33
  • 34.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | • Sample Current State • Maturity Assessment 34 Current State
  • 35.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Data Sources Data Quality Data Staging Master (Customer, Product, etc) Reference Historical (3NF) Analytical Dimensions, (Cubes, Snowflake) Data Marts Enterprise Information Services Reporting Applications Operational Analytics Predictive/ Advanced Analytics Knowledge Discovery Content Management Operational Systems User Interface Big Data Map/Reduce , NoSQL Distributed File System FederationandVirtualization Security, Data Governance, Metadata Management Infrastructure Enterprise Data Warehouse SOA/ESB ELT/ETL CEP Weblog Aggregation Internal Systems External Systems Data Integration Portals Mashups Search Dashboards Alerts Visuals Mobile Profile De-dupe Cleanse Match Oracle Information Architecture Conceptual View Shared Services
  • 36.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Oracle Big Data Architecture Conceptual View 36
  • 37.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Oracle Big Data Architecture Logical View 37
  • 38.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Roadmap (Example) 38
  • 39.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Governance • Goals – Define, approve, and communicate Big Data strategies, policies, standards, architecture, procedures, and metrics – Track and enforce regulatory compliance and conformance to Big Data policies, Big Data standards, Big Data architecture and procedures – Sponsor, track, and oversee the delivery of Big Data management projects and services to deliver the intended business outcome – Manage and resolve Big Data–related quality issues – Understand and promote the value of Big Data assets and related governance 39
  • 40.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Governance • Approach – Align with business strategy – Establish Big Data governance and incorporate roles into existing Governance organization – Be flexible to change existing governance policies and processes – Embed governance process into Big Data projects – Establish key data subject areas in the Big Data space that are linked to high-potential business outcome – Establish business as well as technical ownership of Big Data governance – Establish and align Big Data architecture with the Enterprise Information Architecture – Establish data quality, master data, and metadata that may be new and unique to Big Data – Clearly establish Information Lifecycle Management components for Big Data governance – Establish regular meetings between business stewards, data stewards, and data scientists on Big Data that is impacted by regulatory compliance and protection of sensitive data 40
  • 41.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Big Data Best Practices • Align Big Data initiatives with specific business goals • Ensure a centralized IT strategy for standards and governance • Use a Center of Excellence to minimize training and risk • Correlate Big Data with structured data • Provide high-performance and scalable analytical sandboxes • Reshape the IT operating model. 41
  • 42.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. | Questions
  • 43.
    Copyright © 2014,Oracle and/or its affiliates. All rights reserved. |