Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Evolution of Big Data at Intel - crawl, walk
and run approach
Gomathy Bala | Director
Chandhu Yalla | Manager & Architect
Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson
IT BI Big Data Team

Copyright © 2014, Intel Corporation. All rights reserved.
Legal Notices
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
The content in this presentation is being shared Under NDA.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
2

Agenda
• Intel IT Big Data Journey
• Enterprise DW architecture
• BI Big Data 3 yr Roadmap
• Big Data Ecosystem Architecture
• Platform Strategies & BKMs
• Summary
3

2011 2012 2013 2014 2015
Intel IT Big Data Journey
4
Big Data
&
Analytics
Strategy
Production
Online
Telmap:
1st Use Case
Preproduction
Online
Hadoop
Evaluation
IDH to CDH
Hadoop 2.0
$176M BV
Production: Security BI,
Attribute Reduction System,
ATM Ellipses Engine, IAH-
Retail Analytics
6 Environments
CDH 5.3
4 Use Cases in
Preproduction
12 POC Use
Cases
6 Use Cases in
Production
$290K
investment
$948/TB
3 Use Cases in
Production
Smart-What, Marketing-
IAH, Incident
Predictability
$6M BV
CDH 5.1
IAH – Cloud CRM
In Production
Enterprise
Standards,
Guidance,
Processes for
Platform &
Capabilities
15 Active Use Cases | $290K + 10.5 HC Investment | Delivered $182M BV

Big Data & Analytics Really Delivers!
5From 2014 – 2015 Intel IT Business Review – Annual Edition
Kim's Video

Any Data Source
ERP
In Memory Real-Time Data Platform
CRM
SCM
SRM
ECC
BW
ECCW
Real-Time & Self Service
Analytics Platform
MDG
NW
Teradata Cloudera Hadoop Data Lake
Reporting Tools
Data Tiering
Hot-Cold data
Enterprise
Data Warehouse
Other Apps
Custom
Intel
…
NR
T
Predictive
Analytics
BPC
BCS
Cloud
BI
Saa
S
New
Apps.
Downstream
Applications
2014-2017 Vision: Real-Time Enterprise
6

FE Tools
CLS/Proxy
High speed data loader
BigData
• Machine Learning
• Log Processing
• Unstructured data
Use Cases
• High volume counter Analytics
• Text Parsing/Mining
• Strategic/Operational reporting
• Interactive Reporting
Use Cases
• High Concurrent user analytics -
Supply/Order
• Mission critical analytics – Finance/HR
SQL on Hadoop
Enterprise Data Architecture with Hadoop and Other MPP DWH
Current & Future Strategy
Future Present
EDWMfg Data
A %ge of
Traditiona
l BI use
cases
IMT

BI Big Data | 3-Year Roadmap
8
Big Data + AA
Big Data + SSAA +
Traditional BI
Big Data + SSAA +
Traditional BI
2015
2016
2017
Scalable and well
designed Hadoop
Platform
 Evolve IMT + Hadoop
 Data Lineage & Data
Catalog
 Streaming Capabilities
 Advanced SQL on Hadoop
 ACID semantics
 Evolve Big Data + SSAA per
ecosystem roadmaps
 BC/DR
 End to end enterprise features
 Enterprise ready: OLAP and
Traditional DW
Hadoop is an open source framework designed for big data analytics.
Hadoop is evolving rapidly, but it will still take a couple of years for it to
mature and support “traditional bi” use cases.
Legend
Orange Text: Traditional BI Capabilities
Green Text: Big Data/AA Capabilities
 Security (RBAC, ITS/IRS)
 Data Governance
 Data Discovery
 Self Service AA Framework
 IMT + Hadoop
 AVP + Hadoop
 In-memory + Near real time
capabilities
 SQL on Hadoop

Data Integration
Big Data Platform – Ecosystem Architecture & Maturity
9
NRT/Stream Processing In-Memory Processing
Processing
Layer Batch Processing
Data Virtualization Data DiscoveryAdv. AnalyticsAdv. Visualization
Data
Management
Presentation
Layer
End User
Data
Steward
Business
Analyst
Data
Scientist
DeveloperUser layer Auditor
Machine Learning
Analytical
layer Statistical
Numerical Time series
Textual/Log Spatial
Graph
Textual/Log DB Hierarchy DBRelational DB Graph DB
Storage
Model
Platform Virtualization
Infrastructure
Platform Management Network Management Systems Management
Data Ingestion
Continuous IntegrationDev Framework Security
Source/Target APIs 3rd Party Drivers
Ent. Scheduler Srvs Metadata MgmtWorkload Mgmt
Middleware
*Other names and brands may be claimed as the property of others.
Columnar DB
Data Egression
Other Vendors offered capabilities
Majority CDH offered capabilities
Data Consumption
Prescriptive
Guidance
Change
Release
GovernanceEngagement
Service
Management
Training
Support
Processes

BI Big Data Platform
10
Hadoop Project Sandbox – CDH 5.3
Multiple Instances
Deployed on Intel Cloud & MyCloud
environments. TTM to business: 2-3 Days
Hadoop Pre-Production – CDH 5.3
10 data nodes | 399TB | 320 vcores
Use cases in Dev/POC: 14
Hadoop Production – CDH 5.3
22 data nodes | 658TB | 704 vcores
Use cases Live in prod: 7
 Hadoop 2.0 architecture provides reliability,
scalability & performance
 High availability and scalability design
 Well positioned to meet 2015 business use case
requirements
 Repeatable architecture for faster builds.
 Capacity additions: Add data node. White boxes,
Waterfall equipment or HP servers
 TTM: Varies depending on HW (3 wks-2 months) Job/Workflow
Management
Data Node Data Node Data Node Data Node Data Node
Name Node
Resource Mgr
Name Node
Resource Mgr
heartbeat, balancing, replication
YARN
Scale to meet business needs
Gateway
Nodes
(NN hi-av)
Gateway nodes
Login (ssh) : AD authentication &
authorization, access cluster, run
HDFS commands, submit jobs, etc.
Management
Node
Source Data
DB Data
Visualization
Tools
Data Movement/ETL
EDW or Datamart
DB data
Unstructured Semi-structured

• Skills and resources with time to ramp up
• Starting small is ok. Focus on design and scalability for the platform.
• Technical product evaluation
 Stick with a distribution which is core Hadoop open source stack vs proprietary software
• Security is a big deal to Intel, Big Data Security capabilities implementation is
key focus
• Methodology to understand the data is to use an iterative discovery method with
technical, business and modeling teams.
• Intel IT Big Data Journey benefited heavily from Cloudera partnership
• Open source will play a big role in advancing Big Data capabilities and analytics
BKM’s | Summary

BI Big Data IT@Intel Resource Info
12
BI Big Data IT@Intel Resource Links:
1. Hadoop Migration Success Story: How Intel IT Moved to Cloudera
2. Mining Big Data in the Enterprise for Better Business Intelligence
3. Enabling Big Data Platforms and Solutions with Centralized Data Management
4. Integrating Apache Hadoop* into Intel’s Big Data Environment
5. Using a Multiple Data Warehouse Strategy to Improve BI Analytics
To learn more: www.intel.com/bigdata

Q & A
13

Intel Confidential — Do Not Forward

Backup
15

Big Data Capability Catalog
Hive
HDFS MapReduceZookeeper
Pig Mahout
NetworkServers Storage Security OS Hi-AvEAM / AD Integration
HDFS Compress
WHIRR
Hbase
Governance
Change
Release
Engagement
Service mgmt.
Prescriptive
Guidance
Training
SQOOP JDBC
Other DW
Infrastructure
Process
Cloudera* Distribution of Hadoop (CDH)
*Other names and brands may be claimed as the property of others.
Storm
Hcatalog
ACCUMULOYARN
SPARK
Autosys
SecureGIT
Impala JDBC
HiveODBC
3rd Party SW/Connectors
Integration
HUE SOLRIMPALA
PARQUET DataFu
Impala ODBC
TDCH
Oozie
Kafka
Sqoop
DI
Gateway
Flume
SFTP
SMBClient
Data
Integration
Camel
Enabled PlannedWIP
Avail. Now 1-3 Months 3-6+ Months
Cloudera Manager*
System Management
Cloudera Navigator*
Data Management
Audit
Access Control
Discovery Explore
Lineage Lifecyle
DeploymentMonitoring Reporting Diagnostics
Alerting
Service
Management
Rolling
Upgrades
Config
Rollbacks
List includes only the capabilities planned for next 6 months.
16
Google Analytics
SFDC
Sentry

i. Find Differences with a
Comparative Evaluation in a
Sandbox Environment
ii. Define Your Strategy for the
Cloudera Implementation
iii. Split the Hardware
Environment
iv. Upgrade the Hadoop Version
v. Create a Preproduction-to-
Production Pipeline
vi. Rebalance the Data
Migration to Cloudera – 6 BKMs

Building Block Strategy to Enterprise Security of Hadoop
Q1’15: Perimeter access with LDAP + finer grain
controls with Sentry. The second building block
towards enterprise grade security design.
Q2’15: Add Kerberos to enable
more Hadoop components and
further secure the platform
2H’15: Exploration starting,
awaiting product and target to
adopt in 2H’15 in Production.
NowQ2’15 2H’15

Hadoop Maturity & Evolution
19
MapReduce
(batch data processing, cluster
resource management)
HDFS 1.0
(redundant, reliable
data storage)
Hadoop 1.0
YARN
(cluster resource management)
HDFS 2.0
(redundant, reliable data storage)
Interactive
(Impala)
In-Memory
(Spark)
Batch
(Map
Reduce)
Online
(Hbase)
Others
(Search, Storm
etc.)
Graph
Applications Run Natively In Hadoop
+ Scalable data storage and processing
platform
+ Positioned for Batch processing workloads
for Map and Reduce only
+ Apache Hive offers SQL like query
language
- Lacks reliability and stability
- No support for low latency queries
 Apache YARN allows you to run multiple applications in Hadoop and provides reliability, scalability
and performance
 Advanced Resource Management
 Apache Hive offers a 50x improvement in performance for queries
 Cloudera Impala to support low latency query requirements with SQL-92 and SQL- 2000 support
 Data at Rest Encryption and Row Level/Cell Level Security planned
 Data Streaming and Search Capability
 GraphDB
 Expanded Data Governance
 IMT + Hadoop Integration
 Improved Front End tool integration/support
 Deeper Diagnostics for multiple components
2005 - 2012 2013 - 2014
Hadoop 2.0
HDFS
(redundant, reliable
data storage)
YARN
(cluster resource management)
Batch
(Map Reduce)
Others
(data processing)
2015 - 2017

2014 Intel IT Vital Statistics
20
>6,300 IT employees
59 global IT sites
>98,000 Intel employees1
168 Intel sites in 65 Countries
64 Data Centers
(91 Data Centers in 2010)
80% of servers virtualized
(42% virtualized in 2010, goal of 75%)
>147,000+ Devices
100% of laptops encrypted
100% of laptops with SSD’s
>43,200 handheld devices
57 mobile applications developed
Source: Information provided by Intel IT as of Jan 2014
1Total employee count does not include wholly owned subsidiaries that Intel IT
does not directly support

Big Data in the Industry
21
Recommendation Engine Fraud Detection
Sentiment Analytics
Behavioral Targeting
Customer Experience AnalyticsMarketing campaign Analytics

Learn more about Intel IT’s Initiatives at
www.intel.com/IT
Sharing Intel IT Best Practices
With the World

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Similar to Evolution of Big Data at Intel - Crawl, Walk and Run Approach (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Editor's Notes