How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015

Ken Owens
CTO Cisco Intercloud Services
07/15/15
How Cisco Migrated from
MapReduce Jobs to Spark
Jobs
1

Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction

Source: IDC 7
30M
New devices
connected
every week
78%
Workloads
processed
in Cloud DCs
by 2018
5TB+
of data per person
by 2020
180B
Mobile apps
downloaded
in 2015
277X
Data created
by IoE devices
v. end-user
The Uber Trend: Exponential Rise in Connectivity

Exponential Trend
Linear Trend
Disruptive Stress
/Opportunity
Knee of Curve
Exponential Growth Drives Opportunities
Peter Diamandis: BOLD

When Products Become Cloud-enabled, They Become
10X More Valuable
$23.19
$249.00
$18.01
$199.00
$5.99
$59.99

SaaS
PaaS IaaS
A Broader Perspective than Hybrid Cloud Is Required…
Data Center Cloud Edge / IoT

Hyperscale applications serving several
thousands of users very quickly
Traditional enterprise applications
IoE and increasing connectivity driving the need
for such workloads
Hadoop, Mobile back-ends, Gaming, Social
Small (~10%), yet rapidly growing
percentage of applications in the Cloud
ERP, CRM, Applications that leverage
traditional databases
Majority of applications being run
for/by Enterprises today
CIOs Need to Embrace Both Traditional
and Hyperscale Application Deployment

SaaS
PaaS IaaS
Application Portability and Interoperability Is the Key
Traditional
Applications
ERP, Financial, Client/Server,
CRM, email, …
Cloud Native
Applications
IoT, BigData,Analytics,
Gaming, ...
Data Center Cloud Edge / IoT

Source: Gartner, Lydia Leong
of CIOs currently
have a second
fast/agile mode
of operation
45%
Traditional
Mode
Requires
Reliability
(ITIL, CMMI, COBIT)
Nonlinear Mode
Accept Instability
(DevOps,
automation,
reusable)
Systems
of
Differentiation
Systems
of
Innovation
Systems
of
Record
Change
Governance
Bimodal IT Is the New Normal
Source: Gartner, Lydia Leong

Intercloud
The
Intercloud
Web-scale Architecture
API-Driven Automation
Open, Secure, Compliant,
Hybrid IT
Internet
The
Internet
IP Based
Open Standards
World of Isolated Clouds
(2000s)
Individual custom-built clouds
without consistent APIs
Connected for application
acceleration with Open APIs
The Intercloud
Intercloud
Islands of Isolated
PC LAN Networks (1990s)
Multiple LANs using
a multitude of protocols
The Internet
Connected using industry-
standard IP protocol
We Must Connect the Clouds

Use Case: Customer
Interaction Analytics

Omni-Channel Customer Journeys
Server
Logs
Social
& Chat
Mobile
Event
Streams
Call
Center
S/W
Download
Open Trouble
Ticket
Assign
Engineer
Update
Trouble Ticket
Close Trouble
Ticket
Resolve
Trouble Ticket
Read Support
Documents
View Design
Documents
View Tech
Documents
New
Registration
Bug Search FAQs
Contract
Details
Product
Details
Device
Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business
outcome.

• Software Upgrades
• Bug Inquiry
• Software Inquiry
• Trouble Ticket Lifecycle
• Device Troubleshooting
• New Registration
• Contract Renewal
• Customer Interest
Analytics
• Customer Experience
Analytics
• Resource Forecasting
• Security and
Compliance
Customer Journeys Behavioral Insights
• Boost Self Service
• Real-time Content
Optimization &
Recommendation
• Context Based
Predictive Alerts
• Implicit Personalization
Impact
Customer Interaction Analytics
From Journey to Outcome…

Server Logs
Customer Interaction Analytics
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event
Streams
Data
Sources
Data
Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics
Model
Build Model
Activity
Refinement
Activity
Synthesis
Synthesized
Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES
Zoomdata,Platfora

AWS and CIS Intercloud
Solution

AWS Platform
Component Cloud::
Hadoop
(Batch
Analytics)
Cloud::
Queries
(Interactive
Queries)
Cloud::
Streams
(Near Real-
time
Analytics)
Virtual
Machines
30 6 5
AWS
Instance
Sizing
m3.2xlarge c3.xlarge m3.xlarge
Virtual
Cores
8/VM 4/VM 4/VM
RAM 30GB/VM 7.5GB/VM 15GB/VM
Disk 1.5 TB/VM 1.5 TB/VM 1.5 TB/VM

Case for Cisco Intercloud Services for Analytics…
 Cisco Security and Compliance requirements
• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud
solution is a better fit.
 Customer journey beyond the enterprise
• Applications are hosted on AWS
• Partner systems hosted on AWS and other cloud providers
Presence in AWS and other cloud services required to support these
scenarios for end-end customer journey insights.
 Data virtualization integrated in the CIS Analytics Stack
• Connect data from multiple clouds and multiple big data platforms
 Integrated visualization toolset

CIS Analytics Platform

CIS Analytics Platform Requirements
Infra Provisioning
Deploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current
production system.
OpenStack
Icehouse OpenStack with Neutron, Nova, and Swift installed.
Big Data Ecosystem
Cloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm.
Data virtualization & Cloud Integration
Access to data services and data stores via Cisco Data Virtualization
Runtime Services
Foundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation and
support priorities, issue resolution, problem management, deployment process, patch management.
API Services
Provide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud
model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and
flexibility.

AWS to CIS Migration – Success Criteria
 Successful synthesis of customer interaction data
 Successful automation of the end-end data process pipeline
 Build behavioral insight services
 Access to data and services via data discovery and visualization tools
 Meet the performance, scale and platform stability requirements
 Successful deployment of CiscoDV on CIS
 Connect HDFS and Hive DS with CiscoDV via Hive and Impala
 Build and expose insight services for consumption by limited users

AWS and CIS Data Node Sizing Comparison
Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS
available for HDFS storage
AWS Sizing
CCS Sizing
Node Service CCS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS
available for HDFS storage
Less than AWS sizing (Storage)

Pilot Test Data
• Test performed on one day’s production data
• Total no. of records processed – 110,852,667
• Total data size – 32GB
• Total no. of M/R jobs in the data pipeline – 17
• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory)
• Cycle 2: Homogeneous CCS nodes

CIS Performance of Batch Analytics – Limited Test

Test Details by M/R job
Job Name CCS 12
nodes:
cycle1
CCS 18
nodes:
cycle1
CCS 24
nodes:
cycle1
CCS 30
nodes:
cycle1
CCS 18
nodes:
cycle2
CCS 24
nodes:
cycle2
CCS 30
nodes:
cycle2
CCS 35
nodes:
cycle2
New_cleanse 249 176 143 117 82 67 55 51
Process_private_ip 27 14 11 10 7 5 6 6
join_web_and_ip_data 142 95 76 61 49 40 34 29
combine_ip_decorated_files 26 14 11 10 9 7 8 7
filterBotEntries 34 19 15 13 10 8 7 7
sessionize 71 64 69 62 60 63 15 13
firstActivitiesFilter 26 15 13 10 9 8 6 6
allOtherActivitiesFilter 29 18 13 13 11 9 7 6
matchFirstActivities 21 13 11 13 13 11 8 8
buildActivities 27 15 12 10 7 6 9 9
filterBUG 8 5 3 2 3 3 4 4
filterSEA 8 5 3 2 3 3 4 4
filterTCO 8 5 3 2 3 3 4 4
filterTDV 8 5 3 2 3 3 4 4
filterWDV 8 5 3 2 3 3 4 4
filterMOD 8 5 3 2 3 3 4 4
filterTOOL 8 5 3 2 3 3 4 4

PoC: Analytics with Spark on CIS
Existing code
 Made in Ruby with Wukong to run on Hadoop
 A history of changes and modifications
 Script-based, steps communicate via intermediary files
Goal
 Revise, rethink and reimplement with Spark on CIS
 Open for advanced cloud analytics
 Improve maintainability by moving away from aging Ruby on Hadoop

Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
 Pre-process log records (‘cleanse’)
 Extract HTTP sessions (‘sessionize’)
 Extract user actions, such as ‘search’, ‘download
patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
 Each box on the figure is a script in a separate file
 They pipe Gb of data as input and output
 Random matching of nodes to data for sessionizing
 Lots of redundant shuffling
Ruby Flow
global sort in time
global group by IP

Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
 Same flow, but each box is a Java or Scala function
No intermediate temp files
 Steps are chained by Spark, often without any need for
intermediate data
 If still needed, the data is stored in memory and local
disk as much as possible
Local computation
 Cleansing is computed on nodes local to data blocks
(same as Ruby)
 Sessions are built per IP
 On separate nodes each handling a single IP range
 One copied to the node on partition the data remains
local
Spark Flow
global partition by IP
local sort in time

 Volumes
 Logs of a single day: 52 Gb
 Total of 110 mil records
 Where 53 mil records are kept after pre-filtering
 Producing over 1 mil user actions
 Cluster of 30 nodes
 Ruby
 Runtime 140 min
 Spark
 Runtime 7 min (20 times faster )
Runtime comparison

 Extracting sessions means sort in time and group by IP
 Ruby:
 sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
 Spark is good at dealing with partitions:
 per-IP groups are placed on different machines (partitions)
 global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
 Other improvements
 Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala), stages
keep data in memory when possible (comes with Spark)
 Cache results of user agent resolution that are heavy on regular expressions
Why?

Data Virtualization for Intercloud Analytics
Customer Benefits
 Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and,
Cisco Customers and Partners
 Seamless interoperability offers easy access to data across distributed data sources
in the intercloud analytics platform
 Universal data governance maximizes enforcement of data security rules
 Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that
enable nimble data discovery and rapid data analytics to support multiple LOBs
 Deliver data to any number of analytics tools.

Use Case 1: Get Case Interactions
Use Case Description # of cases opened by company X that
are currently open. (other variations
would include cases by company,
trends etc.)
CiscoDV Value CiscoDV enforces data security rules to
restrict access on the intercloud
platform to customer sensitive data.
Data Sources SalesForce
Intercloud Solution CIS CiscoDV service can access the
“sanitized” version of CSOne data
through JDBC from RIDES(SWTG
CiscoDV) API.
Connection Type DV on hybrid cloud  Enterprise data
store

Use Case 2: Get Customer Journey
Use Case Description Customer interactions on the web
pertaining to bug search and case
submission process. Foundational data
can be used to explore trends and feed
into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud Analytics
Platform
Data Sources SAS Analytics
Intercloud Solution By direct network access to the Impala
Server, the CIS CiscoDV server
connects to the Impala Service in
Hadoop also on CIS as a Data Source.
SQL Queries configured in CiscoDV
execute Impala queries
Connection Type DV on hybrid cloud  VPC Big Data
platform

Use Case 3: Get Bug Interactions
Use Case
Description
Another foundational data service that provides
a breakdown of customer exposure or interest
in bugs. The service can be refined further to
look at trends specific to a company or a
product for further analytics.
CiscoDV Value Real-time data federation that accesses
extremely large data in CIS Intercloud Analytics
platform and join that with Bug Data accessed
via departmental CiscoDV instance (RIDES)
Data Sources SASA Analytics and QDDTS via RIDES
Intercloud
Solution
By building on the access to the Impala Server,
the DV server can join the Bug Data from the
Enterprise Data Stores with the HDFS data to
provide a federated view.
Connection
Type
DV on hybrid cloud  VPC Big Data platform
and Enterprise data store

CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco
Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and
Hive on CIS Intercloud
Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS
Big Data Cluster
Scenario1
Scenario 3

How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015

How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015

Similar to How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015

Editor's Notes