DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Findings and Lessons Learned

Cisco Intercloud Services
Customer Interaction Analytics
Migration to CIS
Dmitri Chtchourov, Innovation Architect, Cisco Intercloud Services CTO Group
Imtiaz Syed, Architect, Smart Active Stream Analytics

Topics
Customer Interactions Analytics Overview
AWS and CIS Intercloud Solution Experience
CiscoDV on CIS
 Optimization with Apache Spark

Overview

Omni-Channel Customer Journeys
Server
Logs
Social
& Chat
Mobile
Event
Streams
Call
Center
S/W
Download
Open Trouble
Ticket
Assign
Engineer
Update
Trouble Ticket
Close Trouble
Ticket
Resolve
Trouble Ticket
Read Support
Documents
View Design
Documents
View Tech
Documents
New
Registration
Bug Search FAQs
Contract
Details
Product
Details
Device
Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business
outcome.

• Software Upgrades
• Bug Inquiry
• Software Inquiry
• Trouble Ticket Lifecycle
• Device Troubleshooting
• New Registration
• Contract Renewal
• Customer Interest
Analytics
• Customer Experience
Analytics
• Resource Forecasting
• Security and
Compliance
Customer Journeys Behavioral Insights
• Boost Self Service
• Real-time Content
Optimization &
Recommendation
• Context Based
Predictive Alerts
• Implicit Personalization
Impact
From Journey to Outcome…

Server Logs
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event
Streams
Data
Sources
Data
Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics
Model
Build Model
Activity
Refinement
Activity
Synthesis
Synthesized
Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES
Zoomdata,Platfora

AWS and CIS Intercloud Solution
Overview

AWS Platform
Component Cloud::
Hadoop
(Batch
Analytics)
Cloud::
Queries
(Interactive
Queries)
Cloud::
Streams
(Near Real-
time
Analytics)
Virtual
Machines
30 6 5
AWS
Instance
Sizing
m3.2xlarge c3.xlarge m3.xlarge
Virtual
Cores
8/VM 4/VM 4/VM
RAM 30GB/VM 7.5GB/VM 15GB/VM
Disk 1.5 TB/VM 1.5 TB/VM 1.5 TB/VM

Case for Cisco Intercloud Services for Analytics…
 Cisco Security and Compliance requirements
• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud
solution is a better fit.
 Customer journey beyond the enterprise
• Applications are hosted on AWS
• Partner systems hosted on AWS and other cloud providers
Presence in AWS and other cloud services required to support these
scenarios for end-end customer journey insights.
 Data virtualization integrated in the CIS Analytics Stack
• Connect data from multiple clouds and multiple big data platforms
 Integrated visualization toolset

CIS Analytics Platform Requirements
Infra Provisioning
Deploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current
production system.
OpenStack
Icehouse OpenStack with Neutron, Nova, and Swift installed.
Big Data Ecosystem
Cloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm.
Data virtualization & Cloud Integration
Access to data services and data stores via Cisco Data Virtualization
Runtime Services
Foundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation and
support priorities, issue resolution, problem management, deployment process, patch management.
API Services
Provide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud
model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and
flexibility.

AWS to CIS Migration – Success Criteria
 Successful synthesis of customer interaction data
 Successful automation of the end-end data process pipeline
 Build behavioral insight services
 Access to data and services via data discovery and visualization tools
 Meet the performance, scale and platform stability requirements
 Successful deployment of CiscoDV on CIS
 Connect HDFS and Hive DS with CiscoDV via Hive and Impala
 Build and expose insight services for consumption by limited users

AWS and CIS Data Node Sizing Comparison
Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS
available for HDFS storage
AWS Sizing
CCS Sizing
Node Service CCS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS
available for HDFS storage
Less than AWS sizing (Storage)

Pilot Test Data
• Test performed on one day’s production data
• Total no. of records processed – 110,852,667
• Total data size – 32GB
• Total no. of M/R jobs in the data pipeline – 17
• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory)
• Cycle 2: Homogeneous CCS nodes

CIS Performance of Batch Analytics –
Limited Test

Test Details by M/R job
Job Name CCS 12
nodes:
cycle1
CCS 18
nodes:
cycle1
CCS 24
nodes:
cycle1
CCS 30
nodes:
cycle1
CCS 18
nodes:
cycle2
CCS 24
nodes:
cycle2
CCS 30
nodes:
cycle2
CCS 35
nodes:
cycle2
New_cleanse 249 176 143 117 82 67 55 51
Process_private_ip 27 14 11 10 7 5 6 6
join_web_and_ip_data 142 95 76 61 49 40 34 29
combine_ip_decorated_files 26 14 11 10 9 7 8 7
filterBotEntries 34 19 15 13 10 8 7 7
sessionize 71 64 69 62 60 63 15 13
firstActivitiesFilter 26 15 13 10 9 8 6 6
allOtherActivitiesFilter 29 18 13 13 11 9 7 6
matchFirstActivities 21 13 11 13 13 11 8 8
buildActivities 27 15 12 10 7 6 9 9
filterBUG 8 5 3 2 3 3 4 4
filterSEA 8 5 3 2 3 3 4 4
filterTCO 8 5 3 2 3 3 4 4
filterTDV 8 5 3 2 3 3 4 4
filterWDV 8 5 3 2 3 3 4 4
filterMOD 8 5 3 2 3 3 4 4
filterTOOL 8 5 3 2 3 3 4 4

PoC: Analytics with Spark on CIS
Existing code
 Made in Ruby with Wukong to run on Hadoop
 A history of changes and modifications
 Script-based, steps communicate via intermediary files
Goal
 Revise, rethink and reimplement with Spark on CIS
 Open for advanced cloud analytics
 Improve maintainability by moving away from aging Ruby on Hadoop

Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
 Pre-process log records (‘cleanse’)
 Extract HTTP sessions (‘sessionize’)
 Extract user actions, such as ‘search’, ‘download
patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
 Each box on the figure is a script in a separate file
 They pipe Gb of data as input and output
 Random matching of nodes to data for sessionizing
 Lots of redundant shuffling
Ruby Flow
global sort in time
global group by IP

Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
 Same flow, but each box is a Java or Scala function
No intermediate temp files
 Steps are chained by Spark, often without any need for
intermediate data
 If still needed, the data is stored in memory and local
disk as much as possible
Local computation
 Cleansing is computed on nodes local to data blocks
(same as Ruby)
 Sessions are built per IP
 On separate nodes each handling a single IP range
 One copied to the node on partition the data remains
local
Spark Flow
global partition by IP
local sort in time

 Volumes
 Logs of a single day: 52 Gb
 Total of 110 mil records
 Where 53 mil records are kept after pre-filtering
 Producing over 1 mil user actions
 Cluster of 30 nodes
 Ruby
 Runtime 140 min
 Spark
 Runtime 7 min (20 times faster )
Runtime comparison

 Extracting sessions means sort in time and group by IP
 Ruby:
 sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
 Spark is good at dealing with partitions:
 per-IP groups are placed on different machines (partitions)
 global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
 Other improvements
 Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala),
stages keep data in memory when possible (comes with Spark)
 Cache results of user agent resolution that are heavy on regular expressions
Why?

Data Virtualization for Intercloud Analytics
Customer Benefits
 Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and,
Cisco Customers and Partners
 Seamless interoperability offers easy access to data across distributed data sources
in the intercloud analytics platform
 Universal data governance maximizes enforcement of data security rules
 Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that
enable nimble data discovery and rapid data analytics to support multiple LOBs
 Deliver data to any number of analytics tools.

Use Case 1: Get Case Interactions
Use Case
Description
# of cases opened by company X that
are currently open. (other variations
would include cases by company,
trends etc.)
CiscoDV Value CiscoDV enforces data security rules to
restrict access on the intercloud
platform to customer sensitive data.
Data Sources SalesForce
Intercloud
Solution
CIS CiscoDV service can access the
“sanitized” version of CSOne data
through JDBC from RIDES(SWTG
CiscoDV) API.
Connection Type DV on hybrid cloud  Enterprise data
store

Use Case 2: Get Customer Journey
Use Case
Description
Customer interactions on the web
pertaining to bug search and case
submission process. Foundational data
can be used to explore trends and feed
into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud
Analytics Platform
Data Sources SAS Analytics
Intercloud
Solution
By direct network access to the Impala
Server, the CIS CiscoDV server
connects to the Impala Service in
Hadoop also on CIS as a Data Source.
SQL Queries configured in CiscoDV
execute Impala queries
Connection Type DV on hybrid cloud  VPC Big Data
platform

Use Case 3: Get Bug Interactions
Use Case
Description
Another foundational data service that provides
a breakdown of customer exposure or interest
in bugs. The service can be refined further to
look at trends specific to a company or a
product for further analytics.
CiscoDV
Value
Real-time data federation that accesses
extremely large data in CIS Intercloud Analytics
platform and join that with Bug Data accessed
via departmental CiscoDV instance (RIDES)
Data
Sources
SASA Analytics and QDDTS via RIDES
Intercloud
Solution
By building on the access to the Impala Server,
the DV server can join the Bug Data from the
Enterprise Data Stores with the HDFS data to
provide a federated view.
Connection
Type
DV on hybrid cloud  VPC Big Data platform
and Enterprise data store

CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco
Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and
Hive on CIS Intercloud
Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS
Big Data Cluster
Scenario1
Scenario 3

DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Findings and Lessons Learned

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Findings and Lessons Learned

Similar to DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Findings and Lessons Learned (20)

More from Cisco DevNet

More from Cisco DevNet (20)

Recently uploaded

Recently uploaded (20)

DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Findings and Lessons Learned