Data gravity is a reality when dealing with massive amounts and globally distributed systems. Processing this data requires distributed analytics processing across InterCloud. In this presentation we will share our real world experience with storing, routing, and processing big data workloads on Cisco Cloud Services and Amazon Web Services clouds.
4. Omni-Channel Customer Journeys
Server
Logs
Social
& Chat
Mobile
Event
Streams
Call
Center
S/W
Download
Open Trouble
Ticket
Assign
Engineer
Update
Trouble Ticket
Close Trouble
Ticket
Resolve
Trouble Ticket
Read Support
Documents
View Design
Documents
View Tech
Documents
New
Registration
Bug Search FAQs
Contract
Details
Product
Details
Device
Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business
outcome.
6. Server Logs
Customer Interaction Analytics
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event
Streams
Data
Sources
Data
Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics
Model
Build Model
Activity
Refinement
Activity
Synthesis
Synthesized
Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES
Zoomdata,Platfora
9. Case for Cisco Intercloud Services for Analytics…
Cisco Security and Compliance requirements
• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud
solution is a better fit.
Customer journey beyond the enterprise
• Applications are hosted on AWS
• Partner systems hosted on AWS and other cloud providers
Presence in AWS and other cloud services required to support these
scenarios for end-end customer journey insights.
Data virtualization integrated in the CIS Analytics Stack
• Connect data from multiple clouds and multiple big data platforms
Integrated visualization toolset
11. CIS Analytics Platform Requirements
Infra Provisioning
Deploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current
production system.
OpenStack
Icehouse OpenStack with Neutron, Nova, and Swift installed.
Big Data Ecosystem
Cloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm.
Data virtualization & Cloud Integration
Access to data services and data stores via Cisco Data Virtualization
Runtime Services
Foundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation and
support priorities, issue resolution, problem management, deployment process, patch management.
API Services
Provide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud
model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and
flexibility.
12. AWS to CIS Migration – Success Criteria
Successful synthesis of customer interaction data
Successful automation of the end-end data process pipeline
Build behavioral insight services
Access to data and services via data discovery and visualization tools
Meet the performance, scale and platform stability requirements
Successful deployment of CiscoDV on CIS
Connect HDFS and Hive DS with CiscoDV via Hive and Impala
Build and expose insight services for consumption by limited users
13. AWS and CIS Data Node Sizing Comparison
Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS
available for HDFS storage
AWS Sizing
CCS Sizing
Node Service CCS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS
available for HDFS storage
Less than AWS sizing (Storage)
14. Pilot Test Data
• Test performed on one day’s production data
• Total no. of records processed – 110,852,667
• Total data size – 32GB
• Total no. of M/R jobs in the data pipeline – 17
• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory)
• Cycle 2: Homogeneous CCS nodes
17. PoC: Analytics with Spark on CIS
Existing code
Made in Ruby with Wukong to run on Hadoop
A history of changes and modifications
Script-based, steps communicate via intermediary files
Goal
Revise, rethink and reimplement with Spark on CIS
Open for advanced cloud analytics
Improve maintainability by moving away from aging Ruby on Hadoop
18. Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Pre-process log records (‘cleanse’)
Extract HTTP sessions (‘sessionize’)
Extract user actions, such as ‘search’, ‘download
patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
Each box on the figure is a script in a separate file
They pipe Gb of data as input and output
Random matching of nodes to data for sessionizing
Lots of redundant shuffling
Ruby Flow
global sort in time
global group by IP
19. Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Same flow, but each box is a Java or Scala function
No intermediate temp files
Steps are chained by Spark, often without any need for
intermediate data
If still needed, the data is stored in memory and local
disk as much as possible
Local computation
Cleansing is computed on nodes local to data blocks
(same as Ruby)
Sessions are built per IP
On separate nodes each handling a single IP range
One copied to the node on partition the data remains
local
Spark Flow
global partition by IP
local sort in time
20. Volumes
Logs of a single day: 52 Gb
Total of 110 mil records
Where 53 mil records are kept after pre-filtering
Producing over 1 mil user actions
Cluster of 30 nodes
Ruby
Runtime 140 min
Spark
Runtime 7 min (20 times faster )
Runtime comparison
21. Extracting sessions means sort in time and group by IP
Ruby:
sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
Spark is good at dealing with partitions:
per-IP groups are placed on different machines (partitions)
global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
Other improvements
Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala),
stages keep data in memory when possible (comes with Spark)
Cache results of user agent resolution that are heavy on regular expressions
Why?
23. Data Virtualization for Intercloud Analytics
Customer Benefits
Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and,
Cisco Customers and Partners
Seamless interoperability offers easy access to data across distributed data sources
in the intercloud analytics platform
Universal data governance maximizes enforcement of data security rules
Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that
enable nimble data discovery and rapid data analytics to support multiple LOBs
Deliver data to any number of analytics tools.
24. Use Case 1: Get Case Interactions
Use Case
Description
# of cases opened by company X that
are currently open. (other variations
would include cases by company,
trends etc.)
CiscoDV Value CiscoDV enforces data security rules to
restrict access on the intercloud
platform to customer sensitive data.
Data Sources SalesForce
Intercloud
Solution
CIS CiscoDV service can access the
“sanitized” version of CSOne data
through JDBC from RIDES(SWTG
CiscoDV) API.
Connection Type DV on hybrid cloud Enterprise data
store
25. Use Case 2: Get Customer Journey
Use Case
Description
Customer interactions on the web
pertaining to bug search and case
submission process. Foundational data
can be used to explore trends and feed
into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud
Analytics Platform
Data Sources SAS Analytics
Intercloud
Solution
By direct network access to the Impala
Server, the CIS CiscoDV server
connects to the Impala Service in
Hadoop also on CIS as a Data Source.
SQL Queries configured in CiscoDV
execute Impala queries
Connection Type DV on hybrid cloud VPC Big Data
platform
26. Use Case 3: Get Bug Interactions
Use Case
Description
Another foundational data service that provides
a breakdown of customer exposure or interest
in bugs. The service can be refined further to
look at trends specific to a company or a
product for further analytics.
CiscoDV
Value
Real-time data federation that accesses
extremely large data in CIS Intercloud Analytics
platform and join that with Bug Data accessed
via departmental CiscoDV instance (RIDES)
Data
Sources
SASA Analytics and QDDTS via RIDES
Intercloud
Solution
By building on the access to the Impala Server,
the DV server can join the Bug Data from the
Enterprise Data Stores with the HDFS data to
provide a federated view.
Connection
Type
DV on hybrid cloud VPC Big Data platform
and Enterprise data store
27. CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco
Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and
Hive on CIS Intercloud
Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS
Big Data Cluster
Scenario1
Scenario 3