Ken Owens, the CTO of Cisco Intercloud Services, presented on Cisco's migration from MapReduce jobs to Spark jobs for processing customer interaction data. The document discussed Cisco's need to embrace both traditional and hyperscale application deployment across data centers, clouds, and edges. It also covered Cisco's analysis platform requirements, AWS and Cisco Intercloud sizing comparisons, and performance results from testing the migration of MapReduce jobs to Spark on the Cisco Intercloud.
1. Ken Owens
CTO Cisco Intercloud Services
07/15/15
How Cisco Migrated from
MapReduce Jobs to Spark
Jobs
1
2. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
3. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
4. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
5. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
6. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Introduction
7. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: IDC 7
30M
New devices
connected
every week
78%
Workloads
processed
in Cloud DCs
by 2018
5TB+
of data per person
by 2020
180B
Mobile apps
downloaded
in 2015
277X
Data created
by IoE devices
v. end-user
The Uber Trend: Exponential Rise in Connectivity
8. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Exponential Trend
Linear Trend
Disruptive Stress
/Opportunity
Knee of Curve
Exponential Growth Drives Opportunities
Peter Diamandis: BOLD
9. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
When Products Become Cloud-enabled, They Become
10X More Valuable
$23.19
$249.00
$18.01
$199.00
$5.99
$59.99
10. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
A Broader Perspective than Hybrid Cloud Is Required…
Data Center Cloud Edge / IoT
11. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Hyperscale applications serving several
thousands of users very quickly
Traditional enterprise applications
IoE and increasing connectivity driving the need
for such workloads
Hadoop, Mobile back-ends, Gaming, Social
Small (~10%), yet rapidly growing
percentage of applications in the Cloud
ERP, CRM, Applications that leverage
traditional databases
Majority of applications being run
for/by Enterprises today
CIOs Need to Embrace Both Traditional
and Hyperscale Application Deployment
12. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
SaaS
PaaS IaaS
Application Portability and Interoperability Is the Key
Traditional
Applications
ERP, Financial, Client/Server,
CRM, email, …
Cloud Native
Applications
IoT, BigData,Analytics,
Gaming, ...
Data Center Cloud Edge / IoT
13. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Source: Gartner, Lydia Leong
of CIOs currently
have a second
fast/agile mode
of operation
45%
Traditional
Mode
Requires
Reliability
(ITIL, CMMI, COBIT)
Nonlinear Mode
Accept Instability
(DevOps,
automation,
reusable)
Systems
of
Differentiation
Systems
of
Innovation
Systems
of
Record
Change
Governance
Bimodal IT Is the New Normal
Source: Gartner, Lydia Leong
14. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Intercloud
The
Intercloud
Web-scale Architecture
API-Driven Automation
Open, Secure, Compliant,
Hybrid IT
Internet
The
Internet
IP Based
Open Standards
World of Isolated Clouds
(2000s)
Individual custom-built clouds
without consistent APIs
Connected for application
acceleration with Open APIs
The Intercloud
Intercloud
Islands of Isolated
PC LAN Networks (1990s)
Multiple LANs using
a multitude of protocols
The Internet
Connected using industry-
standard IP protocol
We Must Connect the Clouds
16. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Omni-Channel Customer Journeys
Server
Logs
Social
& Chat
Mobile
Event
Streams
Call
Center
S/W
Download
Open Trouble
Ticket
Assign
Engineer
Update
Trouble Ticket
Close Trouble
Ticket
Resolve
Trouble Ticket
Read Support
Documents
View Design
Documents
View Tech
Documents
New
Registration
Bug Search FAQs
Contract
Details
Product
Details
Device
Coverage
Interaction Touch points
Channels
Journey
Case Resolution
Software Upgrade
The customers’ interaction with Cisco across multiple touch points to get the desired business
outcome.
17. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
• Software Upgrades
• Bug Inquiry
• Software Inquiry
• Trouble Ticket Lifecycle
• Device Troubleshooting
• New Registration
• Contract Renewal
• Customer Interest
Analytics
• Customer Experience
Analytics
• Resource Forecasting
• Security and
Compliance
Customer Journeys Behavioral Insights
• Boost Self Service
• Real-time Content
Optimization &
Recommendation
• Context Based
Predictive Alerts
• Implicit Personalization
Impact
Customer Interaction Analytics
From Journey to Outcome…
18. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Server Logs
Customer Interaction Analytics
Big Data Platform
Synthesize customer journey maps into behavioral insights.
Call Center
Mobility
Social
Event
Streams
Data
Sources
Data
Ingestion
CiscoDV
Kafka
Redis
ETL
Analytics
Model
Build Model
Activity
Refinement
Activity
Synthesis
Synthesized
Insights
Real-time Processing
Batch Analytics
Insight Services
CiscoDV
Interact
ImpalaHive
Pig ES
Zoomdata,Platfora
20. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS Platform
Component Cloud::
Hadoop
(Batch
Analytics)
Cloud::
Queries
(Interactive
Queries)
Cloud::
Streams
(Near Real-
time
Analytics)
Virtual
Machines
30 6 5
AWS
Instance
Sizing
m3.2xlarge c3.xlarge m3.xlarge
Virtual
Cores
8/VM 4/VM 4/VM
RAM 30GB/VM 7.5GB/VM 15GB/VM
Disk 1.5 TB/VM 1.5 TB/VM 1.5 TB/VM
21. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Case for Cisco Intercloud Services for Analytics…
Cisco Security and Compliance requirements
• Workloads that deal with personally identifiable data and Cisco
confidential content cannot be uploaded to AWS. Cisco internal cloud
solution is a better fit.
Customer journey beyond the enterprise
• Applications are hosted on AWS
• Partner systems hosted on AWS and other cloud providers
Presence in AWS and other cloud services required to support these
scenarios for end-end customer journey insights.
Data virtualization integrated in the CIS Analytics Stack
• Connect data from multiple clouds and multiple big data platforms
Integrated visualization toolset
22. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform
23. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Analytics Platform Requirements
Infra Provisioning
Deploy a virtual private cloud (VPC) on CIS with compute, storage and memory requirements comparable to the current
production system.
OpenStack
Icehouse OpenStack with Neutron, Nova, and Swift installed.
Big Data Ecosystem
Cloudera’s Hadoop distribution version CDH 5.1.3., ELK Stack, Apache Kafka and Apache Storm.
Data virtualization & Cloud Integration
Access to data services and data stores via Cisco Data Virtualization
Runtime Services
Foundational PaaS capabilities including SLAs for uptime, performance, latency, data retention, issue escalation and
support priorities, issue resolution, problem management, deployment process, patch management.
API Services
Provide both fine-grained and coarse-grained access to the all service layers of the CIS Analytics Platform. In the hybrid cloud
model it must support interoperability across platform service providers and promote the cloud concepts of extensibility and
flexibility.
24. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS to CIS Migration – Success Criteria
Successful synthesis of customer interaction data
Successful automation of the end-end data process pipeline
Build behavioral insight services
Access to data and services via data discovery and visualization tools
Meet the performance, scale and platform stability requirements
Successful deployment of CiscoDV on CIS
Connect HDFS and Hive DS with CiscoDV via Hive and Impala
Build and expose insight services for consumption by limited users
25. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
AWS and CIS Data Node Sizing Comparison
Hadoop Cluster for Batch and Query Analytics
Node Service AWS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master m3.2xlarge 8 30 2x80 GB 30
Each hadoop data node has 1500GB of EBS
available for HDFS storage
AWS Sizing
CCS Sizing
Node Service CCS Instance Type vCPU Mem Storage
Number of
Data Nodes
Comments
Data Nodes/
Node Master GP-2XLarge 8 32 50 35
Each hadoop data node has 1500GB of EBS
available for HDFS storage
Less than AWS sizing (Storage)
26. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Pilot Test Data
• Test performed on one day’s production data
• Total no. of records processed – 110,852,667
• Total data size – 32GB
• Total no. of M/R jobs in the data pipeline – 17
• Two test cycles
• Cycle 1: Heterogeneous CCS nodes (vCPUs, storage, memory)
• Cycle 2: Homogeneous CCS nodes
27. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CIS Performance of Batch Analytics – Limited Test
29. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
PoC: Analytics with Spark on CIS
Existing code
Made in Ruby with Wukong to run on Hadoop
A history of changes and modifications
Script-based, steps communicate via intermediary files
Goal
Revise, rethink and reimplement with Spark on CIS
Open for advanced cloud analytics
Improve maintainability by moving away from aging Ruby on Hadoop
30. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Pre-process log records (‘cleanse’)
Extract HTTP sessions (‘sessionize’)
Extract user actions, such as ‘search’, ‘download
patch’, ‘open manual’, ‘open a bug’
Ruby: Scripts with temp files
Each box on the figure is a script in a separate file
They pipe Gb of data as input and output
Random matching of nodes to data for sessionizing
Lots of redundant shuffling
Ruby Flow
global sort in time
global group by IP
31. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Sessionize
Cleanse
logs
cleanse
private web
decorate
sessionize
(cookie, time)
sessioned
match 1st
(IP, UA, time)
build actions merge
session PSV
add to hivebug tool
first, others, bots
1..7
onlyBots
first
others
private
Main
computation
happens here
cleansed
Same flow, but each box is a Java or Scala function
No intermediate temp files
Steps are chained by Spark, often without any need for
intermediate data
If still needed, the data is stored in memory and local
disk as much as possible
Local computation
Cleansing is computed on nodes local to data blocks
(same as Ruby)
Sessions are built per IP
On separate nodes each handling a single IP range
One copied to the node on partition the data remains
local
Spark Flow
global partition by IP
local sort in time
32. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Volumes
Logs of a single day: 52 Gb
Total of 110 mil records
Where 53 mil records are kept after pre-filtering
Producing over 1 mil user actions
Cluster of 30 nodes
Ruby
Runtime 140 min
Spark
Runtime 7 min (20 times faster )
Runtime comparison
33. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Extracting sessions means sort in time and group by IP
Ruby:
sorting in time and per-IP grouping is performed across the whole cluster (very bad, lots of IO)
Spark is good at dealing with partitions:
per-IP groups are placed on different machines (partitions)
global sort in time is replaced by many local per-IP sorts done on machines responsible for
extracting sessions for specific groups of IP addressed
Other improvements
Avoid redundant temp files, redundant (de)-serialization of objects (comes with Java/Scala), stages
keep data in memory when possible (comes with Spark)
Cache results of user agent resolution that are heavy on regular expressions
Why?
35. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Data Virtualization for Intercloud Analytics
Customer Benefits
Discover data beyond the enterprise: Virtual integration that combines traditional
enterprise data, Big Data stores on CIS and AWS, cloud data from SaaS providers and,
Cisco Customers and Partners
Seamless interoperability offers easy access to data across distributed data sources
in the intercloud analytics platform
Universal data governance maximizes enforcement of data security rules
Analytics Data Hubs: Deployment flexibility to build hybrid/virtual sandboxes that
enable nimble data discovery and rapid data analytics to support multiple LOBs
Deliver data to any number of analytics tools.
36. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 1: Get Case Interactions
Use Case Description # of cases opened by company X that
are currently open. (other variations
would include cases by company,
trends etc.)
CiscoDV Value CiscoDV enforces data security rules to
restrict access on the intercloud
platform to customer sensitive data.
Data Sources SalesForce
Intercloud Solution CIS CiscoDV service can access the
“sanitized” version of CSOne data
through JDBC from RIDES(SWTG
CiscoDV) API.
Connection Type DV on hybrid cloud Enterprise data
store
37. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 2: Get Customer Journey
Use Case Description Customer interactions on the web
pertaining to bug search and case
submission process. Foundational data
can be used to explore trends and feed
into content recommendation models
CiscoDV Value Direct access to Data on CIS Intercloud Analytics
Platform
Data Sources SAS Analytics
Intercloud Solution By direct network access to the Impala
Server, the CIS CiscoDV server
connects to the Impala Service in
Hadoop also on CIS as a Data Source.
SQL Queries configured in CiscoDV
execute Impala queries
Connection Type DV on hybrid cloud VPC Big Data
platform
38. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
Use Case 3: Get Bug Interactions
Use Case
Description
Another foundational data service that provides
a breakdown of customer exposure or interest
in bugs. The service can be refined further to
look at trends specific to a company or a
product for further analytics.
CiscoDV Value Real-time data federation that accesses
extremely large data in CIS Intercloud Analytics
platform and join that with Bug Data accessed
via departmental CiscoDV instance (RIDES)
Data Sources SASA Analytics and QDDTS via RIDES
Intercloud
Solution
By building on the access to the Impala Server,
the DV server can join the Bug Data from the
Enterprise Data Stores with the HDFS data to
provide a federated view.
Connection
Type
DV on hybrid cloud VPC Big Data platform
and Enterprise data store
39. Cisco and/or its affiliates. All rights reserved.Presentation_ID Cisco Public
CiscoDV on Intercloud Analytics Platform (CIS)
Scenario 1
CIS Cisco DV to Cisco
Enterprise Data Store
Scenario 2
CIS CiscoDV to Impala and
Hive on CIS Intercloud
Analytics Platform
Scenario 3
CIS Cisco DV to Hive on AWS
Big Data Cluster
Scenario1
Scenario 3
Editor's Notes
FABIO – a few items from Pankaj and Liz Monday:
Per the John Chambers slides I sent you Monday night, please be sure to fully address digitization in the opener, so Pankaj can connect to John’s opening remarks.
Set the stage here for what the digital transformation is and why it dries IoE and cloud. Explain where we came from, where we are today – exponential growth and a magnitude of changes still to come.
Please see new VNI, to see if there are any newer/better stats re the Data Center.
Pankaj feels the top 3 data points are ok in this slide, but perhaps we could find better ones for the bottom 2 data points? Maybe uplevel them a bit?
-------------------------------------------------------
The world is changing. The digital transformation is turning traditional business models on their heads. We are seeing unprecedented growth in the explosion of devices and mobile apps and in data utilization.
IoE – IoE devices create 277 times the data that the end user is creating. But only a fraction of it ever reaches the data center. A Boeing 787 for example, generates 40 TB of data per every hour of flight time. But only 0.5 TB is ultimately transmitted to the data center.
Mobility: In 2014, global mobile data traffic grew 1.7x or 69%… In 2014 alone, 77B+ mobile apps downloaded… by 2015 180B apps (233% increase)
Internet… IDC predicts by 2017, there will be 3.6 billion global Internet users… More than 1/2 the world population
Big Data… By 2020 there will be more than 5,000 GB of data for every person on Earth
These massive changes are putting tremendous stress on the data center. The traditional data center model has to evolve in order to meet demand today and into the future.
We know how to fix this
We’re going to do for cloud what we did for data. You couldn’t move data between the networks – they weren’t connected. Cisco unified those worlds
The world of cloud today is a world of isolated clouds. There’s no workload or data portability.
“Amazon is hotel California – you can never leave, and that data is staying there”
Our vision is to connect all these clouds together into the Intercloud - whether private, public , or hybrid through technology and innovation
Intercloud is going to connect these clouds together in the same way we connected data together.
No one cloud model or single cloud approach, such as the massively scalable clouds from Amazon, Google or Microsoft will win alone in this space