Bitkom Cray presentation - on HPC affecting big data analytics in FS

HPC Meets Big Data in Financial
Services
Philip Filleul – Global Lead FS

Agenda
● Who is Cray
● Cray Vision and Products
● Breakthrough analytic technologies
● For each – Spark, Graph, Machine Learning
● What is it
● FS Use cases
● Technology needs to successfully deliver
● How Cray enables
● Key takeaways

Cray: The Myth vs. Reality
Myths
• They are huge
• They are proprietary
• They are complex
• They are expensive
Vs. Reality!
• They can be – but they start less than a rack
• No: Intel, Linux, open standards, Hadoop
• Simpler and more productive than a grid
• No – cost competitive, lower TCO, higher value
Three Focus Areas
• Computation
• Storage & Data Management
• Analytics

Rev. $725M
Gross R&D: >$130M*
H/C: 1,200*
Rev. $562M
Gross R&D: $105M
H/C: 1,138
Rev. $526M
Gross R&D: $92M
H/C: 1,042
Rev. $421M
Gross R&D: $86M
H/C: 929
Rev. $236M
Gross R&D: $77M
H/C: 860
2011 2012 2013 2014 2015
Continuing Financial Strength and R&D Investment
Copyright 2016 Cray Inc

Where does Cray add value in FS?
BUSINESS
LINES
Asset
Management
Search for alpha
Strategy
confidence
Wealth
Management
Roboadvice
Scaling mass
affluent
Securities
Massive
regulation
Compliance
burden
Stress tests
Commoditization
Insurance
Telematics
Fraud
CROSS
INDUSTRY
Technology
Commoditization
and open source
Cloud
Big data
analytics
Cybersecurity
✔ ✔
✔
✔
✔
✔
✔
✔✔
✔
✔

Cray’s Vision:
The Fusion of Supercomputing and Big & Fast Data
Copyright 2016 Cray Inc.
Super Computing
Big Data
Analytics
Modeling The World
Cray Supercomputers solving “grand challenges” in science, engineering and analytics
Compute Store Analyze
Data-Intensive
Processing
High throughput event processing &
data capture from sensors, data feeds
and instruments
Math Models
Modeling and simulation
augmented with data to provide the
highest fidelity virtual reality results
Data Models
Integration of datasets and math
models for search, analysis,
predictive modeling and knowledge
discovery
High Performance Data Analytics (HPDA)

Cray Product Range and FS Applicability
 Aries Interconnect
 Single memory
 Scalability
 Package density
 Grid compatibility
 Upgradeability
 Integrated Stack
 Best in class power
and cooling
 NVIDIA GPU density
 Proven at scale
 Integrated h/w and
s/w stack
 Developer
productivity
CS400
Cluster
Supercomputer
XC40
Supercomputer
 Risk/Pricing
 CVA
 Machine Learning
 Superfast data sharing
 Specialist within grid
 Surprisingly Low TCO
 Risk/Pricing
 Options FFT
 Algo backtesting
 Deep Learning

 Lustre parallel file system
 Single POSIX namespace
 Modular scaling 7.5GB/s-1.7TB/s
 Integrated and preconfigured
 Reliability and availability at scale
 Multi tier single namespace archive
 Rule based policy migration
 Flexible integration with most OEM
tape and disk
 Preconfigured and integrated
Archive
Lustre Parallel File System  High thruput for algo
analytics pipeline
 Converged storage across
grid, analytics, Hadoop
 Data Lake archival
 Analytical data archival
 Market data archival
 Data no longer ‘deep
sixed’

 Most scalable graph
processor available
 Whole graph analytics
possible
 Open RDF/Sparql
 Single memory space
and extreme threaded
processor
 Cloudera 5.2/Yarn
 Open to non CDH apps
 Dense compute and
memory
 SSD layer for HDFS
 Lustre/Posix for scale
out storage
Urika-XA
Extreme Analytics
Platform
Urika-GD
Graph Discovery
Appliance
 Surveillance
 Cybersecurity
 Ontology based
transaction compliance
 Spark optimized
 R/T streaming analytics
converged with regular
analytics
 Machine learning

Breakthrough Analytic Technologies
● Growing CPU capability, commoditization, memory
size and IO bandwidth has made some new software
technologies explode
● Spark
● Graph
● Machine Learning
● For each:
● What are they?
● Why are they important in FS?
● What technology attributes do they need to deliver on the
promise?
● How does Cray enable?

Spark
● What is the technology
● General purpose, productive analytic technology
● Open source, target of much development work
● Memory first, shared data
● Base ecosystem for e.g. GraphX and MLlib
● FS Use Cases:
● Risk analytics
● Real-time alerting and dashboarding
● Web clickstream rapid ETL for CSRs

Spark Technology Needs
Compute
Node
Compute
Node
Memory
SSD
HDD
Block Shuffle over
interconnect
Intermediate results spill
over from memory, SSD
recommended for
latency/size balance
HDD for Job
input and output
HDFS vs. Parallel File System for high
bandwidth and scaling disk separate to
compute
Performance Recommendation
- Fast interconnect
- SSD per node
- Shared parallel filesystem

What is Graph
A Traditional RDBMS is GOOD at:
- Rapid update
- Simple queries about items
But BAD at:
- Relationships between data items
- Patterns of relationships
- Interactions between many data items
- E.g. suspicious pattern of actions
Graph databases:
Operate entirely in memory



Discovering New Risk/Compliance events
● Goal: Find detection patterns and improve
efficiency of the investigation process by
reducing false positives
● Data sets: Accounts, Customer Transactions,
3rd party data feeds, Detection and Case
Management systems
● Technical Challenges: Rigid detection system
schemas and rules; Constantly degrading
performance as new data comes in; Hard to
tune performance with new data; Long data on-
boarding timeframes; Manual disposition of
benign alerts
● Users: Investigators, Analysts
● Usage model: Tune detection system models
via data discovery; Enhance, improve and
augment the alert investigations process
● Augmenting: Existing detection systems
RestrictedTradingList
Trader
StockSymbol
LegalEmployee
DestIP
Port
Protocol SourceIP
TypeDateTime
BadgeLogs
EntryTime
ExitTime
Location
SystemWith
AdminRights
ITEmployee
PolicyViolations
Location
Restriction
RestrictionStartDate
Department
CounterParty
Transaction
Date
Transaction
Type
RestrictionDate
Communication
Event
Location
Restriction
RestrictionStartDate
RecordType
Time

Inexperienced
CSR Event
Resolutions
Discovering Customer Churn drivers
● Goal: Identify correlations between service events
(truck rolls, call escalations, customer service rep
experience level, set-top box reliability…) and
customer churn
● Data sets: Customer records, Historic billing
records, IVR, HR/training records, customer
surveys, Network Operations data, Work Orders…
● Technical Challenges: Volume, Variety and
Velocity of data; Disconnected and disparate data
sources from operational lines of business and 3rd
party contractors
● Users: Customer Operations Analysts
● Usage model: Analyze relationships between
service & related events and eventual customer
contract outcomes
● Augmenting: Existing data warehouse appliances
Customers
Call Center
Events
Work
Orders
Call
Escalations
Truck Rolls Set-Top
Box feeds
Supervisor
Intervention
3rd Party
Service Tech
AVR
Failure
CSR
Resolution
Cabin
et
Failure
Residential
Accounts
Web
Service
Commercial
Accounts

Mphasis Nextangles: A Disruptive Approach
Regulations
& Policies
Data & IT
systems
Now : Sample Audits
connect the two silos
NextAngles: Bridges the
two through Knowledge
Models
1. Regulations are deconstructed to
computer understandable rules
2. Rules are applied to Smart Data
3. This application is through
knowledge model
Old World Solution :
Inadequate
New World Solution :
Knowledge models

1818
NEXTANGLES
Massively scalable, “Living” model of the bank
How it worksHOW NEXTANGLES WORKS
Convert to
“Smart Data”
Time 
Investigation
Tools
Customer’s
Systems
Dashboards
Concept
Model
Rules
Inferences
• Potential violations
• Prohibited activities
• Operational risk measures
• Data problems
T1 T2 T3 T4 T5 T6 T7
Context model
• Line of business
• Legal entities
• Geographies
• Customer segments
• Organization structure
• Processes
Reference & Transaction Data
• Parties
• Accounts / GL / positions
• Transactions & events
“Facts”
Encoded
Regulations &
Policies
Encoded
Banking
Knowledge

ENABLER #1: SMART DATA
• Data stored as computer
intelligible “graphs”
What is it?
Class
predicate
• Formal standards from the W3C
and other bodies
• Over 12 years Semantic Web has
evolved to a full ecosystem of
products and practices
• Order of magnitude reduction in
handling real world data
complexity
How is it enabled? Value Proposition
Making the data computer intelligible
ObjectSubject
Data  SMART Data

ENABLER #2: RULES & CURATED KNOWLEDGE
Reliable, consistent and predictable application of reasoning & complex rules
• Built on the smart data model:
helps computers reach the same
conclusions as human
knowledge workers
• Knowledge expressed as rules
that are intrinsically part of the
smart data ecosystem
• Reduces the need for humans to
intervene & define “how” to solve
problems
What is it? How is it enabled? Value Proposition
Traditional
Rules
Data
Traditional Rules: need to be wired in
Rules in a Smart Data ecosystem: Fills in gaps

ENABLER #3: WORKSPACES
• A complete rethink of user
interfaces around smart data &
knowledge models
• Semantic + knowledge base
driven “Noun-verb” paradigm
• “Workspaces” – context where
users work through an enquiry
• 6 widgets:
• Solves the “I need Excel”
problem
• Solves the swivel chair problem
• Solves the vocabulary problem
A rethink of enterprise applications for knowledge workers
Faceted Search View
List VisualizeHistory
Forms / WizardsWorkspace

ENABLER #4: LEARNING
• Learns from user behavior to
help pre-populate workspaces
• Learns how users use tools to
perform tasks
• Tries to proactively bring up
the tools when it sees a
similar situation
• Interim work products can be
turned into future automation
• User behavior in a user interface
is tracked in detail, and encoded
into smart data
• Learning algorithms eliminate
dead ends & build an optimum
path to the answers
• Effort for manual tasks reduces
over time
• Almost like “custom screens” for
1000’s of subtle variations
• The Next Angles learns from
users’ behavior
• Supervisors can short-circuit
learning engine to “pre-configure”
workspaces
Continuous improvement of efficiency and effectiveness through learning

Anti Money Laundering: Solutions to a Real Problem
● Backlog of investigations due to large number of alerts
● Constantly changing AML rules and regulations
● Consolidation of data from various systems within and outside the
bank
● Balancing the load with limited resources
Challenges

Urika-GD: Purpose-built for data discovery
1,944
Times
Faster !
“In the amount of time it takes to validate one hypothesis, we can now validate 1000
hypotheses – increasing our success rate significantly.” – Dr. Ilya Shmulevich
Access all data with uniform, low latency
regardless of partitioning, layout or
access pattern
Do not know the relationships
in the data
Do not know the desired
insight or the right question to
ask
Do not know the
paths/linkages to explore
diverse data sets
Investigate multiple, changing
hypotheses in parallel without
prefetching/caching
Explore diverse data fused without
upfront modeling and independent of
linkage/traversal path
Shared Memory
Model
Memory Accelerator
In Memory, Graph
Analytical Database
# PROCESSORS TIME
Traditional Approaches after months of optimization 48 10.8 Hours
Cray 32 30 sec

Machine Learning
All the data vs. a sample, messy data is OK
Correlation vs. causation
Algorithms fine tune themselves
Machine Learning is Different:

Machine Learning use cases in FS
● Anomaly detection for compliance
● Rogue traders, Fat Fingers
● E.g. normal accuracy with decision trees: 70-75%
● Deep neural nets >90%, which can halve fraud costs
● Fraud, money laundering
● Trading Strategies
● Risk and reward prediction
● Structured and unstructured data sources
● Personnel and Customer Management
● Recruiting/Turnover prevention
● CRM for trading platforms

Supervised Machine Learning
First label data: human
judgments on historic
data – e.g. fraud or not
fraud
Statistical analysis of
training data
Model finds correlations
between input data and
human applied labels
•1000s of features: events, state,
temporal, graph
•Millions of fraud patterns
•Copes with noisy data

Deep Learning as the emerging Supervised
Learning ML
● NVIDIA the thought (technology) leader in Deep
Learning
● GPU technology well-suited
● Adopters like Google, Facebook, Microsoft
Especially successful for
- Pattern recognition
- Feature extraction
in speech, pictures, time-series

Technology Needs of Machine Learning
● Highly parallel any to any
● Dense compute, large memory, fast interconnect
● Deep learning: Dense GPUs depending on toolset
Cray XC for large single image memory scaling
Cray CS-Storm for dense GPUs for Deep Learning
A greater engineering challenge than you might think
Cray makes the world’s densest most scalable and RELIABLE GPU
systems

In Summary: New Analytics Technology Needs
Characteristic Older Hadoop Traditional HPC Advanced
Analytics
Interconnect Slow Fast/Intelligent Fast/Intelligent
Single memory
capability
No Yes Yes
High Bandwidth
I/O
No Yes Yes
Node Local
Storage
Yes No Hybrid
Compute density Low High High
GPUs No Yes Yes

Summary
● Game changing analytics technologies are arriving
● They have high ROI use cases in FS
● Their technology demands do not align with traditional
Hadoop clusters
● Their technology needs are closer to HPC
● Cray has great heritage, experience and technology
● Cray is designing new age analytic products

Bitkom Cray presentation - on HPC affecting big data analytics in FS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bitkom Cray presentation - on HPC affecting big data analytics in FS

Similar to Bitkom Cray presentation - on HPC affecting big data analytics in FS (20)

Recently uploaded

Recently uploaded (20)

Bitkom Cray presentation - on HPC affecting big data analytics in FS

Editor's Notes