Introduction to NetGuardians' Big Data Software Stack

Introduction to
NetGuardians’
Big Data
Software Stack
Jerome Kehrli, Head of R&D
Geneva, September 2017

Agenda
• Introducing NetGuardians
• Software Stack
• Typical Architecture
• NetGuardians’ Use Cases
• ElasticSearch / Spark / Mesos
Constraints and Behaviour

About NetGuardians
• Top Fintech Europe Company
• Behavioural analysis based on risk
models combining human actions
relative to channels, technical
layers and transactions.
• Stay on top of new regulatory
needs and anti-fraud patterns
using profiling and analytics
• Our intelligence updates
automatically deliver new controls
XXXXXX XXX
E-BANKINGE-BANKING
IT layers
Transactions
Channels

The Problem
70% is internal
Fraud costs the world
$3trillion per year
Certified Fraud Examiners,
Report to the Nations, 2014
$6 trillion
Projected cyber crime
cost by 2021
Cyber Security Ventures, 2016
It takes 18 months on average
to detect fraud.
Most remains undetected.
Certified Fraud Examiners, Report to the
Nations, 2014
$6
trillion
$3
trillion
The fine
one single bank was slapped
with due to inadequate
internal controls and slow
documentation process
Bloomberg, April 2015
$2.5
billion

All the caps you need
One single platform
Unique solution made for banks

References
Retail banking
Private banking

Analytics Platform
Software Stack

Mesos is a distributed systems kernel.
Runs on every machine and provides applications (…) with
API’s for resource management and scheduling across
entire datacenter and cloud environments.
Apache Spark is a fast and general engine for large-scale
data processing.
Provides programmers with an API functioning as a working
set for distributed programs that offers a versatile form of
distributed shared memory.
ElasticSearch is a distributed, real-time, RESTful search and
analytics document-oriented storage engine.
Lets one perform and combine many types of searches -
structured, unstructured, geo, metric - in real time.
Apache
(V1.3 = July 2017
V1.0 = July 2016)
Apache
(V2.2 = July 2017
V1.0 = May 2014)
ElasticSearch
(V6.0b = July 2017
V1.0 = February 2014)

ES-Hadoop : connect the massive data storage and deep
processing power of Hadoop with the real-time search and
analytics of Elasticsearch.
Interestingly, Spark can perfectly use ES-Hadoop to load from
or store data to ElasticSearch outside of an Hadoop stack.
The spark connector from the ES-Hadoop library has no
dependency on a Hadoop stack whatsoever.
ES-Hadoop
ES

ELK-MS - Technical Architecture

ELK-MS - Typical Application Architecture

Analytics approach
Pattern Based Intelligence
• Fundamentally rule based
• Implemented as pyspark scripts
• Custom approach (no framework)
Profiling
• Statistical Model
• Natively implemented using both
ES and spark statistics functions
• Custom approach (no framework)
Machine Learning
• Advanced algorithms
• Prototyped using Python SciKit
learn
• Industrialized using Spark MLlib

Typical Data Flow
Data-locality optimization is not optional for us !

ES / Spark / Mesos
Constraints and
behaviour

ES-Hadoop and Data Locality
Data-locality enforcement works well.
• ES-Hadoop makes Spark understand the
topology of the shards on ES
• Mesos / Spark respects locality requirements,
creates as many partitions as shards.
It works only under nominal conditions.
Several factors compromise data-locality:
→ Spark waits only for
spark.locality.wait=10s trying to get the
processing executed on the spark node co-
located to an ES shard
← If ES on co-located node is busy, ES can decide
to answer from another node

Mesos / Spark Scheduling Mode
In Coarse Grained scheduling mode, Mesos only
knows spark executor processes.
• Mesos books as much cluster resources as
possible to allocate Spark executors for a job.
Historically, Mesos on Spark can use Fine
Grained scheduling mode, where Mesos
schedules each and every individual spark task.
• Kills performances !
• Deprecated:
https://issues.apache.org/jira/browse/SPARK
-11857

Spark Static Resource Allocation vs. Dynamic Allocation (1/2)
Static Resource Allocation
• Mesos / Spark decides allocated resources at
job init time
• Allocated resources are kept until the job
completes
• 2 noteworthy consequences :
1. By default, every single job running
alone gets the whole cluster.
A following job would need to wait.
2. Several jobs arriving together would get
the cluster fairly shared.
If only one job is long-lived, that job
would still need to complete its
execution on his small portion.

Spark Static Resource Allocation vs. Dynamic Allocation (2/2)
Dynamic Allocation
• Designed as a solution the previous problems
• But … Spark‘s Dynamic Allocation messes up
data locality optimization completely.
• ES-Hadoop makes spark request as many
executors as shards and indicates
as preferred location the nodes owning the
ES shards.
• Dynamic allocation bypasses this
completely and screws data-locality
optimization
Dynamic Allocation
• Designed as a solution the previous problems
• Works out of the Box

Other concerns
• Python latency
• Java and Scala jobs run natively in the Spark JVM.
• Pyspark launches “some tasks” in a separate process than the Spark JVM.
• DataFrame or RDD methods exposed to python scripts are actually implemented in native
Scala underneath.
• One noticeable exception: UDF (User Defined functions) implemented in python!
• One can very well still use pyspark but write UDF in Scala.
• Repartitioning
• A redistribution of a dataset on the cluster is only hardly achievable … and not necessarily
desirable.
• Advanced ES queries
• The ES-Hadoop connector can only submit “simple” requests to ES, with filtering (now)
• Advanced features such as aggregation queries cannot be used

ES / Spark / Mesos
Why is it cool ?

Why cool ? (1/5)
Spark’s API is brilliant for our use cases (NetGuardians)
Pattern Based Intelligence
• Implementing our rules in pyspark
is straightforward
• We are now considering DRESS on
spark streaming
Profiling
• Out of the box with Spark’s
statistics functions
• Here as well we consider spark
streaming for event scoring
Machine Learning
• We prototype with Python SciKit
Learn
• Implementation on spark is easy
with Spark MLlib

Why cool ? (2/5)
What do we want ?
Initial situation

Why cool ? (2/5)
What do we want ?
Working with a small
subset of the data

Why cool ? (2/5)
What do we want ?
Working with a full
month of data

Why cool ? (2/5)
What do we want ?
Working with the
whole dataset

Why cool ? (3/5)
Processing Distribution scaling linearly with Data Distribution
Works Out of the box with
• Dynamic Allocation in Spark + Mesos
• ES-Hadoop / ES-Spark connector data locality optimization

Why cool ? (4/5)
Processing Distribution scaling linearly with Data Distribution
ES / Spark / Mesos provide the basic building blocks to distribute
and scale the processing exactly how we want
• ES-Hadoop : Data locality optimization
• Mesos / Spark : spark.cores.max=X configuration
• ElasticSearch : search_shards API
Golden Rule : use spark.core.max = Nbr Shards

Why cool ? (5/5)
“One ring to rule them all ...”
• ES, Spark and Mesos are
designed to run on large clusters
• But they work very well as well
on one single fat machine with
tons of CPUs and RAM
• We deploy the same platform in
tier 1 banks and small banks.

THANK YOU!
NetGuardians SA Headquarters
Rue Galilée 6
1400 Yverdon-les-Bains
Switzerland
Tel: +41 24 425 97 60
Email: info@netguardians.ch
www.netguardians.ch
Linkedin.com/company/netguardians
Facebook.com/NetGuardians
@netguardians

Introduction to NetGuardians' Big Data Software Stack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to NetGuardians' Big Data Software Stack

Similar to Introduction to NetGuardians' Big Data Software Stack (20)

More from Jérôme Kehrli

More from Jérôme Kehrli (18)

Recently uploaded

Recently uploaded (20)

Introduction to NetGuardians' Big Data Software Stack

Editor's Notes