Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to
NetGuardians’
Big Data
Software Stack
Jerome Kehrli, Head of R&D
Geneva, September 2017
Agenda
• Introducing NetGuardians
• Software Stack
• Typical Architecture
• NetGuardians’ Use Cases
• ElasticSearch / Spar...
About NetGuardians
• Top Fintech Europe Company
• Behavioural analysis based on risk
models combining human actions
relati...
The Problem
70% is internal
Fraud costs the world
$3trillion per year
Certified Fraud Examiners,
Report to the Nations, 20...
All the caps you need
One single platform
Unique solution made for banks
All the caps you need
One single platform
Unique solution made for banks
References
Retail banking
Private banking
Scalable Big Data Technology
Analytics Platform
Software Stack
Mesos is a distributed systems kernel.
Runs on every machine and provides applications (…) with
API’s for resource managem...
ES-Hadoop : connect the massive data storage and deep
processing power of Hadoop with the real-time search and
analytics o...
ELK-MS
Architecture
ELK-MS - Technical Architecture
ELK-MS - System Architecture
ELK-MS - Typical Application Architecture
NetGuardians
Use Cases
Analytics approach
Pattern Based Intelligence
• Fundamentally rule based
• Implemented as pyspark scripts
• Custom approac...
Typical Data Flow
Data-locality optimization is not optional for us !
ES / Spark / Mesos
Constraints and
behaviour
ES-Hadoop and Data Locality
Data-locality enforcement works well.
• ES-Hadoop makes Spark understand the
topology of the s...
Mesos / Spark Scheduling Mode
In Coarse Grained scheduling mode, Mesos only
knows spark executor processes.
• Mesos books ...
Spark Static Resource Allocation vs. Dynamic Allocation (1/2)
Static Resource Allocation
• Mesos / Spark decides allocated...
Spark Static Resource Allocation vs. Dynamic Allocation (2/2)
Dynamic Allocation
• Designed as a solution the previous pro...
Other concerns
• Python latency
• Java and Scala jobs run natively in the Spark JVM.
• Pyspark launches “some tasks” in a ...
ES / Spark / Mesos
Why is it cool ?
Why cool ? (1/5)
Spark’s API is brilliant for our use cases (NetGuardians)
Pattern Based Intelligence
• Implementing our r...
Why cool ? (2/5)
What do we want ?
Initial situation
Why cool ? (2/5)
What do we want ?
Working with a small
subset of the data
Why cool ? (2/5)
What do we want ?
Working with a full
month of data
Why cool ? (2/5)
What do we want ?
Working with the
whole dataset
Why cool ? (3/5)
Processing Distribution scaling linearly with Data Distribution
Works Out of the box with
• Dynamic Alloc...
Why cool ? (4/5)
Processing Distribution scaling linearly with Data Distribution
ES / Spark / Mesos provide the basic buil...
Why cool ? (5/5)
“One ring to rule them all ...”
• ES, Spark and Mesos are
designed to run on large clusters
• But they wo...
THANK YOU!
NetGuardians SA Headquarters
Rue Galilée 6
1400 Yverdon-les-Bains
Switzerland
Tel: +41 24 425 97 60
Email: info...
Upcoming SlideShare
Loading in …5
×

Introduction to NetGuardians' Big Data Software Stack

722 views

Published on

NetGuardians is executing it's Big Data Analytics Platform on three key Big Data components underneath: ElasticSearch, Apache Mesos and Apache Spark. This is a presentation of the behaviour of this software stack.

Published in: Technology
  • I am so pleased that I found you! I have suffered from Sleep Apnea for years. I have tried everything to fix the problem but nothing has worked. For the last years I have been trying to use a CPAP machine on and off but it is very difficult to sleep with. It's noisy and very uncomfortable. I had no idea there was a natural way to help me. I am so pleased that I found you! ♣♣♣ http://t.cn/AigiCT7Q
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • It's genuinely changed my life. I have been sleeping in the spare room for 4 months - and let's just say my sex life had become pretty boring! My wife and I were becoming strangers living in the same house. Thanks to your strategies, I am now back in our bed and the closeness and intimacy have returned. Thank you so much for taking the time to put all this together. It has genuinely changed my life. ♥♥♥ http://t.cn/Aigi9dEf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • It's genuinely changed my life. I have been sleeping in the spare room for 4 months - and let's just say my sex life had become pretty boring! My wife and I were becoming strangers living in the same house. Thanks to your strategies, I am now back in our bed and the closeness and intimacy have returned. Thank you so much for taking the time to put all this together. It has genuinely changed my life. ♣♣♣ http://t.cn/AigiN2V1
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • slide 26: do you use any algorithm to find the rules or are these expert-based?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Introduction to NetGuardians' Big Data Software Stack

  1. 1. Introduction to NetGuardians’ Big Data Software Stack Jerome Kehrli, Head of R&D Geneva, September 2017
  2. 2. Agenda • Introducing NetGuardians • Software Stack • Typical Architecture • NetGuardians’ Use Cases • ElasticSearch / Spark / Mesos Constraints and Behaviour
  3. 3. About NetGuardians • Top Fintech Europe Company • Behavioural analysis based on risk models combining human actions relative to channels, technical layers and transactions. • Stay on top of new regulatory needs and anti-fraud patterns using profiling and analytics • Our intelligence updates automatically deliver new controls XXXXXX XXX E-BANKINGE-BANKING IT layers Transactions Channels
  4. 4. The Problem 70% is internal Fraud costs the world $3trillion per year Certified Fraud Examiners, Report to the Nations, 2014 $6 trillion Projected cyber crime cost by 2021 Cyber Security Ventures, 2016 It takes 18 months on average to detect fraud. Most remains undetected. Certified Fraud Examiners, Report to the Nations, 2014 $6 trillion $3 trillion The fine one single bank was slapped with due to inadequate internal controls and slow documentation process Bloomberg, April 2015 $2.5 billion
  5. 5. All the caps you need One single platform Unique solution made for banks
  6. 6. All the caps you need One single platform Unique solution made for banks
  7. 7. References Retail banking Private banking
  8. 8. Scalable Big Data Technology
  9. 9. Analytics Platform Software Stack
  10. 10. Mesos is a distributed systems kernel. Runs on every machine and provides applications (…) with API’s for resource management and scheduling across entire datacenter and cloud environments. Apache Spark is a fast and general engine for large-scale data processing. Provides programmers with an API functioning as a working set for distributed programs that offers a versatile form of distributed shared memory. ElasticSearch is a distributed, real-time, RESTful search and analytics document-oriented storage engine. Lets one perform and combine many types of searches - structured, unstructured, geo, metric - in real time. Apache (V1.3 = July 2017 V1.0 = July 2016) Apache (V2.2 = July 2017 V1.0 = May 2014) ElasticSearch (V6.0b = July 2017 V1.0 = February 2014)
  11. 11. ES-Hadoop : connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. Interestingly, Spark can perfectly use ES-Hadoop to load from or store data to ElasticSearch outside of an Hadoop stack. The spark connector from the ES-Hadoop library has no dependency on a Hadoop stack whatsoever. ES-Hadoop ES
  12. 12. ELK-MS Architecture
  13. 13. ELK-MS - Technical Architecture
  14. 14. ELK-MS - System Architecture
  15. 15. ELK-MS - Typical Application Architecture
  16. 16. NetGuardians Use Cases
  17. 17. Analytics approach Pattern Based Intelligence • Fundamentally rule based • Implemented as pyspark scripts • Custom approach (no framework) Profiling • Statistical Model • Natively implemented using both ES and spark statistics functions • Custom approach (no framework) Machine Learning • Advanced algorithms • Prototyped using Python SciKit learn • Industrialized using Spark MLlib
  18. 18. Typical Data Flow Data-locality optimization is not optional for us !
  19. 19. ES / Spark / Mesos Constraints and behaviour
  20. 20. ES-Hadoop and Data Locality Data-locality enforcement works well. • ES-Hadoop makes Spark understand the topology of the shards on ES • Mesos / Spark respects locality requirements, creates as many partitions as shards. It works only under nominal conditions. Several factors compromise data-locality: → Spark waits only for spark.locality.wait=10s trying to get the processing executed on the spark node co- located to an ES shard ← If ES on co-located node is busy, ES can decide to answer from another node
  21. 21. Mesos / Spark Scheduling Mode In Coarse Grained scheduling mode, Mesos only knows spark executor processes. • Mesos books as much cluster resources as possible to allocate Spark executors for a job. Historically, Mesos on Spark can use Fine Grained scheduling mode, where Mesos schedules each and every individual spark task. • Kills performances ! • Deprecated: https://issues.apache.org/jira/browse/SPARK -11857
  22. 22. Spark Static Resource Allocation vs. Dynamic Allocation (1/2) Static Resource Allocation • Mesos / Spark decides allocated resources at job init time • Allocated resources are kept until the job completes • 2 noteworthy consequences : 1. By default, every single job running alone gets the whole cluster. A following job would need to wait. 2. Several jobs arriving together would get the cluster fairly shared. If only one job is long-lived, that job would still need to complete its execution on his small portion.
  23. 23. Spark Static Resource Allocation vs. Dynamic Allocation (2/2) Dynamic Allocation • Designed as a solution the previous problems • But … Spark‘s Dynamic Allocation messes up data locality optimization completely. • ES-Hadoop makes spark request as many executors as shards and indicates as preferred location the nodes owning the ES shards. • Dynamic allocation bypasses this completely and screws data-locality optimization Dynamic Allocation • Designed as a solution the previous problems • Works out of the Box
  24. 24. Other concerns • Python latency • Java and Scala jobs run natively in the Spark JVM. • Pyspark launches “some tasks” in a separate process than the Spark JVM. • DataFrame or RDD methods exposed to python scripts are actually implemented in native Scala underneath. • One noticeable exception: UDF (User Defined functions) implemented in python! • One can very well still use pyspark but write UDF in Scala. • Repartitioning • A redistribution of a dataset on the cluster is only hardly achievable … and not necessarily desirable. • Advanced ES queries • The ES-Hadoop connector can only submit “simple” requests to ES, with filtering (now) • Advanced features such as aggregation queries cannot be used
  25. 25. ES / Spark / Mesos Why is it cool ?
  26. 26. Why cool ? (1/5) Spark’s API is brilliant for our use cases (NetGuardians) Pattern Based Intelligence • Implementing our rules in pyspark is straightforward • We are now considering DRESS on spark streaming Profiling • Out of the box with Spark’s statistics functions • Here as well we consider spark streaming for event scoring Machine Learning • We prototype with Python SciKit Learn • Implementation on spark is easy with Spark MLlib
  27. 27. Why cool ? (2/5) What do we want ? Initial situation
  28. 28. Why cool ? (2/5) What do we want ? Working with a small subset of the data
  29. 29. Why cool ? (2/5) What do we want ? Working with a full month of data
  30. 30. Why cool ? (2/5) What do we want ? Working with the whole dataset
  31. 31. Why cool ? (3/5) Processing Distribution scaling linearly with Data Distribution Works Out of the box with • Dynamic Allocation in Spark + Mesos • ES-Hadoop / ES-Spark connector data locality optimization
  32. 32. Why cool ? (4/5) Processing Distribution scaling linearly with Data Distribution ES / Spark / Mesos provide the basic building blocks to distribute and scale the processing exactly how we want • ES-Hadoop : Data locality optimization • Mesos / Spark : spark.cores.max=X configuration • ElasticSearch : search_shards API Golden Rule : use spark.core.max = Nbr Shards
  33. 33. Why cool ? (5/5) “One ring to rule them all ...” • ES, Spark and Mesos are designed to run on large clusters • But they work very well as well on one single fat machine with tons of CPUs and RAM • We deploy the same platform in tier 1 banks and small banks.
  34. 34. THANK YOU! NetGuardians SA Headquarters Rue Galilée 6 1400 Yverdon-les-Bains Switzerland Tel: +41 24 425 97 60 Email: info@netguardians.ch www.netguardians.ch Linkedin.com/company/netguardians Facebook.com/NetGuardians @netguardians

×