SlideShare a Scribd company logo
1 of 18
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 1
Applied Intelligence
Company Confidential
APACHE SPARK DATA FRAMES
FOR FRAUD DETECTION
Eddie Baggott
15th October 2015
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 2
Applied Intelligence
Company Confidential
INTRODUCTION
• Functional and Data Architect
• BAE Systems , Norkom
• Anti Fraud ,AML , Compliance , Watch lists, Cyber
• Big data?
• Beginner in spark…No java or Scala experience
• All my own opinion
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 3
Applied Intelligence
Company Confidential
OVERVIEW
DataFrames
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 4
Applied Intelligence
Company Confidential
WHAT ARE DATAFRAMES
• Distributed collection of rows organized into named columns
• An abstraction for selecting, filtering, aggregating and plotting structured
data (R,Pandas)
• Schema RDD
• Named columns
• Spark SQL
• RDD-style methods such as Map, flatMap are available
• Full Spark Stack (MLLIb, GraphX) available
• Can switch between RDD and DF with .rdd and .toDF
• Many data sources and formats. Easy to read from:
• JSON ,Parquet, Hive , CSV ,HDFS ,S3 ,JDBC, Cassandra
• Evaluated lazily
• optimized
•
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 5
Applied Intelligence
Company Confidential
USING DATAFRAMES
• Selecting, Filtering, Aggregating structured data
• Write Less code
• Read Less data
• Let the optimizer do the hard work
• Old style
wordCount = data.flatMap(_.split(" ")).map(_ ->
1).reduceByKey(_+_)
• SQL
sqlContext.sql("select count(*) , AccountId FROM
TRANSACTIONS GROUP BY AccountId")
• Programmatically
Txn_df.groupBy(“AccountId").count()
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 6
Applied Intelligence
Company Confidential
CREATE A DATAFRAME
case class Account_View ( RunDate: String ,txnBaseAmount: Double
,TransactionType: String ,AccountId: String ,CreditDebitCode: String )
val df_txn = sc.textFile("TRANSACTIONS.txt").map(_.split('|')).map(p =>
Account_View(p(0), p(4).trim.toDouble, p(7), p(8), p(5))).toDF
val df_cash = df_txn.filter("TransactionType = 'TT0090'").groupBy("AccountId").sum()
df_cash. printSchema
|-- AccountId: string (nullable = true)
|-- SUM(txnBaseAmount): double (nullable = true)
df_cash.show
| AccountId | SUM(txnBaseAmount)|
|account8849-25| 54.0|
|account16705-2| 65.0|
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 7
Applied Intelligence
Company Confidential
SQL EXAMPLE
//register the table with hive metadata
df_txn.registerTempTable("TRANSACTIONS")
//run query
val df_agg = sqlContext.sql("select RunDate, AccountId , CreditDebitCode
, COUNT(TransactionId) AS COUNT_TXN , SUM(txnBaseAmount)
AS TOTAL_AMOUNT FROM TRANSACTIONS
GROUP BY RunDate, AccountId, CreditDebitCode ")
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 8
Applied Intelligence
Company Confidential
PROGRAMMATICALLY
// read from json or parquet
val df_json = sqlContext.read.json("txn_json")
//schema discovery
df_json.printSchema
//aggregate
df_json.filter(" CreditDebitCode = 'Credit'")
.groupBy("AccountId").agg(sum("txnBaseAmount"), max("txnBaseAmount") )
//join different data sources
df_json.join (df_txn.as('b), $"a.TransactionId" === $"b.TransactionId")
//sql
data.registerTempTable("TXN_JSON")
sqlContext.sql("select distinct a.TransactionType from TRANSACTIONS a JOIN
TXN_JSON b where a.TransactionId = b.TransactionId ")
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 9
Applied Intelligence
Company Confidential
NEW FEATURES
• Windows 1.4
• Analytic functions
• Ranking functions
.over(…)
.PartitionBy,.OrderBy, Frame( Row or Range)
• New in 1.5
• 90 functions
• Math/date/condition/string/etc.
• https://issues.apache.org/jira/browse/SPARK-8159
• Spark Packages http://spark-packages.org/
• spark-csv from DataBricks
• https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 10
Applied Intelligence
Company Confidential
APPLICATIONS
 Profiling
 Build aggregates
 Detection with CEP
• Charting
• Crossfilter and Other BI tools
• Data Quality
• Need to understand the data from a new Client
• Finding Networks
• Fraud and Ml rings
• GraphX and PageRank
• ETL
• Analytics
• Recommendations
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 11
Applied Intelligence
Company Confidential
PROFILING
//Data labels
{ "Name":“Credit", "Field": " CreditDebitCode ", "Value": " Credit " ,
"Operator":"=" } ] }
{ "Name":"Cash", "Field": " INSTRUMENT ", "Value": "Cash" , "Operator":"=" }
] }
//Profile Definition
{ "id":2, "ProfileName":“CashCredit" , "Operator" : "Accumulator" ,
“Data_Labels": [ { "Name":" Credit " } ,{Name:"Cash" } ] }
//CEP rules engine
If CashCredit.sum >10,000 Then Alert
//File layout
RunDate|x|x|AccountId|AmountBaseCurrency|
CreditDebitCode|TransactionID|TransactionType|UnusedField|…
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 12
Applied Intelligence
Company Confidential
CHARTING
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 13
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (1)
• Three open source JavaScript libraries
• D3.js
• Crossfilter.js
• DC.js
• https://dc-js.github.io/dc.js/docs/stock.html
• Data in JSON format from previous example
• Reduce 30m transactions down to 40k rows
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 14
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (2)
• Define Dimension (x-axis)
• var dateDim = ndx.dimension(function(d) {return d.rundate;});
• var creditordebitDim = ndx.dimension(function (d) {return d.creditordebit;});
• var txnTypeDim = ndx.dimension(function (d) {return d.transactiontype;});
• Define Metrics ( groups)
• var amount = dateDim.group().reduceSum(function(d) {return d.total_amount/10000;});
• var txnCreditOrDebitCount = creditordebitDim.group().reduceSum(function (d) { return
d.count_txn; })
• var txnTypeCount = txnTypeDim.group().reduceSum(function (d) { return d.count_txn; })
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 15
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (3)
• Create Charts
txnType_chart
.dimension(txnTypeDim)
.group(txnTypeCount)
.cap(10)
.width(500)
.height(600)
.colors(d3.scale.category10())
.ordering(function(d) { return -d.value })
.title(function (d) { return d.value; } )
.elasticX(true)
.xAxis().ticks(4).tickFormat(d3.format(".2s"));
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 16
Applied Intelligence
Company Confidential
DQ REPORT
• Early stages of New Data from a Client
• New to understand the data to create good scenarios
• New to be flexible , fast and easy to use
• Frequent values
.stat.freqItems
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 17
Applied Intelligence
Company Confidential
OTHER APPLICATIONS
• Finding Networks
• Fraud and ML rings
• GraphX and PageRank
• ETL
• Identity Resolution
• Analytics
• Ml Lib
• Recommendations
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 18
Applied Intelligence
Company Confidential
RECAP
• DataFrames make Spark easy to use
• Can reuse existing SQL code and skills
• Fast to write code with Scala REPL
• Lots of business applications for Spark
• New ways of thinking about working with data to get information
• We’re hiring!
• http://www.baesystems.com/en/cybersecurity
• Thanks to Brendan and his team here
• Encourage anyone to give a presentation
• Any Questions?

More Related Content

What's hot

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubData Con LA
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesJim Hatcher
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David WilliamsMongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David WilliamsMongoDB
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsPanagiotis Garefalakis
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard TrainingPhuoc Do
 
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB
 
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational DatabaseHow to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational DatabaseDATAVERSITY
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorKeshav Murthy
 

What's hot (12)

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
Slick @ Confitura 2013
Slick @ Confitura 2013Slick @ Confitura 2013
Slick @ Confitura 2013
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David WilliamsMongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
 
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
 
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational DatabaseHow to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
 

Similar to Dublin Ireland Spark Meetup October 15, 2015

Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsSalesforce Developers
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 
Openstack Nova APIs
Openstack Nova APIs Openstack Nova APIs
Openstack Nova APIs Stackops
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...Noriaki Tatsumi
 
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStackDomain-Driven Design with SeedStack
Domain-Driven Design with SeedStackSeedStack
 
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능Hyperledger Korea User Group
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Tom Diederich
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Stefan Urbanek
 

Similar to Dublin Ireland Spark Meetup October 15, 2015 (20)

Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Openstack Nova APIs
Openstack Nova APIs Openstack Nova APIs
Openstack Nova APIs
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStackDomain-Driven Design with SeedStack
Domain-Driven Design with SeedStack
 
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 

Recently uploaded

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 

Recently uploaded (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 

Dublin Ireland Spark Meetup October 15, 2015

  • 1. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 1 Applied Intelligence Company Confidential APACHE SPARK DATA FRAMES FOR FRAUD DETECTION Eddie Baggott 15th October 2015
  • 2. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 2 Applied Intelligence Company Confidential INTRODUCTION • Functional and Data Architect • BAE Systems , Norkom • Anti Fraud ,AML , Compliance , Watch lists, Cyber • Big data? • Beginner in spark…No java or Scala experience • All my own opinion
  • 3. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 3 Applied Intelligence Company Confidential OVERVIEW DataFrames
  • 4. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 4 Applied Intelligence Company Confidential WHAT ARE DATAFRAMES • Distributed collection of rows organized into named columns • An abstraction for selecting, filtering, aggregating and plotting structured data (R,Pandas) • Schema RDD • Named columns • Spark SQL • RDD-style methods such as Map, flatMap are available • Full Spark Stack (MLLIb, GraphX) available • Can switch between RDD and DF with .rdd and .toDF • Many data sources and formats. Easy to read from: • JSON ,Parquet, Hive , CSV ,HDFS ,S3 ,JDBC, Cassandra • Evaluated lazily • optimized •
  • 5. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 5 Applied Intelligence Company Confidential USING DATAFRAMES • Selecting, Filtering, Aggregating structured data • Write Less code • Read Less data • Let the optimizer do the hard work • Old style wordCount = data.flatMap(_.split(" ")).map(_ -> 1).reduceByKey(_+_) • SQL sqlContext.sql("select count(*) , AccountId FROM TRANSACTIONS GROUP BY AccountId") • Programmatically Txn_df.groupBy(“AccountId").count()
  • 6. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 6 Applied Intelligence Company Confidential CREATE A DATAFRAME case class Account_View ( RunDate: String ,txnBaseAmount: Double ,TransactionType: String ,AccountId: String ,CreditDebitCode: String ) val df_txn = sc.textFile("TRANSACTIONS.txt").map(_.split('|')).map(p => Account_View(p(0), p(4).trim.toDouble, p(7), p(8), p(5))).toDF val df_cash = df_txn.filter("TransactionType = 'TT0090'").groupBy("AccountId").sum() df_cash. printSchema |-- AccountId: string (nullable = true) |-- SUM(txnBaseAmount): double (nullable = true) df_cash.show | AccountId | SUM(txnBaseAmount)| |account8849-25| 54.0| |account16705-2| 65.0|
  • 7. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 7 Applied Intelligence Company Confidential SQL EXAMPLE //register the table with hive metadata df_txn.registerTempTable("TRANSACTIONS") //run query val df_agg = sqlContext.sql("select RunDate, AccountId , CreditDebitCode , COUNT(TransactionId) AS COUNT_TXN , SUM(txnBaseAmount) AS TOTAL_AMOUNT FROM TRANSACTIONS GROUP BY RunDate, AccountId, CreditDebitCode ")
  • 8. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 8 Applied Intelligence Company Confidential PROGRAMMATICALLY // read from json or parquet val df_json = sqlContext.read.json("txn_json") //schema discovery df_json.printSchema //aggregate df_json.filter(" CreditDebitCode = 'Credit'") .groupBy("AccountId").agg(sum("txnBaseAmount"), max("txnBaseAmount") ) //join different data sources df_json.join (df_txn.as('b), $"a.TransactionId" === $"b.TransactionId") //sql data.registerTempTable("TXN_JSON") sqlContext.sql("select distinct a.TransactionType from TRANSACTIONS a JOIN TXN_JSON b where a.TransactionId = b.TransactionId ")
  • 9. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 9 Applied Intelligence Company Confidential NEW FEATURES • Windows 1.4 • Analytic functions • Ranking functions .over(…) .PartitionBy,.OrderBy, Frame( Row or Range) • New in 1.5 • 90 functions • Math/date/condition/string/etc. • https://issues.apache.org/jira/browse/SPARK-8159 • Spark Packages http://spark-packages.org/ • spark-csv from DataBricks • https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
  • 10. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 10 Applied Intelligence Company Confidential APPLICATIONS  Profiling  Build aggregates  Detection with CEP • Charting • Crossfilter and Other BI tools • Data Quality • Need to understand the data from a new Client • Finding Networks • Fraud and Ml rings • GraphX and PageRank • ETL • Analytics • Recommendations
  • 11. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 11 Applied Intelligence Company Confidential PROFILING //Data labels { "Name":“Credit", "Field": " CreditDebitCode ", "Value": " Credit " , "Operator":"=" } ] } { "Name":"Cash", "Field": " INSTRUMENT ", "Value": "Cash" , "Operator":"=" } ] } //Profile Definition { "id":2, "ProfileName":“CashCredit" , "Operator" : "Accumulator" , “Data_Labels": [ { "Name":" Credit " } ,{Name:"Cash" } ] } //CEP rules engine If CashCredit.sum >10,000 Then Alert //File layout RunDate|x|x|AccountId|AmountBaseCurrency| CreditDebitCode|TransactionID|TransactionType|UnusedField|…
  • 12. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 12 Applied Intelligence Company Confidential CHARTING
  • 13. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 13 Applied Intelligence Company Confidential CROSSFILTER CHARTS (1) • Three open source JavaScript libraries • D3.js • Crossfilter.js • DC.js • https://dc-js.github.io/dc.js/docs/stock.html • Data in JSON format from previous example • Reduce 30m transactions down to 40k rows
  • 14. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 14 Applied Intelligence Company Confidential CROSSFILTER CHARTS (2) • Define Dimension (x-axis) • var dateDim = ndx.dimension(function(d) {return d.rundate;}); • var creditordebitDim = ndx.dimension(function (d) {return d.creditordebit;}); • var txnTypeDim = ndx.dimension(function (d) {return d.transactiontype;}); • Define Metrics ( groups) • var amount = dateDim.group().reduceSum(function(d) {return d.total_amount/10000;}); • var txnCreditOrDebitCount = creditordebitDim.group().reduceSum(function (d) { return d.count_txn; }) • var txnTypeCount = txnTypeDim.group().reduceSum(function (d) { return d.count_txn; })
  • 15. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 15 Applied Intelligence Company Confidential CROSSFILTER CHARTS (3) • Create Charts txnType_chart .dimension(txnTypeDim) .group(txnTypeCount) .cap(10) .width(500) .height(600) .colors(d3.scale.category10()) .ordering(function(d) { return -d.value }) .title(function (d) { return d.value; } ) .elasticX(true) .xAxis().ticks(4).tickFormat(d3.format(".2s"));
  • 16. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 16 Applied Intelligence Company Confidential DQ REPORT • Early stages of New Data from a Client • New to understand the data to create good scenarios • New to be flexible , fast and easy to use • Frequent values .stat.freqItems
  • 17. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 17 Applied Intelligence Company Confidential OTHER APPLICATIONS • Finding Networks • Fraud and ML rings • GraphX and PageRank • ETL • Identity Resolution • Analytics • Ml Lib • Recommendations
  • 18. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 18 Applied Intelligence Company Confidential RECAP • DataFrames make Spark easy to use • Can reuse existing SQL code and skills • Fast to write code with Scala REPL • Lots of business applications for Spark • New ways of thinking about working with data to get information • We’re hiring! • http://www.baesystems.com/en/cybersecurity • Thanks to Brendan and his team here • Encourage anyone to give a presentation • Any Questions?