SlideShare a Scribd company logo
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 1
Applied Intelligence
Company Confidential
APACHE SPARK DATA FRAMES
FOR FRAUD DETECTION
Eddie Baggott
15th October 2015
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 2
Applied Intelligence
Company Confidential
INTRODUCTION
• Functional and Data Architect
• BAE Systems , Norkom
• Anti Fraud ,AML , Compliance , Watch lists, Cyber
• Big data?
• Beginner in spark…No java or Scala experience
• All my own opinion
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 3
Applied Intelligence
Company Confidential
OVERVIEW
DataFrames
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 4
Applied Intelligence
Company Confidential
WHAT ARE DATAFRAMES
• Distributed collection of rows organized into named columns
• An abstraction for selecting, filtering, aggregating and plotting structured
data (R,Pandas)
• Schema RDD
• Named columns
• Spark SQL
• RDD-style methods such as Map, flatMap are available
• Full Spark Stack (MLLIb, GraphX) available
• Can switch between RDD and DF with .rdd and .toDF
• Many data sources and formats. Easy to read from:
• JSON ,Parquet, Hive , CSV ,HDFS ,S3 ,JDBC, Cassandra
• Evaluated lazily
• optimized
•
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 5
Applied Intelligence
Company Confidential
USING DATAFRAMES
• Selecting, Filtering, Aggregating structured data
• Write Less code
• Read Less data
• Let the optimizer do the hard work
• Old style
wordCount = data.flatMap(_.split(" ")).map(_ ->
1).reduceByKey(_+_)
• SQL
sqlContext.sql("select count(*) , AccountId FROM
TRANSACTIONS GROUP BY AccountId")
• Programmatically
Txn_df.groupBy(“AccountId").count()
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 6
Applied Intelligence
Company Confidential
CREATE A DATAFRAME
case class Account_View ( RunDate: String ,txnBaseAmount: Double
,TransactionType: String ,AccountId: String ,CreditDebitCode: String )
val df_txn = sc.textFile("TRANSACTIONS.txt").map(_.split('|')).map(p =>
Account_View(p(0), p(4).trim.toDouble, p(7), p(8), p(5))).toDF
val df_cash = df_txn.filter("TransactionType = 'TT0090'").groupBy("AccountId").sum()
df_cash. printSchema
|-- AccountId: string (nullable = true)
|-- SUM(txnBaseAmount): double (nullable = true)
df_cash.show
| AccountId | SUM(txnBaseAmount)|
|account8849-25| 54.0|
|account16705-2| 65.0|
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 7
Applied Intelligence
Company Confidential
SQL EXAMPLE
//register the table with hive metadata
df_txn.registerTempTable("TRANSACTIONS")
//run query
val df_agg = sqlContext.sql("select RunDate, AccountId , CreditDebitCode
, COUNT(TransactionId) AS COUNT_TXN , SUM(txnBaseAmount)
AS TOTAL_AMOUNT FROM TRANSACTIONS
GROUP BY RunDate, AccountId, CreditDebitCode ")
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 8
Applied Intelligence
Company Confidential
PROGRAMMATICALLY
// read from json or parquet
val df_json = sqlContext.read.json("txn_json")
//schema discovery
df_json.printSchema
//aggregate
df_json.filter(" CreditDebitCode = 'Credit'")
.groupBy("AccountId").agg(sum("txnBaseAmount"), max("txnBaseAmount") )
//join different data sources
df_json.join (df_txn.as('b), $"a.TransactionId" === $"b.TransactionId")
//sql
data.registerTempTable("TXN_JSON")
sqlContext.sql("select distinct a.TransactionType from TRANSACTIONS a JOIN
TXN_JSON b where a.TransactionId = b.TransactionId ")
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 9
Applied Intelligence
Company Confidential
NEW FEATURES
• Windows 1.4
• Analytic functions
• Ranking functions
.over(…)
.PartitionBy,.OrderBy, Frame( Row or Range)
• New in 1.5
• 90 functions
• Math/date/condition/string/etc.
• https://issues.apache.org/jira/browse/SPARK-8159
• Spark Packages http://spark-packages.org/
• spark-csv from DataBricks
• https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 10
Applied Intelligence
Company Confidential
APPLICATIONS
 Profiling
 Build aggregates
 Detection with CEP
• Charting
• Crossfilter and Other BI tools
• Data Quality
• Need to understand the data from a new Client
• Finding Networks
• Fraud and Ml rings
• GraphX and PageRank
• ETL
• Analytics
• Recommendations
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 11
Applied Intelligence
Company Confidential
PROFILING
//Data labels
{ "Name":“Credit", "Field": " CreditDebitCode ", "Value": " Credit " ,
"Operator":"=" } ] }
{ "Name":"Cash", "Field": " INSTRUMENT ", "Value": "Cash" , "Operator":"=" }
] }
//Profile Definition
{ "id":2, "ProfileName":“CashCredit" , "Operator" : "Accumulator" ,
“Data_Labels": [ { "Name":" Credit " } ,{Name:"Cash" } ] }
//CEP rules engine
If CashCredit.sum >10,000 Then Alert
//File layout
RunDate|x|x|AccountId|AmountBaseCurrency|
CreditDebitCode|TransactionID|TransactionType|UnusedField|…
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 12
Applied Intelligence
Company Confidential
CHARTING
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 13
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (1)
• Three open source JavaScript libraries
• D3.js
• Crossfilter.js
• DC.js
• https://dc-js.github.io/dc.js/docs/stock.html
• Data in JSON format from previous example
• Reduce 30m transactions down to 40k rows
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 14
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (2)
• Define Dimension (x-axis)
• var dateDim = ndx.dimension(function(d) {return d.rundate;});
• var creditordebitDim = ndx.dimension(function (d) {return d.creditordebit;});
• var txnTypeDim = ndx.dimension(function (d) {return d.transactiontype;});
• Define Metrics ( groups)
• var amount = dateDim.group().reduceSum(function(d) {return d.total_amount/10000;});
• var txnCreditOrDebitCount = creditordebitDim.group().reduceSum(function (d) { return
d.count_txn; })
• var txnTypeCount = txnTypeDim.group().reduceSum(function (d) { return d.count_txn; })
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 15
Applied Intelligence
Company Confidential
CROSSFILTER CHARTS (3)
• Create Charts
txnType_chart
.dimension(txnTypeDim)
.group(txnTypeCount)
.cap(10)
.width(500)
.height(600)
.colors(d3.scale.category10())
.ordering(function(d) { return -d.value })
.title(function (d) { return d.value; } )
.elasticX(true)
.xAxis().ticks(4).tickFormat(d3.format(".2s"));
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 16
Applied Intelligence
Company Confidential
DQ REPORT
• Early stages of New Data from a Client
• New to understand the data to create good scenarios
• New to be flexible , fast and easy to use
• Frequent values
.stat.freqItems
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 17
Applied Intelligence
Company Confidential
OTHER APPLICATIONS
• Finding Networks
• Fraud and ML rings
• GraphX and PageRank
• ETL
• Identity Resolution
• Analytics
• Ml Lib
• Recommendations
Copyright © 2014 BAE Systems. All Rights Reserved.
BAE Systems is a trade mark of BAE Systems plc 18
Applied Intelligence
Company Confidential
RECAP
• DataFrames make Spark easy to use
• Can reuse existing SQL code and skills
• Fast to write code with Scala REPL
• Lots of business applications for Spark
• New ways of thinking about working with data to get information
• We’re hiring!
• http://www.baesystems.com/en/cybersecurity
• Thanks to Brendan and his team here
• Encourage anyone to give a presentation
• Any Questions?

More Related Content

What's hot

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
Jim Hatcher
 
Slick @ Confitura 2013
Slick @ Confitura 2013Slick @ Confitura 2013
Slick @ Confitura 2013
Maciek Próchniak
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David WilliamsMongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
MongoDB
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Panagiotis Garefalakis
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
Phuoc Do
 
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB
 
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational DatabaseHow to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
DATAVERSITY
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
Keshav Murthy
 

What's hot (12)

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave ClubJoining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI IndexesUpdates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
 
Slick @ Confitura 2013
Slick @ Confitura 2013Slick @ Confitura 2013
Slick @ Confitura 2013
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David WilliamsMongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Vida Dashboard Training
Vida Dashboard TrainingVida Dashboard Training
Vida Dashboard Training
 
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDBMongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
 
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational DatabaseHow to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
 
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index AdvisorCouchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
 

Similar to Dublin Ireland Spark Meetup October 15, 2015

PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
Paulo Gandra de Sousa
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
Salesforce Developers
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
 
Openstack Nova APIs
Openstack Nova APIs Openstack Nova APIs
Openstack Nova APIs
Stackops
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
Norberto Leite
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
Torsten Steinbach
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
 
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStackDomain-Driven Design with SeedStack
Domain-Driven Design with SeedStack
SeedStack
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
Neo4j
 
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
Hyperledger Korea User Group
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 

Similar to Dublin Ireland Spark Meetup October 15, 2015 (20)

PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Openstack Nova APIs
Openstack Nova APIs Openstack Nova APIs
Openstack Nova APIs
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The WinITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
 
MongodB Internals
MongodB InternalsMongodB Internals
MongodB Internals
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
 
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStackDomain-Driven Design with SeedStack
Domain-Driven Design with SeedStack
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
 

Recently uploaded

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 

Recently uploaded (20)

Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 

Dublin Ireland Spark Meetup October 15, 2015

  • 1. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 1 Applied Intelligence Company Confidential APACHE SPARK DATA FRAMES FOR FRAUD DETECTION Eddie Baggott 15th October 2015
  • 2. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 2 Applied Intelligence Company Confidential INTRODUCTION • Functional and Data Architect • BAE Systems , Norkom • Anti Fraud ,AML , Compliance , Watch lists, Cyber • Big data? • Beginner in spark…No java or Scala experience • All my own opinion
  • 3. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 3 Applied Intelligence Company Confidential OVERVIEW DataFrames
  • 4. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 4 Applied Intelligence Company Confidential WHAT ARE DATAFRAMES • Distributed collection of rows organized into named columns • An abstraction for selecting, filtering, aggregating and plotting structured data (R,Pandas) • Schema RDD • Named columns • Spark SQL • RDD-style methods such as Map, flatMap are available • Full Spark Stack (MLLIb, GraphX) available • Can switch between RDD and DF with .rdd and .toDF • Many data sources and formats. Easy to read from: • JSON ,Parquet, Hive , CSV ,HDFS ,S3 ,JDBC, Cassandra • Evaluated lazily • optimized •
  • 5. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 5 Applied Intelligence Company Confidential USING DATAFRAMES • Selecting, Filtering, Aggregating structured data • Write Less code • Read Less data • Let the optimizer do the hard work • Old style wordCount = data.flatMap(_.split(" ")).map(_ -> 1).reduceByKey(_+_) • SQL sqlContext.sql("select count(*) , AccountId FROM TRANSACTIONS GROUP BY AccountId") • Programmatically Txn_df.groupBy(“AccountId").count()
  • 6. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 6 Applied Intelligence Company Confidential CREATE A DATAFRAME case class Account_View ( RunDate: String ,txnBaseAmount: Double ,TransactionType: String ,AccountId: String ,CreditDebitCode: String ) val df_txn = sc.textFile("TRANSACTIONS.txt").map(_.split('|')).map(p => Account_View(p(0), p(4).trim.toDouble, p(7), p(8), p(5))).toDF val df_cash = df_txn.filter("TransactionType = 'TT0090'").groupBy("AccountId").sum() df_cash. printSchema |-- AccountId: string (nullable = true) |-- SUM(txnBaseAmount): double (nullable = true) df_cash.show | AccountId | SUM(txnBaseAmount)| |account8849-25| 54.0| |account16705-2| 65.0|
  • 7. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 7 Applied Intelligence Company Confidential SQL EXAMPLE //register the table with hive metadata df_txn.registerTempTable("TRANSACTIONS") //run query val df_agg = sqlContext.sql("select RunDate, AccountId , CreditDebitCode , COUNT(TransactionId) AS COUNT_TXN , SUM(txnBaseAmount) AS TOTAL_AMOUNT FROM TRANSACTIONS GROUP BY RunDate, AccountId, CreditDebitCode ")
  • 8. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 8 Applied Intelligence Company Confidential PROGRAMMATICALLY // read from json or parquet val df_json = sqlContext.read.json("txn_json") //schema discovery df_json.printSchema //aggregate df_json.filter(" CreditDebitCode = 'Credit'") .groupBy("AccountId").agg(sum("txnBaseAmount"), max("txnBaseAmount") ) //join different data sources df_json.join (df_txn.as('b), $"a.TransactionId" === $"b.TransactionId") //sql data.registerTempTable("TXN_JSON") sqlContext.sql("select distinct a.TransactionType from TRANSACTIONS a JOIN TXN_JSON b where a.TransactionId = b.TransactionId ")
  • 9. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 9 Applied Intelligence Company Confidential NEW FEATURES • Windows 1.4 • Analytic functions • Ranking functions .over(…) .PartitionBy,.OrderBy, Frame( Row or Range) • New in 1.5 • 90 functions • Math/date/condition/string/etc. • https://issues.apache.org/jira/browse/SPARK-8159 • Spark Packages http://spark-packages.org/ • spark-csv from DataBricks • https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
  • 10. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 10 Applied Intelligence Company Confidential APPLICATIONS  Profiling  Build aggregates  Detection with CEP • Charting • Crossfilter and Other BI tools • Data Quality • Need to understand the data from a new Client • Finding Networks • Fraud and Ml rings • GraphX and PageRank • ETL • Analytics • Recommendations
  • 11. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 11 Applied Intelligence Company Confidential PROFILING //Data labels { "Name":“Credit", "Field": " CreditDebitCode ", "Value": " Credit " , "Operator":"=" } ] } { "Name":"Cash", "Field": " INSTRUMENT ", "Value": "Cash" , "Operator":"=" } ] } //Profile Definition { "id":2, "ProfileName":“CashCredit" , "Operator" : "Accumulator" , “Data_Labels": [ { "Name":" Credit " } ,{Name:"Cash" } ] } //CEP rules engine If CashCredit.sum >10,000 Then Alert //File layout RunDate|x|x|AccountId|AmountBaseCurrency| CreditDebitCode|TransactionID|TransactionType|UnusedField|…
  • 12. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 12 Applied Intelligence Company Confidential CHARTING
  • 13. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 13 Applied Intelligence Company Confidential CROSSFILTER CHARTS (1) • Three open source JavaScript libraries • D3.js • Crossfilter.js • DC.js • https://dc-js.github.io/dc.js/docs/stock.html • Data in JSON format from previous example • Reduce 30m transactions down to 40k rows
  • 14. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 14 Applied Intelligence Company Confidential CROSSFILTER CHARTS (2) • Define Dimension (x-axis) • var dateDim = ndx.dimension(function(d) {return d.rundate;}); • var creditordebitDim = ndx.dimension(function (d) {return d.creditordebit;}); • var txnTypeDim = ndx.dimension(function (d) {return d.transactiontype;}); • Define Metrics ( groups) • var amount = dateDim.group().reduceSum(function(d) {return d.total_amount/10000;}); • var txnCreditOrDebitCount = creditordebitDim.group().reduceSum(function (d) { return d.count_txn; }) • var txnTypeCount = txnTypeDim.group().reduceSum(function (d) { return d.count_txn; })
  • 15. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 15 Applied Intelligence Company Confidential CROSSFILTER CHARTS (3) • Create Charts txnType_chart .dimension(txnTypeDim) .group(txnTypeCount) .cap(10) .width(500) .height(600) .colors(d3.scale.category10()) .ordering(function(d) { return -d.value }) .title(function (d) { return d.value; } ) .elasticX(true) .xAxis().ticks(4).tickFormat(d3.format(".2s"));
  • 16. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 16 Applied Intelligence Company Confidential DQ REPORT • Early stages of New Data from a Client • New to understand the data to create good scenarios • New to be flexible , fast and easy to use • Frequent values .stat.freqItems
  • 17. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 17 Applied Intelligence Company Confidential OTHER APPLICATIONS • Finding Networks • Fraud and ML rings • GraphX and PageRank • ETL • Identity Resolution • Analytics • Ml Lib • Recommendations
  • 18. Copyright © 2014 BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 18 Applied Intelligence Company Confidential RECAP • DataFrames make Spark easy to use • Can reuse existing SQL code and skills • Fast to write code with Scala REPL • Lots of business applications for Spark • New ways of thinking about working with data to get information • We’re hiring! • http://www.baesystems.com/en/cybersecurity • Thanks to Brendan and his team here • Encourage anyone to give a presentation • Any Questions?