Submit Search
Upload
Dublin Ireland Spark Meetup October 15, 2015
•
Download as PPTX, PDF
•
1 like
•
217 views
E
eddiebaggott
Follow
Apache Spark for Fraud Detection Dublin Spark Meetup
Read less
Read more
Data & Analytics
Report
Share
Report
Share
1 of 18
Download now
Recommended
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
Sencha
SenchaCon 2016: Add Magic to Your Ext JS Apps with D3 Visualizations - Vitaly...
SenchaCon 2016: Add Magic to Your Ext JS Apps with D3 Visualizations - Vitaly...
Sencha
Introducing DataWave
Introducing DataWave
Data Works MD
PistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for Analytics
Andrew Morgan
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
Sencha
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
SFDC Data Models For Pros - Simplifying The Complexities
SFDC Data Models For Pros - Simplifying The Complexities
Baruch Oxman
SenchaCon 2016: The Once and Future Grid - Nige White
SenchaCon 2016: The Once and Future Grid - Nige White
Sencha
Recommended
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
SenchaCon 2016: Integrating Geospatial Maps & Big Data Using CartoDB via Ext ...
Sencha
SenchaCon 2016: Add Magic to Your Ext JS Apps with D3 Visualizations - Vitaly...
SenchaCon 2016: Add Magic to Your Ext JS Apps with D3 Visualizations - Vitaly...
Sencha
Introducing DataWave
Introducing DataWave
Data Works MD
PistonHead's use of MongoDB for Analytics
PistonHead's use of MongoDB for Analytics
Andrew Morgan
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
SenchaCon 2016: Handle Real-World Data with Confidence - Fredric Berling
Sencha
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Stefan Urbanek
SFDC Data Models For Pros - Simplifying The Complexities
SFDC Data Models For Pros - Simplifying The Complexities
Baruch Oxman
SenchaCon 2016: The Once and Future Grid - Nige White
SenchaCon 2016: The Once and Future Grid - Nige White
Sencha
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
Jim Hatcher
Slick @ Confitura 2013
Slick @ Confitura 2013
Maciek Próchniak
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
MongoDB
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Panagiotis Garefalakis
Vida Dashboard Training
Vida Dashboard Training
Phuoc Do
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
DATAVERSITY
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
Keshav Murthy
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
Paulo Gandra de Sousa
PoEAA by Example
PoEAA by Example
Paulo Gandra de Sousa
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
Salesforce Developers
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
Openstack Nova APIs
Openstack Nova APIs
Stackops
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
More Related Content
What's hot
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
GeeksLab Odessa
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
Andrew Morgan
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
Jim Hatcher
Slick @ Confitura 2013
Slick @ Confitura 2013
Maciek Próchniak
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
MongoDB
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Panagiotis Garefalakis
Vida Dashboard Training
Vida Dashboard Training
Phuoc Do
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
DATAVERSITY
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
Keshav Murthy
What's hot
(12)
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
Updates from Cassandra Summit 2016 & SASI Indexes
Updates from Cassandra Summit 2016 & SASI Indexes
Slick @ Confitura 2013
Slick @ Confitura 2013
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
MongoDB at Giant Eagle by David Williams
MongoDB at Giant Eagle by David Williams
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Vida Dashboard Training
Vida Dashboard Training
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
MongoDB .local London 2019: Tips and Tricks++ for Querying and Indexing MongoDB
How to Handle NoSQL with a Relational Database
How to Handle NoSQL with a Relational Database
Couchbase N1QL: Index Advisor
Couchbase N1QL: Index Advisor
Similar to Dublin Ireland Spark Meetup October 15, 2015
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
Paulo Gandra de Sousa
PoEAA by Example
PoEAA by Example
Paulo Gandra de Sousa
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
Salesforce Developers
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
DataStax
Openstack Nova APIs
Openstack Nova APIs
Stackops
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp
MongodB Internals
MongodB Internals
Norberto Leite
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
Torsten Steinbach
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Noriaki Tatsumi
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStack
SeedStack
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
Hyperledger Korea User Group
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
Similar to Dublin Ireland Spark Meetup October 15, 2015
(20)
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
PoEAA by Example
PoEAA by Example
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Openstack Nova APIs
Openstack Nova APIs
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
ITCamp 2018 - Magnus Mårtensson - Azure Resource Manager For The Win
MongodB Internals
MongodB Internals
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
GraphQL Summit 2019 - Configuration Driven Data as a Service Gateway with Gra...
Domain-Driven Design with SeedStack
Domain-Driven Design with SeedStack
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
[2019.1] 하이퍼레저 패브릭 v1.3, v1.4 새로운 기능
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Recently uploaded
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
Rafezzaman
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Human37
Learn How Data Science Changes Our World
Learn How Data Science Changes Our World
Eduminds Learning
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
John Sterrett
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Timothy Spann
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
dataanalyticsqueen03
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Boston Institute of Analytics
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
yuu sss
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
jennyeacort
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Yasamin16
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
jennyeacort
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
Mike Bennett
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Cathrine Wilhelmsen
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Boston Institute of Analytics
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Boston Institute of Analytics
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
208367051
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Boston Institute of Analytics
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Universitat Politècnica de Catalunya
Real-Time AI Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Timothy Spann
Recently uploaded
(20)
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
Learn How Data Science Changes Our World
Learn How Data Science Changes Our World
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Real-Time AI Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Dublin Ireland Spark Meetup October 15, 2015
1.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 1 Applied Intelligence Company Confidential APACHE SPARK DATA FRAMES FOR FRAUD DETECTION Eddie Baggott 15th October 2015
2.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 2 Applied Intelligence Company Confidential INTRODUCTION • Functional and Data Architect • BAE Systems , Norkom • Anti Fraud ,AML , Compliance , Watch lists, Cyber • Big data? • Beginner in spark…No java or Scala experience • All my own opinion
3.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 3 Applied Intelligence Company Confidential OVERVIEW DataFrames
4.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 4 Applied Intelligence Company Confidential WHAT ARE DATAFRAMES • Distributed collection of rows organized into named columns • An abstraction for selecting, filtering, aggregating and plotting structured data (R,Pandas) • Schema RDD • Named columns • Spark SQL • RDD-style methods such as Map, flatMap are available • Full Spark Stack (MLLIb, GraphX) available • Can switch between RDD and DF with .rdd and .toDF • Many data sources and formats. Easy to read from: • JSON ,Parquet, Hive , CSV ,HDFS ,S3 ,JDBC, Cassandra • Evaluated lazily • optimized •
5.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 5 Applied Intelligence Company Confidential USING DATAFRAMES • Selecting, Filtering, Aggregating structured data • Write Less code • Read Less data • Let the optimizer do the hard work • Old style wordCount = data.flatMap(_.split(" ")).map(_ -> 1).reduceByKey(_+_) • SQL sqlContext.sql("select count(*) , AccountId FROM TRANSACTIONS GROUP BY AccountId") • Programmatically Txn_df.groupBy(“AccountId").count()
6.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 6 Applied Intelligence Company Confidential CREATE A DATAFRAME case class Account_View ( RunDate: String ,txnBaseAmount: Double ,TransactionType: String ,AccountId: String ,CreditDebitCode: String ) val df_txn = sc.textFile("TRANSACTIONS.txt").map(_.split('|')).map(p => Account_View(p(0), p(4).trim.toDouble, p(7), p(8), p(5))).toDF val df_cash = df_txn.filter("TransactionType = 'TT0090'").groupBy("AccountId").sum() df_cash. printSchema |-- AccountId: string (nullable = true) |-- SUM(txnBaseAmount): double (nullable = true) df_cash.show | AccountId | SUM(txnBaseAmount)| |account8849-25| 54.0| |account16705-2| 65.0|
7.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 7 Applied Intelligence Company Confidential SQL EXAMPLE //register the table with hive metadata df_txn.registerTempTable("TRANSACTIONS") //run query val df_agg = sqlContext.sql("select RunDate, AccountId , CreditDebitCode , COUNT(TransactionId) AS COUNT_TXN , SUM(txnBaseAmount) AS TOTAL_AMOUNT FROM TRANSACTIONS GROUP BY RunDate, AccountId, CreditDebitCode ")
8.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 8 Applied Intelligence Company Confidential PROGRAMMATICALLY // read from json or parquet val df_json = sqlContext.read.json("txn_json") //schema discovery df_json.printSchema //aggregate df_json.filter(" CreditDebitCode = 'Credit'") .groupBy("AccountId").agg(sum("txnBaseAmount"), max("txnBaseAmount") ) //join different data sources df_json.join (df_txn.as('b), $"a.TransactionId" === $"b.TransactionId") //sql data.registerTempTable("TXN_JSON") sqlContext.sql("select distinct a.TransactionType from TRANSACTIONS a JOIN TXN_JSON b where a.TransactionId = b.TransactionId ")
9.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 9 Applied Intelligence Company Confidential NEW FEATURES • Windows 1.4 • Analytic functions • Ranking functions .over(…) .PartitionBy,.OrderBy, Frame( Row or Range) • New in 1.5 • 90 functions • Math/date/condition/string/etc. • https://issues.apache.org/jira/browse/SPARK-8159 • Spark Packages http://spark-packages.org/ • spark-csv from DataBricks • https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
10.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 10 Applied Intelligence Company Confidential APPLICATIONS Profiling Build aggregates Detection with CEP • Charting • Crossfilter and Other BI tools • Data Quality • Need to understand the data from a new Client • Finding Networks • Fraud and Ml rings • GraphX and PageRank • ETL • Analytics • Recommendations
11.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 11 Applied Intelligence Company Confidential PROFILING //Data labels { "Name":“Credit", "Field": " CreditDebitCode ", "Value": " Credit " , "Operator":"=" } ] } { "Name":"Cash", "Field": " INSTRUMENT ", "Value": "Cash" , "Operator":"=" } ] } //Profile Definition { "id":2, "ProfileName":“CashCredit" , "Operator" : "Accumulator" , “Data_Labels": [ { "Name":" Credit " } ,{Name:"Cash" } ] } //CEP rules engine If CashCredit.sum >10,000 Then Alert //File layout RunDate|x|x|AccountId|AmountBaseCurrency| CreditDebitCode|TransactionID|TransactionType|UnusedField|…
12.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 12 Applied Intelligence Company Confidential CHARTING
13.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 13 Applied Intelligence Company Confidential CROSSFILTER CHARTS (1) • Three open source JavaScript libraries • D3.js • Crossfilter.js • DC.js • https://dc-js.github.io/dc.js/docs/stock.html • Data in JSON format from previous example • Reduce 30m transactions down to 40k rows
14.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 14 Applied Intelligence Company Confidential CROSSFILTER CHARTS (2) • Define Dimension (x-axis) • var dateDim = ndx.dimension(function(d) {return d.rundate;}); • var creditordebitDim = ndx.dimension(function (d) {return d.creditordebit;}); • var txnTypeDim = ndx.dimension(function (d) {return d.transactiontype;}); • Define Metrics ( groups) • var amount = dateDim.group().reduceSum(function(d) {return d.total_amount/10000;}); • var txnCreditOrDebitCount = creditordebitDim.group().reduceSum(function (d) { return d.count_txn; }) • var txnTypeCount = txnTypeDim.group().reduceSum(function (d) { return d.count_txn; })
15.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 15 Applied Intelligence Company Confidential CROSSFILTER CHARTS (3) • Create Charts txnType_chart .dimension(txnTypeDim) .group(txnTypeCount) .cap(10) .width(500) .height(600) .colors(d3.scale.category10()) .ordering(function(d) { return -d.value }) .title(function (d) { return d.value; } ) .elasticX(true) .xAxis().ticks(4).tickFormat(d3.format(".2s"));
16.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 16 Applied Intelligence Company Confidential DQ REPORT • Early stages of New Data from a Client • New to understand the data to create good scenarios • New to be flexible , fast and easy to use • Frequent values .stat.freqItems
17.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 17 Applied Intelligence Company Confidential OTHER APPLICATIONS • Finding Networks • Fraud and ML rings • GraphX and PageRank • ETL • Identity Resolution • Analytics • Ml Lib • Recommendations
18.
Copyright © 2014
BAE Systems. All Rights Reserved. BAE Systems is a trade mark of BAE Systems plc 18 Applied Intelligence Company Confidential RECAP • DataFrames make Spark easy to use • Can reuse existing SQL code and skills • Fast to write code with Scala REPL • Lots of business applications for Spark • New ways of thinking about working with data to get information • We’re hiring! • http://www.baesystems.com/en/cybersecurity • Thanks to Brendan and his team here • Encourage anyone to give a presentation • Any Questions?
Download now