2© 2016 Pivotal Software, Inc. All rights reserved. 2© 2016 Pivotal Software, Inc. All rights reserved.
Large Scale Fraud Analytics
GemFire Greenplum Connector (G2C)
3© 2016 Pivotal Software, Inc. All rights reserved.
Background
Ÿ  Government fraud revenue retention program
Ÿ  Detecting & retaining ~$5B annually
–  Primary focus on identity theft
–  Processes up to 8 million cases per day
–  Current & historic data size ~60 TB (compressed)
Ÿ  Modifying architecture to integrate GemFire for scalable
Java-based business logic, web service integration, and
event driven design
4© 2016 Pivotal Software, Inc. All rights reserved.
Fraud Systems Simplified
Prepare
•  Ingest
•  Restructure (ETL)
Score
•  Model Evaluation
Disposition
•  Business Logic
•  Prioritization
Respond
•  Investigation
•  Stop Payments
Business Logic Engine
ETL
Reporting
In-db Analytics
Application Services
5© 2016 Pivotal Software, Inc. All rights reserved.
Case Study Architecture – Scaling Up
GemFire
Greenplum
Spring Boot App Services
Informatica w/ PWX (ETL)
Business Objects
(Reporting)
Legacy Logic
Implementation
Logic Engine
In-db Analytics
Greenplum
Prepare
•  Ingest
•  Restructure (ETL)
Score
•  Model Evaluation
Disposition
•  Business Logic
•  Prioritization
Respond
•  Investigation
•  Stop Payments
6© 2016 Pivotal Software, Inc. All rights reserved.
Pivotal Greenplum (GPDB)
Ÿ  Postgres Community OSS
–  Original fork of 8.2.15
–  Massively parallel processing
database
Ÿ  Master coordinates queries
across segments databases
Ÿ  Supports in-database model
evaluation
–  MadLib, PL/R, SAS
GPDB
Logical
GPDB
Physical
GPDB
Software
Master
Segments
7© 2016 Pivotal Software, Inc. All rights reserved.
Initial Implementation
Ÿ  Fraud model results evaluated
by business logic engine
Ÿ  Flat file data extraction
–  Significant custom code to
construct required object model
–  Table à CSV à POJO
Ÿ  Shared element in an otherwise
distributed system
–  Performance considerations
GPDB
Legacy Logic
Implementation
8© 2016 Pivotal Software, Inc. All rights reserved.
Architecture Adjustments
Ÿ  New requirements introduced
external integrations
–  Drives desire for web-services
Ÿ  Desire to improve performance
& simplify codebase
Ÿ  Expanding business logic
–  Logic engine run as a GemFire
function
GemFire
GPDB
Legacy Logic
Implementation
Spring Boot (App Services)
9© 2016 Pivotal Software, Inc. All rights reserved. 9© 2016 Pivotal Software, Inc. All rights reserved.
GemFire Greenplum Connector
10© 2016 Pivotal Software, Inc. All rights reserved.
Context
Greenplum!
ANSI
SQL
Analytical
Parallel
Configurable Data
Load
GemFire!App 1App 1App 1
App 1App 1App 2
Native API
Rest /
HTTP
Transactional
Custom Apps
Transactional
data write
behind
Data Science,
Analytics & ML
11© 2016 Pivotal Software, Inc. All rights reserved.
GemFire Greenplum Connector (G2C)
Ÿ  Extension package for GemFire
Ÿ  Provides simple import and export of data between GemFire
regions & Greenplum tables
–  Parallel data motion leveraging Greenplum’s external table interface
Ÿ  Simple mapping between table rows and PdxInstance
–  Flat object relational mapping
–  Set of predefined type conversions
–  Configurable GemFire data collocation
12© 2016 Pivotal Software, Inc. All rights reserved.
Greenplum
Master
Segments GemFire
G2C Data Interfaces
JDBC /
ODBC
Data
Node
Data
Node
Control Logic
13© 2016 Pivotal Software, Inc. All rights reserved.
GpdbService is the primary entry
point for explicitly invoked data
motion
1.  Import - loads the full table
contents from Greenplum
2.  Export - sends region
contents to Greenplum
Sample Data Import / Export
Cache cache = CacheFactory.getAnyInstance();
GpdbService gpdb = GpdbService.getInstance(cache);
long count;
count = gpdb.importRegion(region);
count = gpdb.exportRegion(region);
1
2
14© 2016 Pivotal Software, Inc. All rights reserved.
Basic Cache Configuration
Configured via GemFire extension
framework
•  1) Each region maps to a jndi data
source back by Greenplum
•  2) Link an entity type and table
•  3) Declare a field to be used as the key
•  Compound keys supported
•  4) Define a mapping between the table
columns
•  Default auto-configuration
•  Optional name and column attributes for
naming convention changes
•  Class used to control type conversion
•  Set of built in types
<region name="Parent">
<region-attributes refid="PARTITION">
<partition-attributes/>
</region-attributes>
<gpdb:store datasource="datasource">
<gpdb:types>
<gpdb:pdx name="io.pivotal...entity.Parent"
table="parent">
<gpdb:id field="id" />
<gpdb:fields>
<gpdb:field name="name" />
<gpdb:field name="id" column="id" />
<gpdb:field name="income"
class="java.math.BigDecimal" />
</gpdb:fields>
</gpdb:pdx>
</gpdb:types>
</gpdb:store>
</region>
2
1
3
4
15© 2016 Pivotal Software, Inc. All rights reserved.
Configuring Collocation
Parent-child foreign key relationships
supported through collocation
1.  Compound keys configurations
result in a HashMap based key in
GemFire
2.  Provided partition resolver works
with compound keys
<region name="Child">
<...>
<partition-resolver>
<class-name>
io.pivotal.gemfire.gpdb.IdPartitionResolver
</class-name>
<parameter name="field">
<string>parentId</string>
</parameter>
</...>
<gpdb:id>
<gpdb:field ref="parentId" />
<gpdb:field ref="id" />
</gpdb:id>
<gpdb:fields>
<gpdb:field name="parentId"/>
<gpdb:field name="id" />
</...>
1
2
16© 2016 Pivotal Software, Inc. All rights reserved.
Configuring Automatic Synchronization
●  Data exported to Greenplum via
asynchronous eventing
○  Time and batch size triggers
available
●  Causes each GemFire member to
independently interact with Greenplum
○  Configure GPDB resource queues
accordingly
<region name="Child">
<...>
<gpdb:store datasource="datasource">
<gpdb:synchronize mode="automatic"
time-interval="3000"
persistent="false" />
<gpdb:types>
<...>
17© 2016 Pivotal Software, Inc. All rights reserved.
Case Study G2C Configuration Details
Ÿ  Existing required domain objects
–  Multiple many-to-one groupings
Ÿ  Wide tables / objects (500+ fields)
Ÿ  Data Collocation configured on
caseId
Ÿ  Source tables wrapped in views
CaseWrapper
-  caseId
-  …
ModelScores
-  caseId
-  …
Documents
-  caseId
-  …
PriorHistory
-  caseId
-  …
OtherData…
-  caseId
-  …
* *
* *
1
LogicResults
-  caseId
-  …
18© 2016 Pivotal Software, Inc. All rights reserved.
Simple Loading – Single Table per Object
:LoadTrigger :GPDBService :Region :AsyncEventLister :LogicEngine results:Region
Import()
put()
processEvents()
process()
put()
19© 2016 Pivotal Software, Inc. All rights reserved.
Complex Loading – Multiple Tables per Object
:MergeLoader :GPDBService :Region :LogicEngine results:Region
Import()
put()
process()
put()
par
assemble()
:LoadTrigger
executeFunction()
20© 2016 Pivotal Software, Inc. All rights reserved.
Impacts & Results
Ÿ  Simplified implementation & code reduction
Ÿ  Maintained or improved data motion rates
–  Case study CPU bound
–  Additional improvements in the backlog
Ÿ  Improved system throughput
21© 2016 Pivotal Software, Inc. All rights reserved. 21© 2016 Pivotal Software, Inc. All rights reserved.
Questions?
Join the Apache Geode Community!
•  Check out: http://geode.incubator.apache.org
•  Subscribe: user-subscribe@geode.incubator.apache.org
•  Download: http://geode.incubator.apache.org/releases/
#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Greenplum

#GeodeSummit - Large Scale Fraud Detection using GemFire Integrated with Greenplum

  • 2.
    2© 2016 PivotalSoftware, Inc. All rights reserved. 2© 2016 Pivotal Software, Inc. All rights reserved. Large Scale Fraud Analytics GemFire Greenplum Connector (G2C)
  • 3.
    3© 2016 PivotalSoftware, Inc. All rights reserved. Background Ÿ  Government fraud revenue retention program Ÿ  Detecting & retaining ~$5B annually –  Primary focus on identity theft –  Processes up to 8 million cases per day –  Current & historic data size ~60 TB (compressed) Ÿ  Modifying architecture to integrate GemFire for scalable Java-based business logic, web service integration, and event driven design
  • 4.
    4© 2016 PivotalSoftware, Inc. All rights reserved. Fraud Systems Simplified Prepare •  Ingest •  Restructure (ETL) Score •  Model Evaluation Disposition •  Business Logic •  Prioritization Respond •  Investigation •  Stop Payments Business Logic Engine ETL Reporting In-db Analytics Application Services
  • 5.
    5© 2016 PivotalSoftware, Inc. All rights reserved. Case Study Architecture – Scaling Up GemFire Greenplum Spring Boot App Services Informatica w/ PWX (ETL) Business Objects (Reporting) Legacy Logic Implementation Logic Engine In-db Analytics Greenplum Prepare •  Ingest •  Restructure (ETL) Score •  Model Evaluation Disposition •  Business Logic •  Prioritization Respond •  Investigation •  Stop Payments
  • 6.
    6© 2016 PivotalSoftware, Inc. All rights reserved. Pivotal Greenplum (GPDB) Ÿ  Postgres Community OSS –  Original fork of 8.2.15 –  Massively parallel processing database Ÿ  Master coordinates queries across segments databases Ÿ  Supports in-database model evaluation –  MadLib, PL/R, SAS GPDB Logical GPDB Physical GPDB Software Master Segments
  • 7.
    7© 2016 PivotalSoftware, Inc. All rights reserved. Initial Implementation Ÿ  Fraud model results evaluated by business logic engine Ÿ  Flat file data extraction –  Significant custom code to construct required object model –  Table à CSV à POJO Ÿ  Shared element in an otherwise distributed system –  Performance considerations GPDB Legacy Logic Implementation
  • 8.
    8© 2016 PivotalSoftware, Inc. All rights reserved. Architecture Adjustments Ÿ  New requirements introduced external integrations –  Drives desire for web-services Ÿ  Desire to improve performance & simplify codebase Ÿ  Expanding business logic –  Logic engine run as a GemFire function GemFire GPDB Legacy Logic Implementation Spring Boot (App Services)
  • 9.
    9© 2016 PivotalSoftware, Inc. All rights reserved. 9© 2016 Pivotal Software, Inc. All rights reserved. GemFire Greenplum Connector
  • 10.
    10© 2016 PivotalSoftware, Inc. All rights reserved. Context Greenplum! ANSI SQL Analytical Parallel Configurable Data Load GemFire!App 1App 1App 1 App 1App 1App 2 Native API Rest / HTTP Transactional Custom Apps Transactional data write behind Data Science, Analytics & ML
  • 11.
    11© 2016 PivotalSoftware, Inc. All rights reserved. GemFire Greenplum Connector (G2C) Ÿ  Extension package for GemFire Ÿ  Provides simple import and export of data between GemFire regions & Greenplum tables –  Parallel data motion leveraging Greenplum’s external table interface Ÿ  Simple mapping between table rows and PdxInstance –  Flat object relational mapping –  Set of predefined type conversions –  Configurable GemFire data collocation
  • 12.
    12© 2016 PivotalSoftware, Inc. All rights reserved. Greenplum Master Segments GemFire G2C Data Interfaces JDBC / ODBC Data Node Data Node Control Logic
  • 13.
    13© 2016 PivotalSoftware, Inc. All rights reserved. GpdbService is the primary entry point for explicitly invoked data motion 1.  Import - loads the full table contents from Greenplum 2.  Export - sends region contents to Greenplum Sample Data Import / Export Cache cache = CacheFactory.getAnyInstance(); GpdbService gpdb = GpdbService.getInstance(cache); long count; count = gpdb.importRegion(region); count = gpdb.exportRegion(region); 1 2
  • 14.
    14© 2016 PivotalSoftware, Inc. All rights reserved. Basic Cache Configuration Configured via GemFire extension framework •  1) Each region maps to a jndi data source back by Greenplum •  2) Link an entity type and table •  3) Declare a field to be used as the key •  Compound keys supported •  4) Define a mapping between the table columns •  Default auto-configuration •  Optional name and column attributes for naming convention changes •  Class used to control type conversion •  Set of built in types <region name="Parent"> <region-attributes refid="PARTITION"> <partition-attributes/> </region-attributes> <gpdb:store datasource="datasource"> <gpdb:types> <gpdb:pdx name="io.pivotal...entity.Parent" table="parent"> <gpdb:id field="id" /> <gpdb:fields> <gpdb:field name="name" /> <gpdb:field name="id" column="id" /> <gpdb:field name="income" class="java.math.BigDecimal" /> </gpdb:fields> </gpdb:pdx> </gpdb:types> </gpdb:store> </region> 2 1 3 4
  • 15.
    15© 2016 PivotalSoftware, Inc. All rights reserved. Configuring Collocation Parent-child foreign key relationships supported through collocation 1.  Compound keys configurations result in a HashMap based key in GemFire 2.  Provided partition resolver works with compound keys <region name="Child"> <...> <partition-resolver> <class-name> io.pivotal.gemfire.gpdb.IdPartitionResolver </class-name> <parameter name="field"> <string>parentId</string> </parameter> </...> <gpdb:id> <gpdb:field ref="parentId" /> <gpdb:field ref="id" /> </gpdb:id> <gpdb:fields> <gpdb:field name="parentId"/> <gpdb:field name="id" /> </...> 1 2
  • 16.
    16© 2016 PivotalSoftware, Inc. All rights reserved. Configuring Automatic Synchronization ●  Data exported to Greenplum via asynchronous eventing ○  Time and batch size triggers available ●  Causes each GemFire member to independently interact with Greenplum ○  Configure GPDB resource queues accordingly <region name="Child"> <...> <gpdb:store datasource="datasource"> <gpdb:synchronize mode="automatic" time-interval="3000" persistent="false" /> <gpdb:types> <...>
  • 17.
    17© 2016 PivotalSoftware, Inc. All rights reserved. Case Study G2C Configuration Details Ÿ  Existing required domain objects –  Multiple many-to-one groupings Ÿ  Wide tables / objects (500+ fields) Ÿ  Data Collocation configured on caseId Ÿ  Source tables wrapped in views CaseWrapper -  caseId -  … ModelScores -  caseId -  … Documents -  caseId -  … PriorHistory -  caseId -  … OtherData… -  caseId -  … * * * * 1 LogicResults -  caseId -  …
  • 18.
    18© 2016 PivotalSoftware, Inc. All rights reserved. Simple Loading – Single Table per Object :LoadTrigger :GPDBService :Region :AsyncEventLister :LogicEngine results:Region Import() put() processEvents() process() put()
  • 19.
    19© 2016 PivotalSoftware, Inc. All rights reserved. Complex Loading – Multiple Tables per Object :MergeLoader :GPDBService :Region :LogicEngine results:Region Import() put() process() put() par assemble() :LoadTrigger executeFunction()
  • 20.
    20© 2016 PivotalSoftware, Inc. All rights reserved. Impacts & Results Ÿ  Simplified implementation & code reduction Ÿ  Maintained or improved data motion rates –  Case study CPU bound –  Additional improvements in the backlog Ÿ  Improved system throughput
  • 21.
    21© 2016 PivotalSoftware, Inc. All rights reserved. 21© 2016 Pivotal Software, Inc. All rights reserved. Questions?
  • 22.
    Join the ApacheGeode Community! •  Check out: http://geode.incubator.apache.org •  Subscribe: user-subscribe@geode.incubator.apache.org •  Download: http://geode.incubator.apache.org/releases/