SlideShare a Scribd company logo
1 of 23
Download to read offline
© 2015 ligaDATA, Inc. All Rights Reserved.
 October 2015
Download, Forums, Docs, Events http://Kamanja.org 
Meet 80% of the Needs of a Pilot Project
With a CC Fraud Detection Example
By Greg Makowski
ACM Data Science Camp, Saturday 10/24/2015
http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015
http://kamanja.org/white-papers/
© 2015 ligaDATA, Inc. All Rights Reserved.
 2
ligaDATA
Summary
Ques%on)	
  How	
  to	
  help	
  a	
  OSS	
  pilot	
  evalua0on	
  go	
  faster?	
  
	
  
Answer)	
  
Develop	
  “design	
  pa:erns”	
  for	
  applica0ons	
  
Pick	
  a	
  specific	
  app	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  (Credit	
  Card	
  Fraud	
  Detec0on)	
  
Get	
  data	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
   	
  (end	
  up	
  genera0ng	
  it)	
  
Need	
  to	
  vary	
  arch	
  config	
  (like	
  performance	
  tes0ng)	
  
	
  
Given	
  requirements,	
  generate	
  a	
  mul0-­‐node	
  example	
  pilot	
  
system,	
  involving	
  many	
  OSS	
  components	
  
	
  
PMML	
  can	
  abstract	
  the	
  produc0on	
  step	
  from	
  model	
  
building	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 3
ligaDATA
Problem
When	
  evalua0ng	
  any	
  new	
  data	
  mining	
  or	
  big	
  data	
  
soPware,	
  companies	
  want	
  to	
  “try	
  it	
  out”	
  and	
  see	
  how	
  it	
  
meets	
  their	
  requirements.	
  	
  A	
  common	
  step	
  is	
  a	
  pilot	
  
project.	
  
	
  
A	
  pilot	
  would	
  commonly	
  involve	
  integra0on	
  with	
  related	
  
soPware	
  systems.	
  
	
  
Open	
  Source	
  SoPware	
  (OSS)	
  may	
  come	
  with	
  examples.	
  
Need	
  an	
  example	
  “produc%on	
  system”	
  
	
  
Q)	
  What	
  can	
  be	
  done	
  to	
  shorten	
  the	
  0me	
  to	
  finish	
  a	
  Pilot?	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 4
ligaDATA
Problem:
Questions to be answered from Pilot
How	
  fast	
  is	
  it?	
  
	
  It	
  depends	
  	
  (yes,	
  that	
  is	
  an	
  annoying	
  answer)	
  
	
  	
  	
  
	
  
How	
  to	
  configure	
  the	
  system	
  with	
  other	
  OSS	
  soBware?	
  
	
  It	
  depends	
  	
  (yes,	
  that	
  is	
  an	
  annoying	
  answer)	
  
	
  	
  
	
  
	
  	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 5
ligaDATA
Problem:
Questions to be answered from Pilot
How	
  fast	
  is	
  it?	
  
	
  It	
  depends	
  	
  (yes,	
  that	
  is	
  an	
  annoying	
  answer)	
  
	
  Show	
  example	
  configs	
  with	
  performance	
  results	
  
	
  
How	
  to	
  configure	
  the	
  system	
  with	
  other	
  OSS	
  soBware?	
  
	
  It	
  depends	
  	
  (yes,	
  that	
  is	
  an	
  annoying	
  answer)	
  
	
  Consider	
  different	
  applica%on	
  “design	
  paCerns”	
  
	
  
How	
  will	
  the	
  system	
  grow	
  as	
  complexity	
  grows?	
  
	
  The	
  answer	
  is	
  specific	
  per	
  design	
  pa:ern	
  
	
  
How	
  should	
  DevOps	
  monitor	
  and	
  manage?	
  
	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 6
ligaDATA
Kamanja Platform
Storage	
  
Ouput	
  
Queues	
  
Input	
  
Queues	
  
Decisioning	
   Ac0ons	
  
CDC,
Logs,
Apps
Next Best
Action
Batch
Stores
Application
Updates
Decision	
  Engine	
  
Admin	
  Management	
  
kamanja
Databases
ESBs
Alerts &
Notifications
Social
3rd Party
Data	
  Sources	
  
Data Store
© 2015 ligaDATA, Inc. All Rights Reserved.
 7
ligaDATA
See Kamanja.org, and github
Kamanja	
  is	
  used	
  as	
  an	
  example,	
  	
  
	
  
The	
  process	
  is	
  in	
  this	
  talk	
  is	
  general	
  and	
  can	
  be	
  broadly	
  
applied	
  to	
  other	
  OSS.	
  
	
  
	
  
Kamanja	
  is	
  a	
  big	
  data	
  con0nuous	
  decisioning	
  system	
  
	
  Apache	
  license,	
  available	
  on	
  github	
  
	
  	
  	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 8
ligaDATA
Application Design Pattern
Departmental Model Scoring Application
Scaling	
  challenges	
  
	
  transac0on	
  growth	
  and	
  type	
  	
  (quan0ty	
  &	
  speed)	
  
	
  model	
  complexity	
  (hybrid	
  systems)	
  
	
  quan0ty	
  of	
  models:	
  10’s	
  to	
  10k’s	
  
	
   	
  for	
  most	
  models,	
  most	
  fields,	
  	
  
	
   	
  need	
  to	
  access	
  the	
  data	
  store	
  for	
  preprocessing	
  
	
  
	
  
Input
queue
Model
Scoring
Real time
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Financial
Log
Consumer
Business
Preprocessing
& Scores
Reporting
Analysis
Lambda
Architecture
Combines
Real time
And Batch
PMML
© 2015 ligaDATA, Inc. All Rights Reserved.
 9
ligaDATA
Application Design Pattern
Social Network Analysis
Scaling	
  challenges	
  
	
  transac0on	
  growth	
  and	
  source	
  	
  (quan0ty	
  &	
  speed)	
  
	
  model:	
  sen0ment,	
  graph	
  
	
  quan0ty	
  of	
  models:	
  a	
  few	
  
	
  data	
  store	
  lookup	
  for	
  base	
  user	
  info	
  
	
  
	
  
Input
queue
Model
Scoring
Real Time
Charting, Alerting
Cache +
Data Store
Managementand
ControlSystem
Twitter
Facebook
:
User baseline
Network
Trend Analysis
Deep Dive
Java, Scala
© 2015 ligaDATA, Inc. All Rights Reserved.
 10
ligaDATA
Application Design Pattern
Text Mining, Search
Scaling	
  challenges	
  
	
  transac0on	
  growth	
  	
  
	
  some	
  projects:	
  very	
  heavy	
  compu0ng	
  for	
  NLP	
  parsing	
  
	
   	
  quickly	
  score	
  on	
  tagged	
  results	
  
	
  
Input
queue
Model
Scoring
Output
Queue
Cache +
Data Store
Managementand
ControlSystem
Pages
Documents
Posts
Tweets
Java, Stanford NLP
Parse trees
Inverted indexes
Trending topics
Update Thesaurus
Docs ßà Topics
© 2015 ligaDATA, Inc. All Rights Reserved.
 11
ligaDATA
Details on Departmental Scoring:
Credit Card Fraud Detection System
How	
  to	
  develop	
  an	
  example	
  system?	
  
	
  There	
  is	
  no	
  public	
  data.	
  	
  	
  Private	
  won’t	
  be	
  shared	
  
	
  
Generate	
  the	
  data	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (then	
  can	
  also	
  test	
  BIG	
  DATA)	
  
	
  Focus	
  on	
  5	
  use	
  cases	
  of	
  “normal”	
  and	
  5	
  “fraud”	
  
	
  Configuring	
  architecture	
  can	
  be	
  used	
  for	
  	
  
	
   	
  1)	
  Performance	
  tes0ng	
  for	
  different	
  requirements	
  
	
   	
  2)	
  Pilot	
  system,	
  example	
  included	
  w/	
  Kamanja	
  
	
   	
  	
  
	
  
Train	
  models,	
  generate	
  PMML	
  for	
  scoring	
  
	
  
	
   	
  	
  
	
  	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 12
ligaDATA
Credit Card Fraud Detection System
FRAUD Use Cases
Fraudster	
  extrac%ng	
  value	
  out	
  of	
  hacked	
  card	
  
	
  Likely	
  a	
  first	
  “test”	
  of	
  CC	
  info.	
  	
  iTunes	
  or	
  unmanned	
  gas	
  pump	
  w/o	
  camera	
  
	
  Drain	
  account	
  up	
  to	
  CC	
  limit	
  in	
  15	
  min,	
  up	
  to	
  2-­‐3	
  days	
  
	
  Purchase	
  things	
  “easy	
  to	
  cash	
  out	
  or	
  resell”	
  –	
  launder	
  money	
  
	
  	
  	
  	
  	
  	
  	
  giP	
  cards,	
  gems,	
  jewelry,	
  small	
  electronics	
  easy	
  to	
  sell,	
  burner	
  phones	
  
	
  
F1)	
  Elder	
  abuse	
  –	
  either	
  PII	
  or	
  CC	
  info	
  gets	
  copied	
  
	
  Fraudster	
  opens	
  first	
  web	
  or	
  mobile	
  account	
  (surprising	
  for	
  grandmother)	
  
	
  Higher	
  credit	
  limit,	
  long	
  0me	
  with	
  no	
  web/mobile	
  
	
  Long	
  0me	
  CC	
  holder	
  (high	
  tenure),	
  li:le	
  spend	
  varia0on	
  
F2)	
  Hacker	
  bought	
  PII	
  (Personally	
  Iden0fiable	
  Informa0on)	
  
	
  Fraudster	
  used	
  PII	
  to	
  apply	
  for	
  a	
  new	
  account	
  	
  
	
  	
  	
  	
  	
  new	
  account	
  likely	
  has	
  a	
  lower	
  credit	
  limit	
  
	
  	
  	
  	
  	
  Over	
  1st	
  month,	
  slowly	
  changes	
  PII	
  to	
  fraudsters	
  to	
  not	
  alert	
  vic0m 	
  	
  
	
  	
  	
  	
  	
  use	
  in	
  “card	
  not	
  present”	
  situa0ons	
  
	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 13
ligaDATA
Credit Card Fraud Detection System
FRAUD Use Cases
F3)	
  Physical	
  clone	
  
	
  	
  	
  	
  	
  	
  Fraudster	
  may	
  have	
  bought	
  CC	
  info	
  online	
  ($1/account)	
  or	
  copied	
  mag	
  strip	
  
from	
  the	
  vic0m	
  in	
  the	
  store.	
  
	
  	
  	
  	
  	
  	
  Fraudster	
  card	
  use	
  can	
  be	
  concurrent	
  with	
  normal	
  consumer	
  use	
  –	
  or	
  very	
  
different	
  place	
  and	
  0me	
  zone	
  
	
  
F4)	
  Rare	
  Behavior	
  (may	
  be	
  part	
  of	
  other	
  use	
  cases)	
  
	
  	
  	
  	
  	
  	
  Unusual	
  0me	
  of	
  day,	
  geography,	
  spending	
  by	
  type	
  of	
  goods	
  /	
  services	
  
	
  
F5)	
  Risky	
  Behavior	
  –	
  fraudster	
  may	
  visit	
  blacklisted	
  web	
  page	
  
	
  	
  	
  	
  	
  	
  Fraudster	
  is	
  engaging	
  with	
  	
  
	
  	
  	
  	
  	
  	
  Geography	
  changes	
  are	
  not	
  plausible	
  (noon	
  in	
  San	
  Jose,	
  1pm	
  in	
  Hong	
  Kong)	
  
	
  	
  	
  	
  	
  	
  Relate	
  to	
  past	
  labeled	
  cases	
  of	
  CC	
  fraud.	
  
	
  
	
  
	
  
	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 14
ligaDATA
Credit Card Fraud Detection System
NORMAL Use Cases
1)  Steady	
  State	
  use	
  –	
  the	
  CC	
  use	
  by	
  these	
  people	
  is	
  fairly	
  consistent	
  and	
  
stable.	
  	
  Can	
  have	
  a	
  die	
  vei	
  
2)  New	
  Card,	
  1st	
  month	
  –	
  this	
  example	
  is	
  setup	
  to	
  make	
  it	
  difficult	
  to	
  
compare	
  with	
  fraudulently	
  opened	
  new	
  cards.	
  
	
  	
  	
  	
  	
  	
  	
  	
  Spending	
  may	
  max	
  out	
  
3)	
  Young	
  and	
  star%ng	
  singles	
  or	
  newly	
  married.	
  	
  
	
  	
  	
  	
  These	
  people	
  don’t	
  have	
  much	
  of	
  a	
  credit	
  ra0ng	
  
	
  	
  	
  	
  More	
  likely	
  to	
  use	
  web	
  and	
  mobile	
  channels.	
  	
  	
  
	
  	
  	
  	
  More	
  likely	
  to	
  wander	
  to	
  dangerous	
  areas	
  of	
  the	
  web.	
  
	
  	
  	
  	
  Likely	
  to	
  spend	
  in	
  a	
  bigger	
  array	
  of	
  categories	
  
	
  	
  	
  	
  Possibly	
  many	
  geographic	
  loca0ons	
  
4)	
  Normal	
  Case,	
  Family	
  –	
  	
  
	
  	
  	
  	
  Medium	
  to	
  higher	
  income	
  limit,	
  many	
  don’t	
  hit	
  limit	
  
	
  	
  	
  	
  Low	
  to	
  moderate	
  showing	
  up	
  in	
  new	
  geographies,	
  or	
  spending	
  on	
  new	
  catagor.	
  
5)	
  Work	
  Travel	
  –	
  Work	
  in	
  sales	
  or	
  consul0ng.	
  	
  New	
  loca0ons	
  are	
  no	
  surprise.	
  	
  
Higher	
  spending	
  limit	
  and	
  amounts,	
  many	
  flight,	
  hotel,	
  car	
  rental,	
  high	
  mobile	
  
	
  	
  	
  	
  	
  	
  	
  
	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 15
ligaDATA
Pilot Project & Performance Testing
Credit Card Fraud Detection
Input
queue
Model
Scoring
Real time
Output
Input
queue
Real time
OutputInput
queue
Input
queue
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Real time
Output
Real time
Output
Input
queue
Model
Scoring
Real time
Output
Cache +
Data Store
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
Model
Scoring
1 Kafka
1 Kamanja
1 Kafka
~3 Kafka
16 Kamanja
~3 Kafka
Add
Preprocessing
Logic and
HBase table lookup
© 2015 ligaDATA, Inc. All Rights Reserved.
 16
ligaDATA
Performance Testing – Model Node
Credit Card Fraud Detection
Fields	
  per	
  record:	
  	
  tes0ng	
  network	
  speed	
  between	
  nodes	
  
	
  30,	
  120,	
  480	
  fields	
  	
  	
  	
  	
  (yes,	
  could	
  go	
  10k,	
  100k)	
  
	
  
Single	
  model	
  complexity:	
  tes0ng	
  compute	
  load	
  
	
  Small,	
  Medium	
  &	
  Large	
  	
  	
  (100,	
  2k,	
  32.5k	
  elements)	
  
	
  
Preprocessing	
  lookup	
  tables:	
  tes0ng	
  cache	
  to	
  HB	
  &	
  netwrk	
  
	
  none,	
  some	
  
	
  
Ensemble	
  Models	
  per	
  score:	
  tes0ng	
  compute	
  &	
  network	
  
	
  1,	
  5,	
  20	
  
	
  
Number	
  of	
  Models	
  in	
  department:	
  	
  1,	
  10,	
  100	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
 17
ligaDATA
Solution to Developer Questions
(How Fast, How to Configure?)
How	
  many	
  fields	
  per	
  record?	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  30,	
  120,	
  480	
  	
  	
  	
  (SML)	
  
What	
  model	
  complexity?	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  100,	
  2k,	
  32.5k	
  	
  (SML)	
  
Is	
  data	
  already	
  preprocessed?	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Yes,	
  	
  No	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (YN)	
  
Average	
  models	
  /	
  ensemble?	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1,	
  5,	
  20	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (SML)	
  
How	
  many	
  models	
  in	
  the	
  department?	
  	
  1,	
  10,	
  100	
  	
  	
  	
  	
  	
  	
  (SML)	
  
What	
  language?	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  PMML,	
  Java,	
  Scala	
  
	
  
(I	
  want	
  to	
  create	
  a	
  table	
  like…)	
  
Requirements	
  à	
  Then	
  need	
  configure	
  	
  	
  	
  For	
  speed	
  rec/s	
  
S,S,Y,M,S 	
   	
  1	
  Kaf,	
  1	
  Kam,	
  1	
  Kaf	
   	
  1.1mm	
  	
  
M,L,Y,M,S 	
   	
  1	
  Kaf,	
  1	
  Kam,	
  1	
  Kaf	
   	
  	
  	
  	
  	
  200K	
  	
  
L,L,N,L,L 	
  	
  	
  	
  	
  	
  	
  	
  3	
  Kaf,	
  16	
  Kam,	
  1	
  Kaf,	
  3HB 	
  1.6mm	
  	
  
Generate	
  Architecture	
  and	
  run	
  an	
  80%	
  relevant	
  Pilot	
  
Text or
Twitter
API
Java 1
and GUI Kafka
Java 3 for
analysis
Data
Store
Java calls API, and
Kafka producer
Tweets returned in
JSON
JSON tweets sent to Kafka Kafka JSON
to Kamanja
JSON with features saved in DB
JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..)
JSON returns the aggregate query results to JAVA
JSON query results to Kafka
JSON results of rule scoring, alert text
13 Tomcat web
service displays data
and charts
Matched_tags_
per_text
table
results to Java 3 for scoring,
with thresholds
Alerts table
Save results to DB
JAVA 1: check for updates to the alerts table
Kamanja
1
2
3 4
5
6
7
8 9
11
12
10
Java 2 for
Features
Sentiment or
Stanford NLP
Social Netowork Analysis:
Example System Configuration
ligaDATA
19
© 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATA
Scoring
Engine
(Kamanja)
PMML Diagram
Predictive Modeling Markup Language
Training & test data
(batch)
Data
Mining
Tool File, Save As
PMML
PMML
File
PMML
Producer
(18 available)
PMML
FileScoring data
(real time streaming)
Output data has
new score field
Training Project Phase
Production Scoring Project Phase
Full model
specification
PMML Consumer
20
© 2015 ligaDATA, Inc. All Rights Reserved.
ligaDATA
Given industry fragmentation,
PMML is a solution for Data Mining scoring
PMML Producers (18 data mining packages)
•  R (Rattle, PMML)*
•  RapidMiner
•  KNIME*
PMML Consumers (12 co)
•  Zementis
•  IBM SPSS
•  KNIME
•  Microstrategy
•  SAS
•  Kamanja* (Open Source)
•  Spark (MLib)* * = Open Source
•  Weka*
•  SAS Enterprise Miner
PREDICTIVE
Naïve Bayes
Neural Net
Regression
Rules
Scorecard
Sequence
SVM
Time Series
Trees
DESCRIPTIVE / OTH
Association Rules
Cluster, K-Nearest Nb
Text Models
model ensembles &
composition
(i.e. Gradient Boosting)
© 2015 ligaDATA, Inc. All Rights Reserved.
 21
ligaDATA
Summary
Ques%on)	
  How	
  to	
  help	
  a	
  OSS	
  pilot	
  evalua0on	
  go	
  faster?	
  
	
  
Answer)	
  
Develop	
  “design	
  pa:erns”	
  for	
  applica0ons	
  
Pick	
  a	
  specific	
  app	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  (Credit	
  Card	
  Fraud	
  Detec0on)	
  
Get	
  data	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
   	
  (end	
  up	
  genera0ng	
  it)	
  
Need	
  to	
  vary	
  arch	
  config	
  (like	
  performance	
  tes0ng)	
  
	
  
Given	
  requirements,	
  generate	
  a	
  mul0-­‐node	
  example	
  pilot	
  
system,	
  involving	
  many	
  OSS	
  components	
  
	
  
PMML	
  can	
  abstract	
  the	
  produc0on	
  step	
  from	
  model	
  
building	
  
	
  
© 2015 ligaDATA, Inc. All Rights Reserved.
Try out

Kamanja
© 2015 ligaDATA, Inc. All Rights Reserved.
CONFIDENTIAL 
Download, Forums, Docs, Events http://Kamanja.org 
ligaDATA
http://kamanja.org/white-papers/
Kamanja: 220k to 230k messages / second
CONFIGURATION:
•  16 core box, using Solid State Disc
•  Sample Tool to generate messages of size 1k (not being reduced)
•  Data Mining uses 100’s to 100k fields – not 100 byte message
•  Kafka Queue
•  3 input queues, each queue has 8 partitions
•  Kamanja Engine
•  Using the remaining 12-13 cores
•  Not saving score results per record in this test
SO WHAT? COMPARISON:
•  Storm is currently the lowest latency Apache big data system
•  Storm integration, got up to 90k to 100k for same data
•  Kamanja is 2.4 times faster than Storm = (225k/95k) in this test
•  Spark streaming is with mini-batches, with higher latency than Storm or Kamanja
Why is Kamanja faster
than Storm?
Storm reads the data from
the input queue (sprout)
and passes that to Bolts.
Each pass between sprout
to bolt they serialize &
deserialize the data. There
is other overhead.
Kamanja: 
One Speed Analysis
ligaDATA

More Related Content

What's hot

Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at NubankDatabricks
 
Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0DataWorks Summit
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Tina Zhang
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYCSri Ambati
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware DetectionVMware Tanzu
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessInside Analysis
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXKirk Haslbeck
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorchgeetachauhan
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDatabricks
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...TigerGraph
 
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...TigerGraph
 
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache SparkPolymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache SparkDatabricks
 
Deep Learning Image Processing Applications in the Enterprise
Deep Learning Image Processing Applications in the EnterpriseDeep Learning Image Processing Applications in the Enterprise
Deep Learning Image Processing Applications in the EnterpriseGanesan Narayanasamy
 
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Publicis Sapient Engineering
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
 
Flash session -goldengate--lht1053-lon
Flash session -goldengate--lht1053-lonFlash session -goldengate--lht1053-lon
Flash session -goldengate--lht1053-lonJeffrey T. Pollock
 

What's hot (20)

7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Data Science Driven Malware Detection
Data Science Driven Malware DetectionData Science Driven Malware Detection
Data Science Driven Malware Detection
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Deep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle GroveDeep Credit Risk Ranking with LSTM with Kyle Grove
Deep Credit Risk Ranking with LSTM with Kyle Grove
 
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
Graph Gurus Episode 25: Unleash the Business Value of Your Data Lake with Gra...
 
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
Using Graph Algorithms For Advanced Analytics - Part 4 Similarity 30 graph al...
 
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache SparkPolymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark
Polymorphic Table Functions: The Best Way to Integrate SQL and Apache Spark
 
Deep Learning Image Processing Applications in the Enterprise
Deep Learning Image Processing Applications in the EnterpriseDeep Learning Image Processing Applications in the Enterprise
Deep Learning Image Processing Applications in the Enterprise
 
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
Ask Bigger Questions with Cloudera and Apache Hadoop - Big Data Day Paris 2013
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Flash session -goldengate--lht1053-lon
Flash session -goldengate--lht1053-lonFlash session -goldengate--lht1053-lon
Flash session -goldengate--lht1053-lon
 

Viewers also liked

Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Linked In Slides 2009 02 24 B
Linked In Slides 2009 02 24 BLinked In Slides 2009 02 24 B
Linked In Slides 2009 02 24 BGreg Makowski
 
SFbayACM ACM Data Science Camp 2015 10 24
SFbayACM ACM Data Science Camp 2015 10 24SFbayACM ACM Data Science Camp 2015 10 24
SFbayACM ACM Data Science Camp 2015 10 24Greg Makowski
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
The 360º Leader (Section 2 of 6)
The 360º Leader (Section 2 of 6)The 360º Leader (Section 2 of 6)
The 360º Leader (Section 2 of 6)Greg Makowski
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Greg Makowski
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigData_Europe
 
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerBeyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerGE_India
 
The 360º Leader (Section 1 of 6)
The 360º Leader (Section 1 of 6)The 360º Leader (Section 1 of 6)
The 360º Leader (Section 1 of 6)Greg Makowski
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsGreg Makowski
 
Three case studies deploying cluster analysis
Three case studies deploying cluster analysisThree case studies deploying cluster analysis
Three case studies deploying cluster analysisGreg Makowski
 

Viewers also liked (20)

Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Linked In Slides 2009 02 24 B
Linked In Slides 2009 02 24 BLinked In Slides 2009 02 24 B
Linked In Slides 2009 02 24 B
 
SFbayACM ACM Data Science Camp 2015 10 24
SFbayACM ACM Data Science Camp 2015 10 24SFbayACM ACM Data Science Camp 2015 10 24
SFbayACM ACM Data Science Camp 2015 10 24
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
The 360º Leader (Section 2 of 6)
The 360º Leader (Section 2 of 6)The 360º Leader (Section 2 of 6)
The 360º Leader (Section 2 of 6)
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Big data 101
Big data 101Big data 101
Big data 101
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate Change
 
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerBeyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
 
The 360º Leader (Section 1 of 6)
The 360º Leader (Section 1 of 6)The 360º Leader (Section 1 of 6)
The 360º Leader (Section 1 of 6)
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical Applications
 
Three case studies deploying cluster analysis
Three case studies deploying cluster analysisThree case studies deploying cluster analysis
Three case studies deploying cluster analysis
 
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTBIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
 

Similar to How to Create 80% of a Big Data Pilot Project

Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScaleAntónio Alegria
 
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...TigerGraph
 
Connecting Above the Cloud
Connecting Above the CloudConnecting Above the Cloud
Connecting Above the CloudPeter Coffee
 
Analytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène LyonAnalytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène LyonNRB
 
How To Build Mature SM - final
How To Build Mature SM - finalHow To Build Mature SM - final
How To Build Mature SM - finalDanijel Božić
 
The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...Shift Conference
 
Clouds of connection sept2011 acm aitp
Clouds of connection sept2011 acm aitpClouds of connection sept2011 acm aitp
Clouds of connection sept2011 acm aitpPeter Coffee
 
Social Enterprise: Trust; Vision; Revolution
Social Enterprise: Trust; Vision; RevolutionSocial Enterprise: Trust; Vision; Revolution
Social Enterprise: Trust; Vision; RevolutionPeter Coffee
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...Dataconomy Media
 
2020 Fintech Trends amidst a huge digital shift
2020 Fintech Trends amidst a huge digital shift2020 Fintech Trends amidst a huge digital shift
2020 Fintech Trends amidst a huge digital shiftSai Sundar
 
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...IRJET Journal
 
Jakarta presentation
Jakarta presentationJakarta presentation
Jakarta presentationGil Brown
 
Exploring new mobile and cloud platforms without a governance .docx
Exploring new mobile and cloud platforms without a governance .docxExploring new mobile and cloud platforms without a governance .docx
Exploring new mobile and cloud platforms without a governance .docxssuser454af01
 
The survival kit for your digital transformation
The survival kit for your digital transformationThe survival kit for your digital transformation
The survival kit for your digital transformationrun_frictionless
 
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...Stuart Blair
 
Solving the CIO's disruption dilemma—the blended IT strategy
Solving the CIO's disruption dilemma—the blended IT strategySolving the CIO's disruption dilemma—the blended IT strategy
Solving the CIO's disruption dilemma—the blended IT strategyThe Economist Media Businesses
 
Graph+AI for Fin. Services
Graph+AI for Fin. ServicesGraph+AI for Fin. Services
Graph+AI for Fin. ServicesTigerGraph
 

Similar to How to Create 80% of a Big Data Pilot Project (20)

Moving To SaaS
Moving To SaaSMoving To SaaS
Moving To SaaS
 
Pixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at ScalePixels.camp - Machine Learning: Building Successful Products at Scale
Pixels.camp - Machine Learning: Building Successful Products at Scale
 
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
Graph Gurus Episode 34: Graph Databases are Changing the Fraud Detection and ...
 
Connecting Above the Cloud
Connecting Above the CloudConnecting Above the Cloud
Connecting Above the Cloud
 
Analytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène LyonAnalytics on z Systems Focus on Real Time - Hélène Lyon
Analytics on z Systems Focus on Real Time - Hélène Lyon
 
How To Build Mature SM - final
How To Build Mature SM - finalHow To Build Mature SM - final
How To Build Mature SM - final
 
The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...
 
Clouds of connection sept2011 acm aitp
Clouds of connection sept2011 acm aitpClouds of connection sept2011 acm aitp
Clouds of connection sept2011 acm aitp
 
Social Enterprise: Trust; Vision; Revolution
Social Enterprise: Trust; Vision; RevolutionSocial Enterprise: Trust; Vision; Revolution
Social Enterprise: Trust; Vision; Revolution
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
2020 Fintech Trends amidst a huge digital shift
2020 Fintech Trends amidst a huge digital shift2020 Fintech Trends amidst a huge digital shift
2020 Fintech Trends amidst a huge digital shift
 
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...Implementation of Sentimental Analysis of Social Media for Stock  Prediction ...
Implementation of Sentimental Analysis of Social Media for Stock Prediction ...
 
Jakarta presentation
Jakarta presentationJakarta presentation
Jakarta presentation
 
Exploring new mobile and cloud platforms without a governance .docx
Exploring new mobile and cloud platforms without a governance .docxExploring new mobile and cloud platforms without a governance .docx
Exploring new mobile and cloud platforms without a governance .docx
 
The survival kit for your digital transformation
The survival kit for your digital transformationThe survival kit for your digital transformation
The survival kit for your digital transformation
 
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...
Fast Data and Architecting the Digital Enterprise Fast Data drivers, componen...
 
Solving the CIO's disruption dilemma—the blended IT strategy
Solving the CIO's disruption dilemma—the blended IT strategySolving the CIO's disruption dilemma—the blended IT strategy
Solving the CIO's disruption dilemma—the blended IT strategy
 
Graph+AI for Fin. Services
Graph+AI for Fin. ServicesGraph+AI for Fin. Services
Graph+AI for Fin. Services
 

More from Greg Makowski

Understanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxGreg Makowski
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxGreg Makowski
 
A Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsA Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsGreg Makowski
 
Kdd 2019: Standardizing Data Science to Help Hiring
Kdd 2019:  Standardizing Data Science to Help HiringKdd 2019:  Standardizing Data Science to Help Hiring
Kdd 2019: Standardizing Data Science to Help HiringGreg Makowski
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareGreg Makowski
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 

More from Greg Makowski (6)

Understanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
 
Future of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptxFuture of AI - 2023 07 25.pptx
Future of AI - 2023 07 25.pptx
 
A Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data ScientistsA Successful Hiring Process for Data Scientists
A Successful Hiring Process for Data Scientists
 
Kdd 2019: Standardizing Data Science to Help Hiring
Kdd 2019:  Standardizing Data Science to Help HiringKdd 2019:  Standardizing Data Science to Help Hiring
Kdd 2019: Standardizing Data Science to Help Hiring
 
Tales from an ip worker in consulting and software
Tales from an ip worker in consulting and softwareTales from an ip worker in consulting and software
Tales from an ip worker in consulting and software
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

How to Create 80% of a Big Data Pilot Project

  • 1. © 2015 ligaDATA, Inc. All Rights Reserved. October 2015 Download, Forums, Docs, Events http://Kamanja.org Meet 80% of the Needs of a Pilot Project With a CC Fraud Detection Example By Greg Makowski ACM Data Science Camp, Saturday 10/24/2015 http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015 http://kamanja.org/white-papers/
  • 2. © 2015 ligaDATA, Inc. All Rights Reserved. 2 ligaDATA Summary Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?     Answer)   Develop  “design  pa:erns”  for  applica0ons   Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)   Get  data                        (end  up  genera0ng  it)   Need  to  vary  arch  config  (like  performance  tes0ng)     Given  requirements,  generate  a  mul0-­‐node  example  pilot   system,  involving  many  OSS  components     PMML  can  abstract  the  produc0on  step  from  model   building    
  • 3. © 2015 ligaDATA, Inc. All Rights Reserved. 3 ligaDATA Problem When  evalua0ng  any  new  data  mining  or  big  data   soPware,  companies  want  to  “try  it  out”  and  see  how  it   meets  their  requirements.    A  common  step  is  a  pilot   project.     A  pilot  would  commonly  involve  integra0on  with  related   soPware  systems.     Open  Source  SoPware  (OSS)  may  come  with  examples.   Need  an  example  “produc%on  system”     Q)  What  can  be  done  to  shorten  the  0me  to  finish  a  Pilot?  
  • 4. © 2015 ligaDATA, Inc. All Rights Reserved. 4 ligaDATA Problem: Questions to be answered from Pilot How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)           How  to  configure  the  system  with  other  OSS  soBware?    It  depends    (yes,  that  is  an  annoying  answer)            
  • 5. © 2015 ligaDATA, Inc. All Rights Reserved. 5 ligaDATA Problem: Questions to be answered from Pilot How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)    Show  example  configs  with  performance  results     How  to  configure  the  system  with  other  OSS  soBware?    It  depends    (yes,  that  is  an  annoying  answer)    Consider  different  applica%on  “design  paCerns”     How  will  the  system  grow  as  complexity  grows?    The  answer  is  specific  per  design  pa:ern     How  should  DevOps  monitor  and  manage?      
  • 6. © 2015 ligaDATA, Inc. All Rights Reserved. 6 ligaDATA Kamanja Platform Storage   Ouput   Queues   Input   Queues   Decisioning   Ac0ons   CDC, Logs, Apps Next Best Action Batch Stores Application Updates Decision  Engine   Admin  Management   kamanja Databases ESBs Alerts & Notifications Social 3rd Party Data  Sources   Data Store
  • 7. © 2015 ligaDATA, Inc. All Rights Reserved. 7 ligaDATA See Kamanja.org, and github Kamanja  is  used  as  an  example,       The  process  is  in  this  talk  is  general  and  can  be  broadly   applied  to  other  OSS.       Kamanja  is  a  big  data  con0nuous  decisioning  system    Apache  license,  available  on  github        
  • 8. © 2015 ligaDATA, Inc. All Rights Reserved. 8 ligaDATA Application Design Pattern Departmental Model Scoring Application Scaling  challenges    transac0on  growth  and  type    (quan0ty  &  speed)    model  complexity  (hybrid  systems)    quan0ty  of  models:  10’s  to  10k’s      for  most  models,  most  fields,        need  to  access  the  data  store  for  preprocessing       Input queue Model Scoring Real time Output Queue Cache + Data Store Managementand ControlSystem Financial Log Consumer Business Preprocessing & Scores Reporting Analysis Lambda Architecture Combines Real time And Batch PMML
  • 9. © 2015 ligaDATA, Inc. All Rights Reserved. 9 ligaDATA Application Design Pattern Social Network Analysis Scaling  challenges    transac0on  growth  and  source    (quan0ty  &  speed)    model:  sen0ment,  graph    quan0ty  of  models:  a  few    data  store  lookup  for  base  user  info       Input queue Model Scoring Real Time Charting, Alerting Cache + Data Store Managementand ControlSystem Twitter Facebook : User baseline Network Trend Analysis Deep Dive Java, Scala
  • 10. © 2015 ligaDATA, Inc. All Rights Reserved. 10 ligaDATA Application Design Pattern Text Mining, Search Scaling  challenges    transac0on  growth      some  projects:  very  heavy  compu0ng  for  NLP  parsing      quickly  score  on  tagged  results     Input queue Model Scoring Output Queue Cache + Data Store Managementand ControlSystem Pages Documents Posts Tweets Java, Stanford NLP Parse trees Inverted indexes Trending topics Update Thesaurus Docs ßà Topics
  • 11. © 2015 ligaDATA, Inc. All Rights Reserved. 11 ligaDATA Details on Departmental Scoring: Credit Card Fraud Detection System How  to  develop  an  example  system?    There  is  no  public  data.      Private  won’t  be  shared     Generate  the  data                            (then  can  also  test  BIG  DATA)    Focus  on  5  use  cases  of  “normal”  and  5  “fraud”    Configuring  architecture  can  be  used  for        1)  Performance  tes0ng  for  different  requirements      2)  Pilot  system,  example  included  w/  Kamanja           Train  models,  generate  PMML  for  scoring              
  • 12. © 2015 ligaDATA, Inc. All Rights Reserved. 12 ligaDATA Credit Card Fraud Detection System FRAUD Use Cases Fraudster  extrac%ng  value  out  of  hacked  card    Likely  a  first  “test”  of  CC  info.    iTunes  or  unmanned  gas  pump  w/o  camera    Drain  account  up  to  CC  limit  in  15  min,  up  to  2-­‐3  days    Purchase  things  “easy  to  cash  out  or  resell”  –  launder  money                giP  cards,  gems,  jewelry,  small  electronics  easy  to  sell,  burner  phones     F1)  Elder  abuse  –  either  PII  or  CC  info  gets  copied    Fraudster  opens  first  web  or  mobile  account  (surprising  for  grandmother)    Higher  credit  limit,  long  0me  with  no  web/mobile    Long  0me  CC  holder  (high  tenure),  li:le  spend  varia0on   F2)  Hacker  bought  PII  (Personally  Iden0fiable  Informa0on)    Fraudster  used  PII  to  apply  for  a  new  account              new  account  likely  has  a  lower  credit  limit            Over  1st  month,  slowly  changes  PII  to  fraudsters  to  not  alert  vic0m              use  in  “card  not  present”  situa0ons      
  • 13. © 2015 ligaDATA, Inc. All Rights Reserved. 13 ligaDATA Credit Card Fraud Detection System FRAUD Use Cases F3)  Physical  clone              Fraudster  may  have  bought  CC  info  online  ($1/account)  or  copied  mag  strip   from  the  vic0m  in  the  store.              Fraudster  card  use  can  be  concurrent  with  normal  consumer  use  –  or  very   different  place  and  0me  zone     F4)  Rare  Behavior  (may  be  part  of  other  use  cases)              Unusual  0me  of  day,  geography,  spending  by  type  of  goods  /  services     F5)  Risky  Behavior  –  fraudster  may  visit  blacklisted  web  page              Fraudster  is  engaging  with                Geography  changes  are  not  plausible  (noon  in  San  Jose,  1pm  in  Hong  Kong)              Relate  to  past  labeled  cases  of  CC  fraud.            
  • 14. © 2015 ligaDATA, Inc. All Rights Reserved. 14 ligaDATA Credit Card Fraud Detection System NORMAL Use Cases 1)  Steady  State  use  –  the  CC  use  by  these  people  is  fairly  consistent  and   stable.    Can  have  a  die  vei   2)  New  Card,  1st  month  –  this  example  is  setup  to  make  it  difficult  to   compare  with  fraudulently  opened  new  cards.                  Spending  may  max  out   3)  Young  and  star%ng  singles  or  newly  married.            These  people  don’t  have  much  of  a  credit  ra0ng          More  likely  to  use  web  and  mobile  channels.              More  likely  to  wander  to  dangerous  areas  of  the  web.          Likely  to  spend  in  a  bigger  array  of  categories          Possibly  many  geographic  loca0ons   4)  Normal  Case,  Family  –            Medium  to  higher  income  limit,  many  don’t  hit  limit          Low  to  moderate  showing  up  in  new  geographies,  or  spending  on  new  catagor.   5)  Work  Travel  –  Work  in  sales  or  consul0ng.    New  loca0ons  are  no  surprise.     Higher  spending  limit  and  amounts,  many  flight,  hotel,  car  rental,  high  mobile                    
  • 15. © 2015 ligaDATA, Inc. All Rights Reserved. 15 ligaDATA Pilot Project & Performance Testing Credit Card Fraud Detection Input queue Model Scoring Real time Output Input queue Real time OutputInput queue Input queue Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Real time Output Real time Output Input queue Model Scoring Real time Output Cache + Data Store Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring 1 Kafka 1 Kamanja 1 Kafka ~3 Kafka 16 Kamanja ~3 Kafka Add Preprocessing Logic and HBase table lookup
  • 16. © 2015 ligaDATA, Inc. All Rights Reserved. 16 ligaDATA Performance Testing – Model Node Credit Card Fraud Detection Fields  per  record:    tes0ng  network  speed  between  nodes    30,  120,  480  fields          (yes,  could  go  10k,  100k)     Single  model  complexity:  tes0ng  compute  load    Small,  Medium  &  Large      (100,  2k,  32.5k  elements)     Preprocessing  lookup  tables:  tes0ng  cache  to  HB  &  netwrk    none,  some     Ensemble  Models  per  score:  tes0ng  compute  &  network    1,  5,  20     Number  of  Models  in  department:    1,  10,  100  
  • 17. © 2015 ligaDATA, Inc. All Rights Reserved. 17 ligaDATA Solution to Developer Questions (How Fast, How to Configure?) How  many  fields  per  record?                                  30,  120,  480        (SML)   What  model  complexity?                                              100,  2k,  32.5k    (SML)   Is  data  already  preprocessed?                                    Yes,    No                    (YN)   Average  models  /  ensemble?                                      1,  5,  20                      (SML)   How  many  models  in  the  department?    1,  10,  100              (SML)   What  language?                                                                                    PMML,  Java,  Scala     (I  want  to  create  a  table  like…)   Requirements  à  Then  need  configure        For  speed  rec/s   S,S,Y,M,S    1  Kaf,  1  Kam,  1  Kaf    1.1mm     M,L,Y,M,S    1  Kaf,  1  Kam,  1  Kaf            200K     L,L,N,L,L                3  Kaf,  16  Kam,  1  Kaf,  3HB  1.6mm     Generate  Architecture  and  run  an  80%  relevant  Pilot  
  • 18. Text or Twitter API Java 1 and GUI Kafka Java 3 for analysis Data Store Java calls API, and Kafka producer Tweets returned in JSON JSON tweets sent to Kafka Kafka JSON to Kamanja JSON with features saved in DB JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..) JSON returns the aggregate query results to JAVA JSON query results to Kafka JSON results of rule scoring, alert text 13 Tomcat web service displays data and charts Matched_tags_ per_text table results to Java 3 for scoring, with thresholds Alerts table Save results to DB JAVA 1: check for updates to the alerts table Kamanja 1 2 3 4 5 6 7 8 9 11 12 10 Java 2 for Features Sentiment or Stanford NLP Social Netowork Analysis: Example System Configuration ligaDATA
  • 19. 19 © 2015 ligaDATA, Inc. All Rights Reserved. ligaDATA Scoring Engine (Kamanja) PMML Diagram Predictive Modeling Markup Language Training & test data (batch) Data Mining Tool File, Save As PMML PMML File PMML Producer (18 available) PMML FileScoring data (real time streaming) Output data has new score field Training Project Phase Production Scoring Project Phase Full model specification PMML Consumer
  • 20. 20 © 2015 ligaDATA, Inc. All Rights Reserved. ligaDATA Given industry fragmentation, PMML is a solution for Data Mining scoring PMML Producers (18 data mining packages) •  R (Rattle, PMML)* •  RapidMiner •  KNIME* PMML Consumers (12 co) •  Zementis •  IBM SPSS •  KNIME •  Microstrategy •  SAS •  Kamanja* (Open Source) •  Spark (MLib)* * = Open Source •  Weka* •  SAS Enterprise Miner PREDICTIVE Naïve Bayes Neural Net Regression Rules Scorecard Sequence SVM Time Series Trees DESCRIPTIVE / OTH Association Rules Cluster, K-Nearest Nb Text Models model ensembles & composition (i.e. Gradient Boosting)
  • 21. © 2015 ligaDATA, Inc. All Rights Reserved. 21 ligaDATA Summary Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?     Answer)   Develop  “design  pa:erns”  for  applica0ons   Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)   Get  data                        (end  up  genera0ng  it)   Need  to  vary  arch  config  (like  performance  tes0ng)     Given  requirements,  generate  a  mul0-­‐node  example  pilot   system,  involving  many  OSS  components     PMML  can  abstract  the  produc0on  step  from  model   building    
  • 22. © 2015 ligaDATA, Inc. All Rights Reserved. Try out
 Kamanja © 2015 ligaDATA, Inc. All Rights Reserved. CONFIDENTIAL Download, Forums, Docs, Events http://Kamanja.org ligaDATA http://kamanja.org/white-papers/
  • 23. Kamanja: 220k to 230k messages / second CONFIGURATION: •  16 core box, using Solid State Disc •  Sample Tool to generate messages of size 1k (not being reduced) •  Data Mining uses 100’s to 100k fields – not 100 byte message •  Kafka Queue •  3 input queues, each queue has 8 partitions •  Kamanja Engine •  Using the remaining 12-13 cores •  Not saving score results per record in this test SO WHAT? COMPARISON: •  Storm is currently the lowest latency Apache big data system •  Storm integration, got up to 90k to 100k for same data •  Kamanja is 2.4 times faster than Storm = (225k/95k) in this test •  Spark streaming is with mini-batches, with higher latency than Storm or Kamanja Why is Kamanja faster than Storm? Storm reads the data from the input queue (sprout) and passes that to Bolts. Each pass between sprout to bolt they serialize & deserialize the data. There is other overhead. Kamanja: One Speed Analysis ligaDATA