Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Create 80% of a Big Data Pilot Project

498 views

Published on

When evaluating Open Source Software, or other software of a certain size or complexity, organizations frequently want to conduct a Pilot project, or Proof of Concept (POC). This talk describes a process to reduce the length of the Pilot, by leveraging configurations from performance testing to POC starting configurations.

Published in: Data & Analytics
  • Be the first to comment

How to Create 80% of a Big Data Pilot Project

  1. 1. © 2015 ligaDATA, Inc. All Rights Reserved. October 2015 Download, Forums, Docs, Events http://Kamanja.org Meet 80% of the Needs of a Pilot Project With a CC Fraud Detection Example By Greg Makowski ACM Data Science Camp, Saturday 10/24/2015 http://www.sfbayacm.org/event/silicon-valley-data-science-camp-2015 http://kamanja.org/white-papers/
  2. 2. © 2015 ligaDATA, Inc. All Rights Reserved. 2 ligaDATA Summary Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?     Answer)   Develop  “design  pa:erns”  for  applica0ons   Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)   Get  data                        (end  up  genera0ng  it)   Need  to  vary  arch  config  (like  performance  tes0ng)     Given  requirements,  generate  a  mul0-­‐node  example  pilot   system,  involving  many  OSS  components     PMML  can  abstract  the  produc0on  step  from  model   building    
  3. 3. © 2015 ligaDATA, Inc. All Rights Reserved. 3 ligaDATA Problem When  evalua0ng  any  new  data  mining  or  big  data   soPware,  companies  want  to  “try  it  out”  and  see  how  it   meets  their  requirements.    A  common  step  is  a  pilot   project.     A  pilot  would  commonly  involve  integra0on  with  related   soPware  systems.     Open  Source  SoPware  (OSS)  may  come  with  examples.   Need  an  example  “produc%on  system”     Q)  What  can  be  done  to  shorten  the  0me  to  finish  a  Pilot?  
  4. 4. © 2015 ligaDATA, Inc. All Rights Reserved. 4 ligaDATA Problem: Questions to be answered from Pilot How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)           How  to  configure  the  system  with  other  OSS  soBware?    It  depends    (yes,  that  is  an  annoying  answer)            
  5. 5. © 2015 ligaDATA, Inc. All Rights Reserved. 5 ligaDATA Problem: Questions to be answered from Pilot How  fast  is  it?    It  depends    (yes,  that  is  an  annoying  answer)    Show  example  configs  with  performance  results     How  to  configure  the  system  with  other  OSS  soBware?    It  depends    (yes,  that  is  an  annoying  answer)    Consider  different  applica%on  “design  paCerns”     How  will  the  system  grow  as  complexity  grows?    The  answer  is  specific  per  design  pa:ern     How  should  DevOps  monitor  and  manage?      
  6. 6. © 2015 ligaDATA, Inc. All Rights Reserved. 6 ligaDATA Kamanja Platform Storage   Ouput   Queues   Input   Queues   Decisioning   Ac0ons   CDC, Logs, Apps Next Best Action Batch Stores Application Updates Decision  Engine   Admin  Management   kamanja Databases ESBs Alerts & Notifications Social 3rd Party Data  Sources   Data Store
  7. 7. © 2015 ligaDATA, Inc. All Rights Reserved. 7 ligaDATA See Kamanja.org, and github Kamanja  is  used  as  an  example,       The  process  is  in  this  talk  is  general  and  can  be  broadly   applied  to  other  OSS.       Kamanja  is  a  big  data  con0nuous  decisioning  system    Apache  license,  available  on  github        
  8. 8. © 2015 ligaDATA, Inc. All Rights Reserved. 8 ligaDATA Application Design Pattern Departmental Model Scoring Application Scaling  challenges    transac0on  growth  and  type    (quan0ty  &  speed)    model  complexity  (hybrid  systems)    quan0ty  of  models:  10’s  to  10k’s      for  most  models,  most  fields,        need  to  access  the  data  store  for  preprocessing       Input queue Model Scoring Real time Output Queue Cache + Data Store Managementand ControlSystem Financial Log Consumer Business Preprocessing & Scores Reporting Analysis Lambda Architecture Combines Real time And Batch PMML
  9. 9. © 2015 ligaDATA, Inc. All Rights Reserved. 9 ligaDATA Application Design Pattern Social Network Analysis Scaling  challenges    transac0on  growth  and  source    (quan0ty  &  speed)    model:  sen0ment,  graph    quan0ty  of  models:  a  few    data  store  lookup  for  base  user  info       Input queue Model Scoring Real Time Charting, Alerting Cache + Data Store Managementand ControlSystem Twitter Facebook : User baseline Network Trend Analysis Deep Dive Java, Scala
  10. 10. © 2015 ligaDATA, Inc. All Rights Reserved. 10 ligaDATA Application Design Pattern Text Mining, Search Scaling  challenges    transac0on  growth      some  projects:  very  heavy  compu0ng  for  NLP  parsing      quickly  score  on  tagged  results     Input queue Model Scoring Output Queue Cache + Data Store Managementand ControlSystem Pages Documents Posts Tweets Java, Stanford NLP Parse trees Inverted indexes Trending topics Update Thesaurus Docs ßà Topics
  11. 11. © 2015 ligaDATA, Inc. All Rights Reserved. 11 ligaDATA Details on Departmental Scoring: Credit Card Fraud Detection System How  to  develop  an  example  system?    There  is  no  public  data.      Private  won’t  be  shared     Generate  the  data                            (then  can  also  test  BIG  DATA)    Focus  on  5  use  cases  of  “normal”  and  5  “fraud”    Configuring  architecture  can  be  used  for        1)  Performance  tes0ng  for  different  requirements      2)  Pilot  system,  example  included  w/  Kamanja           Train  models,  generate  PMML  for  scoring              
  12. 12. © 2015 ligaDATA, Inc. All Rights Reserved. 12 ligaDATA Credit Card Fraud Detection System FRAUD Use Cases Fraudster  extrac%ng  value  out  of  hacked  card    Likely  a  first  “test”  of  CC  info.    iTunes  or  unmanned  gas  pump  w/o  camera    Drain  account  up  to  CC  limit  in  15  min,  up  to  2-­‐3  days    Purchase  things  “easy  to  cash  out  or  resell”  –  launder  money                giP  cards,  gems,  jewelry,  small  electronics  easy  to  sell,  burner  phones     F1)  Elder  abuse  –  either  PII  or  CC  info  gets  copied    Fraudster  opens  first  web  or  mobile  account  (surprising  for  grandmother)    Higher  credit  limit,  long  0me  with  no  web/mobile    Long  0me  CC  holder  (high  tenure),  li:le  spend  varia0on   F2)  Hacker  bought  PII  (Personally  Iden0fiable  Informa0on)    Fraudster  used  PII  to  apply  for  a  new  account              new  account  likely  has  a  lower  credit  limit            Over  1st  month,  slowly  changes  PII  to  fraudsters  to  not  alert  vic0m              use  in  “card  not  present”  situa0ons      
  13. 13. © 2015 ligaDATA, Inc. All Rights Reserved. 13 ligaDATA Credit Card Fraud Detection System FRAUD Use Cases F3)  Physical  clone              Fraudster  may  have  bought  CC  info  online  ($1/account)  or  copied  mag  strip   from  the  vic0m  in  the  store.              Fraudster  card  use  can  be  concurrent  with  normal  consumer  use  –  or  very   different  place  and  0me  zone     F4)  Rare  Behavior  (may  be  part  of  other  use  cases)              Unusual  0me  of  day,  geography,  spending  by  type  of  goods  /  services     F5)  Risky  Behavior  –  fraudster  may  visit  blacklisted  web  page              Fraudster  is  engaging  with                Geography  changes  are  not  plausible  (noon  in  San  Jose,  1pm  in  Hong  Kong)              Relate  to  past  labeled  cases  of  CC  fraud.            
  14. 14. © 2015 ligaDATA, Inc. All Rights Reserved. 14 ligaDATA Credit Card Fraud Detection System NORMAL Use Cases 1)  Steady  State  use  –  the  CC  use  by  these  people  is  fairly  consistent  and   stable.    Can  have  a  die  vei   2)  New  Card,  1st  month  –  this  example  is  setup  to  make  it  difficult  to   compare  with  fraudulently  opened  new  cards.                  Spending  may  max  out   3)  Young  and  star%ng  singles  or  newly  married.            These  people  don’t  have  much  of  a  credit  ra0ng          More  likely  to  use  web  and  mobile  channels.              More  likely  to  wander  to  dangerous  areas  of  the  web.          Likely  to  spend  in  a  bigger  array  of  categories          Possibly  many  geographic  loca0ons   4)  Normal  Case,  Family  –            Medium  to  higher  income  limit,  many  don’t  hit  limit          Low  to  moderate  showing  up  in  new  geographies,  or  spending  on  new  catagor.   5)  Work  Travel  –  Work  in  sales  or  consul0ng.    New  loca0ons  are  no  surprise.     Higher  spending  limit  and  amounts,  many  flight,  hotel,  car  rental,  high  mobile                    
  15. 15. © 2015 ligaDATA, Inc. All Rights Reserved. 15 ligaDATA Pilot Project & Performance Testing Credit Card Fraud Detection Input queue Model Scoring Real time Output Input queue Real time OutputInput queue Input queue Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Real time Output Real time Output Input queue Model Scoring Real time Output Cache + Data Store Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring Model Scoring 1 Kafka 1 Kamanja 1 Kafka ~3 Kafka 16 Kamanja ~3 Kafka Add Preprocessing Logic and HBase table lookup
  16. 16. © 2015 ligaDATA, Inc. All Rights Reserved. 16 ligaDATA Performance Testing – Model Node Credit Card Fraud Detection Fields  per  record:    tes0ng  network  speed  between  nodes    30,  120,  480  fields          (yes,  could  go  10k,  100k)     Single  model  complexity:  tes0ng  compute  load    Small,  Medium  &  Large      (100,  2k,  32.5k  elements)     Preprocessing  lookup  tables:  tes0ng  cache  to  HB  &  netwrk    none,  some     Ensemble  Models  per  score:  tes0ng  compute  &  network    1,  5,  20     Number  of  Models  in  department:    1,  10,  100  
  17. 17. © 2015 ligaDATA, Inc. All Rights Reserved. 17 ligaDATA Solution to Developer Questions (How Fast, How to Configure?) How  many  fields  per  record?                                  30,  120,  480        (SML)   What  model  complexity?                                              100,  2k,  32.5k    (SML)   Is  data  already  preprocessed?                                    Yes,    No                    (YN)   Average  models  /  ensemble?                                      1,  5,  20                      (SML)   How  many  models  in  the  department?    1,  10,  100              (SML)   What  language?                                                                                    PMML,  Java,  Scala     (I  want  to  create  a  table  like…)   Requirements  à  Then  need  configure        For  speed  rec/s   S,S,Y,M,S    1  Kaf,  1  Kam,  1  Kaf    1.1mm     M,L,Y,M,S    1  Kaf,  1  Kam,  1  Kaf            200K     L,L,N,L,L                3  Kaf,  16  Kam,  1  Kaf,  3HB  1.6mm     Generate  Architecture  and  run  an  80%  relevant  Pilot  
  18. 18. Text or Twitter API Java 1 and GUI Kafka Java 3 for analysis Data Store Java calls API, and Kafka producer Tweets returned in JSON JSON tweets sent to Kafka Kafka JSON to Kamanja JSON with features saved in DB JAVA: Every “time window”, queries the DB to aggregate (i.e. count (tags) by (tags) by..) JSON returns the aggregate query results to JAVA JSON query results to Kafka JSON results of rule scoring, alert text 13 Tomcat web service displays data and charts Matched_tags_ per_text table results to Java 3 for scoring, with thresholds Alerts table Save results to DB JAVA 1: check for updates to the alerts table Kamanja 1 2 3 4 5 6 7 8 9 11 12 10 Java 2 for Features Sentiment or Stanford NLP Social Netowork Analysis: Example System Configuration ligaDATA
  19. 19. 19 © 2015 ligaDATA, Inc. All Rights Reserved. ligaDATA Scoring Engine (Kamanja) PMML Diagram Predictive Modeling Markup Language Training & test data (batch) Data Mining Tool File, Save As PMML PMML File PMML Producer (18 available) PMML FileScoring data (real time streaming) Output data has new score field Training Project Phase Production Scoring Project Phase Full model specification PMML Consumer
  20. 20. 20 © 2015 ligaDATA, Inc. All Rights Reserved. ligaDATA Given industry fragmentation, PMML is a solution for Data Mining scoring PMML Producers (18 data mining packages) •  R (Rattle, PMML)* •  RapidMiner •  KNIME* PMML Consumers (12 co) •  Zementis •  IBM SPSS •  KNIME •  Microstrategy •  SAS •  Kamanja* (Open Source) •  Spark (MLib)* * = Open Source •  Weka* •  SAS Enterprise Miner PREDICTIVE Naïve Bayes Neural Net Regression Rules Scorecard Sequence SVM Time Series Trees DESCRIPTIVE / OTH Association Rules Cluster, K-Nearest Nb Text Models model ensembles & composition (i.e. Gradient Boosting)
  21. 21. © 2015 ligaDATA, Inc. All Rights Reserved. 21 ligaDATA Summary Ques%on)  How  to  help  a  OSS  pilot  evalua0on  go  faster?     Answer)   Develop  “design  pa:erns”  for  applica0ons   Pick  a  specific  app                      (Credit  Card  Fraud  Detec0on)   Get  data                        (end  up  genera0ng  it)   Need  to  vary  arch  config  (like  performance  tes0ng)     Given  requirements,  generate  a  mul0-­‐node  example  pilot   system,  involving  many  OSS  components     PMML  can  abstract  the  produc0on  step  from  model   building    
  22. 22. © 2015 ligaDATA, Inc. All Rights Reserved. Try out
 Kamanja © 2015 ligaDATA, Inc. All Rights Reserved. CONFIDENTIAL Download, Forums, Docs, Events http://Kamanja.org ligaDATA http://kamanja.org/white-papers/
  23. 23. Kamanja: 220k to 230k messages / second CONFIGURATION: •  16 core box, using Solid State Disc •  Sample Tool to generate messages of size 1k (not being reduced) •  Data Mining uses 100’s to 100k fields – not 100 byte message •  Kafka Queue •  3 input queues, each queue has 8 partitions •  Kamanja Engine •  Using the remaining 12-13 cores •  Not saving score results per record in this test SO WHAT? COMPARISON: •  Storm is currently the lowest latency Apache big data system •  Storm integration, got up to 90k to 100k for same data •  Kamanja is 2.4 times faster than Storm = (225k/95k) in this test •  Spark streaming is with mini-batches, with higher latency than Storm or Kamanja Why is Kamanja faster than Storm? Storm reads the data from the input queue (sprout) and passes that to Bolts. Each pass between sprout to bolt they serialize & deserialize the data. There is other overhead. Kamanja: One Speed Analysis ligaDATA

×