Clinical Data Classification of alzheimer's disease

1,447 views

Published on

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

Clinical Data Classification of alzheimer's disease

  1. 1. Alzheimer's  Disease-­‐  Clinical  Data   Classifica4on       By   George  Kalangi   Venkata  Gopi    
  2. 2. Overview:   •  Introduc4on   •  Analysis  of  commonly  used  terms  and   explana4on  of  data  sets   •  Overall  Programming  Process   •  Genera4ng  a  merged  file  with  CDGLOBAL   •  Genera4on  of  files  for  future  status  predic4on   •  Data  Preprocessing   •  Classifica4on  (Algorithms)  used  on  the  data   •  Analysis  on  the  output  data  from  WEKAb   G
  3. 3. Introduc4on   •  What  is  Alzheimer’s  Disease?   •  Brain  disorder   •  Most  common  form  of  demen4a   – Term  for  the  loss     • Memory   • Other  intellectual  abili4es   • Serious  enough  to  interfere  with  daily  life   •  Clinical  Demen4a  Ra4o  (0,0.5,1,2,3)   Mild to Severe Dementia1.0 to 3.0 Questionable Dementia0.5 Normal0 G
  4. 4. Datasets  (60  Files)   "  56  comma  separated  files    1  File  –  Data  Dic4onary  (Explains  the  terms   used)    1  File  –  Clinical  Demen4a  Ra4ng  (Has   CDGLOBAL)    Rest    Assessments      Data  Defini4ons    Other  like  visits  having  abbrevia4ons   G
  5. 5. Environment  Setup   •  Programming  Languages  used  for  the  project   are  PHP,  MySQL,  Java,  Postgresql   •  Tools  used  are  WEKA  (Waikato  Environment   for  Knowledge  Analysis),  MySQLWorkBench,          and  NetBeans   •               -­‐Front  End      (PHP)   •               -­‐Back  End      (MySQL)             G V
  6. 6. Overall  Programming  Process     •  A  selected  dataset  (FAQ)  is  given  by  the  user.   •  At  the  backend  MYSQL  queries  are  defined   enough  to  create  the  required  tables  and   insert  the  required  data  to  the  corresponding   tables.   •  Here  aeer  the  required  opera4ons  are   performed  on  the  tables.   •  Final  output  files  are  stored  in  .csv  format.   G V
  7. 7. Genera4ng  a  merged  file  with   CDGLOBAL  (For  current)   •  For  the  given  datasets  as  input,        (Eg:adni_faq_2011-­‐01-­‐20.csv)  and  from  the   adni_cdr_2011-­‐01-­‐20.csv)  file                                                  -­‐-­‐the  RID’s  and  VISCODE’s  of  faq  and   cdr  are  compared  and  based  on  that   CDGLOBAL  column  in  cdr  file  is  merged  to  faq   file.   •   During  Remove  CDGLOBAL  which  has  -­‐1  and         VISCODE’s  f,nv,uns1  are  trimmed  off.                Result  file  is  “Merged_dataset_file.csv”   G
  8. 8. Query  used  for  genera4ng  merged   file:   •  Select   f.cID  ,f.RID  ,f.VISCODE  ,f.EXAMDATE  ,f.FAQSOURCE ,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f. FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQR EM,f.FAQTRAVL,f.FAQTOTAL  ,cdr.cdglobal  from   cdr,faq  f  where  cdr.rid=f.rid  and   cdr.VISCODE=f.VISCODE  and  cdr.cdglobal  not  in   (-­‐1)";     G
  9. 9. Genera4on  of  files  for  future  status   predic4on   •  Predic4on  dataset  is  generated  by  mapping   the  first  4me  visit  to  the  6  month’s  Class  and  6   month  visit  to  the  12  month’s  Class  and  so  on.   •  SQL  query  opera4ons  are  performed  on  the   merged  file  to  separate  the  6  month’s  4me   interval  classes.   •  Following  are  the  files  generated:                                            -­‐  File_dataset_m06.csv                              -­‐File_dataset_m12.csv  and  so  on           V
  10. 10. Query  used  for  genera4ng  class   files:   •  Select  v.ID  as  ID,v.RID  as   RID,v.VISCODE  ,v.EXAMDATE,v.FAQSOURCE  ,v.FAQ FINAN  ,v.FAQFORM  ,v.FAQSHOP  ,v.FAQGAME  ,v.FA QBEVG,v.FAQMEAL  ,v.FAQEVENT  ,v.FAQTV  ,v.FAQR EM  ,v.FAQTRAVL  ,v.FAQTOTAL  ,m12.cdrglobal  fro m  `table_adni_faq_2011-­‐01-­‐20_m06`   v,`table_adni_faq_2011-­‐01-­‐20_m12`  m12  where   v.rid=m12.rid   V
  11. 11. Preprocessing   •  Aeer  we  get  required  .csv  files,  we  use  WEKA   to  preprocess  the  data.   •  Load  the  file  into  WEKA.   •  Apply  Filter   “weka.filters.unsuperwised.apributes.Remove ”  to  trim  off  the  unused  fields.   •  Apply  “NumericaltoNominal”  to  convert  all  the   values  in  the  data  to  Nominal  before   classifying  and  fetching  to  a  classifier   algorithm.   G
  12. 12. Classifica4on  Algorithms  Used   •  The  Classify  panel  enables  the  user  to  apply   classifica4on  and  regression  algorithms   (indiscriminately  called  classifiers  in  Weka)  to   the  resul4ng  dataset,  to  es4mate  the  accuracy   of  the  resul4ng  predic4ve  model.   •   J48  uses    C4.5  (a  successor  of  ID3)  Algorithm   •  Naïve  Bayesian  Classifica4on  Algorithm   G
  13. 13. What  is  classifica4on?   •  Given  a  collec4on  of  records  (training  set  )   – Each  record  contains  a  set  of  a"ributes,  one  of  the   apributes  is  the  class   -­‐-­‐  A  test  set  is  used  to  determine  the  accuracy  of  the  model.   Usually,  the  given  data  set  is  divided  into  training  and  test   sets,  with  training  set  used  to  build  the  model  and  test  set   used  to  validate  it.   Example:              If  we  have  items  in  a  house  which  are  not  classified  then  we  can’t  arrange   items  in  our  house.              We  classify  the  items  depending  on  their  usage  as  cooking  items,   decora4on  items  etc.,  such  that  we  could  arrange  them  accordingly  and   can  use  it  in  an  efficient  and  easier  way.     G
  14. 14. Decision  Tree  Classifica/on  Task   G Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Test Data Assign Cheat to “No”
  15. 15. Decision  Tree     Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Test Data G
  16. 16. J  48  uses  C  4.5  Algorithm   •  Decision  trees  represent  a  supervised  approach  to  classifica4on   •  Decision  trees  are  a  classic  way  to  represent  informa4on  from  a  machine   learning  algorithm,  and  offer  a  fast  and  powerful  way  to  express  structures   in  data.     •  A  decision  tree  is  a  simple  structure  where  non-­‐terminal  nodes  represent   tests  on  one  or  more  apributes  and  terminal  nodes  reflect  decision   outcomes.   •  The  basic  algorithm  described  above  recursively  classifies  un4l  each  leaf  is   pure,  meaning  that  the  data  has  been  categorized  as  close  to  perfectly  as   possible.     •  The  latest  public  domain  implementa4on  of  Quinlan's  model  is  C4.5.  The   Weka  classifier  package  has  its  own  version  of  C4.5  known  as  J48.   •  This  process  ensures  maximum  accuracy  on  the  training  data.  
  17. 17. Why  decision  tree  Algorithm?   •  Advantages:   – Inexpensive  to  construct   – Easy  to  interpret  for  small-­‐sized  trees   – Accuracy  is  comparable  to  other  classifica4on   techniques  for  many  simple  data  sets   – There  could  be  more  than  one  tree  possible  for   the  same  data   •  Disadvantages:            -­‐  Under  fivng:  when  the  model  is  too  simple,  both   training  and  test  errors  are  large  
  18. 18. All  about  Cross  Valida4on   •  We  perform  cross  valida4on  when  amount  of  data  is  small  and  we   need  to  have  independent    training  and  test  set  from  it.   •  It  is  important  that  each  class  is  represented  in  its  actual   propor4ons  in  the  training  and  test  sets:  Stra4fica4on   •  An  important  cross  valida4on  technique  is  stra4fied  10  fold  cross   valida4on,  where  the  instance  set  is  divided  into  10  folds.   •  We  have  10  itera4ons  with  taking  different  single  fold  for  tes4ng   and  the  rest  for  training.   V
  19. 19. Evalua4on   •  Metrics  for  Performance  Evalua4on   – How  to  evaluate  the  performance  of  a  model?   •  Methods  for  Model  Comparison   – How  to  compare  the  rela4ve  performance  among   compe4ng  models?   V
  20. 20. Metrics  for  Performance  Evalua4on:   Confusion  Matrix   •  A  confusion  matrix  contains  informa4on  about  actual  and  predicted   classifica4ons  done  by  a  classifica4on  system.  Performance  of  systems  is   commonly  evaluated  using  the  data  in  the  matrix.  The  following  table   shows  the  confusion  matrix  for  a  two  class  classifier:     •  We  get  confusion  matrix  aeer  supplying  data  to  a  Classifier   •  Based  on  the  confusion  matrix  we  can  evaluate  using  the  measures  like,   precision,  F-­‐measure,  accuracy  and  Recall.   G
  21. 21. Example   •  Suppose  there  are  a  sample  of  27  animals  —  8  cats,  6  dogs,  and  13  rabbits.   •  Each  column  of  the  matrix  represents  the  instances  in  a  predicted  class,   while  each  row  represents  the  instances  in  an  actual  class.   •  We  can  see  from  the  matrix  that  the  system  in  ques4on  has  trouble   dis4nguishing  between  cats  and  dogs,  but  can  make  the  dis4nc4on   between  rabbits  and  other  types  of  animals  prepy  well.     •  All  correct  guesses  are  located  in  the  diagonal  of  the  table,  so  it's  easy  to   visually  inspect  the  table  for  errors,  as  they  will  be  represented  by  any   non-­‐zero  values  outside  the  diagonal.   G
  22. 22. Limita4on  of  accuracy   Limita/on  of  accuracy:   •  Consider  a  2-­‐class  problem   –  Number  of  Class  0  examples  =  9990   –  Number  of  Class  1  examples  =  10   •  If  model  predicts  everything  to  be  class  0,  accuracy  is  9990/10000  =  99.9  %   –  It  has  some  disadvantages  as  a  performance  es4mate.  For  example,  if   there  were  95  cats  and  only  5  dogs  in  the  data  set,  the  classifier  could   easily  be  biased  into  classifying  all  the  samples  as  cats.  The  overall   accuracy  would  be  95%,  but  in  prac4ce  the  classifier  would  have  a   100%  recogni4on  rate  for  the  cat  class  but  a  0%  recogni4on  rate  for   the  dog  class,  so  you'll  probably  want  to  look  at  some  of  the  other   numbers.  ROC  Area,  or  area  under  the  ROC  curve,  is  also  taken  as     preferred  measure.   –  Accuracy  is  misleading  because  model  does  not  detect  any  class  1   example.   G
  23. 23. Metrics  for  Evalua4on   •  Accuracy:  The  accuracy  (AC)  is  the  propor4ons  of  the  total  number  of   predic4ons  that  were  correct,  what  percentage  of  people  were  correctly   classified.  It  is  determined  using  the  equa4on:                                                     Accuracy  =  (#  True  Posi4ves  +  #  True  Nega4ves)  /  N                                             Where  N  =  Total  #  predic4ons.   •  Precision:      Finally,  precision  (P)  is  the  propor4on  of  the  predicted  posi4ve   cases  that  were  correct.  Of  all  the  people  that  are  classified  as  demented,   what  percentage  of  them  is  actually  demented?                It  is  calculated  using  the  equa4on                                                                  Precision  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False   Posi4ve)   € Accuracy = TP + TN TP + TN + FP + FNV
  24. 24. Evalua4on   •  F-­‐measure:                                                     F-­‐measure  =2*  (#  True  Posi4ves  )  /  (  #  2*True  Posi4ves  +  #   True  Nega4ves  +  #False  Posi4ves)   •  Recall:    Recall  is  the  ra4o  of  the  number  of  true  posi4ves  and  the  sum  of   true  posi4ves  and  false  nega4ves.  It  is  calculated  using  the  equa4on:                                                           Recall  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False   Nega4ves)   V
  25. 25. Methods  for  Model  Comparison   ROC  (Receiver  Opera/ng  Characteris/c)   •  Developed  in  1950s  for  signal  detec4on  theory   to  analyze  noisy  signals     – Characterize  the  trade-­‐off  between  posi4ve  hits   and  false  alarms   •  ROC  curve  plots  TP  (on  the  y-­‐axis)  against  FP   (on  the  x-­‐axis)   V
  26. 26. Using  ROC  for  Model  Comparison     M1 is better for small FPR   M2 is better for large FPR A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:. .90-1 = excellent (A). .80-.90 = good (B). .70-.80 = fair (C). .60-.70 = poor (D). .50-.60 = fail (F)Area Under the ROC curve A V
  27. 27. Naïve  Bayes   •  It  is  a  simple  probabilis4c  classifier  based  on  applying  bayes  theorem  with   independence  assump4ons.  Naive  Bayes  classifier  assumes  that  the   presence  (or  absence)  of  a  par4cular  feature  of  a  class  is  unrelated  to  the   presence  (or  absence)  of  any  other  feature.   •  For  example,  a  fruit  may  be  considered  to  be  an  apple  if  it  is  red,  round,   and  about  4"  in  diameter.  Even  if  these  features  depend  on  each  other  or   upon  the  existence  of  the  other  features,  a  naive  Bayes  classifier  considers   all  of  these  proper4es  to  independently  contribute  to  the  probability  that   this  fruit  is  an  apple.   •  An  advantage  of  the  naive  Bayes  classifier  is  that  it  requires  a  small   amount  of  training  data  to  es4mate  the  parameters  (means  and  variances   of  the  variables)  necessary  for  classifica4on.  Because  independent   variables  are  assumed,  only  the  variances  of  the  variables  for  each  class   need  to  be  determined  and  not  the  en4re  set.  Best  suited  for  apributes,   which  are  independent.  It  is  very  simple,  very  fast.       V
  28. 28. Challenges  faced   •  Ini4ally  data  files  all  being  processed  using  JDBC  and  MySQL  and  later  its   been  found  to  be  hec4c  if  at  all  other  dataset  being  used.  Hence  PHP   based  MYSQL  is  used  which  is  generalized  for  all  datasets.   •  Table  crea4on  ini4ally  for  loading  the  data,  later  done  with  file  opera4ng   func4ons.         •  Running  all  the  “MYSQL”  commands  sequen4ally,  later  enhanced  using   php  as  front  end.   •   Ini4ally  J48  tree  was  not  able  to  process  due  to  the  data  being  in   numerical  values.  Later  done  by  Discre4za4on/NumericaltoNominal  of   CDGLobal  columns.   VG
  29. 29. Preprocess  Output   G
  30. 30. Result  file  for  current  status(J48)   G
  31. 31. Current  status  (Naïve  Bayes)   V
  32. 32. Future  status  (J48)   V
  33. 33. Future  status  (Naïve  Bayes)   V
  34. 34. MMSE  (J48)  
  35. 35. References:   http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/ WekaManual-3-6-2.pdf http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf http://stackoverflow.com/questions/2903933/how-to-interpret-weka- classification http://www.slideshare.net/dataminingtools/weka-credibility-evaluating- whats-been-learned
  36. 36.                                            Thank  you    

×