SlideShare a Scribd company logo
1 of 36
Download to read offline
Alzheimer's	
  Disease-­‐	
  Clinical	
  Data	
  
Classifica4on	
  	
  	
  
By	
  
George	
  Kalangi	
  
Venkata	
  Gopi	
  	
  
Overview:	
  
•  Introduc4on	
  
•  Analysis	
  of	
  commonly	
  used	
  terms	
  and	
  
explana4on	
  of	
  data	
  sets	
  
•  Overall	
  Programming	
  Process	
  
•  Genera4ng	
  a	
  merged	
  file	
  with	
  CDGLOBAL	
  
•  Genera4on	
  of	
  files	
  for	
  future	
  status	
  predic4on	
  
•  Data	
  Preprocessing	
  
•  Classifica4on	
  (Algorithms)	
  used	
  on	
  the	
  data	
  
•  Analysis	
  on	
  the	
  output	
  data	
  from	
  WEKAb	
  
G
Introduc4on	
  
•  What	
  is	
  Alzheimer’s	
  Disease?	
  
•  Brain	
  disorder	
  
•  Most	
  common	
  form	
  of	
  demen4a	
  
– Term	
  for	
  the	
  loss	
  	
  
• Memory	
  
• Other	
  intellectual	
  abili4es	
  
• Serious	
  enough	
  to	
  interfere	
  with	
  daily	
  life	
  
•  Clinical	
  Demen4a	
  Ra4o	
  (0,0.5,1,2,3)	
  
Mild to Severe Dementia1.0 to 3.0
Questionable Dementia0.5
Normal0
G
Datasets	
  (60	
  Files)	
  
"  56	
  comma	
  separated	
  files	
  
 1	
  File	
  –	
  Data	
  Dic4onary	
  (Explains	
  the	
  terms	
  
used)	
  
 1	
  File	
  –	
  Clinical	
  Demen4a	
  Ra4ng	
  (Has	
  
CDGLOBAL)	
  
 Rest	
  
 Assessments	
  	
  
 Data	
  Defini4ons	
  
 Other	
  like	
  visits	
  having	
  abbrevia4ons	
  
G
Environment	
  Setup	
  
•  Programming	
  Languages	
  used	
  for	
  the	
  project	
  
are	
  PHP,	
  MySQL,	
  Java,	
  Postgresql	
  
•  Tools	
  used	
  are	
  WEKA	
  (Waikato	
  Environment	
  
for	
  Knowledge	
  Analysis),	
  MySQLWorkBench,	
  
	
  	
  	
  	
  and	
  NetBeans	
  
•  	
  	
  	
  	
  	
  	
  	
  -­‐Front	
  End	
  	
  	
  (PHP)	
  
•  	
  	
  	
  	
  	
  	
  	
  -­‐Back	
  End	
  	
  	
  (MySQL)	
  	
  	
  	
  	
  	
  
G
V
Overall	
  Programming	
  Process	
  	
  
•  A	
  selected	
  dataset	
  (FAQ)	
  is	
  given	
  by	
  the	
  user.	
  
•  At	
  the	
  backend	
  MYSQL	
  queries	
  are	
  defined	
  
enough	
  to	
  create	
  the	
  required	
  tables	
  and	
  
insert	
  the	
  required	
  data	
  to	
  the	
  corresponding	
  
tables.	
  
•  Here	
  aeer	
  the	
  required	
  opera4ons	
  are	
  
performed	
  on	
  the	
  tables.	
  
•  Final	
  output	
  files	
  are	
  stored	
  in	
  .csv	
  format.	
  
G
V
Genera4ng	
  a	
  merged	
  file	
  with	
  
CDGLOBAL	
  (For	
  current)	
  
•  For	
  the	
  given	
  datasets	
  as	
  input,	
  	
  
	
  	
  (Eg:adni_faq_2011-­‐01-­‐20.csv)	
  and	
  from	
  the	
  
adni_cdr_2011-­‐01-­‐20.csv)	
  file	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐-­‐the	
  RID’s	
  and	
  VISCODE’s	
  of	
  faq	
  and	
  
cdr	
  are	
  compared	
  and	
  based	
  on	
  that	
  
CDGLOBAL	
  column	
  in	
  cdr	
  file	
  is	
  merged	
  to	
  faq	
  
file.	
  
•  	
  During	
  Remove	
  CDGLOBAL	
  which	
  has	
  -­‐1	
  and	
  	
  	
  	
  
VISCODE’s	
  f,nv,uns1	
  are	
  trimmed	
  off.	
  
	
  	
  	
  	
  	
  	
  	
  Result	
  file	
  is	
  “Merged_dataset_file.csv”	
  
G
Query	
  used	
  for	
  genera4ng	
  merged	
  
file:	
  
•  Select	
  
f.cID	
  ,f.RID	
  ,f.VISCODE	
  ,f.EXAMDATE	
  ,f.FAQSOURCE
,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.
FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQR
EM,f.FAQTRAVL,f.FAQTOTAL	
  ,cdr.cdglobal	
  from	
  
cdr,faq	
  f	
  where	
  cdr.rid=f.rid	
  and	
  
cdr.VISCODE=f.VISCODE	
  and	
  cdr.cdglobal	
  not	
  in	
  
(-­‐1)";	
  	
  
G
Genera4on	
  of	
  files	
  for	
  future	
  status	
  
predic4on	
  
•  Predic4on	
  dataset	
  is	
  generated	
  by	
  mapping	
  
the	
  first	
  4me	
  visit	
  to	
  the	
  6	
  month’s	
  Class	
  and	
  6	
  
month	
  visit	
  to	
  the	
  12	
  month’s	
  Class	
  and	
  so	
  on.	
  
•  SQL	
  query	
  opera4ons	
  are	
  performed	
  on	
  the	
  
merged	
  file	
  to	
  separate	
  the	
  6	
  month’s	
  4me	
  
interval	
  classes.	
  
•  Following	
  are	
  the	
  files	
  generated:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐	
  File_dataset_m06.csv	
  
	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  -­‐File_dataset_m12.csv	
  and	
  so	
  on	
  	
  	
  	
  	
  
V
Query	
  used	
  for	
  genera4ng	
  class	
  
files:	
  
•  Select	
  v.ID	
  as	
  ID,v.RID	
  as	
  
RID,v.VISCODE	
  ,v.EXAMDATE,v.FAQSOURCE	
  ,v.FAQ
FINAN	
  ,v.FAQFORM	
  ,v.FAQSHOP	
  ,v.FAQGAME	
  ,v.FA
QBEVG,v.FAQMEAL	
  ,v.FAQEVENT	
  ,v.FAQTV	
  ,v.FAQR
EM	
  ,v.FAQTRAVL	
  ,v.FAQTOTAL	
  ,m12.cdrglobal	
  fro
m	
  `table_adni_faq_2011-­‐01-­‐20_m06`	
  
v,`table_adni_faq_2011-­‐01-­‐20_m12`	
  m12	
  where	
  
v.rid=m12.rid	
  
V
Preprocessing	
  
•  Aeer	
  we	
  get	
  required	
  .csv	
  files,	
  we	
  use	
  WEKA	
  
to	
  preprocess	
  the	
  data.	
  
•  Load	
  the	
  file	
  into	
  WEKA.	
  
•  Apply	
  Filter	
  
“weka.filters.unsuperwised.apributes.Remove
”	
  to	
  trim	
  off	
  the	
  unused	
  fields.	
  
•  Apply	
  “NumericaltoNominal”	
  to	
  convert	
  all	
  the	
  
values	
  in	
  the	
  data	
  to	
  Nominal	
  before	
  
classifying	
  and	
  fetching	
  to	
  a	
  classifier	
  
algorithm.	
  
G
Classifica4on	
  Algorithms	
  Used	
  
•  The	
  Classify	
  panel	
  enables	
  the	
  user	
  to	
  apply	
  
classifica4on	
  and	
  regression	
  algorithms	
  
(indiscriminately	
  called	
  classifiers	
  in	
  Weka)	
  to	
  
the	
  resul4ng	
  dataset,	
  to	
  es4mate	
  the	
  accuracy	
  
of	
  the	
  resul4ng	
  predic4ve	
  model.	
  
•  	
  J48	
  uses	
  	
  C4.5	
  (a	
  successor	
  of	
  ID3)	
  Algorithm	
  
•  Naïve	
  Bayesian	
  Classifica4on	
  Algorithm	
  
G
What	
  is	
  classifica4on?	
  
•  Given	
  a	
  collec4on	
  of	
  records	
  (training	
  set	
  )	
  
– Each	
  record	
  contains	
  a	
  set	
  of	
  a"ributes,	
  one	
  of	
  the	
  
apributes	
  is	
  the	
  class	
  
-­‐-­‐	
  A	
  test	
  set	
  is	
  used	
  to	
  determine	
  the	
  accuracy	
  of	
  the	
  model.	
  
Usually,	
  the	
  given	
  data	
  set	
  is	
  divided	
  into	
  training	
  and	
  test	
  
sets,	
  with	
  training	
  set	
  used	
  to	
  build	
  the	
  model	
  and	
  test	
  set	
  
used	
  to	
  validate	
  it.	
  
Example:	
  
	
  	
  	
  	
  	
  	
  If	
  we	
  have	
  items	
  in	
  a	
  house	
  which	
  are	
  not	
  classified	
  then	
  we	
  can’t	
  arrange	
  
items	
  in	
  our	
  house.	
  	
  
	
  	
  	
  	
  	
  We	
  classify	
  the	
  items	
  depending	
  on	
  their	
  usage	
  as	
  cooking	
  items,	
  
decora4on	
  items	
  etc.,	
  such	
  that	
  we	
  could	
  arrange	
  them	
  accordingly	
  and	
  
can	
  use	
  it	
  in	
  an	
  efficient	
  and	
  easier	
  way.	
  	
  
G
Decision	
  Tree	
  Classifica/on	
  Task	
   G
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”
Decision	
  Tree	
  	
  
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
G
J	
  48	
  uses	
  C	
  4.5	
  Algorithm	
  
•  Decision	
  trees	
  represent	
  a	
  supervised	
  approach	
  to	
  classifica4on	
  
•  Decision	
  trees	
  are	
  a	
  classic	
  way	
  to	
  represent	
  informa4on	
  from	
  a	
  machine	
  
learning	
  algorithm,	
  and	
  offer	
  a	
  fast	
  and	
  powerful	
  way	
  to	
  express	
  structures	
  
in	
  data.	
  	
  
•  A	
  decision	
  tree	
  is	
  a	
  simple	
  structure	
  where	
  non-­‐terminal	
  nodes	
  represent	
  
tests	
  on	
  one	
  or	
  more	
  apributes	
  and	
  terminal	
  nodes	
  reflect	
  decision	
  
outcomes.	
  
•  The	
  basic	
  algorithm	
  described	
  above	
  recursively	
  classifies	
  un4l	
  each	
  leaf	
  is	
  
pure,	
  meaning	
  that	
  the	
  data	
  has	
  been	
  categorized	
  as	
  close	
  to	
  perfectly	
  as	
  
possible.	
  	
  
•  The	
  latest	
  public	
  domain	
  implementa4on	
  of	
  Quinlan's	
  model	
  is	
  C4.5.	
  The	
  
Weka	
  classifier	
  package	
  has	
  its	
  own	
  version	
  of	
  C4.5	
  known	
  as	
  J48.	
  
•  This	
  process	
  ensures	
  maximum	
  accuracy	
  on	
  the	
  training	
  data.	
  
Why	
  decision	
  tree	
  Algorithm?	
  
•  Advantages:	
  
– Inexpensive	
  to	
  construct	
  
– Easy	
  to	
  interpret	
  for	
  small-­‐sized	
  trees	
  
– Accuracy	
  is	
  comparable	
  to	
  other	
  classifica4on	
  
techniques	
  for	
  many	
  simple	
  data	
  sets	
  
– There	
  could	
  be	
  more	
  than	
  one	
  tree	
  possible	
  for	
  
the	
  same	
  data	
  
•  Disadvantages:	
  
	
  	
  	
  	
  	
  -­‐	
  Under	
  fivng:	
  when	
  the	
  model	
  is	
  too	
  simple,	
  both	
  
training	
  and	
  test	
  errors	
  are	
  large	
  
All	
  about	
  Cross	
  Valida4on	
  
•  We	
  perform	
  cross	
  valida4on	
  when	
  amount	
  of	
  data	
  is	
  small	
  and	
  we	
  
need	
  to	
  have	
  independent	
  	
  training	
  and	
  test	
  set	
  from	
  it.	
  
•  It	
  is	
  important	
  that	
  each	
  class	
  is	
  represented	
  in	
  its	
  actual	
  
propor4ons	
  in	
  the	
  training	
  and	
  test	
  sets:	
  Stra4fica4on	
  
•  An	
  important	
  cross	
  valida4on	
  technique	
  is	
  stra4fied	
  10	
  fold	
  cross	
  
valida4on,	
  where	
  the	
  instance	
  set	
  is	
  divided	
  into	
  10	
  folds.	
  
•  We	
  have	
  10	
  itera4ons	
  with	
  taking	
  different	
  single	
  fold	
  for	
  tes4ng	
  
and	
  the	
  rest	
  for	
  training.	
  
V
Evalua4on	
  
•  Metrics	
  for	
  Performance	
  Evalua4on	
  
– How	
  to	
  evaluate	
  the	
  performance	
  of	
  a	
  model?	
  
•  Methods	
  for	
  Model	
  Comparison	
  
– How	
  to	
  compare	
  the	
  rela4ve	
  performance	
  among	
  
compe4ng	
  models?	
  
V
Metrics	
  for	
  Performance	
  Evalua4on:	
  
Confusion	
  Matrix	
  
•  A	
  confusion	
  matrix	
  contains	
  informa4on	
  about	
  actual	
  and	
  predicted	
  
classifica4ons	
  done	
  by	
  a	
  classifica4on	
  system.	
  Performance	
  of	
  systems	
  is	
  
commonly	
  evaluated	
  using	
  the	
  data	
  in	
  the	
  matrix.	
  The	
  following	
  table	
  
shows	
  the	
  confusion	
  matrix	
  for	
  a	
  two	
  class	
  classifier:	
  	
  
•  We	
  get	
  confusion	
  matrix	
  aeer	
  supplying	
  data	
  to	
  a	
  Classifier	
  
•  Based	
  on	
  the	
  confusion	
  matrix	
  we	
  can	
  evaluate	
  using	
  the	
  measures	
  like,	
  
precision,	
  F-­‐measure,	
  accuracy	
  and	
  Recall.	
  
G
Example	
  
•  Suppose	
  there	
  are	
  a	
  sample	
  of	
  27	
  animals	
  —	
  8	
  cats,	
  6	
  dogs,	
  and	
  13	
  rabbits.	
  
•  Each	
  column	
  of	
  the	
  matrix	
  represents	
  the	
  instances	
  in	
  a	
  predicted	
  class,	
  
while	
  each	
  row	
  represents	
  the	
  instances	
  in	
  an	
  actual	
  class.	
  
•  We	
  can	
  see	
  from	
  the	
  matrix	
  that	
  the	
  system	
  in	
  ques4on	
  has	
  trouble	
  
dis4nguishing	
  between	
  cats	
  and	
  dogs,	
  but	
  can	
  make	
  the	
  dis4nc4on	
  
between	
  rabbits	
  and	
  other	
  types	
  of	
  animals	
  prepy	
  well.	
  	
  
•  All	
  correct	
  guesses	
  are	
  located	
  in	
  the	
  diagonal	
  of	
  the	
  table,	
  so	
  it's	
  easy	
  to	
  
visually	
  inspect	
  the	
  table	
  for	
  errors,	
  as	
  they	
  will	
  be	
  represented	
  by	
  any	
  
non-­‐zero	
  values	
  outside	
  the	
  diagonal.	
  
G
Limita4on	
  of	
  accuracy	
  
Limita/on	
  of	
  accuracy:	
  
•  Consider	
  a	
  2-­‐class	
  problem	
  
–  Number	
  of	
  Class	
  0	
  examples	
  =	
  9990	
  
–  Number	
  of	
  Class	
  1	
  examples	
  =	
  10	
  
•  If	
  model	
  predicts	
  everything	
  to	
  be	
  class	
  0,	
  accuracy	
  is	
  9990/10000	
  =	
  99.9	
  %	
  
–  It	
  has	
  some	
  disadvantages	
  as	
  a	
  performance	
  es4mate.	
  For	
  example,	
  if	
  
there	
  were	
  95	
  cats	
  and	
  only	
  5	
  dogs	
  in	
  the	
  data	
  set,	
  the	
  classifier	
  could	
  
easily	
  be	
  biased	
  into	
  classifying	
  all	
  the	
  samples	
  as	
  cats.	
  The	
  overall	
  
accuracy	
  would	
  be	
  95%,	
  but	
  in	
  prac4ce	
  the	
  classifier	
  would	
  have	
  a	
  
100%	
  recogni4on	
  rate	
  for	
  the	
  cat	
  class	
  but	
  a	
  0%	
  recogni4on	
  rate	
  for	
  
the	
  dog	
  class,	
  so	
  you'll	
  probably	
  want	
  to	
  look	
  at	
  some	
  of	
  the	
  other	
  
numbers.	
  ROC	
  Area,	
  or	
  area	
  under	
  the	
  ROC	
  curve,	
  is	
  also	
  taken	
  as	
  	
  
preferred	
  measure.	
  
–  Accuracy	
  is	
  misleading	
  because	
  model	
  does	
  not	
  detect	
  any	
  class	
  1	
  
example.	
  
G
Metrics	
  for	
  Evalua4on	
  
•  Accuracy:	
  The	
  accuracy	
  (AC)	
  is	
  the	
  propor4ons	
  of	
  the	
  total	
  number	
  of	
  
predic4ons	
  that	
  were	
  correct,	
  what	
  percentage	
  of	
  people	
  were	
  correctly	
  
classified.	
  It	
  is	
  determined	
  using	
  the	
  equa4on:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   Accuracy	
  =	
  (#	
  True	
  Posi4ves	
  +	
  #	
  True	
  Nega4ves)	
  /	
  N	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   Where	
  N	
  =	
  Total	
  #	
  predic4ons.	
  
•  Precision:	
  	
  	
  Finally,	
  precision	
  (P)	
  is	
  the	
  propor4on	
  of	
  the	
  predicted	
  posi4ve	
  
cases	
  that	
  were	
  correct.	
  Of	
  all	
  the	
  people	
  that	
  are	
  classified	
  as	
  demented,	
  
what	
  percentage	
  of	
  them	
  is	
  actually	
  demented?	
  
	
  	
  	
  	
  	
  	
  	
  It	
  is	
  calculated	
  using	
  the	
  equa4on	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Precision	
  =	
  (#	
  True	
  Posi4ves)	
  /	
  (#	
  True	
  Posi4ves	
  +	
  #	
  False	
  
Posi4ve)	
  
€
Accuracy =
TP + TN
TP + TN + FP + FNV
Evalua4on	
  
•  F-­‐measure:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   F-­‐measure	
  =2*	
  (#	
  True	
  Posi4ves	
  )	
  /	
  (	
  #	
  2*True	
  Posi4ves	
  +	
  #	
  
True	
  Nega4ves	
  +	
  #False	
  Posi4ves)	
  
•  Recall:	
  	
  Recall	
  is	
  the	
  ra4o	
  of	
  the	
  number	
  of	
  true	
  posi4ves	
  and	
  the	
  sum	
  of	
  
true	
  posi4ves	
  and	
  false	
  nega4ves.	
  It	
  is	
  calculated	
  using	
  the	
  equa4on:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   Recall	
  =	
  (#	
  True	
  Posi4ves)	
  /	
  (#	
  True	
  Posi4ves	
  +	
  #	
  False	
  
Nega4ves)	
  
V
Methods	
  for	
  Model	
  Comparison	
  
ROC	
  (Receiver	
  Opera/ng	
  Characteris/c)	
  
•  Developed	
  in	
  1950s	
  for	
  signal	
  detec4on	
  theory	
  
to	
  analyze	
  noisy	
  signals	
  	
  
– Characterize	
  the	
  trade-­‐off	
  between	
  posi4ve	
  hits	
  
and	
  false	
  alarms	
  
•  ROC	
  curve	
  plots	
  TP	
  (on	
  the	
  y-­‐axis)	
  against	
  FP	
  
(on	
  the	
  x-­‐axis)	
  
V
Using	
  ROC	
  for	
  Model	
  Comparison	
  
  M1 is better for small
FPR
  M2 is better for large
FPR
A rough guide for
classifying the accuracy
of a diagnostic test is the
traditional academic
point system:.
.90-1 = excellent (A).
.80-.90 = good (B).
.70-.80 = fair (C).
.60-.70 = poor (D).
.50-.60 = fail (F)Area Under the ROC curve A
V
Naïve	
  Bayes	
  
•  It	
  is	
  a	
  simple	
  probabilis4c	
  classifier	
  based	
  on	
  applying	
  bayes	
  theorem	
  with	
  
independence	
  assump4ons.	
  Naive	
  Bayes	
  classifier	
  assumes	
  that	
  the	
  
presence	
  (or	
  absence)	
  of	
  a	
  par4cular	
  feature	
  of	
  a	
  class	
  is	
  unrelated	
  to	
  the	
  
presence	
  (or	
  absence)	
  of	
  any	
  other	
  feature.	
  
•  For	
  example,	
  a	
  fruit	
  may	
  be	
  considered	
  to	
  be	
  an	
  apple	
  if	
  it	
  is	
  red,	
  round,	
  
and	
  about	
  4"	
  in	
  diameter.	
  Even	
  if	
  these	
  features	
  depend	
  on	
  each	
  other	
  or	
  
upon	
  the	
  existence	
  of	
  the	
  other	
  features,	
  a	
  naive	
  Bayes	
  classifier	
  considers	
  
all	
  of	
  these	
  proper4es	
  to	
  independently	
  contribute	
  to	
  the	
  probability	
  that	
  
this	
  fruit	
  is	
  an	
  apple.	
  
•  An	
  advantage	
  of	
  the	
  naive	
  Bayes	
  classifier	
  is	
  that	
  it	
  requires	
  a	
  small	
  
amount	
  of	
  training	
  data	
  to	
  es4mate	
  the	
  parameters	
  (means	
  and	
  variances	
  
of	
  the	
  variables)	
  necessary	
  for	
  classifica4on.	
  Because	
  independent	
  
variables	
  are	
  assumed,	
  only	
  the	
  variances	
  of	
  the	
  variables	
  for	
  each	
  class	
  
need	
  to	
  be	
  determined	
  and	
  not	
  the	
  en4re	
  set.	
  Best	
  suited	
  for	
  apributes,	
  
which	
  are	
  independent.	
  It	
  is	
  very	
  simple,	
  very	
  fast.	
  	
  	
  
V
Challenges	
  faced	
  
•  Ini4ally	
  data	
  files	
  all	
  being	
  processed	
  using	
  JDBC	
  and	
  MySQL	
  and	
  later	
  its	
  
been	
  found	
  to	
  be	
  hec4c	
  if	
  at	
  all	
  other	
  dataset	
  being	
  used.	
  Hence	
  PHP	
  
based	
  MYSQL	
  is	
  used	
  which	
  is	
  generalized	
  for	
  all	
  datasets.	
  
•  Table	
  crea4on	
  ini4ally	
  for	
  loading	
  the	
  data,	
  later	
  done	
  with	
  file	
  opera4ng	
  
func4ons.	
  	
  	
  	
  
•  Running	
  all	
  the	
  “MYSQL”	
  commands	
  sequen4ally,	
  later	
  enhanced	
  using	
  
php	
  as	
  front	
  end.	
  
•  	
  Ini4ally	
  J48	
  tree	
  was	
  not	
  able	
  to	
  process	
  due	
  to	
  the	
  data	
  being	
  in	
  
numerical	
  values.	
  Later	
  done	
  by	
  Discre4za4on/NumericaltoNominal	
  of	
  
CDGLobal	
  columns.	
  
VG
Preprocess	
  Output	
  
G
Result	
  file	
  for	
  current	
  status(J48)	
   G
Current	
  status	
  (Naïve	
  Bayes)	
  
V
Future	
  status	
  (J48)	
   V
Future	
  status	
  (Naïve	
  Bayes)	
   V
MMSE	
  (J48)	
  
References:	
  
http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/
WekaManual-3-6-2.pdf
http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf
http://stackoverflow.com/questions/2903933/how-to-interpret-weka-
classification
http://www.slideshare.net/dataminingtools/weka-credibility-evaluating-
whats-been-learned
 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Thank	
  you	
  	
  

More Related Content

Viewers also liked

The global prevalence of Alzheimer's disease
The global prevalence of Alzheimer's diseaseThe global prevalence of Alzheimer's disease
The global prevalence of Alzheimer's diseaseHamish Robertson
 
BrainImaging_2015-10-06_22h29
BrainImaging_2015-10-06_22h29BrainImaging_2015-10-06_22h29
BrainImaging_2015-10-06_22h29Myriam Dimanche
 
FMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipelineFMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipelineKrzysztof Gorgolewski
 
Group analyses with FieldTrip
Group analyses with FieldTripGroup analyses with FieldTrip
Group analyses with FieldTripRobert Oostenveld
 
Introduction to Neuroimaging
Introduction to NeuroimagingIntroduction to Neuroimaging
Introduction to NeuroimagingSunghyon Kyeong
 
fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)Sunghyon Kyeong
 

Viewers also liked (8)

The global prevalence of Alzheimer's disease
The global prevalence of Alzheimer's diseaseThe global prevalence of Alzheimer's disease
The global prevalence of Alzheimer's disease
 
BrainImaging_2015-10-06_22h29
BrainImaging_2015-10-06_22h29BrainImaging_2015-10-06_22h29
BrainImaging_2015-10-06_22h29
 
FMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipelineFMRIPREP - robust and easy to use fMRI preprocessing pipeline
FMRIPREP - robust and easy to use fMRI preprocessing pipeline
 
Group analyses with FieldTrip
Group analyses with FieldTripGroup analyses with FieldTrip
Group analyses with FieldTrip
 
Introduction to Neuroimaging
Introduction to NeuroimagingIntroduction to Neuroimaging
Introduction to Neuroimaging
 
fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)fMRI preprocessing steps (in SPM8)
fMRI preprocessing steps (in SPM8)
 
Introduction to fMRI
Introduction to fMRIIntroduction to fMRI
Introduction to fMRI
 
Alzheimer's disease
Alzheimer's diseaseAlzheimer's disease
Alzheimer's disease
 

Similar to Clinical Data Classification of alzheimer's disease

Oracle Study Setup_Katalyst HLS
Oracle Study Setup_Katalyst HLSOracle Study Setup_Katalyst HLS
Oracle Study Setup_Katalyst HLSKatalyst HLS
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Test data documentation ss
Test data documentation ssTest data documentation ss
Test data documentation ssAshwiniPoloju
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Informatik Aktuell
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1arthi v
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certificationelephantscale
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...TEST Huddle
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
Data Extract Views_Katalyst HLS
Data Extract Views_Katalyst HLSData Extract Views_Katalyst HLS
Data Extract Views_Katalyst HLSKatalyst HLS
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorialSajib Sen
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful toolsMark Chappell
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overviewdublinx
 

Similar to Clinical Data Classification of alzheimer's disease (20)

Oracle Study Setup_Katalyst HLS
Oracle Study Setup_Katalyst HLSOracle Study Setup_Katalyst HLS
Oracle Study Setup_Katalyst HLS
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Test data documentation ss
Test data documentation ssTest data documentation ss
Test data documentation ss
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
 
DPR.pptx
DPR.pptxDPR.pptx
DPR.pptx
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
How to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer CertificationHow to obtain the Cloudera Data Engineer Certification
How to obtain the Cloudera Data Engineer Certification
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
G53 qat09pdf6up
G53 qat09pdf6upG53 qat09pdf6up
G53 qat09pdf6up
 
Data Extract Views_Katalyst HLS
Data Extract Views_Katalyst HLSData Extract Views_Katalyst HLS
Data Extract Views_Katalyst HLS
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
Test strategy utilising mc useful tools
Test strategy utilising mc useful toolsTest strategy utilising mc useful tools
Test strategy utilising mc useful tools
 
Mind Map Test Data Management Overview
Mind Map Test Data Management OverviewMind Map Test Data Management Overview
Mind Map Test Data Management Overview
 

Recently uploaded

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Recently uploaded (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Clinical Data Classification of alzheimer's disease

  • 1. Alzheimer's  Disease-­‐  Clinical  Data   Classifica4on       By   George  Kalangi   Venkata  Gopi    
  • 2. Overview:   •  Introduc4on   •  Analysis  of  commonly  used  terms  and   explana4on  of  data  sets   •  Overall  Programming  Process   •  Genera4ng  a  merged  file  with  CDGLOBAL   •  Genera4on  of  files  for  future  status  predic4on   •  Data  Preprocessing   •  Classifica4on  (Algorithms)  used  on  the  data   •  Analysis  on  the  output  data  from  WEKAb   G
  • 3. Introduc4on   •  What  is  Alzheimer’s  Disease?   •  Brain  disorder   •  Most  common  form  of  demen4a   – Term  for  the  loss     • Memory   • Other  intellectual  abili4es   • Serious  enough  to  interfere  with  daily  life   •  Clinical  Demen4a  Ra4o  (0,0.5,1,2,3)   Mild to Severe Dementia1.0 to 3.0 Questionable Dementia0.5 Normal0 G
  • 4. Datasets  (60  Files)   "  56  comma  separated  files    1  File  –  Data  Dic4onary  (Explains  the  terms   used)    1  File  –  Clinical  Demen4a  Ra4ng  (Has   CDGLOBAL)    Rest    Assessments      Data  Defini4ons    Other  like  visits  having  abbrevia4ons   G
  • 5. Environment  Setup   •  Programming  Languages  used  for  the  project   are  PHP,  MySQL,  Java,  Postgresql   •  Tools  used  are  WEKA  (Waikato  Environment   for  Knowledge  Analysis),  MySQLWorkBench,          and  NetBeans   •               -­‐Front  End      (PHP)   •               -­‐Back  End      (MySQL)             G V
  • 6. Overall  Programming  Process     •  A  selected  dataset  (FAQ)  is  given  by  the  user.   •  At  the  backend  MYSQL  queries  are  defined   enough  to  create  the  required  tables  and   insert  the  required  data  to  the  corresponding   tables.   •  Here  aeer  the  required  opera4ons  are   performed  on  the  tables.   •  Final  output  files  are  stored  in  .csv  format.   G V
  • 7. Genera4ng  a  merged  file  with   CDGLOBAL  (For  current)   •  For  the  given  datasets  as  input,        (Eg:adni_faq_2011-­‐01-­‐20.csv)  and  from  the   adni_cdr_2011-­‐01-­‐20.csv)  file                                                  -­‐-­‐the  RID’s  and  VISCODE’s  of  faq  and   cdr  are  compared  and  based  on  that   CDGLOBAL  column  in  cdr  file  is  merged  to  faq   file.   •   During  Remove  CDGLOBAL  which  has  -­‐1  and         VISCODE’s  f,nv,uns1  are  trimmed  off.                Result  file  is  “Merged_dataset_file.csv”   G
  • 8. Query  used  for  genera4ng  merged   file:   •  Select   f.cID  ,f.RID  ,f.VISCODE  ,f.EXAMDATE  ,f.FAQSOURCE ,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f. FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQR EM,f.FAQTRAVL,f.FAQTOTAL  ,cdr.cdglobal  from   cdr,faq  f  where  cdr.rid=f.rid  and   cdr.VISCODE=f.VISCODE  and  cdr.cdglobal  not  in   (-­‐1)";     G
  • 9. Genera4on  of  files  for  future  status   predic4on   •  Predic4on  dataset  is  generated  by  mapping   the  first  4me  visit  to  the  6  month’s  Class  and  6   month  visit  to  the  12  month’s  Class  and  so  on.   •  SQL  query  opera4ons  are  performed  on  the   merged  file  to  separate  the  6  month’s  4me   interval  classes.   •  Following  are  the  files  generated:                                            -­‐  File_dataset_m06.csv                              -­‐File_dataset_m12.csv  and  so  on           V
  • 10. Query  used  for  genera4ng  class   files:   •  Select  v.ID  as  ID,v.RID  as   RID,v.VISCODE  ,v.EXAMDATE,v.FAQSOURCE  ,v.FAQ FINAN  ,v.FAQFORM  ,v.FAQSHOP  ,v.FAQGAME  ,v.FA QBEVG,v.FAQMEAL  ,v.FAQEVENT  ,v.FAQTV  ,v.FAQR EM  ,v.FAQTRAVL  ,v.FAQTOTAL  ,m12.cdrglobal  fro m  `table_adni_faq_2011-­‐01-­‐20_m06`   v,`table_adni_faq_2011-­‐01-­‐20_m12`  m12  where   v.rid=m12.rid   V
  • 11. Preprocessing   •  Aeer  we  get  required  .csv  files,  we  use  WEKA   to  preprocess  the  data.   •  Load  the  file  into  WEKA.   •  Apply  Filter   “weka.filters.unsuperwised.apributes.Remove ”  to  trim  off  the  unused  fields.   •  Apply  “NumericaltoNominal”  to  convert  all  the   values  in  the  data  to  Nominal  before   classifying  and  fetching  to  a  classifier   algorithm.   G
  • 12. Classifica4on  Algorithms  Used   •  The  Classify  panel  enables  the  user  to  apply   classifica4on  and  regression  algorithms   (indiscriminately  called  classifiers  in  Weka)  to   the  resul4ng  dataset,  to  es4mate  the  accuracy   of  the  resul4ng  predic4ve  model.   •   J48  uses    C4.5  (a  successor  of  ID3)  Algorithm   •  Naïve  Bayesian  Classifica4on  Algorithm   G
  • 13. What  is  classifica4on?   •  Given  a  collec4on  of  records  (training  set  )   – Each  record  contains  a  set  of  a"ributes,  one  of  the   apributes  is  the  class   -­‐-­‐  A  test  set  is  used  to  determine  the  accuracy  of  the  model.   Usually,  the  given  data  set  is  divided  into  training  and  test   sets,  with  training  set  used  to  build  the  model  and  test  set   used  to  validate  it.   Example:              If  we  have  items  in  a  house  which  are  not  classified  then  we  can’t  arrange   items  in  our  house.              We  classify  the  items  depending  on  their  usage  as  cooking  items,   decora4on  items  etc.,  such  that  we  could  arrange  them  accordingly  and   can  use  it  in  an  efficient  and  easier  way.     G
  • 14. Decision  Tree  Classifica/on  Task   G Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Test Data Assign Cheat to “No”
  • 15. Decision  Tree     Refund MarSt TaxInc YESNO NO NO Yes No MarriedSingle, Divorced < 80K > 80K Test Data G
  • 16. J  48  uses  C  4.5  Algorithm   •  Decision  trees  represent  a  supervised  approach  to  classifica4on   •  Decision  trees  are  a  classic  way  to  represent  informa4on  from  a  machine   learning  algorithm,  and  offer  a  fast  and  powerful  way  to  express  structures   in  data.     •  A  decision  tree  is  a  simple  structure  where  non-­‐terminal  nodes  represent   tests  on  one  or  more  apributes  and  terminal  nodes  reflect  decision   outcomes.   •  The  basic  algorithm  described  above  recursively  classifies  un4l  each  leaf  is   pure,  meaning  that  the  data  has  been  categorized  as  close  to  perfectly  as   possible.     •  The  latest  public  domain  implementa4on  of  Quinlan's  model  is  C4.5.  The   Weka  classifier  package  has  its  own  version  of  C4.5  known  as  J48.   •  This  process  ensures  maximum  accuracy  on  the  training  data.  
  • 17. Why  decision  tree  Algorithm?   •  Advantages:   – Inexpensive  to  construct   – Easy  to  interpret  for  small-­‐sized  trees   – Accuracy  is  comparable  to  other  classifica4on   techniques  for  many  simple  data  sets   – There  could  be  more  than  one  tree  possible  for   the  same  data   •  Disadvantages:            -­‐  Under  fivng:  when  the  model  is  too  simple,  both   training  and  test  errors  are  large  
  • 18. All  about  Cross  Valida4on   •  We  perform  cross  valida4on  when  amount  of  data  is  small  and  we   need  to  have  independent    training  and  test  set  from  it.   •  It  is  important  that  each  class  is  represented  in  its  actual   propor4ons  in  the  training  and  test  sets:  Stra4fica4on   •  An  important  cross  valida4on  technique  is  stra4fied  10  fold  cross   valida4on,  where  the  instance  set  is  divided  into  10  folds.   •  We  have  10  itera4ons  with  taking  different  single  fold  for  tes4ng   and  the  rest  for  training.   V
  • 19. Evalua4on   •  Metrics  for  Performance  Evalua4on   – How  to  evaluate  the  performance  of  a  model?   •  Methods  for  Model  Comparison   – How  to  compare  the  rela4ve  performance  among   compe4ng  models?   V
  • 20. Metrics  for  Performance  Evalua4on:   Confusion  Matrix   •  A  confusion  matrix  contains  informa4on  about  actual  and  predicted   classifica4ons  done  by  a  classifica4on  system.  Performance  of  systems  is   commonly  evaluated  using  the  data  in  the  matrix.  The  following  table   shows  the  confusion  matrix  for  a  two  class  classifier:     •  We  get  confusion  matrix  aeer  supplying  data  to  a  Classifier   •  Based  on  the  confusion  matrix  we  can  evaluate  using  the  measures  like,   precision,  F-­‐measure,  accuracy  and  Recall.   G
  • 21. Example   •  Suppose  there  are  a  sample  of  27  animals  —  8  cats,  6  dogs,  and  13  rabbits.   •  Each  column  of  the  matrix  represents  the  instances  in  a  predicted  class,   while  each  row  represents  the  instances  in  an  actual  class.   •  We  can  see  from  the  matrix  that  the  system  in  ques4on  has  trouble   dis4nguishing  between  cats  and  dogs,  but  can  make  the  dis4nc4on   between  rabbits  and  other  types  of  animals  prepy  well.     •  All  correct  guesses  are  located  in  the  diagonal  of  the  table,  so  it's  easy  to   visually  inspect  the  table  for  errors,  as  they  will  be  represented  by  any   non-­‐zero  values  outside  the  diagonal.   G
  • 22. Limita4on  of  accuracy   Limita/on  of  accuracy:   •  Consider  a  2-­‐class  problem   –  Number  of  Class  0  examples  =  9990   –  Number  of  Class  1  examples  =  10   •  If  model  predicts  everything  to  be  class  0,  accuracy  is  9990/10000  =  99.9  %   –  It  has  some  disadvantages  as  a  performance  es4mate.  For  example,  if   there  were  95  cats  and  only  5  dogs  in  the  data  set,  the  classifier  could   easily  be  biased  into  classifying  all  the  samples  as  cats.  The  overall   accuracy  would  be  95%,  but  in  prac4ce  the  classifier  would  have  a   100%  recogni4on  rate  for  the  cat  class  but  a  0%  recogni4on  rate  for   the  dog  class,  so  you'll  probably  want  to  look  at  some  of  the  other   numbers.  ROC  Area,  or  area  under  the  ROC  curve,  is  also  taken  as     preferred  measure.   –  Accuracy  is  misleading  because  model  does  not  detect  any  class  1   example.   G
  • 23. Metrics  for  Evalua4on   •  Accuracy:  The  accuracy  (AC)  is  the  propor4ons  of  the  total  number  of   predic4ons  that  were  correct,  what  percentage  of  people  were  correctly   classified.  It  is  determined  using  the  equa4on:                                                     Accuracy  =  (#  True  Posi4ves  +  #  True  Nega4ves)  /  N                                             Where  N  =  Total  #  predic4ons.   •  Precision:      Finally,  precision  (P)  is  the  propor4on  of  the  predicted  posi4ve   cases  that  were  correct.  Of  all  the  people  that  are  classified  as  demented,   what  percentage  of  them  is  actually  demented?                It  is  calculated  using  the  equa4on                                                                  Precision  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False   Posi4ve)   € Accuracy = TP + TN TP + TN + FP + FNV
  • 24. Evalua4on   •  F-­‐measure:                                                     F-­‐measure  =2*  (#  True  Posi4ves  )  /  (  #  2*True  Posi4ves  +  #   True  Nega4ves  +  #False  Posi4ves)   •  Recall:    Recall  is  the  ra4o  of  the  number  of  true  posi4ves  and  the  sum  of   true  posi4ves  and  false  nega4ves.  It  is  calculated  using  the  equa4on:                                                           Recall  =  (#  True  Posi4ves)  /  (#  True  Posi4ves  +  #  False   Nega4ves)   V
  • 25. Methods  for  Model  Comparison   ROC  (Receiver  Opera/ng  Characteris/c)   •  Developed  in  1950s  for  signal  detec4on  theory   to  analyze  noisy  signals     – Characterize  the  trade-­‐off  between  posi4ve  hits   and  false  alarms   •  ROC  curve  plots  TP  (on  the  y-­‐axis)  against  FP   (on  the  x-­‐axis)   V
  • 26. Using  ROC  for  Model  Comparison     M1 is better for small FPR   M2 is better for large FPR A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:. .90-1 = excellent (A). .80-.90 = good (B). .70-.80 = fair (C). .60-.70 = poor (D). .50-.60 = fail (F)Area Under the ROC curve A V
  • 27. Naïve  Bayes   •  It  is  a  simple  probabilis4c  classifier  based  on  applying  bayes  theorem  with   independence  assump4ons.  Naive  Bayes  classifier  assumes  that  the   presence  (or  absence)  of  a  par4cular  feature  of  a  class  is  unrelated  to  the   presence  (or  absence)  of  any  other  feature.   •  For  example,  a  fruit  may  be  considered  to  be  an  apple  if  it  is  red,  round,   and  about  4"  in  diameter.  Even  if  these  features  depend  on  each  other  or   upon  the  existence  of  the  other  features,  a  naive  Bayes  classifier  considers   all  of  these  proper4es  to  independently  contribute  to  the  probability  that   this  fruit  is  an  apple.   •  An  advantage  of  the  naive  Bayes  classifier  is  that  it  requires  a  small   amount  of  training  data  to  es4mate  the  parameters  (means  and  variances   of  the  variables)  necessary  for  classifica4on.  Because  independent   variables  are  assumed,  only  the  variances  of  the  variables  for  each  class   need  to  be  determined  and  not  the  en4re  set.  Best  suited  for  apributes,   which  are  independent.  It  is  very  simple,  very  fast.       V
  • 28. Challenges  faced   •  Ini4ally  data  files  all  being  processed  using  JDBC  and  MySQL  and  later  its   been  found  to  be  hec4c  if  at  all  other  dataset  being  used.  Hence  PHP   based  MYSQL  is  used  which  is  generalized  for  all  datasets.   •  Table  crea4on  ini4ally  for  loading  the  data,  later  done  with  file  opera4ng   func4ons.         •  Running  all  the  “MYSQL”  commands  sequen4ally,  later  enhanced  using   php  as  front  end.   •   Ini4ally  J48  tree  was  not  able  to  process  due  to  the  data  being  in   numerical  values.  Later  done  by  Discre4za4on/NumericaltoNominal  of   CDGLobal  columns.   VG
  • 30. Result  file  for  current  status(J48)   G
  • 31. Current  status  (Naïve  Bayes)   V
  • 33. Future  status  (Naïve  Bayes)   V
  • 36.                                            Thank  you