SlideShare a Scribd company logo
Prac%cal	
  Methods	
  for	
  Iden%fying	
  Anomalies	
  
That	
  Ma8er	
  in	
  Large	
  Datasets	
  
Robert	
  L.	
  Grossman	
  
University	
  of	
  Chicago	
  
and	
  
Open	
  Data	
  Group	
  
O’Reilly	
  Strata	
  Conference	
  
February	
  20,	
  2015	
  
rgrossman.com	
  
@bobgrossman	
  
•  Walter	
  A.	
  Shewhart	
  of	
  the	
  Bell	
  Telephone	
  Laboratories	
  wrote	
  a	
  
memorandum	
  on	
  May	
  16,	
  1924	
  that	
  included	
  a	
  diagram	
  of	
  a	
  control	
  
chart.	
  	
  
•  Upper	
  and	
  lower	
  control	
  limits	
  were	
  defined	
  by	
  k*standard	
  devia%on	
  
of	
  the	
  observa%ons	
  in	
  a	
  test	
  set.	
  
•  Using	
  exponen%ally	
  weighted	
  moving	
  averages	
  (EWMA)	
  made	
  the	
  
method	
  much	
  more	
  prac%cal.	
  
•  Methods,	
  such	
  as	
  CUSUM,	
  developed	
  for	
  quickest	
  detec%on	
  of	
  
changes.	
  
Source	
  of	
  figure:	
  h8p://www.itl.nist.gov/div898/handbook/mpc/sec%on2/mpc22.htm	
  
Originally	
  published	
  in	
  1939.	
  
Approach	
  1:	
  Sta%s%cal	
  
modeling	
  of	
  popula%ons	
  
•  In	
  this	
  toy	
  example,	
  blue	
  points	
  are	
  normal	
  behavior.	
  	
  	
  Red	
  
points	
  are	
  anomalies	
  and	
  are	
  determined	
  by	
  the	
  distances	
  
from	
  clusters.	
  
•  There	
  are	
  many,	
  many	
  variants:	
  
–  There	
  are	
  many	
  ways	
  to	
  measure	
  distance	
  of	
  candidate	
  
anomalies	
  to	
  clusters.	
  
–  There	
  are	
  many	
  different	
  clustering	
  algorithms.	
  
Cluster	
  1	
  
Cluster	
  2	
  
Cluster	
  3	
  
Approach	
  2:	
  Clusters	
  and	
  
their	
  distances	
  
•  Neighborhoods,	
  density	
  
and	
  context	
  ma8er**	
  	
  
•  Local	
  Outlier	
  Factor	
  (LOF)	
  	
  
–  P1	
  and	
  P2	
  outliers	
  
•  Nearest	
  Neighbor	
  	
  
–  P1	
  outlier	
  
Source:	
  *	
  V.	
  Chandola,	
  A.	
  Banerjee	
  and	
  V.	
  Kumar,	
  Anomaly	
  detec%on:	
  A	
  survey,	
  ACM	
  
Compu%ng	
  Surveys	
  Volume	
  41,	
  Number	
  3,	
  2009.	
  	
  
**	
  R.	
  Grossman	
  and	
  C.	
  Gupta,	
  Outlier	
  Detec%on	
  with	
  Streaming	
  Dyadic	
  Decomposi%on,	
  
Industrial	
  Conference	
  on	
  Data	
  Mining,	
  2007,	
  pages	
  	
  77-­‐91.	
  
P1	
  
P2	
  
P3	
  
*	
  
Approach	
  3:	
  Neighbors	
  and	
  
their	
  densi%es	
  
•  An	
  outlier	
  is	
  an	
  observa%on	
  that	
  deviates	
  so	
  much	
  
from	
  other	
  observa%ons	
  as	
  to	
  arouse	
  suspicion	
  that	
  
it	
  was	
  generated	
  by	
  a	
  different	
  mechanism.*	
  
•  Anomaly	
  detec%on	
  is	
  the	
  search	
  for	
  items	
  or	
  events	
  
which	
  do	
  not	
  conform	
  to	
  an	
  expected	
  pa8ern.**	
  	
  
•  As	
  a	
  simple	
  working	
  defini%on,	
  we	
  view	
  anomalies	
  as	
  
arising	
  from	
  a	
  different	
  popula%on	
  or	
  process.	
  
*D.	
  Hawkins.	
  Iden%fica%on	
  of	
  Outliers.	
  Chapman	
  and	
  Hall,	
  1980.	
  
	
  
**V.	
  Chandola,	
  A.	
  Banerjee	
  and	
  V.	
  Kumar,	
  Anomaly	
  detec%on:	
  A	
  survey,	
  ACM	
  Compu%ng	
  
Surveys	
  Volume	
  41,	
  Number	
  3,	
  2009.	
  
•  Three	
  common	
  cases:	
  
– Supervised:	
  We	
  have	
  two	
  classes	
  of	
  data,	
  one	
  
normal	
  and	
  one	
  consis%ng	
  of	
  anomalies.	
  	
  This	
  is	
  a	
  
standard	
  classifica%on	
  problem,	
  perhaps	
  with	
  
imbalanced	
  classes.	
  	
  Well	
  understood	
  problem.	
  
– Unsupervised:	
  We	
  have	
  no	
  labeled	
  data	
  
represen%ng	
  anomalies	
  or	
  normal	
  data.	
  
– Semi-­‐supervised:	
  We	
  have	
  some	
  labeled	
  data	
  
available	
  for	
  training,	
  but	
  not	
  very	
  much;	
  perhaps	
  
just	
  for	
  normal	
  data.	
  
The	
  core	
  of	
  most	
  large	
  scale	
  systems	
  that	
  process	
  
anomalies	
  is	
  a	
  ranking	
  and	
  packaging	
  of	
  candidate	
  
anomalies	
  so	
  that	
  they	
  may	
  be	
  carefully	
  inves%gated	
  by	
  
subject	
  ma8er	
  experts.	
  
Case	
  Study	
  1:	
  Ac%ve	
  Voxels	
  in	
  FMRI	
  
Source:	
  Salmo	
  salar,	
  (Atlan%c	
  Salmon),	
  wikipedia.org	
  	
  
Methods	
  
Subject.	
  One	
  mature	
  Atlan%c	
  Salmon	
  (Salmo	
  salar)	
  
par%cipated	
  in	
  the	
  fMRI	
  study.	
  The	
  salmon	
  was	
  
approximately	
  18	
  inches	
  long,	
  weighed	
  3.8	
  lbs,	
  and	
  was	
  not	
  
alive	
  at	
  the	
  %me	
  of	
  scanning.	
  	
  
Task.	
  The	
  task	
  administered	
  to	
  the	
  salmon	
  involved	
  
comple%ng	
  an	
  open-­‐ended	
  mentalizing	
  task.	
  The	
  salmon	
  
was	
  shown	
  a	
  series	
  of	
  photographs	
  depic%ng	
  human	
  
individuals	
  in	
  social	
  situa%ons	
  with	
  a	
  specified	
  emo%onal	
  
valence.	
  The	
  salmon	
  was	
  asked	
  to	
  determine	
  what	
  emo%on	
  
the	
  individual	
  in	
  the	
  photo	
  must	
  have	
  been	
  experiencing.	
  	
  
Design.	
  S%muli	
  were	
  presented	
  in	
  a	
  block	
  design	
  with	
  each	
  
photo	
  presented	
  for	
  10	
  seconds	
  followed	
  by	
  12	
  seconds	
  of	
  
rest.	
  A	
  total	
  of	
  15	
  photos	
  were	
  displayed.	
  Total	
  scan	
  %me	
  
was	
  5.5	
  minutes.	
  	
  
	
  Source:	
  Craig	
  M.	
  Benne8,	
  Abigail	
  A.	
  Baird,	
  Michael	
  B.	
  Miller,	
  and	
  George	
  L.	
  Wolford,	
  Neural	
  correlates	
  
of	
  interspecies	
  perspec%ve	
  taking	
  in	
  the	
  post-­‐mortem	
  Atlan%c	
  Salmon:	
  An	
  argument	
  for	
  mul%ple	
  
comparisons	
  correc%on,	
  retrieved	
  from	
  h8p://prefrontal.org/files/posters/Benne8-­‐Salmon-­‐2009.pdf.	
  
Several	
  ac%ve	
  voxels	
  were	
  discovered	
  in	
  a	
  cluster	
  located	
  
within	
  the	
  salmon’s	
  brain	
  cavity.	
  The	
  size	
  of	
  this	
  cluster	
  was	
  
81	
  mm3	
  with	
  a	
  cluster-­‐level	
  significance	
  of	
  p	
  =	
  0.001.	
  Due	
  to	
  
the	
  coarse	
  resolu%on	
  of	
  the	
  echo-­‐planar	
  image	
  acquisi%on	
  
and	
  the	
  rela%vely	
  small	
  size	
  of	
  the	
  salmon	
  brain	
  further	
  
discrimina%on	
  between	
  brain	
  regions	
  could	
  not	
  be	
  
completed.	
  Out	
  of	
  a	
  search	
  volume	
  of	
  8064	
  voxels	
  a	
  total	
  of	
  
16	
  voxels	
  were	
  significant.	
  	
  
	
  Source:	
  ibid.	
  
Rule	
  of	
  Thumb:	
  	
  
Beware	
  of	
  Mul%ple	
  Tests	
  
•  Beware	
  of	
  drawing	
  any	
  conclusions	
  from	
  many	
  
independent	
  tests	
  unless	
  the	
  analysis	
  is	
  properly	
  
done.	
  
•  In	
  the	
  case	
  of	
  understanding	
  the	
  cogni%ve	
  processing	
  
of	
  human	
  emo%ons	
  by	
  dead	
  salmon,	
  asking	
  whether	
  
each	
  voxel	
  is	
  “on”	
  is	
  an	
  independent	
  test.	
  
What	
  Happened:	
  Toy	
  Example	
  
Flipping	
  a	
  coin	
  10	
  %mes	
  
•  How	
  likely	
  is	
  it	
  if	
  you	
  flip	
  a	
  fair	
  coin	
  10	
  %mes	
  that	
  you	
  
will	
  get	
  9	
  or	
  10	
  heads?	
  
	
  
HHHHH	
  HHHHT	
   	
  HHHHT	
  HHHHH	
  
HHHHH	
  HHHTH	
   	
  HHHTH	
  HHHHH	
  
HHHHH	
  HHTHH	
   	
  HHTHH	
  HHHHH	
  
HHHHH	
  HTHHH	
   	
  HTHHH	
  HHHHH	
  
HHHHH	
  THHHH	
   	
  THHHH	
  HHHHH	
  
HHHHH	
  HHHHH	
  
	
  
	
  
Likelihood	
  
=	
  11	
  /	
  (1/2)10	
  
=	
  11	
  /	
  1024	
  
=	
  .01074	
  
=	
  1.074%	
  
Bonferroni	
  Correc%on	
  for	
  Mul%ple	
  Tests	
  
•  What	
  is	
  the	
  probability	
  if	
  a	
  repeat	
  this	
  experiment	
  125	
  
%mes	
  that	
  I	
  will	
  never	
  a	
  get	
  a	
  run	
  of	
  9	
  or	
  10	
  heads?	
  
•  The	
  probability	
  is	
  (1	
  –	
  0.01074)125	
  	
  =	
  25.9%	
  
•  So	
  almost	
  75%	
  of	
  the	
  %me	
  I	
  will	
  select	
  at	
  least	
  one	
  of	
  the	
  
coins	
  as	
  biased	
  when	
  in	
  fact	
  all	
  of	
  them	
  are	
  fair.	
  
•  To	
  be	
  more	
  conserva%ve,	
  divide	
  the	
  significance	
  of	
  
coming	
  up	
  with	
  9	
  or	
  10	
  heads	
  by	
  the	
  number	
  of	
  tests.	
  
•  The	
  probability	
  is	
  (1	
  –	
  0.01074/125)125	
  	
  =	
  98.2%	
  
•  So	
  less	
  than	
  2%	
  of	
  the	
  %me	
  will	
  I	
  select	
  at	
  least	
  one	
  of	
  the	
  
coins	
  as	
  biased	
  when	
  in	
  fact	
  all	
  of	
  them	
  are	
  fair.	
  
Bonferroni	
  Correc%on	
  
Case	
  Study	
  2:	
  Hyperspectral	
  Anomalies	
  from	
  
NASA’s	
  EO-­‐1	
  Satellite*	
  
*Joint	
  work	
  with	
  the	
  Project	
  Matsu	
  team.	
  	
  Maria	
  Pa8erson	
  is	
  the	
  Open	
  Cloud	
  Consor%um	
  
lead	
  for	
  the	
  project.	
  For	
  more	
  informa%on,	
  see	
  matsu.opensciencedatacloud.org.	
  	
  
NASA’s EO-1 Hyperion and ALI
Hyperspectral Instruments
Matsu “Wheel” Spectral Anomaly Detector
•  “Contours and Clusters” – looks for physical contours around
spectral clusters
o  PCA analysis applied to the set of reflectivity values (spectra)
for every pixel, and the top 5 components are extracted for
further analysis.
o  Pixels are clustered in the transformed 5-D spectral space
using a k-means clustering algorithm.
o  For each image, k = 50 spectral clusters are formed and
ranked from most to least extreme using the Mahalanobis
distance of the cluster from the spectral center.
o  For each spectral cluster, adjacent pixels are grouped together
into contiguous objects.
•  Returns geographic regions of spectral anomalies that are scored
as anomalous (0 least , 1000 most) compared to a set of “normal”
spectra, constructed for comparison over a baseline of time
Source:	
  Maria	
  T.	
  Pa8erson,	
  Nikolas	
  Anderson,	
  Collin	
  Benne8,	
  Jacob	
  Bruggemann,	
  Robert	
  L.	
  
Grossman,	
  Ma8hew	
  Handy,	
  Vuong	
  Ly,	
  Daniel	
  J.	
  Mandl,	
  James	
  Pivarski,	
  Ray	
  Powell,	
  Jonathan	
  
Spring,	
  and	
  Walt	
  Wells,	
  The	
  Project	
  Matsu	
  Wheel:	
  Cloud-­‐based	
  Analysis	
  of	
  Earth	
  Satellite	
  Imagery	
  
for	
  Disaster	
  Relief,	
  to	
  appear.	
  
Spectral anomaly detected:
Nishinoshima active volcano, Dec, 2014
Spectral anomaly detected:
Barren Island active volcano, Feb, 2014
Spectral anomaly detected:
North Sentinel Island fires, May, 2014
Spectral anomaly detected:
North Sentinel Island fires, May, 2014
Summary	
  of	
  Matsu	
  
Analy%c	
  
Infrastructure	
  
Hadoop	
  &	
  Accumulo	
  
Analy%c	
  
algorithms	
  
Clustering,	
  PCA,	
  custom	
  distance	
  func%on	
  
Analy%c	
  
opera%ons	
  
Daily	
  list	
  of	
  candidate	
  anomalies	
  
organized	
  so	
  that	
  it	
  can	
  be	
  scanned	
  
quickly	
  by	
  domain	
  experts	
  
Noteworthy	
  	
   Special	
  support	
  for	
  scanning	
  queries	
  
allowing	
  algorithms	
  to	
  be	
  loaded	
  and	
  run	
  
automa%cally	
  against	
  all	
  the	
  data	
  each	
  
day	
  (“Matsu	
  Wheel”)	
  
Case	
  Study	
  3:	
  Geospa%al	
  Varia%on	
  	
  
of	
  Disease	
  Incidence	
  
Source: Neighbor Based Bootstrap (NB2) applied to 99.1 million electronic million
medical records. Joint work with Maria Patterson, to appear. 	
  
Hypertensive Heart Disease
NB2
identifies
small
peaked
clusters
Summary	
  of	
  Disease	
  Incidence	
  Varia%on	
  
Analy%c	
  
Infrastructure	
  
Private	
  secure	
  OpenStack	
  cloud	
  designed	
  to	
  
hold	
  biomedical	
  data*	
  
Analy%c	
  
algorithms	
  
Neighbor	
  based	
  bootstrap	
  (NB2)	
  method	
  
with	
  significance	
  checks	
  using	
  
randomiza%on.	
  	
  Comparison	
  with	
  other	
  
methods	
  for	
  detec%ng	
  sta%s%cally	
  
significant	
  geospa%al	
  varia%on.	
  	
  
Analy%c	
  
opera%ons	
  
Visualiza%on	
  of	
  results.	
  	
  Currently	
  working	
  
out	
  best	
  way	
  to	
  review	
  with	
  subject	
  ma8er	
  
experts.	
  
*For	
  more	
  informa%on,	
  see	
  Allison	
  P.	
  Heath,	
  et.	
  al.	
  ,	
  Bionimbus:	
  A	
  Cloud	
  for	
  Managing,	
  Analyzing	
  
and	
  Sharing	
  Large	
  Genomics	
  Datasets,	
  Journal	
  of	
  the	
  American	
  Medical	
  Informa%cs	
  Associa%on,	
  
2014.	
  
Case	
  Study	
  4:	
  Millions	
  of	
  POS	
  Devices	
  	
  
Sending	
  Payment	
  Informa%on	
  
If POS Data Were Homogeneous, We Could
Use Single Change Detection Model
•  Sequence of events x[1], x[2], x[3], …
•  Question: is the observed distribution different than the baseline
distribution?
•  Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests*
Observed
Model
Baseline
Model
β
*H.	
  Vincent	
  Poor	
  and	
  Olympia	
  Hadjiliadis,	
  Quickest	
  Detec%on,	
  Cambridge	
  University	
  
Press,	
  2008.	
  	
  
Build Thousands of Baseline Models,
One for Each Cell in Data Cube
•  Build	
  separate	
  model	
  for	
  each	
  
bank	
  (1000+)	
  
•  Build	
  separate	
  model	
  for	
  each	
  
geographical	
  region	
  (6	
  regions)	
  
•  Build	
  separate	
  model	
  for	
  each	
  
different	
  type	
  of	
  merchant	
  (c.	
  
800	
  types	
  of	
  merchants)	
  
•  For	
  each	
  dis%nct	
  cell	
  in	
  cube,	
  
establish	
  separate	
  baselines	
  for	
  
each	
  metric	
  of	
  interest	
  (declines,	
  
etc.)	
  
•  Detect	
  changes	
  from	
  baselines	
  	
  
Geospatial
region
Type of
Transaction
20,000+ separate
baselines
Bank
Source:	
  Chris	
  Curry,	
  Robert	
  L.	
  Grossman,	
  David	
  Locke,	
  Steve	
  Vejcik,	
  and	
  Joseph	
  Bugajski,	
  Detec%ng	
  
changes	
  in	
  large	
  data	
  sets	
  of	
  payment	
  card	
  data:	
  a	
  case	
  study,	
  Proceedings	
  of	
  the	
  13th	
  ACM	
  SIGKDD	
  
interna%onal	
  conference	
  on	
  Knowledge	
  discovery	
  and	
  data	
  mining	
  (KDD	
  '07).	
  	
  
Balance Meaningful vs Manageable
Number of Alerts
• Fewer alerts
• Alerts more
manageable
• To decrease alerts,
remove breakpoint,
order by number
of decreased alerts,
& select one or more
breakpoints to remove
• More alerts
• Alerts more
meaningful
• To increase alerts,
add breakpoint
to split cubes,
order by number
of new alerts, &
select one or more
new breakpoints
One model for each
cell in data cube
Breakpoint
Summary	
  of	
  Cubes	
  of	
  Baseline	
  Models	
  
Analy%c	
  
Infrastructure	
  
Models	
  were	
  built	
  in	
  data	
  warehouse	
  and	
  
exported	
  in	
  the	
  Predic%ve	
  Model	
  Markup	
  
Language	
  (PMML).	
  	
  	
  PMML	
  models	
  
imported	
  into	
  real	
  %me	
  scoring	
  engine.	
  	
  
Analy%c	
  
algorithms	
  
Many	
  segmented	
  baseline	
  models,	
  one	
  for	
  
each	
  cell	
  in	
  a	
  mul%-­‐dimensional	
  data	
  cube.	
  
Analy%c	
  
opera%ons	
  
Order	
  most	
  meaningful	
  alerts	
  each	
  day,	
  
create	
  custom	
  visual	
  report,	
  and	
  present	
  to	
  
subject	
  ma8er	
  experts	
  each	
  day	
  
Noteworthy	
  	
   There	
  is	
  a	
  successor	
  to	
  PMML	
  called	
  the	
  
Portable	
  Format	
  for	
  Analy%cs	
  (PFA)	
  
designed	
  for	
  today’s	
  big	
  data	
  
environments.	
  
Summary	
  
When	
  You	
  Have	
  No	
  Labeled	
  Data	
  	
  
and	
  Lots	
  of	
  It	
  –	
  Approach	
  1	
  
1.  Create	
  clusters	
  or	
  micro-­‐clusters	
  
2.  Manually	
  curate	
  some	
  of	
  the	
  clusters	
  with	
  tags	
  or	
  
labels.	
  	
  
3.  Produce	
  scores	
  from	
  0	
  to	
  1000	
  based	
  upon	
  
distances	
  to	
  interes%ng	
  clusters.	
  
4.  Visualize	
  	
  
5.  Discuss	
  weekly	
  with	
  subject	
  ma8er	
  experts	
  and	
  use	
  
these	
  sessions	
  to	
  improve	
  the	
  model.	
  
Case	
  Study	
  2:	
  Detec%ng	
  anomalies	
  in	
  NASA’s	
  EO-­‐1	
  
hyperspectral	
  data.	
  	
  	
  	
  
When	
  You	
  Have	
  No	
  Labeled	
  Data	
  	
  
and	
  Lots	
  of	
  It	
  –	
  Approach	
  2	
  
1.  Create	
  candidate	
  findings	
  with	
  a	
  neighbor-­‐based	
  
method,	
  cluster	
  based	
  method,	
  etc.	
  
2.  Rank	
  order	
  the	
  findings	
  from	
  most	
  significant	
  to	
  
least	
  significant.	
  
3.  Compare	
  your	
  findings	
  with	
  other	
  ranking	
  methods.	
  	
  
Enrich	
  finding(s)	
  with	
  covariates	
  as	
  available.	
  
4.  Visualize	
  	
  
5.  Discuss	
  weekly	
  with	
  subject	
  ma8er	
  experts	
  and	
  use	
  
these	
  sessions	
  to	
  improve	
  the	
  model.	
  
Case	
  Study	
  3:	
  Ranking	
  incidence	
  levels	
  of	
  diseases	
  by	
  
their	
  geo-­‐spa%al	
  varia%on.	
  	
  	
  	
  
When	
  You	
  Have	
  No	
  Labeled	
  Data	
  	
  
and	
  Lots	
  of	
  It	
  –	
  Approach	
  3	
  
1.  Build	
  baseline	
  models	
  for	
  each	
  en'ty	
  of	
  interest,	
  
even	
  if	
  thousands	
  of	
  millions	
  of	
  models.	
  
2.  Produce	
  scores	
  from	
  0	
  to	
  1000	
  based	
  upon	
  
devia%on	
  of	
  observed	
  behavior	
  from	
  baseline	
  
3.  Visualize	
  	
  
4.  Discuss	
  weekly	
  with	
  subject	
  ma8er	
  experts	
  and	
  use	
  
these	
  sessions	
  to	
  improve	
  the	
  model.	
  
Case	
  Study	
  4:	
  Detec%ng	
  anomalies	
  in	
  payment	
  data	
  
using	
  baseline	
  models	
  for	
  millions	
  POS	
  devices.	
  
Ques%ons?	
  
35	
  
rgrossman.com	
  
@bobgrossman	
  

More Related Content

Viewers also liked

Detecting Discontinuties in Large Scale Systems
Detecting  Discontinuties in Large Scale SystemsDetecting  Discontinuties in Large Scale Systems
Detecting Discontinuties in Large Scale Systems
haroonmalik786
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
Robert Grossman
 
Análisis de modelos para diseños_instruccionales
Análisis de modelos para diseños_instruccionalesAnálisis de modelos para diseños_instruccionales
Análisis de modelos para diseños_instruccionales
luiscotorres1989
 
Poluicaosonora
PoluicaosonoraPoluicaosonora
Poluicaosonora
sibfv
 
Ensamble y desensamble de un pc de mesa
Ensamble y desensamble de un pc de mesaEnsamble y desensamble de un pc de mesa
Ensamble y desensamble de un pc de mesa
valenypaom
 
How DevOps is Transforming IT, and What it Can Do for Academia
How DevOps is Transforming IT, and What it Can Do for AcademiaHow DevOps is Transforming IT, and What it Can Do for Academia
How DevOps is Transforming IT, and What it Can Do for Academia
Nicole Forsgren
 

Viewers also liked (20)

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Detecting Discontinuties in Large Scale Systems
Detecting  Discontinuties in Large Scale SystemsDetecting  Discontinuties in Large Scale Systems
Detecting Discontinuties in Large Scale Systems
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event data
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Análisis de modelos para diseños_instruccionales
Análisis de modelos para diseños_instruccionalesAnálisis de modelos para diseños_instruccionales
Análisis de modelos para diseños_instruccionales
 
Poluicaosonora
PoluicaosonoraPoluicaosonora
Poluicaosonora
 
Clase iii
Clase iiiClase iii
Clase iii
 
Didáctica de la matemática
Didáctica de la matemáticaDidáctica de la matemática
Didáctica de la matemática
 
Ensamble y desensamble de un pc de mesa
Ensamble y desensamble de un pc de mesaEnsamble y desensamble de un pc de mesa
Ensamble y desensamble de un pc de mesa
 
How DevOps is Transforming IT, and What it Can Do for Academia
How DevOps is Transforming IT, and What it Can Do for AcademiaHow DevOps is Transforming IT, and What it Can Do for Academia
How DevOps is Transforming IT, and What it Can Do for Academia
 
T01ppt david oteiza.ppt
T01ppt david oteiza.ppt T01ppt david oteiza.ppt
T01ppt david oteiza.ppt
 

Similar to Practical Methods for Identifying Anomalies That Matter in Large Datasets

Chapter-7-Sampling & sampling Distributions.pdf
Chapter-7-Sampling & sampling Distributions.pdfChapter-7-Sampling & sampling Distributions.pdf
Chapter-7-Sampling & sampling Distributions.pdf
Koteswari Kasireddy
 
Chapter 9 sampling and statistical tool
Chapter 9 sampling and statistical toolChapter 9 sampling and statistical tool
Chapter 9 sampling and statistical tool
Maria Theresa
 
Sheet1Student Name ________________________________10-coin trialO.docx
Sheet1Student Name  ________________________________10-coin trialO.docxSheet1Student Name  ________________________________10-coin trialO.docx
Sheet1Student Name ________________________________10-coin trialO.docx
bjohn46
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
​Iván Rodríguez
 

Similar to Practical Methods for Identifying Anomalies That Matter in Large Datasets (20)

Lect w3 probability
Lect w3 probabilityLect w3 probability
Lect w3 probability
 
Making your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designMaking your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental design
 
Behavioral phenotyping of mouse models of neurodegeneration
Behavioral phenotyping of mouse models of neurodegeneration Behavioral phenotyping of mouse models of neurodegeneration
Behavioral phenotyping of mouse models of neurodegeneration
 
Chapter-7-Sampling & sampling Distributions.pdf
Chapter-7-Sampling & sampling Distributions.pdfChapter-7-Sampling & sampling Distributions.pdf
Chapter-7-Sampling & sampling Distributions.pdf
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniques
 
bio presentation.pptx
bio presentation.pptxbio presentation.pptx
bio presentation.pptx
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
Sampling in Research
Sampling in ResearchSampling in Research
Sampling in Research
 
Sampling
SamplingSampling
Sampling
 
Chapter 9 sampling and statistical tool
Chapter 9 sampling and statistical toolChapter 9 sampling and statistical tool
Chapter 9 sampling and statistical tool
 
Dissertation
DissertationDissertation
Dissertation
 
Sheet1Student Name ________________________________10-coin trialO.docx
Sheet1Student Name  ________________________________10-coin trialO.docxSheet1Student Name  ________________________________10-coin trialO.docx
Sheet1Student Name ________________________________10-coin trialO.docx
 
Temporal profiles of avalanches on networks
Temporal profiles of avalanches on networksTemporal profiles of avalanches on networks
Temporal profiles of avalanches on networks
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
2014 naples
2014 naples2014 naples
2014 naples
 
Sample and sampling
Sample and samplingSample and sampling
Sample and sampling
 
Psychology Unit 1
Psychology Unit 1 Psychology Unit 1
Psychology Unit 1
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
sampling and statiscal inference
sampling and statiscal inferencesampling and statiscal inference
sampling and statiscal inference
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_ReportRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
 

More from Robert Grossman

More from Robert Grossman (12)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 

Recently uploaded (20)

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 

Practical Methods for Identifying Anomalies That Matter in Large Datasets

  • 1. Prac%cal  Methods  for  Iden%fying  Anomalies   That  Ma8er  in  Large  Datasets   Robert  L.  Grossman   University  of  Chicago   and   Open  Data  Group   O’Reilly  Strata  Conference   February  20,  2015   rgrossman.com   @bobgrossman  
  • 2. •  Walter  A.  Shewhart  of  the  Bell  Telephone  Laboratories  wrote  a   memorandum  on  May  16,  1924  that  included  a  diagram  of  a  control   chart.     •  Upper  and  lower  control  limits  were  defined  by  k*standard  devia%on   of  the  observa%ons  in  a  test  set.   •  Using  exponen%ally  weighted  moving  averages  (EWMA)  made  the   method  much  more  prac%cal.   •  Methods,  such  as  CUSUM,  developed  for  quickest  detec%on  of   changes.   Source  of  figure:  h8p://www.itl.nist.gov/div898/handbook/mpc/sec%on2/mpc22.htm   Originally  published  in  1939.   Approach  1:  Sta%s%cal   modeling  of  popula%ons  
  • 3. •  In  this  toy  example,  blue  points  are  normal  behavior.      Red   points  are  anomalies  and  are  determined  by  the  distances   from  clusters.   •  There  are  many,  many  variants:   –  There  are  many  ways  to  measure  distance  of  candidate   anomalies  to  clusters.   –  There  are  many  different  clustering  algorithms.   Cluster  1   Cluster  2   Cluster  3   Approach  2:  Clusters  and   their  distances  
  • 4. •  Neighborhoods,  density   and  context  ma8er**     •  Local  Outlier  Factor  (LOF)     –  P1  and  P2  outliers   •  Nearest  Neighbor     –  P1  outlier   Source:  *  V.  Chandola,  A.  Banerjee  and  V.  Kumar,  Anomaly  detec%on:  A  survey,  ACM   Compu%ng  Surveys  Volume  41,  Number  3,  2009.     **  R.  Grossman  and  C.  Gupta,  Outlier  Detec%on  with  Streaming  Dyadic  Decomposi%on,   Industrial  Conference  on  Data  Mining,  2007,  pages    77-­‐91.   P1   P2   P3   *   Approach  3:  Neighbors  and   their  densi%es  
  • 5. •  An  outlier  is  an  observa%on  that  deviates  so  much   from  other  observa%ons  as  to  arouse  suspicion  that   it  was  generated  by  a  different  mechanism.*   •  Anomaly  detec%on  is  the  search  for  items  or  events   which  do  not  conform  to  an  expected  pa8ern.**     •  As  a  simple  working  defini%on,  we  view  anomalies  as   arising  from  a  different  popula%on  or  process.   *D.  Hawkins.  Iden%fica%on  of  Outliers.  Chapman  and  Hall,  1980.     **V.  Chandola,  A.  Banerjee  and  V.  Kumar,  Anomaly  detec%on:  A  survey,  ACM  Compu%ng   Surveys  Volume  41,  Number  3,  2009.  
  • 6. •  Three  common  cases:   – Supervised:  We  have  two  classes  of  data,  one   normal  and  one  consis%ng  of  anomalies.    This  is  a   standard  classifica%on  problem,  perhaps  with   imbalanced  classes.    Well  understood  problem.   – Unsupervised:  We  have  no  labeled  data   represen%ng  anomalies  or  normal  data.   – Semi-­‐supervised:  We  have  some  labeled  data   available  for  training,  but  not  very  much;  perhaps   just  for  normal  data.  
  • 7. The  core  of  most  large  scale  systems  that  process   anomalies  is  a  ranking  and  packaging  of  candidate   anomalies  so  that  they  may  be  carefully  inves%gated  by   subject  ma8er  experts.  
  • 8. Case  Study  1:  Ac%ve  Voxels  in  FMRI   Source:  Salmo  salar,  (Atlan%c  Salmon),  wikipedia.org    
  • 9. Methods   Subject.  One  mature  Atlan%c  Salmon  (Salmo  salar)   par%cipated  in  the  fMRI  study.  The  salmon  was   approximately  18  inches  long,  weighed  3.8  lbs,  and  was  not   alive  at  the  %me  of  scanning.     Task.  The  task  administered  to  the  salmon  involved   comple%ng  an  open-­‐ended  mentalizing  task.  The  salmon   was  shown  a  series  of  photographs  depic%ng  human   individuals  in  social  situa%ons  with  a  specified  emo%onal   valence.  The  salmon  was  asked  to  determine  what  emo%on   the  individual  in  the  photo  must  have  been  experiencing.     Design.  S%muli  were  presented  in  a  block  design  with  each   photo  presented  for  10  seconds  followed  by  12  seconds  of   rest.  A  total  of  15  photos  were  displayed.  Total  scan  %me   was  5.5  minutes.      Source:  Craig  M.  Benne8,  Abigail  A.  Baird,  Michael  B.  Miller,  and  George  L.  Wolford,  Neural  correlates   of  interspecies  perspec%ve  taking  in  the  post-­‐mortem  Atlan%c  Salmon:  An  argument  for  mul%ple   comparisons  correc%on,  retrieved  from  h8p://prefrontal.org/files/posters/Benne8-­‐Salmon-­‐2009.pdf.  
  • 10. Several  ac%ve  voxels  were  discovered  in  a  cluster  located   within  the  salmon’s  brain  cavity.  The  size  of  this  cluster  was   81  mm3  with  a  cluster-­‐level  significance  of  p  =  0.001.  Due  to   the  coarse  resolu%on  of  the  echo-­‐planar  image  acquisi%on   and  the  rela%vely  small  size  of  the  salmon  brain  further   discrimina%on  between  brain  regions  could  not  be   completed.  Out  of  a  search  volume  of  8064  voxels  a  total  of   16  voxels  were  significant.      Source:  ibid.  
  • 11. Rule  of  Thumb:     Beware  of  Mul%ple  Tests   •  Beware  of  drawing  any  conclusions  from  many   independent  tests  unless  the  analysis  is  properly   done.   •  In  the  case  of  understanding  the  cogni%ve  processing   of  human  emo%ons  by  dead  salmon,  asking  whether   each  voxel  is  “on”  is  an  independent  test.  
  • 12. What  Happened:  Toy  Example   Flipping  a  coin  10  %mes   •  How  likely  is  it  if  you  flip  a  fair  coin  10  %mes  that  you   will  get  9  or  10  heads?     HHHHH  HHHHT    HHHHT  HHHHH   HHHHH  HHHTH    HHHTH  HHHHH   HHHHH  HHTHH    HHTHH  HHHHH   HHHHH  HTHHH    HTHHH  HHHHH   HHHHH  THHHH    THHHH  HHHHH   HHHHH  HHHHH       Likelihood   =  11  /  (1/2)10   =  11  /  1024   =  .01074   =  1.074%  
  • 13. Bonferroni  Correc%on  for  Mul%ple  Tests   •  What  is  the  probability  if  a  repeat  this  experiment  125   %mes  that  I  will  never  a  get  a  run  of  9  or  10  heads?   •  The  probability  is  (1  –  0.01074)125    =  25.9%   •  So  almost  75%  of  the  %me  I  will  select  at  least  one  of  the   coins  as  biased  when  in  fact  all  of  them  are  fair.   •  To  be  more  conserva%ve,  divide  the  significance  of   coming  up  with  9  or  10  heads  by  the  number  of  tests.   •  The  probability  is  (1  –  0.01074/125)125    =  98.2%   •  So  less  than  2%  of  the  %me  will  I  select  at  least  one  of  the   coins  as  biased  when  in  fact  all  of  them  are  fair.   Bonferroni  Correc%on  
  • 14. Case  Study  2:  Hyperspectral  Anomalies  from   NASA’s  EO-­‐1  Satellite*   *Joint  work  with  the  Project  Matsu  team.    Maria  Pa8erson  is  the  Open  Cloud  Consor%um   lead  for  the  project.  For  more  informa%on,  see  matsu.opensciencedatacloud.org.    
  • 15. NASA’s EO-1 Hyperion and ALI Hyperspectral Instruments
  • 16. Matsu “Wheel” Spectral Anomaly Detector •  “Contours and Clusters” – looks for physical contours around spectral clusters o  PCA analysis applied to the set of reflectivity values (spectra) for every pixel, and the top 5 components are extracted for further analysis. o  Pixels are clustered in the transformed 5-D spectral space using a k-means clustering algorithm. o  For each image, k = 50 spectral clusters are formed and ranked from most to least extreme using the Mahalanobis distance of the cluster from the spectral center. o  For each spectral cluster, adjacent pixels are grouped together into contiguous objects. •  Returns geographic regions of spectral anomalies that are scored as anomalous (0 least , 1000 most) compared to a set of “normal” spectra, constructed for comparison over a baseline of time Source:  Maria  T.  Pa8erson,  Nikolas  Anderson,  Collin  Benne8,  Jacob  Bruggemann,  Robert  L.   Grossman,  Ma8hew  Handy,  Vuong  Ly,  Daniel  J.  Mandl,  James  Pivarski,  Ray  Powell,  Jonathan   Spring,  and  Walt  Wells,  The  Project  Matsu  Wheel:  Cloud-­‐based  Analysis  of  Earth  Satellite  Imagery   for  Disaster  Relief,  to  appear.  
  • 17. Spectral anomaly detected: Nishinoshima active volcano, Dec, 2014
  • 18. Spectral anomaly detected: Barren Island active volcano, Feb, 2014
  • 19. Spectral anomaly detected: North Sentinel Island fires, May, 2014
  • 20. Spectral anomaly detected: North Sentinel Island fires, May, 2014
  • 21. Summary  of  Matsu   Analy%c   Infrastructure   Hadoop  &  Accumulo   Analy%c   algorithms   Clustering,  PCA,  custom  distance  func%on   Analy%c   opera%ons   Daily  list  of  candidate  anomalies   organized  so  that  it  can  be  scanned   quickly  by  domain  experts   Noteworthy     Special  support  for  scanning  queries   allowing  algorithms  to  be  loaded  and  run   automa%cally  against  all  the  data  each   day  (“Matsu  Wheel”)  
  • 22. Case  Study  3:  Geospa%al  Varia%on     of  Disease  Incidence   Source: Neighbor Based Bootstrap (NB2) applied to 99.1 million electronic million medical records. Joint work with Maria Patterson, to appear.  
  • 25. Summary  of  Disease  Incidence  Varia%on   Analy%c   Infrastructure   Private  secure  OpenStack  cloud  designed  to   hold  biomedical  data*   Analy%c   algorithms   Neighbor  based  bootstrap  (NB2)  method   with  significance  checks  using   randomiza%on.    Comparison  with  other   methods  for  detec%ng  sta%s%cally   significant  geospa%al  varia%on.     Analy%c   opera%ons   Visualiza%on  of  results.    Currently  working   out  best  way  to  review  with  subject  ma8er   experts.   *For  more  informa%on,  see  Allison  P.  Heath,  et.  al.  ,  Bionimbus:  A  Cloud  for  Managing,  Analyzing   and  Sharing  Large  Genomics  Datasets,  Journal  of  the  American  Medical  Informa%cs  Associa%on,   2014.  
  • 26. Case  Study  4:  Millions  of  POS  Devices     Sending  Payment  Informa%on  
  • 27. If POS Data Were Homogeneous, We Could Use Single Change Detection Model •  Sequence of events x[1], x[2], x[3], … •  Question: is the observed distribution different than the baseline distribution? •  Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests* Observed Model Baseline Model β *H.  Vincent  Poor  and  Olympia  Hadjiliadis,  Quickest  Detec%on,  Cambridge  University   Press,  2008.    
  • 28. Build Thousands of Baseline Models, One for Each Cell in Data Cube •  Build  separate  model  for  each   bank  (1000+)   •  Build  separate  model  for  each   geographical  region  (6  regions)   •  Build  separate  model  for  each   different  type  of  merchant  (c.   800  types  of  merchants)   •  For  each  dis%nct  cell  in  cube,   establish  separate  baselines  for   each  metric  of  interest  (declines,   etc.)   •  Detect  changes  from  baselines     Geospatial region Type of Transaction 20,000+ separate baselines Bank Source:  Chris  Curry,  Robert  L.  Grossman,  David  Locke,  Steve  Vejcik,  and  Joseph  Bugajski,  Detec%ng   changes  in  large  data  sets  of  payment  card  data:  a  case  study,  Proceedings  of  the  13th  ACM  SIGKDD   interna%onal  conference  on  Knowledge  discovery  and  data  mining  (KDD  '07).    
  • 29. Balance Meaningful vs Manageable Number of Alerts • Fewer alerts • Alerts more manageable • To decrease alerts, remove breakpoint, order by number of decreased alerts, & select one or more breakpoints to remove • More alerts • Alerts more meaningful • To increase alerts, add breakpoint to split cubes, order by number of new alerts, & select one or more new breakpoints One model for each cell in data cube Breakpoint
  • 30. Summary  of  Cubes  of  Baseline  Models   Analy%c   Infrastructure   Models  were  built  in  data  warehouse  and   exported  in  the  Predic%ve  Model  Markup   Language  (PMML).      PMML  models   imported  into  real  %me  scoring  engine.     Analy%c   algorithms   Many  segmented  baseline  models,  one  for   each  cell  in  a  mul%-­‐dimensional  data  cube.   Analy%c   opera%ons   Order  most  meaningful  alerts  each  day,   create  custom  visual  report,  and  present  to   subject  ma8er  experts  each  day   Noteworthy     There  is  a  successor  to  PMML  called  the   Portable  Format  for  Analy%cs  (PFA)   designed  for  today’s  big  data   environments.  
  • 32. When  You  Have  No  Labeled  Data     and  Lots  of  It  –  Approach  1   1.  Create  clusters  or  micro-­‐clusters   2.  Manually  curate  some  of  the  clusters  with  tags  or   labels.     3.  Produce  scores  from  0  to  1000  based  upon   distances  to  interes%ng  clusters.   4.  Visualize     5.  Discuss  weekly  with  subject  ma8er  experts  and  use   these  sessions  to  improve  the  model.   Case  Study  2:  Detec%ng  anomalies  in  NASA’s  EO-­‐1   hyperspectral  data.        
  • 33. When  You  Have  No  Labeled  Data     and  Lots  of  It  –  Approach  2   1.  Create  candidate  findings  with  a  neighbor-­‐based   method,  cluster  based  method,  etc.   2.  Rank  order  the  findings  from  most  significant  to   least  significant.   3.  Compare  your  findings  with  other  ranking  methods.     Enrich  finding(s)  with  covariates  as  available.   4.  Visualize     5.  Discuss  weekly  with  subject  ma8er  experts  and  use   these  sessions  to  improve  the  model.   Case  Study  3:  Ranking  incidence  levels  of  diseases  by   their  geo-­‐spa%al  varia%on.        
  • 34. When  You  Have  No  Labeled  Data     and  Lots  of  It  –  Approach  3   1.  Build  baseline  models  for  each  en'ty  of  interest,   even  if  thousands  of  millions  of  models.   2.  Produce  scores  from  0  to  1000  based  upon   devia%on  of  observed  behavior  from  baseline   3.  Visualize     4.  Discuss  weekly  with  subject  ma8er  experts  and  use   these  sessions  to  improve  the  model.   Case  Study  4:  Detec%ng  anomalies  in  payment  data   using  baseline  models  for  millions  POS  devices.  
  • 35. Ques%ons?   35   rgrossman.com   @bobgrossman