The numbers behind numb3 rs solving crime with mathematics (malestrom)

6,816 views
6,699 views

Published on

J Gabriel Lima - http://jgabriellima.in

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,816
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
184
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

The numbers behind numb3 rs solving crime with mathematics (malestrom)

  1. 1. I 4 1 1 1 1 ftSOLVIN G CRIME WITH MATHEMATICS 1 - *THE NUMBERS BEHINDNUMB3RS KEITH DEVLIN . N P R S " M o t h Guy" and G A R ! L O R D E hI, the M o t h C o n s u l t a n t on NU MB3RS", t h e h it C B S tel evision series
  2. 2. A COMPANION TO THE HIT CBS CRIME SERIES NUMB3RS PRESENTS THE FASCINATING WAYS MATHEMATICS IS USED TO FIGHT REAL-LIFE CRIME• :i k imUsing the popular CBS prime-time TV crime series NUMB3RS asa springboard, Keith Devlin (known to millions of NPR listenersas "the Math Guy" on NPRs Weekend Edition with Scott Simon)and Gary Lorden (the math consultant to NUMB3RS " explain )real-life mathematical techniques used by the FBI and other lawenforcement agencies to catch and convict criminals. Fromforensics to counterterrorism. the Riemann hypothesis lo imageenhancement, solving murders to beating casino odds, Devlinand Lorden present compelling cases that illustrate how ad­vanced mathematics can be used in state-of-the-art criminalinvestigations. P r a i s e for t h e t e l e v i s i o n s e r i e s : "NUMB3RS L O O K S LIKE A W I N N 3 R . " —USA Today
  3. 3. A PLUME BOOK THE NUMBERS BEHIND NUMB3RSDR. KEITH DEVLIN is executive director o f Stanford Universitys Center forthe Study o f Language and Information and a consulting professor o fmathematics at Stanford. Devlin has a B.Sc. degree in Mathematics fromKings College London (1968) and a Ph.D. in Mathematics from the Uni­versity o f Bristol (1971). He is a fellow o f the American Association forthe Advancement o f Science, a World Economic Forum fellow, and aformer member o f the Mathematical Sciences Education Board o f theU.S. National Academy o f Sciences. The author o f twenty-five books,Devlin has been a regular contributor to National Public Radios popularprogram Weekend Edition, where he is known as "the Math Guy" in hison-air conversations with host Scott Simon. His monthly column, "Dev­lins Angle," appears on Mathematical Association o f Americas webjournal MAA Online.DR. GARY L O R D E N is a professor in the mathematics department o f theCalifornia Institute o f Technology in Pasadena. He graduated fromCaltech with a B.S. in mathematics in 1962, received his Ph.D. in math­ematics from Cornell University in 1966, and taught at NorthwesternUniversity before returning to Caltech in 1968. A fellow o f the Instituteof Mathematical Statistics, Lorden has taught statistics, probability, andother mathematics at all levels from freshman to doctoral. Lorden hasalso been active as a consultant and expert witness in mathematics andstatistics for government agencies and laboratories, private companies,and law firms. For many years he consulted for Caltechs Jet PropulsionLaboratory for their space exploration programs. He has participated inhighly classified research projects aimed at enhancing the ability o f gov­ernment agencies (such as the NSA) to protect national security. Lordenis the chief mathematics consultant for the CBS T V series NUMB3RS.
  4. 4. THENUMBERS BEHINDNUMB3RSSolving Crime with Mathematics Keith Devlin, Ph.D. and Gary Lorden, Ph.D. © A PLUME B O O K
  5. 5. PLUMEPublished by Penguin GroupPenguin Group (USA) Inc., 375 Hudson Street, New York, New York 10014,U.S.A. Penguin Group (Canada), 9 0 Eglinton Avenue East, Suite 700, Toronto,Ontario, Canada M 4 P 2Y3 (a division of Pearson Penguin Canada Inc.) Penguin BooksLtd., 8 0 Strand, London W C 2 R 0 R L , England Penguin Ireland, 25 St. Stephens Green,Dublin 2, Ireland (a division of Penguin Books Ltd.) Penguin Group (Australia),2 5 0 Camberwell Road, Camberwell, Victoria 3124, Australia (a division of PearsonAustralia Group Pty. Ltd.) Penguin Books India Pvt. Ltd., 11 Community Centre,Panchsheel Park, New Delhi - 110 017, India Penguin Books (NZ), 67 Apollo Drive,Rosedale, North Shore 0 7 4 5 , Auckland, New Zealand (a division of PearsonNew Zealand Ltd.) Penguin Books (South Africa) (Pty.) Ltd., 2 4 Sturdee Avenue,Rosebank, Johannesburg 2196, South AfricaPenguin Books Ltd., Registered Offices: 80 Strand, London WC2R 0RL, EnglandFirst published by Plume, a member of Penguin Group (USA) Inc.First Printing, September 2 0 0 710 9 8 7 6 5 4 3 2 1Copyright © Keith Devlin and Gary Lorden, 2007All rights reservedIllustration credits appear on page 244. REGISTERED TRADEMARK—MARCA REGISTRADALIBRARY OF CONGRESS CATALOGING-IN-PUBLICATION DATADevlin, Keith J. The numbers behind NUMB3RS: solving crime with mathematics/Keith Devlin,Gary Lorden. p. cm. ISBN 978-0-452-28857-7 1. Criminal investigation. 2. Mathematical statistics. 3. Criminal investigation—Dataprocessing. I. Title: Numbers behind numbers. II. Lorden, Gary. HI. Title.HV8073.5.D485 2007363.25015195—dc22 2007018115Printed in the United States of AmericaSet in Dante MTDesigned by Joseph RuttWithout limiting the rights under copyright reserved above, no part of this publication maybe reproduced, stored in or introduced into a retrieval system, or transmitted, in any form, orby any means (electronic, mechanical, photocopying, recording, or otherwise), without theprior written permission of both the copyright owner and the above publisher of this book.PUBLISHERS NOTEThe scanning, uploading, and distribution of this book via the Internet or via any othermeans without the permission of the publisher is illegal and punishable by law. Pleasepurchase only authorized electronic editions, and do not participate in or encour­age electronic piracy of copyrighted materials. Your support of the authors rights isappreciated.BOOKS ARE AVAILABLE AT QUANTITY DISCOUNTS WHEN USED TO PROMOTE PRODUCTS OR SERVICES.FOR INFORMATION PLEASE WRITE TO PREMIUM MARKETING DIVISION, PENGUIN GROUP (USA) INC.,3 7 5 HUDSON STREET, NEW YORK, NEW YORK 1 0 0 1 4 .
  6. 6. AcknowledgmentsThe authors want to thank NUMB3RS creators Cheryl Heuton and NickFalacci for creating Charlie Eppes, televisions first mathematics super­hero, and succeeding brilliantly in putting math on television in primetime. Their efforts have been joined by a stellar team o f other writers,actors, producers, directors, and specialists whose work has inspired us towrite this book. The gifted actor David Krumholtz has earned the undy­ing love o f mathematicians everywhere for bringing Charlie to life in away that has led millions o f people to see mathematics in a completelynew light. Thanks also to NUMB3RS researchers Andy Black and MattKolokoff for being wonderful to work with in coming up with endlessapplications o f mathematics to make the writers dreams come true. We wish to express our particular thanks to mathematician Dr.Lenny Rudin o f Cognitech, one o f the worlds foremost experts on im­age enhancement, for considerable help with Chapter 5 and for provid­ing the images we show in that chapter. Finally, Ted Weinstein, our agent, found us an excellent publisher inDavid Cashion o f Plume, and both worked tirelessly to turn a manuscriptthat we felt was as reader-friendly as possible, given that this is a mathbook, into one that, we have to acknowledge, is now a lot more so! Keith Devlin, Palo Alto, CA Gary Lorden, Pasadena, CA
  7. 7. Contents Introduction The Hero Is a Mathematician? ix 1 Finding t h e H o t Z o n e 1 Criminal Geographic Profiling 2 Fighting Crime w i t h Statistics 101 13 3 D a t a Mining 25 Finding Meaningful Patterns in Masses of Information4 When Does the Writing First Appearon the Wall? 51 Changepoint Detection 5 I m a g e Enhancement and Reconstruction 63 6 Predicting t h e Future 77 Bayesian Inference 7 D N A Profiling 89 8 S e c r e t s — M a k i n g and Breaking C o d e s 105 9 H o w Reliable Is t h e Evidence? 121 Doubts about Fingerprints 10 Connecting t h e Dots 137 The Math of Networks
  8. 8. viii Contents11 The Prisoners Dilemma, Risk Analysis,and Counterterrorism 153 12 M a t h e m a t i c s in t h e C o u r t r o o m 175 13 C r i m e in t h e Casino 193 Using Math to Beat the System Appendix Mathematical Synopses of the Episodes in the First Three Seasons of NUMB3RS 207 Index 233
  9. 9. INTRODUCTION The Hero Is a Mathematician ?On January 23, 2005, a new television crime series called NUMB3RS de­buted. Created by the husband-and-wife team Nick Falacci and CherylHeuton, the series was produced by Paramount Network Televisionand acclaimed Hollywood veterans Ridley and Tony Scott, whose moviecredits include Alien, Top Gun, and Gladiator. Throughout its run,NUMB3RS has regularly beat out the competition to be the most watchedseries in its time slot on Friday nights. What has surprised many is that one o f the shows two heroes is amathematician, and much o f the action revolves around mathematics,as professor Charlie Eppes uses his powerful skills to help his olderbrother, Don, an FBI agent, identify and catch criminals. Many viewers,and several critics, have commented that the stories are entertaining,but the basic premise is far-fetched: You simply cant use math to solvecrimes, they say. As this book proves, they are wrong. You can use mathto solve crimes, and law enforcement agencies do—not in every instanceto be sure, but often enough to make math a powerful weapon in thenever-ending fight against crime. In fact, the very first episode o f theseries was closely based on a real-life case, as we will discuss in the nextchapter. Our book sets out to describe, in a nontechnical fashion, some o f themajor mathematical techniques currently available to the police, CIA,and FBI. Most o f these methods have been mentioned during episodesof NUMB3RS, and while we frequently link our explanations to whatwas depicted on the air, our focus is on the mathematical techniquesand how they can be used in law enforcement. In addition we describe
  10. 10. X Introductionsome real-life cases where mathematics played a role in solving a crimethat have not been used in the T V series—at least not directly. In many ways, NUMB3RS is similar to good science fiction, which isbased on correct physics or chemistry. Each week, NUMB3RS presents adramatic story in which realistic mathematics plays a key role in the nar­rative. The producers o f NUMB3RS go to great lengths to ensure that themathematics used in the scripts is correct and that the applications shownare possible. Although some o f the cases viewers see are fictional, theycertainly could have happened, and in some cases very well may. Thoughthe T V series takes some dramatic license, this book does not. In TheNumbers Behind NUMB3RS, you will discover the mathematics that canbe, and is, used in fighting real crime and catching actual criminals.
  11. 11. THE NUMBERS BEHIND NUMB3RS
  12. 12. CHAPTER Finding the Hot Zone 1 Criminal Geographic ProfilingFBI Special Agent D o n Eppes looks again at t h e large street m a p of LosAngeles spread across t h e dining-room table of his fathers h o u s e . T h ecrosses inked o n t h e m a p s h o w t h e locations w h e r e , over a period ofseveral m o n t h s , a b r u t a l serial killer has struck, raping and t h e n m u r d e r ­ing a n u m b e r of y o u n g w o m e n . D o n s j o b is t o catch t h e killer before h estrikes again. But t h e investigation has stalled. D o n is o u t of clues, a n ddoesnt k n o w w h a t t o d o next. "Can I help?" T h e voice is that of D o n s y o u n g e r brother, Charlie, abrilliant y o u n g professor of m a t h e m a t i c s at t h e n e a r b y university CalSci.D o n has always b e e n in awe of his b r o t h e r s incredible ability at m a t h ,and frankly w o u l d w e l c o m e any help h e can get. B u t . . . help from amathematician? "This case isnt about numbers, Charlie." T h e edge in Dons voice iscaused m o r e by frustration than anger, b u t Charlie seems not to notice, andhis reply is totally matter-of-fact b u t insistent: "Everything is numbers." D o n is n o t convinced. Sure, h e has often h e a r d Charlie say thatm a t h e m a t i c s is all a b o u t patterns—identifying t h e m , analyzing t h e m ,m a k i n g predictions a b o u t t h e m . But it didnt take a m a t h genius t o seethat t h e crosses o n t h e m a p w e r e scattered haphazardly. T h e r e w a s n opattern, n o way anyone could predict w h e r e t h e next cross w o u l d g o —the exact location w h e r e t h e next y o u n g girl w o u l d b e attacked. Maybeit w o u l d occur that very evening. If only there w e r e s o m e regularity t othe a r r a n g e m e n t of t h e crosses, a p a t t e r n that could b e c a p t u r e d w i t h amathematical equation, t h e w a y D o n r e m e m b e r s from his schooldays 2 2that the equation x + y = 9 describes a circle.
  13. 13. 2 T H E NUMBERS B E H I N D NUMB3RS L o o k i n g at t h e m a p , even Charlie has t o agree there is n o way to usem a t h t o predict w h e r e t h e killer w o u l d strike next. H e strolls over to thew i n d o w a n d stares o u t across t h e garden, t h e silence of the eveningb r o k e n only by t h e continual flick-flick-jiick-ftick of t h e automatic sprin­kler w a t e r i n g t h e lawn. Charlies eyes see t h e sprinkler b u t his m i n d isfar away. H e h a d t o a d m i t that D o n w a s probably right. Mathematicscould b e used t o d o lots of things, far m o r e t h a n m o s t people realized.But in o r d e r t o use m a t h , t h e r e h a d t o b e s o m e sort of pattern. Flick-Jiick-flick-jlick. T h e sprinkler continued to do its job. T h e r e wast h e brilliant m a t h e m a t i c i a n in N e w York w h o used mathematics to studyt h e w a y t h e h e a r t w o r k s , helping doctors spot tiny irregularities in aheartbeat before t h e p e r s o n has a h e a r t attack. Flick-flick-flick-flick. T h e r e were all those mathematics-based c o m p u t e rp r o g r a m s the banks utilized t o track credit card purchases, looking for asudden change in the p a t t e r n that might indicate identity theft or a stolencard. Flick-flick-flick-flick. W i t h o u t clever m a t h e m a t i c a l algorithms, the cellp h o n e in Charlies p o c k e t w o u l d have b e e n twice as big and a lotheavier. Flick-flick-flick-flick. In fact, t h e r e w a s scarcely any area of m o d e r n lifethat did n o t d e p e n d , often in a crucial way, o n m a t h e m a t i c s . But thereh a d t o b e a p a t t e r n , o t h e r w i s e t h e m a t h cant get started. Flick-flick-flick-flick. For t h e first t i m e , Charlie notices t h e sprinkler,and suddenly h e k n o w s w h a t t o do. H e has his answer. H e could helpsolve D o n s case, a n d t h e solution has b e e n staring h i m in t h e face allalong. H e j u s t h a d n o t realized it. H e drags D o n over t o t h e window. "Weve b e e n asking the w r o n gquestion," h e says. " F r o m w h a t y o u know, theres n o way y o u can pre­dict w h e r e t h e killer will strike next." H e points t o t h e sprinkler. "Justlike, n o m a t t e r h o w m u c h y o u study w h e r e each d r o p of w a t e r hits thegrass, theres n o w a y y o u can predict w h e r e the next d r o p will land.T h e r e s t o o m u c h uncertainty." H e glances at D o n t o m a k e sure hisolder b r o t h e r is listening. "But suppose you could n o t see t h e sprinkler,a n d all y o u h a d t o g o o n was t h e p a t t e r n of w h e r e all the drops landed.T h e n , using m a t h , y o u could w o r k o u t exactly w h e r e the sprinkler m u s tbe. You cant use t h e p a t t e r n of drops t o predict forward t o the next
  14. 14. Finding the Hot Zone 3drop, b u t y o u can use it t o w o r k b a c k w a r d t o t h e source. Its t h e s a m ewith your killer." D o n finds it difficult to accept w h a t his b r o t h e r seems t o b e suggesting."Charlie, are you telling m e you can figure o u t w h e r e the killer lives?" Charlies answer is simple: "Yes." D o n is still skeptical that Charlies idea can really w o r k , b u t hesimpressed by his b r o t h e r s confidence and passion, a n d so h e agrees t olet h i m assist w i t h t h e investigation. Charlies first step is to learn s o m e basic facts from the science of crimi­nology: First, h o w do serial killers behave? Here, his years of experience asa mathematician have taught h i m h o w to recognize the key factors andignore all the others, so that a seemingly complex problem can b e reducedto one with just a few key variables. Talking with D o n and the other agentsat the FBI office where his elder brother works, h e learns, for instance, thatviolent serial criminals exhibit certain tendencies in selecting locations.They tend to strike close to their h o m e , b u t n o t t o o close; they always seta "buffer z o n e " around their residence w h e r e they will n o t strike, an areathat is too close for comfort; outside that comfort zone, the frequency ofcrime locations decreases as the distance from h o m e increases. T h e n , back in his office in t h e CalSci m a t h e m a t i c s d e p a r t m e n t ,Charlie gets t o w o r k in earnest, feverishly covering his blackboardsw i t h mathematical equations and formulas. His goal: t o find t h e m a t h ­ematical key t o d e t e r m i n e a "hot z o n e " — a n area o n t h e m a p , derivedfrom the crime locations, w h e r e t h e p e r p e t r a t o r is m o s t likely t o live. As always w h e n h e w o r k s o n a difficult m a t h e m a t i c a l p r o b l e m , t h eh o u r s fly by as Charlie tries o u t m a n y unsuccessful approaches. T h e n ,finally, h e has an idea h e thinks should w o r k . H e erases his previouschalk scribbles o n e m o r e t i m e a n d writes this complicated-lookingformula o n t h e board:* =k p, Y, *Well take a closer look at this formula in a moment.
  15. 15. 4 THE NUMBERS B E H I N D NUMB3RS " T h a t should d o t h e trick," h e says t o himself. T h e next step is t o fine-tune his formula by checking it against exam­ples of past serial crimes D o n provides h i m with. W h e n h e inputs thecrime locations from those previous cases into his formula, does it accu­rately predict w h e r e t h e criminals lived? This is t h e m o m e n t of truth,w h e n Charlie will discover w h e t h e r his m a t h e m a t i c s reflects reality.S o m e t i m e s it doesnt, and h e learns that w h e n h e first decided whichfactors t o take into a c c o u n t and which to ignore, h e m u s t have got itw r o n g . But this time, after Charlie m a k e s a few m i n o r adjustments, theformula s e e m s t o w o r k . T h e next day, b u r s t i n g w i t h e n e r g y and conviction, Charlie shows u pat t h e FBI offices w i t h a p r i n t o u t of the crime-location m a p w i t h the 2 2"hot z o n e " p r o m i n e n t l y displayed. Just as the equation x + y = 9 thatD o n r e m e m b e r e d from his schooldays describes a circle, so that w h e nt h e e q u a t i o n is fed into a suitably p r o g r a m m e d c o m p u t e r it will drawt h e circle, so t o o w h e n Charlie fed his n e w equation into his computer,it also p r o d u c e d a picture. N o t a circle this time—Charlies equation ism u c h m o r e complicated. W h a t it gave was a series of concentric col­ored regions d r a w n o n D o n s crime m a p of Los Angeles, regions thath o m e d in o n t h e h o t z o n e w h e r e the killer lives. H a v i n g this m a p will still leave a lot of w o r k for D o n and his col­leagues, b u t finding t h e killer is n o longer like looking for a needle in ahaystack. T h a n k s t o Charlies m a t h e m a t i c s , the haystack has suddenlydwindled t o a m e r e sackful of hay.
  16. 16. Finding t h e H o t Zone 5 Charlie explains to D o n and the other FBI agents w o r k i n g t h e case thatthe serial criminal has tried n o t to reveal w h e r e h e lives, picking victims inw h a t h e thinks is a r a n d o m p a t t e r n of locations, b u t that t h e m a t h e m a t i ­cal formula nevertheless reveals the truth: a h o t z o n e in which t h e crimi­nals residence is located, to a very high probability. D o n and the t e a mdecide to investigate m e n within a certain range of ages, w h o live in t h eh o t zone, and use surveillance and stealth tactics t o obtain D N A evidencefrom the suspects discarded cigarette butts, drinking straws, and the like,which can be m a t c h e d w i t h D N A from t h e crime-scene investigations. Within a few days—and a few heart-stopping m o m e n t s — t h e y havetheir m a n . T h e case is solved. D o n tells his y o u n g e r brother, " T h a t ssome formula youve got there, Charlie."FACT OR FICTION?Leaving out a few dramatic twists, the above is w h a t t h e T V audience sawin the very first episode of NUMB3RS, broadcast o n January 23, 2005.Many viewers could n o t believe that mathematics could help capture acriminal in this way. In fact, that entire first episode w a s based fairly closelyon a real case in which a single mathematical equation was used t o identifythe hot zone w h e r e a criminal lived. It was the very equation, reproducedabove, that viewers saw Charlie write o n his blackboard. T h e real-life m a t h e m a t i c i a n w h o p r o d u c e d t h a t formula is n a m e dKim Rossmo. T h e technique of using m a t h e m a t i c s t o predict w h e r ea serial criminal lives, w h i c h R o s s m o helped t o establish, is calledgeographic profiling. In the 1980s R o s s m o w a s a y o u n g constable o n t h e police force inVancouver, Canada. W h a t m a d e h i m u n u s u a l for a police officer w a s histalent for mathematics. T h r o u g h o u t school h e h a d b e e n a " m a t h w h i z , "the kind of student w h o m a k e s fellow students, a n d often teachers, alittle nervous. T h e story is told that early in t h e twelfth g r a d e , b o r e dw i t h the slow pace of his m a t h e m a t i c s course, h e asked t o take t h e finalexam in the second w e e k of t h e semester. After scoring o n e h u n d r e dpercent, h e was excused from t h e r e m a i n d e r of t h e course. Similarly b o r e d w i t h t h e typical slow progress of police investigationsinvolving violent serial criminals, R o s s m o decided t o g o back t o school,
  17. 17. 6 T H E NUMBERS B E H I N D NUMB3RSending u p w i t h a Ph.D. in criminology from Simon Fraser University, thefirst cop in Canada t o get one. His thesis advisers, Paul and PatriciaBrantingham, w e r e pioneers in t h e development of mathematical models(essentially sets of equations that describe a situation) of criminalbehavior, particularly those that describe w h e r e crimes are m o s t likely tooccur based o n w h e r e a criminal lives, works, and plays. (It was theBrantinghams w h o noticed the location patterns of serial criminalsthat T V veiwers saw Charlie learning a b o u t from D o n and his FBIcolleagues.) Rossmos interest w a s a little different from the Brantinghams. H edid n o t w a n t t o study p a t t e r n s of criminal behavior. As a police officer,h e w a n t e d t o use actual data a b o u t t h e locations of crimes linked to asingle u n k n o w n p e r p e t r a t o r as an investigative tool t o help the police findt h e criminal. R o s s m o h a d s o m e initial successes in re-analyzing old cases, and afterreceiving his Ph.D. and b e i n g p r o m o t e d to detective inspector, h e pur­sued his interest in developing b e t t e r m a t h e m a t i c a l m e t h o d s to do w h a th e c a m e t o call criminal g e o g r a p h i c targeting (CGT). O t h e r s called them e t h o d "geographic profiling," since it c o m p l e m e n t e d the well-knownt e c h n i q u e of "psychological profiling" used by investigators to findcriminals based o n their motivations and psychological characteristics.G e o g r a p h i c profiling a t t e m p t s t o locate a likely base of operation for acriminal b y analyzing t h e locations of their crimes. R o s s m o hit u p o n t h e key idea b e h i n d his seemingly m a g i c formulawhile riding o n a bullet train in J a p a n o n e day in 1991. Finding himselfw i t h o u t a n o t e p a d t o w r i t e on, h e scribbled it o n a napkin. W i t hlater refinements, the formula b e c a m e the principal e l e m e n t of ac o m p u t e r p r o g r a m R o s s m o w r o t e , called Rigel ( p r o n o u n c e d RYE-gel,a n d n a m e d after t h e star in the constellation Orion, the H u n t e r ) . Today,R o s s m o sells Rigel, along w i t h training and consultancy, to policeand o t h e r investigative agencies a r o u n d the world t o help t h e m findcriminals. W h e n R o s s m o describes h o w Rigel works to a law enforcementagency interested in t h e p r o g r a m , h e offers his favorite m e t a p h o r — t h a tof d e t e r m i n i n g t h e location of a rotating lawn sprinkler by analyzing thep a t t e r n of t h e w a t e r drops it sprays o n t h e g r o u n d . W h e n NUMB3RS
  18. 18. Finding the Hot Zone 7cocreators Cheryl H e u t o n and Nick Falacci w e r e w o r k i n g o n their pilotepisode, they t o o k Rossmos o w n m e t a p h o r as t h e w a y Charlie w o u l d hitu p o n the formula and explain the idea t o his brother. Rossmo h a d s o m e early successes dealing w i t h serial crime investiga­tions in Canada, b u t w h a t really m a d e h i m a h o u s e h o l d n a m e a m o n glaw enforcement agencies all over N o r t h America w a s t h e case of t h eSouth Side Rapist in Lafayette, Louisiana. For m o r e t h a n t e n years, an u n k n o w n assailant, his face w r a p p e dbandit-style in a scarf, h a d b e e n stalking w o m e n in t h e t o w n a n d assault­ing t h e m . In 1998 t h e local police, s n o w e d u n d e r by t h o u s a n d s of tipsand a corresponding n u m b e r of suspects, b r o u g h t R o s s m o in t o help.Using Rigel, R o s s m o analyzed t h e crime-location data a n d p r o d u c e d am a p m u c h like the o n e Charlie displayed in NUMB3RS, w i t h b a n d s ofcolor indicating the h o t z o n e and its increasingly h o t interior rings. T h em a p enabled police t o n a r r o w d o w n t h e h u n t t o half a square mile a n dabout a d o z e n suspects. Undercover officers c o m b e d t h e h o t z o n e usingthe same techniques p o r t r a y e d in NUMB3RS, t o obtain D N A samples ofall males of t h e right age r a n g e in t h e area. Frustration set in w h e n each of t h e suspects in t h e h o t z o n e w a scleared by D N A evidence. But t h e n they g o t lucky. T h e lead investigator,McCullan "Mac" Gallien, received an a n o n y m o u s tip pointing t o a veryunlikely suspect—a sheriffs d e p u t y from a n e a r b y d e p a r t m e n t . As j u s to n e m o r e tip o n t o p of t h e m o u n t a i n h e already had, Mac w a s inclinedt o just file it, b u t o n a w h i m h e decided t o check t h e deputys address.N o t even close t o t h e h o t z o n e . Still s o m e t h i n g niggled h i m , and h e d u ga little deeper. A n d t h e n h e hit t h e jackpot. T h e d e p u t y h a d previouslylived at a n o t h e r address—right in t h e h o t z o n e ! D N A evidence w a scollected from a cigarette butt, and it m a t c h e d t h a t t a k e n from t h ecrime scenes. T h e d e p u t y w a s arrested, a n d R o s s m o b e c a m e an instantcelebrity in t h e crime-fighting world. Interestingly, w h e n H e u t o n and Falacci w e r e w r i t i n g t h e pilot epi­sode of NUMB3RS, based o n this real-life case, they could n o t resistincorporating the s a m e d r a m a t i c twist at t h e end. W h e n Charlie firstapplies his formula, n o D N A m a t c h e s are found a m o n g t h e suspects inthe h o t z o n e , as h a p p e n e d w i t h Rossmos formula in Lafayette. Charliesbelief in his m a t h e m a t i c a l analysis is so s t r o n g that w h e n D o n tells h i m
  19. 19. 8 THE NUMBERS B E H I N D NUMB3RSt h e search has d r a w n a blank, h e initially refuses t o accept this o u t c o m e ."You m u s t have missed h i m , " h e says. Frustrated and upset, Charlie huddles w i t h D o n at their father Alansh o u s e , and Alan says, "I k n o w t h e p r o b l e m cant b e t h e m a t h , Charlie. Itm u s t b e s o m e t h i n g else." This r e m a r k spurs D o n t o realize that findingt h e killers residence m a y b e t h e w r o n g goal. "If y o u tried to find m ew h e r e I live, y o u w o u l d probably fail because Im almost never there,"h e notes. " I m usually at work." Charlie seizes o n this n o t i o n t o pursuea different line of attack, modifying his calculations t o look for twoh o t z o n e s , o n e t h a t m i g h t contain t h e killers residence and t h e otherhis place of w o r k . This t i m e Charlies m a t h w o r k s . D o n m a n a g e s t oidentify a n d catch t h e criminal j u s t before h e kills a n o t h e r victim. T h e s e days, Rossmos c o m p a n y ECRI (Environmental CriminologyResearch, Inc.) offers t h e p a t e n t e d c o m p u t e r package Rigel along w i t htraining in h o w t o use it effectively t o solve crimes. R o s s m o himselftravels a r o u n d t h e world, t o Asia, Africa, E u r o p e , and t h e Middle East,assisting in criminal investigations and giving lectures to police andcriminologists. T w o years of training, by R o s s m o or o n e of his assistants,is required t o learn t o adapt t h e use of t h e p r o g r a m to t h e idiosyncrasiesof a particular criminals behavior. Rigel does n o t score a big w i n every time. For example, Rossmo wascalled in o n t h e n o t o r i o u s Beltway Sniper case w h e n , during a three-weekperiod in O c t o b e r 2002, t e n people w e r e killed and three others criticallyinjured by w h a t t u r n e d o u t t o b e a pair of serial killers operating in anda r o u n d t h e Washington, D.C., area. R o s s m o concluded that the snipersbase w a s s o m e w h e r e in the suburbs t o t h e n o r t h of Washington, b u t itt u r n e d o u t that t h e t w o killers did n o t live in t h e area and moved t o ooften t o b e located by geographic profiling. T h e fact that Rigel does n o t always w o r k will n o t c o m e as a surpriset o anyone familiar w i t h w h a t h a p p e n s w h e n y o u try t o apply m a t h e m a t ­ics t o t h e m e s s y real w o r l d of people. M a n y people c o m e away fromtheir h i g h school experience w i t h m a t h e m a t i c s thinking that there is aright w a y a n d a w r o n g w a y t o use m a t h to solve a p r o b l e m — i n t o om a n y cases w i t h t h e teachers w a y b e i n g t h e right o n e and their o w na t t e m p t s b e i n g t h e w r o n g o n e . But this is rarely t h e case. Mathematicswill always give y o u t h e correct answer (if you d o t h e m a t h right) w h e n
  20. 20. Finding the Hot Zone 9you apply it to very well-defined physical situations, such as calculatingh o w m u c h fuel a j e t needs t o fly from Los Angeles t o N e w York. (Thatis, the m a t h will give you t h e right answer provided y o u start w i t h accu­rate data a b o u t t h e total w e i g h t of t h e plane, passengers, a n d cargo, t h eprevailing winds, a n d so forth. Missing a key piece of i n p u t data t oincorporate into t h e m a t h e m a t i c a l equations will almost always resultin an inaccurate answer.) But w h e n y o u apply m a t h t o a social p r o b l e m ,such as a crime, things are rarely so clear-cut. Setting u p equations that capture elements of s o m e real-life activity iscalled constructing a "mathematical m o d e l . " In constructing a physicalm o d e l of something, say an aircraft t o study in a w i n d tunnel, t h e impor­tant thing is t o get everything right, apart from t h e size and t h e materialsused. In constructing a mathematical m o d e l , t h e idea is t o get t h e appro­priate behavior right. For example, to b e useful, a m a t h e m a t i c a l m o d e l ofthe w e a t h e r should predict rain for days w h e n it rains and predict sun­shine o n sunny days. Constructing t h e m o d e l in t h e first place is usuallythe hard part. "Doing the m a t h " w i t h t h e model—i.e., solving t h e equa­tions that m a k e u p the model—is generally m u c h easier, especially w h e nusing computers. Mathematical models of t h e w e a t h e r often fail becausethe w e a t h e r is simply far t o o complicated (in everyday language, its "toounpredictable") to b e captured by m a t h e m a t i c s w i t h great accuracy. As w e shall see in later chapters, t h e r e is usually n o such thing as"one correct w a y " t o use m a t h e m a t i c s t o solve p r o b l e m s in t h e realworld, particularly p r o b l e m s involving people. To try t o m e e t t h e chal­lenges that confront Charlie in NUMB3RS—locating criminals, tracingthe spread of a disease or of counterfeit money, predicting t h e targetselection of terrorists, and so o n — a m a t h e m a t i c i a n c a n n o t m e r e l y w r i t ed o w n an equation and solve it. T h e r e is a considerable art t o t h e processof assembling information and data, selecting m a t h e m a t i c a l variablesthat describe a situation, and t h e n m o d e l i n g it w i t h a set of equations.And once a m a t h e m a t i c i a n has c o n s t r u c t e d a m o d e l , t h e r e is still t h em a t t e r of solving it in s o m e way, by approximations or calculations orc o m p u t e r simulations. Every step in t h e process requires j u d g m e n t a n dcreativity. N o t w o m a t h e m a t i c i a n s w o r k i n g independently, h o w e v e rbrilliant, are likely t o p r o d u c e identical results, if i n d e e d they canp r o d u c e useful results at all.
  21. 21. 10 T H E NUMBERS B E H I N D NUMB3RS It is n o t surprising, then, that in t h e field of geographic profiling,R o s s m o has competitors. Dr. Grover M. G o d w i n of t h e Justice Center att h e University of Alaska, a u t h o r of t h e b o o k Hunting Serial Predators, hasdeveloped a c o m p u t e r package called Predator that uses a b r a n c h ofm a t h e m a t i c a l statistics called multivariate analysis t o pinpoint a serialkillers h o m e base b y analyzing t h e locations of crimes, w h e r e thevictims w e r e last seen, a n d w h e r e t h e bodies w e r e discovered. N e dLevine, a H o u s t o n - b a s e d u r b a n planner, developed a p r o g r a m calledCrimestat for t h e National Institute of Justice, a research b r a n c h of theU.S. Justice D e p a r t m e n t . It uses s o m e t h i n g called spatial statistics toanalyze serial-crime data, and it can also b e applied t o help agents under­stand such things as p a t t e r n s of a u t o accidents o r disease outbreaks.A n d David Canter, a professor of psychology at t h e University ofLiverpool in England, a n d t h e director of t h e Centre for InvestigativePsychology there, has developed his o w n c o m p u t e r p r o g r a m , Dragnet,w h i c h h e has s o m e t i m e s offered free t o researchers. C a n t e r has pointedo u t t h a t so far n o o n e has p e r f o r m e d a head-to-head comparison of thevarious m a t h / c o m p u t e r systems for locating serial criminals based o napplying t h e m in t h e s a m e cases, and h e has claimed in interviews thatin t h e l o n g r u n , his p r o g r a m and o t h e r s will prove to b e at least asaccurate as Rigel.ROSSMOS FORMULAFinally, lets take a closer l o o k at t h e formulas R o s s m o scribbled d o w no n t h a t p a p e r n a p k i n o n t h e bullet train in Japan b a c k in 1991. c To u n d e r s t a n d w h a t it m e a n s , i m a g i n e a grid of little squares super­i m p o s e d o n t h e m a p , each square having t w o n u m b e r s that locate it:w h a t r o w its in and w h a t c o l u m n its in, "i" and "j". T h e probability, p..,that t h e killers residence is in that square is w r i t t e n o n t h e left side of
  22. 22. Finding the Hot Zone 11the equation, and t h e right side shows h o w t o calculate it. T h e crimelocations are represented by m a p coordinates, ( x ^ ) for t h e first crime,(x ,y ) for the second crime, a n d so on. W h a t t h e formula says is this: 2 2 To get the probability p.^ for t h e square in r o w "i", c o l u m n "j" of t h egrid, first calculate h o w far y o u have t o g o t o get from t h e center p o i n t(x.,y.) of that square t o each crime location ( x , y ) . T h e little "n" h e r e n nstands for any o n e of t h e crime l o c a t i o n s — n = l m e a n s "first crime,"n = 2 m e a n s "second crime," and so on. T h e answer t o t h e question ofh o w far you have t o g o is: IXi-xJ + ly.-yJand this is used in t w o ways. Reading from left t o right in t h e formula, t h e first way is to p u t thatdistance in the d e n o m i n a t o r , w i t h (p in t h e n u m e r a t o r . T h e distance israised t o the p o w e r / T h e choice of w h a t n u m b e r t o use for t h i s / w i l l b ebased o n w h a t w o r k s best w h e n t h e formula is checked against data o npast crime patterns. (If y o u t a k e / = 2, for example, t h e n that p a r t of t h eformula will resemble t h e "inverse square law" that describes t h e forceof gravity.) This part of t h e formula expresses t h e idea that t h e probabil­ity of crime locations decreases as t h e distance increases, once outside ofthe buffer z o n e . T h e second w a y t h e formula uses t h e "traveling distance" of eachcrime involves the buffer z o n e . In t h e second fraction, y o u subtract t h edistance from 2B, w h e r e B is a n u m b e r t h a t will b e chosen t o describethe size of t h e buffer z o n e , and y o u use that subtraction result inthe second fraction. T h e subtraction p r o d u c e s smaller answers as t h edistance increases, so that after raising those answers t o a n o t h e r power,g, in the d e n o m i n a t o r of t h e second p a r t of t h e formula, y o u get largerresults. Together, the first and second parts of t h e formula p e r f o r m a sort of"balancing act," expressing t h e fact that as you m o v e away from t h ecriminals base, the probability of crimes first increases (as y o u m o v et h r o u g h the buffer zone) and t h e n decreases. T h e t w o p a r t s of t h eformula are c o m b i n e d using a fancy m a t h e m a t i c a l notation, t h e G r e e kletter Z standing for " s u m (add up) t h e contributions from each of t h e
  23. 23. 12 T H E NUMBERS B E H I N D NUMB3RScrimes t o t h e evaluation of the probability for the if grid square." T h eG r e e k letter (p is u s e d in t h e t w o parts as a way of placing m o r e "weight"o n o n e p a r t or t h e other. A larger choice of (p p u t s m o r e weight o n thep h e n o m e n o n of "decreasing probability as distance increases," whereasa smaller 9 emphasizes t h e effect of t h e buffer z o n e . O n c e t h e formula is used t o calculate t h e probabilities, p„, of all oft h e little squares in t h e grid, its easy t o m a k e a h o t z o n e map. You justcolor t h e squares, w i t h t h e highest probabilities bright yellow, slightlysmaller probabilities o r a n g e , t h e n red, and so on, leaving t h e squaresw i t h l o w probability uncolored. Rossmos formula is a g o o d example of t h e art of using m a t h e m a t i c st o describe i n c o m p l e t e k n o w l e d g e of real-world p h e n o m e n a . Unliket h e law of gravity, w h i c h t h r o u g h careful m e a s u r e m e n t s can b e observedt o o p e r a t e the same way every time, descriptions of t h e behavior ofindividual h u m a n beings are at best approximate and uncertain. W h e nR o s s m o checked o u t his formula o n past crimes, h e h a d to find thebest fit of his formula t o those data b y choosing different possible valuesof / a n d g, a n d of B a n d (p. H e t h e n used those findings in analyzingfuture crime p a t t e r n s , still allowing for further fine-tuning in each n e winvestigation. Rossmos m e t h o d is definitely n o t rocket science—space traveld e p e n d s crucially o n always getting t h e right answer w i t h great accu­racy. But it is nevertheless science. It does n o t w o r k every time, and theanswers it gives are probabilities. But in crime detection and otherd o m a i n s involving h u m a n behavior, k n o w i n g those probabilities cans o m e t i m e s m a k e all t h e difference.
  24. 24. CHAPTER2 Fighting Crime with Statistics 101THE ANGEL OF DEATHBy 1996, Kristen Gilbert, a thirty-three-year-old divorced m o t h e r of t w osons, ages seven and ten, and a nurse in W a r d C at t h e Veterans AffairsMedical Center in N o r t h a m p t o n , Massachusetts, h a d built u p quite areputation a m o n g her colleagues at the hospital. O n several occasions shewas the first o n e to notice that a patient was going into cardiac arrest andto sound a "code blue" to bring t h e e m e r g e n c y resuscitation t e a m . Shealways stayed calm, and was c o m p e t e n t and efficient in administering tothe patient. Sometimes she w o u l d give t h e patient an injection of t h eheart-stimulant d r u g epinephrine to a t t e m p t to restart the h e a r t beforethe emergency t e a m arrived, occasionally saving t h e patients life in thisway. T h e other nurses had given h e r the nickname Angel of Death." But that same year, three nurses approached the authorities to expresstheir growing suspicions that something was not quite right. There hadbeen just too many deaths from cardiac arrest in that particular ward, theyfelt. There had also been several unexplained shortages of epinephrine. T h enurses were starting to fear that Gilbert was giving the patients large dosesof the drug to bring o n the heart attacks in the first place, so that she couldplay the heroic role of trying to save them. T h e Angel of Death" nicknamewas beginning to sound m o r e apt than they h a d first intended. T h e hospital launched an investigation, b u t found nothing untoward. Inparticular, the n u m b e r of cardiac deaths at the unit was broadly in line w i t hthe rates at other VA hospitals, they said. Despite t h e findings of t h e initial
  25. 25. 14 T H E NUMBERS B E H I N D NUMB3RSinvestigation, however, the staff at the hospital remained suspicious, andeventually a second investigation was begun. This included bringing in aprofessional statistician, Stephen Gehlbach of the University of Massachu­setts, to take a closer look at the units cardiac arrest and mortality figures.Largely as a result of Gehlbachs analysis, in 1998 the U.S. Attorneys Officedecided to convene a g r a n d j u r y to hear the evidence against Gilbert. Part of t h e evidence w a s h e r alleged motivation. In addition to seek­ing t h e excitement of t h e code blue a l a r m and the resuscitation process,plus t h e recognition for having struggled valiantly to save t h e patient, itw a s suggested t h a t she s o u g h t t o impress h e r boyfriend, w h o alsow o r k e d at t h e hospital. Moreover, she h a d access t o t h e epinephrine.But since n o o n e h a d seen h e r administer any fatal injections, the caseagainst her, while suggestive, was purely circumstantial. Although thepatients involved w e r e mostly middle-aged m e n n o t regarded as poten­tial h e a r t attack victims, it w a s possible that their attacks had occurrednaturally. W h a t tipped t h e balance, and led t o a decision t o indict Gilbertfor multiple m u r d e r , w a s Gehlbachs statistical analysis.THE SCIENCE OF STATEStatistics is widely used in law enforcement in m a n y ways and for m a n yp u r p o s e s . In NUMB3RS, Charlie often carries o u t a statistical analysis,and t h e use of statistical techniques will appear in m a n y chapters in thisb o o k , often w i t h o u t o u r m a k i n g explicit m e n t i o n of t h e fact. But w h a texactly does statistics entail? A n d w h y was t h e w o r d in the singular int h a t last sentence? T h e w o r d "statistics" c o m e s from the Latin t e r m statisticum collegium,m e a n i n g "council of state" a n d t h e Italian w o r d statista, m e a n i n g "states­m a n , " w h i c h reflects t h e initial uses of the technique. T h e G e r m a nw o r d Statistik likewise originally m e a n t t h e analysis of data about thestate. Until t h e n i n e t e e n t h century, t h e equivalent English t e r m was"political arithmetic," after w h i c h t h e w o r d "statistics" was introducedt o refer t o any collection and classification of data. Today, "statistics" really has t w o c o n n e c t e d meanings. T h e first is thecollection a n d tabulation of data; t h e second is t h e use of mathematicaland o t h e r m e t h o d s t o d r a w meaningful and useful conclusions from
  26. 26. Fighting Crime with Statistics 101 15tabulated data. S o m e statisticians refer t o t h e f o r m e r activity as "little-sstatistics" and the latter activity as "big-S Statistics". Spelled w i t h alower-case s, t h e w o r d is treated as plural w h e n it refers t o a collectionof n u m b e r s . But it is singular w h e n used t o refer t o t h e activity ofcollecting and tabulating those n u m b e r s . "Statistics" (with a capital S)refers t o an activity, and h e n c e is singular. T h o u g h m a n y sports fans a n d o t h e r kinds of people enjoy collectingand tabulating numerical data, t h e real value of little-s statistics is t oprovide t h e data for big-S Statistics. M a n y of t h e m a t h e m a t i c a l tech­niques used in big-S Statistics involve t h e b r a n c h of m a t h e m a t i c s k n o w nas probability theory, which b e g a n in t h e sixteenth a n d seventeenthcenturies as an a t t e m p t t o u n d e r s t a n d t h e likely o u t c o m e s of g a m e sof chance, in order t o increase t h e likelihood of winning. But w h e r e a sprobability t h e o r y is a definite b r a n c h of m a t h e m a t i c s , Statistics isessentially an applied science that uses m a t h e m a t i c a l m e t h o d s . While the law enforcement profession collects a large quantity of little-s statistics, it is the use of big-S Statistics as a tool in fighting crime that w eshall focus on. (From n o w o n w e shall drop the "big S", "little s" terminol­ogy and use the w o r d "statistics" the way statisticians do, to m e a n b o t h ,leaving the reader to determine the intended m e a n i n g from the context.) Although s o m e applications of statistics in law e n f o r c e m e n t usesophisticated m e t h o d s , the basic techniques covered in a first-semestercollege statistics course are often e n o u g h t o crack a case. This was certainly t r u e for United States v. Kristen Gilbert. In that case,a crucial question for the g r a n d j u r y w a s w h e t h e r there w e r e significantlym o r e deaths in t h e unit w h e n Kristen Gilbert w a s o n duty t h a n at o t h e rtimes. T h e key w o r d here is "significantly". O n e or t w o extra deaths o nher watch could b e coincidence. H o w m a n y deaths w o u l d it take to reachthe level of "significance" sufficient t o indict Gilbert? This is a questionthat only statistics can answer. Accordingly, Stephen Gehlbach was askedto provide the g r a n d j u r y w i t h a s u m m a r y of his findings.HYPOTHESIS TESTINGGehlbachs testimony was based o n a f u n d a m e n t a l statistical t e c h n i q u ek n o w n as hypothesis testing. This m e t h o d uses probability t h e o r y t o
  27. 27. 16 THE NUMBERS B E H I N D NUMB3RSdetermine whether an observed outcome is so unusual that it is highlyunlikely to have occurred naturally. One of the first things Gehlbach did was plot the annual number ofdeaths at the hospital from 1988 through 1997, broken down by shifts—midnight to 8:00 AM, 8:00 AM to 4:00 PM, and 4:00 PM to midnight. Theresulting graph is shown in Figure 1. Each vertical bar shows the totalnumber of deaths in the year during that particular shift. 40 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 Year • Night (12 A . M . - 8 A.M.) • Day (8 A . M . - 4 P.M.) H Evening (4 P.M.-12 A.M.) Figure 1 . Total deaths at the hospital, by shift and year. The graph shows a definite pattern. For the first two years, there werearound ten deaths per year on each shift. Then, for each of the years 1990through 1995, one of the three shifts shows between 25 and 35 deaths peryear. Finally, for the last two years, the figures drop back to roughly tendeaths on each of the three shifts. When the investigators examinedKristen Gilberts work record, they discovered that she started work inWard C in March 1990 and stopped working at the hospital in February1996. Moreover, for each of the years she worked at the VA, the shift thatshowed the dramatically increased number of deaths was the one sheworked. To a layperson, this might suggest that Gilbert was clearly respon­sible for the deaths, but on its own it would not be sufficient to secure aconviction—indeed, it might not be enough to justify even an indictment.The problem is that it may be just a coincidence. The job of the statistician
  28. 28. Fighting Crime with Statistics 101 17in this situation is to d e t e r m i n e just h o w unlikely such a coincidencewould be. If the answer is that the likelihood of such a coincidence is, say,1 in 100, then Gilbert might well b e innocent; and even 1 in 1,000 leavessome d o u b t as to her guilt; b u t with a likelihood of, say, 1 in 100,000, m o s tpeople w o u l d find the evidence against her t o b e pretty compelling. To see h o w hypothesis testing works, lets start w i t h t h e simpleexample of tossing a coin. If t h e coin is perfectly balanced (i.e., unbiasedor fair), t h e n t h e probability of getting heads is 0.5.* Suppose w e toss t h ecoin ten times in a r o w t o see if it is biased in favor of heads. T h e n w ecan get a range of different o u t c o m e s , and it is possible t o c o m p u t e t h elikelihood of different results. For example, t h e probability of getting atleast six heads is a b o u t 0.38. (The calculation is straightforward b u t a bitintricate, because there are m a n y possible ways y o u can get six or m o r eheads in ten tosses, and y o u have t o take a c c o u n t of all of t h e m . ) T h efigure of 0.38 p u t s a precise numerical value o n t h e fact that, o n anintuitive level, w e w o u l d n o t b e surprised if t e n coin tosses gave six orm o r e heads. For at least seven heads, t h e probability w o r k s o u t at 0.17,a figure that corresponds t o o u r intuition t h a t seven or m o r e heads iss o m e w h a t u n u s u a l b u t certainly n o t a cause for suspicion t h a t t h e coinwas biased. W h a t w o u l d surprise us is nine or t e n heads, a n d for that t h eprobability w o r k s o u t at a b o u t 0.01, or 1 in 100. T h e probability of get­ting ten heads is a b o u t 0.001, or 1 in 1,000, a n d if t h a t h a p p e n e d w ew o u l d definitely suspect an unfair coin. T h u s , b y tossing t h e coin tentimes, w e can form a reliable, precise j u d g m e n t , based o n m a t h e m a t i c s ,of the hypothesis that t h e coin is unbiased. In the case of the suspicious deaths at t h e Veterans Affairs MedicalCenter, the investigators w a n t e d to k n o w if t h e n u m b e r of deaths thatoccurred w h e n Kristen Gilbert was o n d u t y w a s so unlikely that it couldnot be merely happenstance. T h e m a t h is a bit m o r e complicated t h a nfor the coin tossing, b u t t h e idea is t h e same. Table 1 gives the data t h einvestigators had at their disposal. It gives n u m b e r s of shifts, classified indifferent ways, and covers t h e eighteen-month period ending in February *Actually, this is not entirely accurate. Because of inertia! properties of a physicalcoin, there is a slight tendency for it to resist turning, with the result that, if a perfectlybalanced coin is given a random initial flip, the probability that it will land the sameway up as it started is about 0.51. But we will ignore this caveat in what follows.
  29. 29. 18 THE N U M B E R S B E H I N D NUMB3RS1996, the month when the three nurses told their supervisor of theirconcerns, shortly after which Gilbert took a medical leave. GILBERT PRESENT DEATH O N SHIFT YES NO TOTAL YES 40 217 257 NO 34 1,350 1,384 TOTAL 74 1,567 1,641 Table 1. The data for the statistical analysis in the Gilbert case. Altogether, there were 74 deaths, spread over a total of 1,641 shifts.If the deaths are assumed to have occurred randomly, these figuressuggest that the probability of a death on any one shift is about 74out of 1,641, or 0.045. Focusing now on the shifts when Gilbert was onduty, there were 257 of them. If Gilbert was not killing any of the patients,we would expect there to be around 0.045 x 257 = 11.6 deaths on hershifts, i.e., around 11 or 12 deaths. In fact there were more—40 to be pre­cise. How likely is this? Using mathematical methods similar to those forthe coin tosses, statistician Gehlbach calculated that the probability ofhaving 40 or more of the 74 deaths occur on Gilberts shifts was less than1 in 100 million. In other words, it is unlikely in the extreme that Gilbertsshifts were merely "unlucky" for the patients. The grand jury decided there was sufficient evidence to indictGilbert—presumably the statistical analysis was the most compellingevidence, but we cannot know for sure, as a grand jurys deliberationsare not public knowledge. She was accused of four specific murders andthree attempted murders. Because the VA is a federal facility, the trialwould be in a federal court rather than a state court, and subject to fed­eral laws. A significant consequence of this fact for Gilbert was thatalthough Massachusetts does not have a death penalty, federal law does,and that is what the prosecutor asked for.STATISTICS IN THE COURTROOM?An interesting feature of this case is that the federal trial judge ruledin pretrial deliberations that the statistical evidence should not be
  30. 30. Fighting Crime with Statistics 101 19presented in court. In m a k i n g his ruling, t h e j u d g e t o o k n o t e of asubmission by a second statistician b r o u g h t into t h e case, G e o r g e C o b bof M o u n t Holyoke College. Cobb and Gehlbach did n o t disagree o n any of t h e statistical analysis.(In fact, they ended u p writing a joint article about t h e case.) Rather, theirroles were different, and they w e r e addressing different issues. Gehlbachstask was to use statistics t o d e t e r m i n e if there w e r e reasonable g r o u n d s t osuspect Gilbert of multiple murder. More specifically, h e carried o u t ananalysis that showed that the increased n u m b e r s of deaths at t h e hospitalduring the shifts w h e n Gilbert was o n duty could n o t have arisen due t ochance variation. T h a t was sufficient t o cast suspicion o n Gilbert as thecause of the increase, b u t n o t at all e n o u g h t o prove that she did cause theincrease. W h a t C o b b argued was that the establishment of a statisticalrelationship does n o t explain the cause of that relationship. T h e j u d g e inthe case accepted this argument, since the p u r p o s e of the trial was n o t t odecide if there were g r o u n d s t o m a k e Gilbert a suspect—the g r a n d j u r yand the state attorneys office h a d d o n e that. Rather, t h e j o b before thecourt was to determine w h e t h e r or n o t Gilbert caused the deaths in ques­tion. His reason for excluding the statistical evidence was that, as experi­ences in previous court cases had demonstrated, j u r o r s n o t well versed instatistical reasoning—and that w o u l d b e almost all jurors—typically havegreat difficulty appreciating w h y odds of 1 in 100 million against the suspi­cious deaths occurring by chance does not imply that the odds that Gilbertdid not kill the patients are likewise 1 in 100 million. T h e original oddscould be caused by something else. Cobb illustrated the distinction by means of a famous example from thelong struggle physicians and scientists had in overcoming the powerfultobacco lobby to convince governments and the public that cigarette smok­ing causes lung cancer. Table 2 shows the mortality rates for three categoriesof people: nonsmokers, cigarette smokers, and cigar and pipe smokers. Nonsmokers 20.2 Cigarette smokers 20.5 Cigar and pipe smokers 35.3 Table 2. Mortality rates per 1,000 people per year.
  31. 31. 20 T H E NUMBERS B E H I N D NUMB3RS At first glance, t h e figures in Table 2 s e e m t o indicate that cigarettes m o k i n g is n o t d a n g e r o u s b u t pipe and cigar s m o k i n g are. However, thisis n o t t h e case. T h e r e is a crucial variable lurking behind the data that then u m b e r s themselves d o n o t indicate: age. T h e average age of the non-smokers w a s 54.9, t h e average age of t h e cigarette smokers was 50.5, andthe average age of the cigar and pipe smokers was 65.9. Using statisticaltechniques t o m a k e allowance for t h e age differences, statisticians wereable t o adjust t h e figures to p r o d u c e Table 3. Nonsmokers 20.3 Cigarette smokers 28.3 Cigar and pipe smokers 21.2 Table 3. Mortality rates per 1,000 people per year, adjusted for age.N o w a very different p a t t e r n emerges, indicating that cigarette s m o k i n gis highly d a n g e r o u s . W h e n e v e r a calculation of probabilities is m a d e based o n observa­tional data, t h e m o s t that can generally b e concluded is that there is acorrelation b e t w e e n t w o or m o r e factors. T h a t can m e a n e n o u g h tospur further investigation, b u t o n its o w n it does n o t establish causation.T h e r e is always t h e possibility of a hidden variable that lies behind thecorrelation. W h e n a study is m a d e of, say, t h e effectiveness or safety of a n e wd r u g o r medical p r o c e d u r e , statisticians handle t h e p r o b l e m of hiddenp a r a m e t e r s by relying n o t o n observational data, b u t instead byc o n d u c t i n g a r a n d o m i z e d , double-blind trial. In such a study, the targetp o p u l a t i o n is divided i n t o t w o g r o u p s by an entirely r a n d o m procedure,w i t h t h e g r o u p allocation u n k n o w n t o b o t h t h e experimental subjectsa n d t h e caregivers administering t h e d r u g or t r e a t m e n t (hence t h e t e r m"double-blind"). O n e g r o u p is given t h e n e w d r u g or treatment, theo t h e r is given a placebo or d u m m y t r e a t m e n t . W i t h such an experiment,t h e r a n d o m allocation into g r o u p s overrides t h e possible effect o f hid­d e n p a r a m e t e r s , so that in this case a low probability that a positiveresult is simply chance variation can indeed b e taken as conclusiveevidence that t h e d r u g or t r e a t m e n t is w h a t caused t h e result.
  32. 32. Fighting Crime with Statistics 101 21 In trying t o solve a crime, t h e r e is of course n o choice b u t t ow o r k w i t h t h e data available. H e n c e , use of t h e hypothesis-testingprocedure, as in the Gilbert case, can b e highly effective in t h e identifica­tion of a suspect, b u t o t h e r m e a n s are generally required t o secure aconviction. In United States v. Kristen Gilbert, t h e j u r y was n o t p r e s e n t e d w i t hGehlbachs statistical analysis, b u t they did find sufficient evidence t oconvict her o n three c o u n t s of first-degree m u r d e r , o n e c o u n t of sec­ond-degree murder, and t w o c o u n t s of a t t e m p t e d m u r d e r . A l t h o u g h t h eprosecution asked for t h e d e a t h sentence, t h e j u r y split 8-4 o n t h a t issue,and accordingly Gilbert w a s sentenced t o life i m p r i s o n m e n t w i t h n opossibility of parole.POLICING THE POLICEAnother use of basic statistical techniques in law enforcement concernsthe important matter of ensuring that the police themselves obey the law. Law enforcement officers are given a considerable a m o u n t ofp o w e r over their fellow citizens, a n d o n e of t h e duties of society is t om a k e certain that they d o n o t abuse that power. In particular, policeofficers are supposed to treat everyone equally and fairly, free of anybias based o n gender, race, ethnicity, e c o n o m i c status, age, dress, orreligion. But d e t e r m i n i n g bias is a tricky business and, as w e saw in o u r previ­ous discussion of cigarette s m o k i n g , a superficial glance at t h e statisticscan s o m e t i m e s lead t o a completely false conclusion. This is illustratedin a particularly d r a m a t i c fashion by t h e following example, which,while n o t related t o police activity, clearly indicates t h e n e e d t o a p p r o a c hstatistics w i t h s o m e m a t h e m a t i c a l sophistication. In t h e 1970s, s o m e b o d y noticed that 44 p e r c e n t of m a l e applicants t othe g r a d u a t e school of t h e University of California at Berkeley w e r eaccepted, b u t only 35 percent of female applicants w e r e accepted. O nthe face of it, this looked like a clear case of g e n d e r discrimination, and,n o t surprisingly (particularly at Berkeley, l o n g acknowledged as h o m eto m a n y leading advocates for g e n d e r equality), t h e r e w a s a lawsuit overgender bias in admissions policies.
  33. 33. 22 T H E NUMBERS B E H I N D NUMB3RS It turns out that Berkeley applicants do not apply to the graduateschool, but to individual programs of study—such as engineering, phys­ics, or English—so if there is any admissions bias, it will occur withinone or more particular program. Table 4 gives the admission data pro­gram by program: Major Male apps % admit Female apps % admit A 825 62 108 82 CD 560 63 25 68 C 325 37 593 34 D 417 33 375 35 E 191 28 393 24 F 373 6 341 7 Table 4. Admission figures from the University of California at Berkeley on a program-by-program basis. If you look at each program individually, however, there doesntappear to be an advantage in admission for male applicants. Indeed, thepercentage of female applicants admitted to heavily subscribed programA is considerably higher than for males, and in all other programs thepercentages are fairly close. So how can there appear to be an advantagefor male applicants overall? To answer this question, you need to look at what programs malesand females applied to. Males applied heavily to programs A and B,females applied primarily to programs C, D, E, and F. The programsthat females applied to were more difficult to get into than those formales (the percentages admitted are low for both genders), and this iswhy it appears that males had an admission advantage when looking atthe aggregate data. There was indeed a gender factor at work here, but it had nothing todo with the universitys admissions procedures. Rather, it was one ofself-selection by the applying students, where female applicants avoidedprogams A and B.
  34. 34. Fighting Crime with Statistics 101 23 T h e Berkeley case was an example of a p h e n o m e n o n k n o w n asSimpsons paradox, n a m e d for E. H . Simpson, w h o studied this curiousp h e n o m e n o n in a famous 1951 paper.*HOW DO YOU DETERMINE BIAS?W i t h the above cautionary example in mind, w h a t should w e m a k e ofthe study carried o u t in Oakland, California, in 2003 (by t h e R A N DCorporation, at t h e request of t h e O a k l a n d Police D e p a r t m e n t s RacialProfiling Task Force), t o d e t e r m i n e if there was systematic racial bias inthe way police stopped motorists? T h e R A N D researchers analyzed 7,607 vehicle stops recorded b yOakland police officers b e t w e e n J u n e and D e c e m b e r 2003, using vari­ous statistical tools t o examine a n u m b e r of variables t o uncover anyevidence that suggested racial profiling. O n e figure they found w a s thatblacks w e r e involved in 56 percent of all traffic stops studied, a l t h o u g hthey m a k e u p just 35 percent of O a k l a n d s residential population. D o e sthis finding indicate racial profiling? Well, it might, b u t as s o o n as y o ulook m o r e closely at w h a t o t h e r factors could b e reflected in thosen u m b e r s , the issue is by n o m e a n s clear cut. For instance, like m a n y inner cities, O a k l a n d has s o m e areas w i t hm u c h higher crime rates t h a n others, and t h e police patrol those highercrime areas at a m u c h greater rate t h a n they d o areas having less crime.As a result, they m a k e m o r e traffic stops in those areas. Since t h e highercrime areas typically have greater concentrations of m i n o r i t y g r o u p s ,the higher rate of traffic stops in those areas manifests itself as a higherrate of traffic stops of minority drivers. To overcome these uncertainties, t h e R A N D researchers devised aparticularly ingenious way t o look for possible racial bias. If racial profil­ing was occurring, they reasoned, stops of minority drivers w o u l d b ehigher w h e n the officers could d e t e r m i n e the drivers race prior t o mak­ing the stop. Therefore, they c o m p a r e d t h e stops m a d e d u r i n g a period * E . H. S i m p s o n . " T h e I n t e r p r e t a t i o n o f I n t e r a c t i o n in C o n t i n g e n c y T a b l e s , " Jour­nal of the Royal Statistical Society, Ser. B, 13 (1951) 2 3 8 - 2 4 1 .
  35. 35. 24 T H E NUMBERS B E H I N D NUMB3RSj u s t before nightfall w i t h those m a d e after d a r k — w h e n t h e officersw o u l d b e less likely t o b e able t o d e t e r m i n e t h e drivers race. T h e figuress h o w e d that 50 p e r c e n t of drivers stopped d u r i n g the daylight periodw e r e black, c o m p a r e d w i t h 54 p e r c e n t w h e n it was dark. Based o n thatfinding, t h e r e does n o t appear to b e systematic racial bias in trafficstops. But t h e researchers d u g a little further, and looked at the officerso w n reports as t o w h e t h e r they could d e t e r m i n e the drivers race priort o m a k i n g t h e stop. W h e n officers r e p o r t e d k n o w i n g the race in advanceof t h e stop, 6 6 p e r c e n t of drivers stopped w e r e black, c o m p a r e d w i t honly 44 percent w h e n t h e police r e p o r t e d n o t k n o w i n g the drivers racein advance. This is a fairly s t r o n g indicator of racial bias.* *Sadly, d e s p i t e m a n y efforts t o e l i m i n a t e t h e p r o b l e m , racial bias b y p o l i c es e e m s t o b e a p e r s i s t e n t issue t h r o u g h o u t t h e country. To cite just o n e recent r e p o r t ,A n Analysis of Traffic Stop Data in Riverside, California, b y Larry K. Gaines of t h eC a l i f o r n i a State University in San B e r n a r d i n o , p u b l i s h e d in Police Quarterly, 9, 2 ,J u n e 2 0 0 6 , p p . 2 1 0 - 2 3 3 : " T h e f i n d i n g s f r o m racial p r o f i l i n g or traffic s t o p studiesh a v e b e e n fairly c o n s i s t e n t : M i n o r i t i e s , especially African A m e r i c a n s , are s t o p p e d ,t i c k e t e d , a n d s e a r c h e d at a h i g h e r rate as c o m p a r e d t o W h i t e s . For e x a m p l e ,L a m b e r t h (cited in State v. Pedro Soto, 1996) f o u n d t h a t t h e M a r y l a n d State Polices t o p p e d a n d s e a r c h e d A f r i c a n A m e r i c a n s at a h i g h e r rate as c o m p a r e d t o theirrate o f s p e e d i n g v i o l a t i o n s . Harris (1999) e x a m i n e d c o u r t records in A k r o n , D a y t o n ,T o l e d o , a n d C o l u m b u s , O h i o , a n d f o u n d t h a t African A m e r i c a n s w e r e c i t e d at a ratet h a t surpassed t h e i r r e p r e s e n t a t i o n in t h e d r i v i n g p o p u l a t i o n . C o r d n e r , W i l l i a m s , a n dZ u n i g a (2000) a n d C o r d n e r , W i l l i a m s , a n d Velasco (2002) f o u n d similar t r e n d s in SanD i e g o , C a l i f o r n i a . Zingraff a n d his c o l l e a g u e s (2000) e x a m i n e d s t o p s b y t h e N o r t hCarolina H i g h w a y Patrol a n d f o u n d t h a t A f r i c a n A m e r i c a n s w e r e o v e r r e p r e s e n t e d ins t o p s a n d searches."
  36. 36. CHAPTER Data Mining3 Finding Meaningful in Masses of Information PatternsBRUTUSCharlie Eppes is sitting in front of a b a n k of c o m p u t e r s and televisionmonitors. H e is testing a c o m p u t e r p r o g r a m h e is developing to helppolice m o n i t o r large crowds, l o o k i n g for u n u s u a l behavior that couldindicate a p e n d i n g criminal or terrorist act. His idea is t o use standardmathematical equations that describe the flow of fluids—in rivers, lakes,oceans, tanks, pipes, even blood vessels.* H e is trying o u t t h e n e w sys­t e m at a fund-raising reception for o n e of t h e California state senators.Overhead cameras m o n i t o r t h e diners as they m o v e a r o u n d t h e r o o m ,and Charlies c o m p u t e r p r o g r a m analyzes t h e "flow" of t h e people.Suddenly t h e test takes o n an u n e x p e c t e d aspect. T h e FBI receives atelephone w a r n i n g that a g u n m a n is in t h e r o o m , intending t o kill t h esenator. T h e software works, and Charlie is able to identify t h e g u n m a n , b u tD o n and his t e a m are n o t able t o get t o the killer before h e has shot t h esenator and t h e n t u r n e d t h e g u n o n himself. T h e dead assassin t u r n s o u t t o b e a Vietnamese i m m i g r a n t , a f o r m e rVietcong m e m b e r , w h o , despite having b e e n in prison in California, * T h e idea is b a s e d o n several real-life p r o j e c t s t o use t h e e q u a t i o n s t h a t d e s c r i b ef l u i d f l o w s in o r d e r t o analyze v a r i o u s kinds o f c r o w d activity, i n c l u d i n g f r e e w a y traf­fic f l o w , s p e c t a t o r s e n t e r i n g a n d l e a v i n g a large s p o r t s s t a d i u m , a n d e m e r g e n c yexits f r o m b u r n i n g b u i l d i n g s .
  37. 37. 26 T H E NUMBERS B E H I N D NUMB3RSs o m e h o w m a n a g e d t o obtain U.S. citizenship and b e the recipient of aregular pension from t h e U.S. Army. H e h a d also taken the illegal d r u gspeed o n t h e evening of t h e assassination. W h e n D o n makes s o m eenquiries t o find o u t j u s t w h a t is g o i n g on, h e is visited by a CIA agentw h o asks for help in trying t o prevent t o o m u c h information about thecase leaking out. Apparently t h e dead killer h a d b e e n part of a covertCIA behavior modification project carried o u t in California prisons dur­ing t h e 1960s t o t u r n i n m a t e s into trained assassins w h o , w h e n activated,w o u l d carry o u t their assigned task before killing themselves. (Sadly, thisidea is n o less fanciful t h a n t h a t of Charlie using fluid flow equations tostudy c r o w d behavior.) But w h y h a d this particular individual suddenly b e c o m e active andm u r d e r e d t h e state senator? T h e picture b e c o m e s m u c h clearer w h e n a second m u r d e r occurs.T h e victim this t i m e is a p r o m i n e n t psychiatrist, the killer a C u b a n immi­grant. T h e killer h a d also spent t i m e in a California prison, and h e t o ow a s t h e recipient of regular A r m y pension checks. But o n this occasion,w h e n the assassin tries to s h o o t himself after killing the victim, the g u nfails t o g o off and h e has t o flee t h e scene. A fingerprint identificationfrom the g u n soon leads t o his arrest. W h e n D o n realizes that t h e dead senator h a d b e e n u r g i n g a repeal oft h e statewide b a n o n t h e use of behavior modification techniques o nprison inmates, and that t h e dead psychiatrist h a d b e e n r e c o m m e n d i n gt h e re-adoption of such techniques t o overcome criminal tendencies, h equickly concludes that s o m e o n e has started t o t u r n t h e conditionedassassins o n t h e very p e o p l e w h o w e r e pressing for the reuse of thetechniques that h a d p r o d u c e d t h e m . But who? D o n thinks his best line of investigation is to find o u t w h o suppliedt h e g u n s t h a t t h e t w o killers h a d used. H e k n o w s that t h e w e a p o n s orig­inated w i t h a dealer in Nevada. Charlie is able t o provide t h e next step,w h i c h leads to t h e identification of the individual b e h i n d the t w o assas­sinations. H e obtains data o n all g u n sales involving that particulardealer and analyzes t h e relationships a m o n g all sales that originatedthere. H e explains t h a t h e is e m p l o y i n g m a t h e m a t i c a l techniques similart o those used t o analyze calling p a t t e r n s o n t h e t e l e p h o n e n e t w o r k — a na p p r o a c h used frequently in real-life law enforcement.
  38. 38. Data Mining 27 This is w h a t viewers saw in t h e third-season episode of NUMB3RScalled "Brutus" (the code n a m e for t h e fictitious CIA conditioned-assassinator project), first aired o n N o v e m b e r 24, 2006. As usual, t h em a t h e m a t i c s Charlie uses in the s h o w is based o n real life. T h e m e t h o d Charlie uses to track t h e g u n distribution is generallyreferred to as "link analysis," and is o n e a m o n g m a n y that g o u n d e rthe collective heading of "data mining." D a t a m i n i n g obtains usefulinformation a m o n g the mass of data that is available—often publicly—in m o d e r n society.FINDING MEANING IN INFORMATIONData mining was initially developed by t h e retail industry to detect cus­t o m e r purchasing patterns. (Ever w o n d e r w h y s u p e r m a r k e t s offer cus­t o m e r s those loyalty cards—sometimes called "club" cards—in exchangefor discounts? In p a r t its t o e n c o u r a g e c u s t o m e r s t o k e e p s h o p p i n g atthe same store, b u t l o w prices w o u l d d o that. T h e significant factor for t h ec o m p a n y is that it enables t h e m t o track detailed purchase p a t t e r n s thatthey can link to c u s t o m e r s h o m e zip codes, information that they cant h e n analyze using data-mining techniques.) T h o u g h m u c h of the w o r k in data m i n i n g is d o n e by c o m p u t e r s , forthe m o s t part those c o m p u t e r s d o n o t r u n autonomously. H u m a nexpertise also plays a significant role, and a typical data-mining investi­gation will involve a constant back-and-forth interplay b e t w e e n h u m a nexpert and m a c h i n e . Many of the c o m p u t e r applications used in data m i n i n g fall u n d e rthe general area k n o w n as artificial intelligence, a l t h o u g h that t e r m canbe misleading, being suggestive of c o m p u t e r s that think a n d act likepeople. Although m a n y people believed that w a s a possibility back inthe 1950s w h e n AI first b e g a n t o b e developed, it eventually b e c a m eclear that this was n o t g o i n g to h a p p e n within t h e foreseeable future,and m a y well never b e the case. But that realization did n o t prevent thedevelopment of m a n y " a u t o m a t e d reasoning" p r o g r a m s , s o m e of whicheventually found a powerful and i m p o r t a n t use in data mining, w h e r ethe h u m a n expert often provides t h e "high-level intelligence" that guidesthe c o m p u t e r p r o g r a m s that d o the bulk of t h e w o r k . In this way, data
  39. 39. 28 T H E NUMBERS B E H I N D NUMB3RSm i n i n g provides an excellent example of t h e p o w e r that results w h e nh u m a n brains t e a m u p w i t h c o m p u t e r s . A m o n g t h e m o r e p r o m i n e n t m e t h o d s and tools used in datam i n i n g are: • Link analysis—looking for associations and o t h e r forms of c o n n e c t i o n a m o n g , say, criminals or terrorists • Geometric clustering—a specific form of link analysis • Software agents—small, self-contained pieces of c o m p u t e r code t h a t can monitor, retrieve, analyze, and act o n information • Machine learning—algorithms that can extract profiles of criminals a n d graphical m a p s of crimes • Neural networks—special kinds of c o m p u t e r p r o g r a m s that can predict t h e probability of crimes and terrorist attacks.Well take a brief l o o k at each of these topics in t u r n .LINK ANALYSISN e w s p a p e r s often refer t o link analysis as "connecting the dots." Its theprocess of tracking connections b e t w e e n people, events, locations, andorganizations. T h o s e connections could b e family ties, business relation­ships, criminal associations, financial transactions, in-person meetings,e-mail exchanges, and a host of others. Link analysis can b e particularlypowerful in fighting terrorism, organized crime, m o n e y laundering("follow t h e m o n e y " ) , and telephone fraud. Link analysis is primarily a h u m a n - e x p e r t driven process. Mathemat­ics a n d t e c h n o l o g y are used to provide a h u m a n expert w i t h powerful,flexible c o m p u t e r tools t o uncover, examine, and track possible connec­tions. T h o s e tools generally allow t h e analyst t o represent linked data asa n e t w o r k , displayed and e x a m i n e d (in w h o l e or in part) o n t h e com­p u t e r screen, w i t h n o d e s representing t h e individuals or organizationsor locations of interest a n d t h e links b e t w e e n those n o d e s representingrelationships or transactions. T h e tools m a y also allow t h e analyst to
  40. 40. Data Mining 29investigate and record details a b o u t each link, a n d t o discover n e w n o d e sthat connect t o existing ones or n e w links b e t w e e n existing n o d e s . For example, in an investigation into a suspected crime ring, an inves­tigator might carry o u t a link analysis of t e l e p h o n e calls a suspect hasm a d e or received, using t e l e p h o n e c o m p a n y call-log data, l o o k i n g atfactors such as n u m b e r called, t i m e and d u r a t i o n of each call, o r n u m ­b e r called next. T h e investigator m i g h t t h e n decide t o p r o c e e d furtheralong the call n e t w o r k , l o o k i n g at calls m a d e t o or from o n e or m o r e ofthe individuals w h o h a d h a d p h o n e conversations w i t h t h e initial sus­pect. This process can b r i n g t o t h e investigators a t t e n t i o n individualsn o t previously k n o w n . S o m e m a y t u r n o u t to b e totally innocent, b u tothers could prove to b e criminal collaborators. A n o t h e r line of investigation m a y b e t o track cash transactions t oand from domestic and international b a n k accounts. Still a n o t h e r line m a y b e t o e x a m i n e t h e n e t w o r k of places a n dpeople visited by the suspect, using such data as train a n d airline ticketpurchases, points of e n t r y or d e p a r t u r e in a given country, car rentalrecords, credit card records of purchases, websites visited, a n d t h e like. Given the difficulty n o w a d a y s of d o i n g almost anything w i t h o u tleaving an electronic trace, t h e challenge in link analysis is usually n o to n e of having insufficient data, b u t r a t h e r of deciding w h i c h of t h emegabytes of available data t o select for further analysis. Link analysisw o r k s best w h e n backed u p by o t h e r kinds of information, such as tipsfrom police informants or from n e i g h b o r s of possible suspects. Once an initial link analysis has identified a possible criminal or terroristnetwork, it m a y b e possible to determine w h o the key players are byexamining which individuals have the m o s t links to others in the network.GEOMETRIC CLUSTERINGBecause of resource limitations, law enforcement agencies generally focusm o s t of their attention o n major crime, w i t h the result that m i n o r offensessuch as shoplifting or house burglaries get little attention. If, however, asingle person or an organized g a n g c o m m i t s m a n y such crimes o n a regu­lar basis, the aggregate can constitute significant criminal activity thatdeserves greater police attention. T h e p r o b l e m facing the authorities,
  41. 41. 30 T H E NUMBERS B E H I N D NUMB3RSthen, is to identify within the large n u m b e r s of m i n o r crimes that takeplace every day, clusters that are the w o r k of a single individual or gang. O n e example of a " m i n o r " crime that is often carried o u t o n a regu­lar basis by t w o (and occasionally three) individuals acting together ist h e so-called bogus official burglary (or distraction burglary). This is w h e r et w o people t u r n u p at t h e front d o o r of a h o m e o w n e r (elderly peopleare often t h e preferred targets) posing as s o m e form of officials—perhapst e l e p h o n e engineers, representatives of a utility company, or local gov­e r n m e n t agents—and, while o n e p e r s o n secures t h e attention of theh o m e o w n e r , the o t h e r moves quickly t h r o u g h the h o u s e or a p a r t m e n ttaking any cash or valuables that are easily accessible. Victims of b o g u s official burglaries often file a r e p o r t to the police,w h o will send an officer t o t h e victims h o m e t o take a statement. Sincet h e victim will have spent considerable t i m e w i t h o n e of the perpetra­tors (the distracter), t h e s t a t e m e n t will often include a fairly detaileddescription—gender, race, height, b o d y type, approximate age, generalfacial appearance, eyes, hair color, hair length, hair style, accent, identi­fying physical m a r k s , m a n n e r i s m s , shoes, clothing, unusual jewelry,etc.—together w i t h t h e n u m b e r of accomplices and their genders. Inprinciple, this w e a l t h of information m a k e s crimes of this nature idealfor data mining, and in particular for the technique k n o w n as geometricclustering, t o identify g r o u p s of crimes carried o u t b y a single gang.Application of t h e m e t h o d is, however, fraught w i t h difficulties, and todate t h e m e t h o d appears t o have b e e n restricted to o n e or t w o experi­m e n t a l studies. Well look at o n e such study, b o t h to s h o w h o w them e t h o d w o r k s and t o illustrate s o m e of the p r o b l e m s often faced by thedata-mining practitioner. T h e following study w a s carried o u t in England in 2000 and 2001 byresearchers at the University of W o l v e r h a m p t o n , together w i t h theWest Midlands Police.* T h e study looked at victim statements fromb o g u s official burglaries in t h e police region over a three-year period.D u r i n g that period, t h e r e w e r e 800 such burglaries recorded, involving *Ref. R. A d d e r l e y a n d P. B. M u s g r o v e , G e n e r a l Review o f Police C r i m e R e c o r d i n ga n d I n v e s t i g a t i o n Systems, Policing: An International Journal of Police Strategies andManagement, 2 4 (1), 2 0 0 1 , p p . 1 1 0 - 1 1 4 .
  42. 42. Data Mining 311,292 offenders. This proved to b e t o o great a n u m b e r for t h e resourcesavailable for the study, so t h e analysis w a s restricted t o those cases w h e r ethe distracter was female, a g r o u p comprising 89 crimes and 105 offenderdescriptions. T h e first p r o b l e m e n c o u n t e r e d was that the descriptions of t h e p e r p e ­trators was for the m o s t part in narrative form, as w r i t t e n by t h e investi­gating officer w h o t o o k the statement from t h e victim. A data-miningtechnique k n o w n as text m i n i n g had to b e used to p u t t h e descriptionsinto a structured form. Because of the limitations of the text-mining soft­ware available, h u m a n input was required to handle m a n y of the entries;for instance, to cope w i t h spelling mistakes, ad h o c or inconsistent abbre­viations (e.g., "Bham" or " B h a m " for "Birmingham"), and the use ofdifferent ways of expressing t h e same thing (e.g., "Birmingham accent","Bham accent", "local accent", "accent: local", etc.). After s o m e initial analysis, t h e researchers decided t o focus o n eightvariables: age, height, hair color, hair length, build, accent, race, andn u m b e r of accomplices. Once the data had b e e n processed into the appropriate structuredformat, the next step was t o use g e o m e t r i c clustering to g r o u p t h e105 offender descriptions into collections that w e r e likely t o refer t o thesame individual. To u n d e r s t a n d h o w this w a s d o n e , lets first consider am e t h o d that at first sight might appear t o b e feasible, b u t which soonproves to have significant weaknesses. T h e n , by seeing h o w those weak­nesses m a y be overcome, w e will arrive at the m e t h o d used in t h e Britishstudy. First, you code each of t h e eight variables numerically. Age—often aguess—is likely t o b e recorded either as a single figure or a range; if it isa range, take the m e a n . G e n d e r (not considered in t h e British Midlandsstudy because all the cases e x a m i n e d h a d a female distracter) can b ecoded as 1 for male, 0 for female. H e i g h t m a y b e given as a n u m b e r(inches), a range, or a t e r m such as "tall", " m e d i u m " , or "short"; again,s o m e m e t h o d has to b e chosen t o convert each of these t o a singlefigure. Likewise, schemes have t o b e devised t o represent each of t h eother variables as a n u m b e r . W h e n the numerical coding has been completed, each perpetratordescription is then represented by an eight-vector, the coordinates of
  43. 43. 32 THE NUMBERS B E H I N D NUMB3RSa point in eight-dimensional geometric (Euclidean) space. T h e familiardistance measure of Euclidean g e o m e t r y (the Pythagorean metric) canthen b e used t o measure the geometric distance between each pair ofpoints. This gives the distance between t w o vectors (x v . . . , x ) and g( , . . . , y ) as: V l 8 2 V[(x -y )2 ... 1 1 + + (x -y ) ] 8 8Points that are close t o g e t h e r u n d e r this m e t r i c are likely t o correspondt o p e r p e t r a t o r descriptions that have several features in c o m m o n ; a n dt h e closer t h e points, t h e m o r e features t h e descriptions are likely t ohave in c o m m o n . ( R e m e m b e r , there are p r o b l e m s w i t h this approach,w h i c h well g e t t o momentarily. For t h e time being, however, letssuppose that things w o r k m o r e or less as j u s t described.) T h e challenge n o w is t o identify clusters of points that are closetogether. If t h e r e w e r e only t w o variables, this w o u l d b e easy. All t h epoints could b e plotted o n a single x,y-graph a n d visual inspectionw o u l d indicate possible clusters. But h u m a n beings are totally unable t ovisualize eight-dimensional space, n o m a t t e r w h a t assistance t h e soft­w a r e system designers provide b y w a y of data visualization tools. T h ew a y a r o u n d this difficulty is t o r e d u c e t h e eight-dimensional array ofpoints (descriptions) t o a two-dimensional array (i.e., a matrix o r table).T h e idea is t o a r r a n g e t h e data points (that is, t h e vector representativesof t h e offender descriptions) in a two-dimensional grid in such away that: 1. pairs of points t h a t are extremely close t o g e t h e r in t h e eight- dimensional space are p u t into t h e s a m e grid entry; 2. pairs of points t h a t are n e i g h b o r s in t h e grid are close together in t h e eight-dimensional space; a n d 3. points t h a t are farther apart in t h e grid are farther apart in t h e space.This c a n b e d o n e using a special kind of c o m p u t e r p r o g r a m k n o w n as an e u r a l net, in particular, a K o h o n e n self-organizing m a p (or SOM).
  44. 44. Data Mining 33Neural nets (including SOMs) are described later in t h e chapter. Fornow, all w e n e e d t o k n o w is that these systems, w h i c h w o r k iteratively,are extremely g o o d at h o m i n g in (over t h e course of m a n y iterations) o npatterns, such as g e o m e t r i c clusters of t h e kind w e are interested in, andthus can indeed take an eight-dimensional array of t h e k i n d describedabove and place the points appropriately in a two-dimensional grid.(Part of the skill required t o use an S O M effectively in a case such as thisis deciding in advance, or by s o m e initial trial and error, w h a t are t h eoptimal dimensions of t h e final grid. T h e SOM n e e d s t h a t informationin order to start work.) Once the data has b e e n p u t into t h e grid, law enforcement officers canexamine grid squares that contain several entries, which are highly likelyto c o m e from a single g a n g responsible for a series of crimes, a n d canvisually identify clusters o n the grid, w h e r e there is also a likelihood thatthey represent g a n g activity. In either case, the officers can examine t h ecorresponding original crime s t a t e m e n t entries, looking for indicationsthat those crimes are indeed the w o r k of a single gang. N o w lets see w h a t goes w r o n g w i t h t h e m e t h o d j u s t described, a n dh o w to correct it. T h e first p r o b l e m is that t h e original e n c o d i n g of entries as n u m b e r sis n o t systematic. This can lead t o o n e variable d o m i n a t i n g o t h e r s w h e nthe entries are clustered using g e o m e t r i c distance (the P y t h a g o r e a nmetric) in eight-dimensional space. For example, a d i m e n s i o n that m e a ­sures height (which could b e anything b e t w e e n 60 inches and 76 inches)w o u l d d o m i n a t e t h e e n t r y for g e n d e r (0 or 1). So t h e first step is t o scale(in mathematical terminology, normalize) t h e eight numerical variables,so that each o n e varies b e t w e e n 0 and 1. O n e way to do that w o u l d b e t o simply scale d o w n each variable by amultiplicative scaling factor appropriate for that particular feature(height, age, etc.). But that will introduce further p r o b l e m s w h e n t h eseparation distances are calculated; for example, if g e n d e r and height area m o n g the variables, then, all o t h e r variables being roughly the same, avery tall w o m a n w o u l d c o m e o u t close t o a very short m a n (becausefemale gives a 0 and m a l e gives a 1, whereas tall c o m e s o u t close to 1 andshort close to 0). T h u s , a m o r e sophisticated normalization p r o c e d u r ehas to b e used.

×