Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mass declassification sept 23 2010v2.1


Published on

My public presentation as delivered to the Public Interest Declassification Board (PIDB) trying to determine the best way to declassify and release over 400M classified documents.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Mass declassification sept 23 2010v2.1

  1. 1. Mass Declassification What If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email_address] September 23, 2010
  2. 2. The Ask <ul><li>What emerging technology or innovative approaches come to mind … which may have applicability to this task? </li></ul><ul><li>Use your imagination. What if? </li></ul><ul><li>Not talking about any specific products </li></ul><ul><li>Not focusing on the widely available COTS/GOTS technologies (OCR, document management, case management, workflow, etc.) </li></ul>
  3. 3. The Problem at Hand <ul><li>Volumes may be beyond human, brute force review (@5min/ea = 18,382 FTEs) </li></ul><ul><li>Necessitates some form of machine triage </li></ul><ul><ul><li>Red: A disclosure risk </li></ul></ul><ul><ul><li>Yellow: A possible disclosure risk </li></ul></ul><ul><ul><li>Green: No disclosure risk </li></ul></ul><ul><li>Reliable machine triage requires substantially better prediction systems </li></ul><ul><li>Even then, advanced means for humans to deal with the remaining large volumes of “possibles” is still required </li></ul>
  4. 4. Background <ul><li>Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy </li></ul><ul><li>1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA) </li></ul><ul><li>2001/2003: Funded by In-Q-Tel </li></ul><ul><li>2005: IBM acquires SRD </li></ul><ul><li>Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities </li></ul><ul><li>Affiliations: </li></ul><ul><ul><li>Member, Markle Foundation Task Force on National Security in the Information Age </li></ul></ul><ul><ul><li>Senior Associate, Center for Strategic and International Studies (CSIS) </li></ul></ul><ul><ul><li>Distinguished Research Faculty (adjunct), Singapore Management University, School of Information Systems </li></ul></ul><ul><ul><li>Member, EPIC advisory board </li></ul></ul><ul><ul><li>Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body </li></ul></ul>
  5. 5. In Today’s Session <ul><li>Intro to context accumulating systems </li></ul><ul><li>Predictions and data points needed for mass declassification </li></ul><ul><li>Strawman architecture </li></ul><ul><li>Challenges </li></ul><ul><li>Q&A </li></ul>
  6. 6. Context Accumulating Systems
  7. 7. From Pixels to Pictures to Insight Observations Context Relevance Consumer (An analyst, a system, the sensor itself, etc.) Contextualization
  8. 8. <ul><li>Context, definition of: </li></ul><ul><li>Better understanding something by taking into account the things around it. </li></ul>
  9. 9. Without Context [email_address]
  10. 10. Consequences <ul><li>Algorithms flat-lining (e.g., alert queues) </li></ul><ul><li>Enterprise amnesia on the rise </li></ul><ul><li>Overwhelmed by false positives and false negatives? You have seen nothing yet </li></ul><ul><li>Not enough humans to fix this with brute force </li></ul><ul><li>Risk assessment becomes the risk </li></ul>
  11. 11. Context Accumulation Trusted Supplier Job Applicant Stolen Identity Known Terrorist [email_address]
  12. 12. Puzzle Metaphor Primer <ul><li>Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors </li></ul><ul><li>What it represents is unknown – there is no picture on hand </li></ul><ul><li>Is it one puzzle, 15 puzzles, or 1,500 puzzles? </li></ul><ul><li>Some pieces are duplicates and some are missing </li></ul><ul><li>Some are pieces are incomplete, low quality, or have been misinterpreted </li></ul><ul><li>Some pieces may even be professionally fabricated lies </li></ul><ul><li>Until you take the pieces to the table, you don’t know what you are dealing with </li></ul>
  13. 13. How Context Accumulates <ul><li>With each new observation … one of three assertions are made: 1) Un-associated; 2) near like neighbors; or 3) connections </li></ul><ul><li>Asserted connections must favor the false negative </li></ul><ul><li>New observations sometimes reverse earlier assertions </li></ul><ul><li>Some observations produce novel discovery </li></ul><ul><li>As the working space expands, computational effort increases </li></ul><ul><li>The emerging picture helps focus collection interests </li></ul><ul><li>Given sufficient observations, there can come a tipping point </li></ul><ul><li>Thereafter, confidence improves while computational effort decreases!!!! </li></ul>
  14. 14. False Negatives Overstate The Universe Observations Unique Identities True Population
  15. 15. Counting Is Difficult Mark Smith 6/12/1978 443-43-0000 Mark R Smith (707) 433-0000 DL: 00001234 File 1 File 2
  16. 16. The Rise and Fall of a Population Observations Unique Identities True Population
  17. 17. Data Triangulation Mark Smith 6/12/1978 443-43-0000 Mark R Smith (707) 433-0000 DL: 00001234 File 1 File 2 Mark Randy Smith 443-43-0000 DL: 00001234 New Record
  18. 18. Increasing Accuracy and Performance Observations Unique Identities True Population
  19. 19. “ Expert Counting” is Fundamental to Prediction <ul><li>Is it 5 people each with 1 account … or is it 1 person with 5 accounts? </li></ul><ul><li>If one cannot count … one cannot estimate vector or velocity (direction and speed). </li></ul><ul><li>Without vector and velocity … prediction is nearly impossible. </li></ul><ul><li>Therefore, if you can’t count, you can’t predict. </li></ul>
  20. 20. Mass Declassification Predictions
  21. 21. Mass Declassification Predictions <ul><li>Whose equity is it? </li></ul><ul><li>Machine triage – disposition </li></ul><ul><li>Queue prioritization </li></ul>
  22. 22. Using What Data Points? <ul><li>FOR EXAMPLE: </li></ul><ul><li>450M target documents </li></ul><ul><li>Dirty words </li></ul><ul><li>Previous declassifications </li></ul><ul><li>Previous declassification denials </li></ul><ul><li>FOIA’s </li></ul><ul><li>Intellipedia </li></ul><ul><li>Wikipedia </li></ul><ul><li>WikiLeaks </li></ul><ul><li>Deceased persons </li></ul><ul><li>Publically available accounts/facts </li></ul>
  23. 23.
  24. 24. Open Source Discovery/Scoring <ul><li>“ Height of Pakistan’s Mufasa missile.” </li></ul><ul><ul><ul><li>What is 15.5 meters? </li></ul></ul></ul><ul><ul><ul><ul><li>New York Times, Sept 21, 2010, C3 </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>“ Pakistan unveils Mufasa 7 Warhead” </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Wikipedia: Mufasa_7_Warhead </li></ul></ul></ul></ul>
  25. 25. Context Accumulation FOIA March 2010 Open Source Reference Dirty Word Classified – Asserted Mufasa 7 Warhead
  26. 26. Context Accumulation + Statistics <ul><li>Document Element Total | Declass | Class-Default | Class-Asserted </li></ul><ul><li>Author: “Billy K” 4503 1600 403 0 </li></ul><ul><li>Codeword: “Tomatoe” 4818 4600 218 0 </li></ul><ul><li>Classification: “SI/TK/001” 23 22 1 0 </li></ul><ul><li>Actors: “Salam Ahmed” 782 700 82 0 </li></ul>Declassification dispositions … becoming a force multiplier. The more human dispositions, the more automated dispositions. Human Triage Auto Triage 5,000 20 10,000 4,000 100,000 65,000 1,000,000 17,000,000
  27. 27. Policy Questions <ul><li>What related information is already available in the public domain? </li></ul><ul><ul><li>Evidence: Exists in open source </li></ul></ul><ul><li>What damage might conceivably result from disclosure and what benefits might ensue? </li></ul><ul><ul><li>Evidence: Same text already released (by same equity holder) </li></ul></ul>
  28. 28. Strawman Architecture
  29. 29. Strawman Architecture 450M Docs Historical Dispositions DirtyWords Etc. Feature Extraction & Classification Context Accumulation Predictions(*) Workflow System (*) Recommendations: Equity of, Disposition, Priority Dispositions
  30. 30. Another Idea: Crowd Sourcing <ul><li>Can you predict specific people with privileges and knowledge … to whom can be routed selected documents for evaluation? </li></ul><ul><li>Can you publish machine-triage recommendations to a wiki or other form of internal broadcast for community crowd sourcing? </li></ul>
  31. 31. Another Idea: Better Classification <ul><li>Using the overall declassification platform to assist in proper classification (real-time) </li></ul><ul><li>And, better pre-tagging to assist in future auto-declassification </li></ul>
  32. 32. Challenges
  33. 33. Challenges <ul><li>Entity extraction is imperfect </li></ul><ul><li>Predictions may still not good enough, often enough </li></ul><ul><li>Not in English </li></ul><ul><li>The user work surface and its distribution </li></ul><ul><li>Consequences of an inappropriate release </li></ul><ul><li>With super access and super tools, this may call for stronger audit and insider-threat protections </li></ul><ul><li>Your contracting cycle and the creation of the system might take until mid-2011 or 2012 or 2013 </li></ul>
  34. 34. Closing Thoughts
  35. 35. Closing Thoughts <ul><li>Contextualization is essential to better prediction </li></ul><ul><li>There are not enough humans to ask every question every day </li></ul><ul><li>“ Human attention directing” systems are critical to the mission </li></ul><ul><li>The data must find the data, the relevance must find the user </li></ul>
  36. 36. Worst Case Scenario <ul><li>Rich context enables better hints for users, results in faster dispositions </li></ul><ul><li>Rich context enables improved sequencing of the work </li></ul>
  37. 37. Related Blog Posts <ul><li>Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems </li></ul><ul><li>Data Finds Data </li></ul><ul><li>Puzzling: How Observations Are Accumulated Into Context </li></ul><ul><li>The Fast Last Puzzle Piece </li></ul><ul><li>Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel </li></ul><ul><li>How to Use a Glue Gun to Catch a Liar </li></ul><ul><li>It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You </li></ul><ul><li>Smart Systems Flip-Flop </li></ul>
  38. 38. Blogging At: Information Management Privacy National Security and Triathlons Questions?
  39. 39. Mass Declassification What If? Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics [email_address] September 23, 2010