Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m    ...
It’s not the documents,  it’s the DATA! <ul><li>Presentation at </li></ul><ul><li>“ 2011 Open Government Academy”  March 2...
<ul><li>Important point </li></ul>1 Nothing is as important – and valuable – as a good theory!
Theory of Journalistic Process <ul><li>Data In   Analysis    Info Out </li></ul><ul><li>Data   = that which, upon Analys...
<ul><li>Important point </li></ul><ul><li>The document  is  not   the data. </li></ul>2
Bertillon system: Public Records DB  <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><u...
Bertillon system: Public Records DB  <ul><li>By 1910… </li></ul><ul><li>Indexing system has improved </li></ul><ul><li>Typ...
Bertillon system: Public Records DB  <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><u...
Bertillon system: Public Records DB  <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><u...
Traditional Data In     Analysis      Info Out Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text <...
Digital Age Data In     Analysis      Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul>...
Digital Age Data In     Analysis      Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul>...
<ul><li>Important point </li></ul><ul><li>The document  is  not   the data . Without analysis, the data are  not   the sto...
Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail  </li></ul...
Journalism and GIS <ul><li>Steve Doig  [Miami Herald] </li></ul><ul><li>1992 </li></ul>Hurricane Andrew + damage reports +...
Doig: Hurricane Andrew
Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail  </li></ul...
Analysis with real data Search   Sort DB info
Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail  </li></ul...
Vanishing Wetlands
Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail  </li></ul...
UK MP’s expenses Solid search tools These are PDFs,  POST -search
Major questions? <ul><li>As participants in a liberal democracy…  </li></ul><ul><li>How do we get the necessary data? </li...
Files, Transparency, Ease of Analysis Easier Challenging
Files, Transparency, Ease of Analysis
Data In: Objectives/Requirements <ul><li>Move data from “out there” to analytic site/tools </li></ul><ul><li>Looking for c...
Data In: Objectives/Requirements <ul><li>Seeking  fine-grained  data, NOT aggregations </li></ul><ul><ul><li>Seek data in ...
Data In:  “Typical” problems with gov sites <ul><ul><li>Barriers data = barriers to analysis </li></ul></ul><ul><ul><ul><l...
Good NM sites Search! Español Feedback!
NM Legis. Bill Finder Could be better: no way to find what bills were introduced by X legislator Download bill in  TWO for...
Data In: Challenges <ul><ul><li>New site in New Mexico:  www.sunshineportalnm.com </li></ul></ul><ul><ul><li>“ Beta ,” but...
Data In: Challenges in SunshinePort <ul><ul><li>Comprehensive  Annual Financial  Reports </li></ul></ul><ul><ul><ul><li>Po...
Bottom line on SunshinePortalNM.com <ul><li>“ If the State of New Mexico takes the position that through this site it is d...
Bottom line on SunshinePortalNM.com <ul><li>“ If the State of New Mexico takes the position that through this site it is d...
Good data sites – Gov and NGO <ul><li>Data.gov  [A  beta  site]  www.data.gov/ </li></ul><ul><ul><li>Metrics  www.data.gov...
Common aspects? <ul><li>All have up-front search capabilities </li></ul><ul><li>All are written in “data-accessible” code ...
Challenge for Watchdogs? <ul><li>Failure on the part of planners/bureaucrats to simply…  </li></ul><ul><li>Give The People...
Tomorrow? Public Access to Original Data Impact Why not?
It’s not the documents, it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico US...
It’s not the documents,  it’s the DATA! <ul><li>Presentation at </li></ul><ul><li>“ 2011 Open Government Academy”  March 2...
FOI history <ul><li>The world’s rst reedom o inormation legislation was adopted by the Swedish parliament in 1766. Thi...
Early police data base: incomplete data Source: Jay, Ricky.  “Grifters, Bunco Artists & Flimflammen.”  Wired, Feb. 2011, p...
NM HB 406 <ul><li>“… information contained in information systems databases created or maintained by or on behalf of a pub...
Analytic Tools <ul><li>Text </li></ul><ul><ul><li>ThemeRiver -  http://infoviz.pnl.gov/research_themeriver.stm </li></ul><...
“ Analytic tools” also for story-telling <ul><li>Spreadsheets: </li></ul><ul><ul><li>Tables, charts, infographics </li></u...
FOIA b(3) Exemptions Original:  http://www.propublica.org/article/foia-exemptions-sunshine-law
Content Analysis
Content analysis of legis party  text
Positive example of gov’t data <ul><li>Positive example: NM Leg Bill Locator </li></ul><ul><li>http://www.nmlegis.gov/lcs/...
NM HB 406 <ul><li>Senate approved 39-0 on Feb. 9 http:// www.nmlegis.gov/Sessions/11%20Regular/bills/house/HB0406.html </l...
“ Data In” questions Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </l...
“ Data In” questions Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </l...
<ul><li>Data In </li></ul><ul><li> </li></ul><ul><li>Analysis </li></ul><ul><li> </li></ul><ul><li>Info Out </li></ul>
“ Analysis” phase Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li><...
“ Analysis” phase Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li><...
Data In     Analysis      Info Out Data In     Analysis    Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul...
Data In     Analysis      Info Out Data In     Analysis     Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><u...
Theory of Journalistic Process Copyright ©  J. T. Johnson Data  In <ul><li>Interviews </li></ul><ul><li>Text docs </li></u...
Upcoming SlideShare
Loading in...5
×

It's not the documents; it's the DATA

707

Published on

Presentation at the New Mexico Foundation for Open Government Academy, Univ. of New Mexico Law School, Albuquerque, NM 26 March 2011

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
707
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Show of hands: How many journalists? How many from NGOs? How many lawyers (you can double count yourself)? How many students?
  • Theory of PROCESS Same for journalists, lawyers socially concerned citizens Same up to the point of “Information Out” – there we might write and present somewhat differently for different audiences
  • While sometimes an individual document can be important, today seeing the patterns of behaviors is usually just as important and insightful if not more so.
  • Early public records Intricate data collection Potential for error in data entry Potential for error in filing No machine retrieval or analysis Even today, OCR would be impossible http://cultureandcommunication.org/deadmedia/index.php/Bertillon_System This rare Bertillon Card (named after the inventor of Anthropometry) Decline of Bertillonage Fingerprint killed the Bertillon star The complexity of the Bertillon system —the very thing that provided it with such accurate and reliable data—also proved to be its downfall: it was simply too cumbersome to replicate with sufficient accuracy. As soon as Bertillon’s procedures began to be disseminated outside of Paris there were problems; as Cole explains: Learning the system from translated books, far from the exacting presence of Bertillon himself, identification clerks seldom replicated the rigor that characterized operations in Paris. Instead, they skimped on learning the morphological vocabulary, glossed over the precise movements in the measuring process, and contented themselves with sloppily recording a few measurements. Worse, most identification bureaus, too proud to simply adopt Bertillon’s system wholesale, took it upon themselves to modify various aspects of the system. (Cole 2001, 52) Bertillon anticipated these problems, writing a strongly-worded message in his instruction manual directed towards all those who would consider meddling with his finally tuned methods: The arrangement of these instruments was the subject of many experiments and numberless improvements before they reached their present shape, which we consider as final. So we reject in advance every modification, every further change, however slight, either in their form or in their manner of using them. That is a great temptation for beginners, to whom numerous new ideas occur, but who are not aware that all these ideas, even those that they believe to be the most original, the most personal, have already been proposed by others, tried and finally rejected for divers reasons. (Bertillon 1896, 19) Alas, Bertillon’s warnings were not heeded, and the accuracy of anthropometric measurements—and the reputation of the system as a whole—suffered as result. Even if the integrity of Bertillon’s system could be sustained outside of Paris, it was soon to be overtaken by another form of criminal identification. As Kaluszynski notes, “at the last moment before it seemed likely to dominate the future, anthropometry was to undergo a rude shock. Its success had barely been established and savored when its supremacy began to falter in the face of a new and infallible technique” (2001, 128). Of course, the new technique was fingerprinting, a much simpler process than Bertillonage. “A fingerprint is a physical sign that cannot be falsified or disguised, and the mathematical likelihood of two individuals having identical fingerprints is infinitely small” (128). Occam’s razor would dictate that fingerprinting soon supplant Bertillonage as the world-wide standard for criminal identification.
  • By 1910… Indexing system has improved Typewriters instead of pen Better haircuts But still … Null fields Subject to data entry errors; lost or misfiled cards/data Limited large-scale analysis resources
  • Early “hard drives,” data retrieval and data analysis of public records
  • A public record, but one of limited usage A DOCUMENT , but no efficient, productive, insightful way to FIND the data A DOCUMENT , but no efficient, productive, insightful way to EXTRACT the data Sorta like a PDF
  • All data today requires NEW tools for ANALYSIS and STORY-TELLING Statutes are usually adequate; the CULTURES are the challenge. Both the culture of politicians and bureaucrats AND the culture of traditional journalism -- which reports the event not the issue -- and lack contemporary analytic skills.
  • Those were combined with data from the National Wetlands Inventory and the state Fish and Wildlife Conservation Commission . Waite: The government doesn&apos;t know how many acres of Florida wetlands have been destroyed in the past 15 years. No state or federal agency has kept track, not even the Army Corps of Engineers, which has the final say on protecting wetlands. Another federal agency, the National Wetlands Inventory, is supposed to track losses nationwide. The tiny agency, based in St. Petersburg, mapped Florida&apos;s wetlands 20 years ago, but hasn&apos;t updated its maps except for two of Florida&apos;s 67 counties. So the St. Petersburg Times examined satellite images of Florida to determine the loss of wetlands. Satellite images taken in the late 1980s were compared with those taken in 2003. A random sample of 385 places on the resulting maps were checked against other data through property records, aerial images and site visits. No satellite image analysis can be 100 percent accurate, particularly one covering such a broad area. In this case, the accuracy was about 85 percent, the level required by the U.S. Geological Survey for similar satellite analysis. To filter out temporary changes from long-lasting ones, the analysis relied on a map of urbanization created by the state wildlife agency . That showed about 84,000 missing acres of wetlands. The methodology was reviewed by Barnali Dixon, professor of geography at the University of South Florida; Leonard G. Pearlstine, assistant scientist at the University of Florida&apos;s Fort Lauderdale Research and Education Center; and Tom Lillesand, professor of geography and director of the Environmental Remote Sensing Center at the University of Wisconsin-Madison. [Last modified December 14, 2006, 18:10:27] UK The Expenses Files -- MPs&apos; Expenses A year ago the High Court backed an earlier ruling by the Information Tribunal that full details of MPs&apos; expenses, including receipts, should be made public. Since then MPs have been accused of dragging their feet and playing for time. Full details are slated to be published in July but with some crucial details – such as addresses of second homes – blacked out. An investigation by the Telegraph has uncovered the full files. The Guardian - Join us in digging through the documents of MPs&apos; expenses to identify individual claims, or documents that you think merit further investigation. You can work through your own MP&apos;s expenses, or just hit the button below to start reviewing. (Update, Fri pm: we now have a virtually complete set of expenses documents so you should be able to find your MP&apos;s) Already created an account? Log in here. “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go”
  • http://www.flickr.com/photos/juggernautco/2844066535/in/photostream/
  • The reporter harvested the data Cleaned and verified the data A team produced the story for multiple delivery platforms. But it all started with the data, some of which probably never existed as an ink-on-paper DOCUMENT Project homepage: http://www.azcentral.com/news/articles/2010/11/12/20101112arizona-pension-funds.html State pension fund tried suing &apos;Republic‘ http://www.azcentral.com/news/articles/arizona-pension-funds-records.html
  • Waite: The government doesn&apos;t know how many acres of Florida wetlands have been destroyed in the past 15 years. No state or federal agency has kept track, not even the Army Corps of Engineers, which has the final say on protecting wetlands. Another federal agency, the National Wetlands Inventory, is supposed to track losses nationwide. The tiny agency, based in St. Petersburg, mapped Florida&apos;s wetlands 20 years ago, but hasn&apos;t updated its maps except for two of Florida&apos;s 67 counties. So the St. Petersburg Times examined satellite images of Florida to determine the loss of wetlands. Satellite images taken in the late 1980s were compared with those taken in 2003. Those were combined with data from the National Wetlands Inventory and the state Fish and Wildlife Conservation Commission . A random sample of 385 places on the resulting maps were checked against other data through property records, aerial images and site visits. No satellite image analysis can be 100 percent accurate, particularly one covering such a broad area. In this case, the accuracy was about 85 percent, the level required by the U.S. Geological Survey for similar satellite analysis. To filter out temporary changes from long-lasting ones, the analysis relied on a map of urbanization created by the state wildlife agency . That showed about 84,000 missing acres of wetlands. The methodology was reviewed by Barnali Dixon, professor of geography at the University of South Florida; Leonard G. Pearlstine, assistant scientist at the University of Florida&apos;s Fort Lauderdale Research and Education Center; and Tom Lillesand, professor of geography and director of the Environmental Remote Sensing Center at the University of Wisconsin-Madison. [Last modified December 14, 2006, 18:10:27] UK The Expenses Files -- MPs&apos; Expenses A year ago the High Court backed an earlier ruling by the Information Tribunal that full details of MPs&apos; expenses, including receipts, should be made public. Since then MPs have been accused of dragging their feet and playing for time. Full details are slated to be published in July but with some crucial details – such as addresses of second homes – blacked out. An investigation by the Telegraph has uncovered the full files. The Guardian - Join us in digging through the documents of MPs&apos; expenses to identify individual claims, or documents that you think merit further investigation. You can work through your own MP&apos;s expenses, or just hit the button below to start reviewing. (Update, Fri pm: we now have a virtually complete set of expenses documents so you should be able to find your MP&apos;s) Already created an account? Log in here. “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go”
  • Early, major example of crowd-source analysis “ Wet wear” content analysis tool Text data AND PDF but connection to the PDF is – SORTA -- the end result
  • Range of file “states/form” Range of the challenge in extracting and analyzing the data “ JSON is an important standard for ease of interaction across systems. It&apos;s becoming the preferred route over XML in many cases. “ And as geo-spatial data explodes, addressing the standards there might be helpful. I would include KML, GeoJSON and SHP files for vector and many options for raster: bil, netCDF, ECW, GeoTIFF, etc.” (Guerin)
  • And even these are NOT perfect; have to know some of the underlying assumptions inherent in these file types. That said, this is still the best point of departure when seeking to acquire files and their data. Just as an example, csv does not allow trailing zeros in a numeric field, so my zip would collapse from 02151 to 2151. Or, the field would be represented as text, &amp;quot;02151&amp;quot; (surrounded by quote marks). Some translation programs do that automatically, but there is no standard. Same problem with phone numbers, some equations, etc.  Csv also assumes field headers are on one line. They need to be in one cell in excel to translate correctly that way. Often, they are not, or the excel file has multiple levels of heads. XML is the general link format people want to use, but not all states have adopted it, and a standard schema. Yeah, csv standard does not even allow a blank row or a formatting row (like ---------) between the header and the live data table.  The format row is usually read as a zero, not null, and that screws up averages, medians and so forth. Excel &amp;quot;cheats&amp;quot; on calculating medians, etc. (SSR) Should be ANSI standard CVS (SSR)
  • Move data from “out there” to analytic site/tools Looking for connections; patterns
  • Seeking fine-grained data, NOT aggregations Seek data in original form (i.e. NO PDFs) Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text Who collected the data? Why? How? Who proofed/edited the data? Why? How? If from data base, first ask for “record layout” or “code sheet” or “schema” Definitions of variables or fields. Constant or ???
  • Barriers data = barriers to analysis NO site search capability; no site map Failure to use open-standard HTML; using closed-standard Adobe Flash/Shockwave environment. Page formats/layouts not consistent; too many drill-downs instead of search-driven generators Jiggly roll-overs; too much effort spent on bling Impossible to download or scrape data for analysis Information available only in Adobe PDF files; notoriously unfriendly to data analysis.
  • State of NM gov’t agency develops creditable web site Search engine Choice of Spanish Opportunity for feedback Registered, i.e. OWNED, by the CITIZENS of New Mexico Award Digital Government Achievement Awards http://www.centerdigitalgov.com/survey/88/2010
  • Another relatively valuable NM state website Clean site design Search engine Quick links to actual document in Word format and PDF
  • No search engine No Spanish version Head-scratching logic in the taxonomy of the silos Why the roll-overs that don’t do much? If we drill down into “Capital Outlay” (top left menu) we end up with a 70-page PDF. Again.
  • Go to these sites and, if lucky, find the document we want…. But they are all and only PDFs PDFs can be retrieved and saved – one at a time – to your desktop. Apps available that OCR what probably is the output of an Excel document. But that shows up on two partial pages. Which just adds more time and effort After extracting to Excel, then must be closely copy edited to make sure the extraction process read ever zero as a zero and not a capital letter “Oh” Another interesting problem: screwy, idiosyncratic fonts
  • It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users.   A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you&apos;ll find an embedded URL --   http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: James Jennings &lt; [email_address] &gt; Date: Monday, February 14, 2011 Subject: Ease of scraping this site? To: chris feola &lt; [email_address] &gt; 1-The entire site is in flash.  I might be able to pipe some of the search data to a csv but not everything is searchable.  This is the best job of making public data as inaccessible as possible that I have ever seen.  It is a masterwork. I would call and just ask them to send it all in a spreadsheet and see what happens. jj
  • It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users.   A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you&apos;ll find an embedded URL --   http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: James Jennings &lt; [email_address] &gt; Date: Monday, February 14, 2011 Subject: Ease of scraping this site? To: chris feola &lt; [email_address] &gt; 1-The entire site is in flash.  I might be able to pipe some of the search data to a csv but not everything is searchable.  This is the best job of making public data as inaccessible as possible that I have ever seen.  It is a masterwork. I would call and just ask them to send it all in a spreadsheet and see what happens. jj
  • Examples of good GOVERNMENT sites.
  • What do we see on all these GOOD gov’t sites? All have up-front search capabilities All are written in “data-accessible” code All data can be downloaded with “relative” ease Some have various languages available ALL are run by GOVERNMENT; no commercial sites
  • Failure on the part of planners/bureaucrats to simply… Give The People THEIR Data… In The Most Basic, Original, Straightforward Form… And Let Them Figure Out What Should Be Done With It! The governor agrees
  • See “The Public Document Information Act” -- http://sunlightfoundation.com/policy/poia/ The state has had – and has employed – for multiple years the Fiscal Impact Report, which is required to be attached to EVERY bill introduced in the Legislature In the 21 st Century – if we are to have not just citizens participating in democracy in an informed manner but economic growth We should have a requirement for every bill introduced that says, How will this bill advance the public’s access to the ORIGINAL FORM data related to this bill and topic?
  • https://secure.wikimedia.org/wikipedia/en/wiki/Public_records Historic perspective The concept of public records first emerged in western Europe in the late middle ages. Some of the first public records were census records,birth, burial, and marriage records such as the Doomsday Book (1085-6) of William the Conquerer [2] and royalty marriage agreements, which were perceived as international treaties brokered by private parties. In the United Kingdom, Public Record Office Act was passed in 1838. [3] Of particular significance was the evolution of the common law right &amp;quot;to access court records to inspect and to copy&amp;quot;. The expectation inherent in the common law right to access court records is that any person may come to the office of the clerk of the court during business hours and request to inspect court records, with almost instantaneous access. Such right is a central safeguard for the integrity of the courts. Any decision to conceal court records requires a sealing order. The right to access court records is also central to liberty: There is no conceivable way to exercise the Habeas Corpus right, deemed by the late Justice Brennan [4] as &amp;quot;the cornerstone&amp;quot; of the United States Constitution, absent access to court records as public records. In the United States the common law right to &amp;quot;access court records to inspect and to copy&amp;quot; was re-affirmed by the US Supreme Court in Nixon v Warner Communications, Inc (1978), where the court found various parts of the right to access court records as inherent to the First, Fourth, Sixth, and Fourteenth Amendments. Therefore, in the United States, access to court records is governed by Civil Rights in the Amendments to the United States Constitution, not by the Freedom of Information Act. [ edit ] Public records in the United States Access to public records in the US at the federal level is guided by the Freedom of Information Act (FOIA). Requests for access to records pursuant to FOIA are often frustrated by federal agencies through the numerous exemptions found in the law, and through redaction of critical data. Each state has its own version of FOIA. For example, in Colorado there is the Colorado Open Records Act [5] (CORA) and in New Jersey the law is known as the Open Public Records Act [6] (OPRA). There are many degrees of accessibility to public records between states, with some making it fairly easy to request and receive documents, and others with many exemptions and restricted categories of documents. One state that is fairly responsive to public records requests is New York, which utilizes the Committee on Open Government [7] to assist citizens with their requests. A state that is fairly restrictive in how they respond to public records requests is Pennsylvania, where the law currently presumes that all documents are exempt from disclosure [8] , unless they can be proven otherwise. The California Public Records Act - California Government Code §§6250-6276.48 - covers the arrest and booking records of inmates in the State of California jails and prisons, which are not covered by First Amendment rights. Public access to arrest and booking records is seen as a critical safeguard of Liberty.
  • The ThemeRiver™ visualization helps users identify time-related patterns, trends, and relationships across a large collection of documents. The themes in the collection are represented by a &amp;quot;river&amp;quot; that flows left to right through time. The river widens or narrows to depict changes in the collective strength of selected themes in the underlying documents. Individual themes are represented as colored &amp;quot;currents&amp;quot; flowing within the river. The theme currents narrow or widen to indicate changes in individual theme strength at any point in time.
  • Original source: http://www.propublica.org/article/foia-exemptions-sunshine-law NB: And if one picks the CIA, for example, You get a “vitural” webpage, NOT an actual document You can drill down into viewing the PDF, a secondary result of the search, not not primary result
  • “ DATA” upon analysis becomes information “ DATA” is sensual, qualitative and quantitative. * Smell of a forest fire, expression of an interviewee’s feelings, copy of the state budget or a bill marked up by committee Quality of the “Information Out” can be no better than the DATA that goes in (and that means Research and Reporting) and the Analysis applies to that high-quality data. In the Infosphere, “Information” is often released back into the Infosphere to become “DATA” for some other species or colleagues’ use.
  • It's not the documents; it's the DATA

    1. 1. Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m It’s not the documents; it’s the DATA!
    2. 2. It’s not the documents, it’s the DATA! <ul><li>Presentation at </li></ul><ul><li>“ 2011 Open Government Academy” March 26, 2011 </li></ul><ul><li>Presented by the New Mexico Foundation for Open Government , </li></ul><ul><li>New Mexico Press Association and New Mexico Broadcasters Association </li></ul>This PowerPoint deck and Tipsheet posted at: http:// j o h n s o n – f o g . n o t l o n g . c o m        Licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License .
    3. 3. <ul><li>Important point </li></ul>1 Nothing is as important – and valuable – as a good theory!
    4. 4. Theory of Journalistic Process <ul><li>Data In  Analysis  Info Out </li></ul><ul><li>Data = that which, upon Analysis, yields Information. “Data” has many forms. </li></ul><ul><li>Analysis = Examination of data and facts to uncover and understand cause-effect and contextual relationships and patterns , thus providing basis for problem solving and decision making . </li></ul><ul><li>Information = that which aids in making decisions </li></ul>
    5. 5. <ul><li>Important point </li></ul><ul><li>The document is not the data. </li></ul>2
    6. 6. Bertillon system: Public Records DB <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><ul><li>Potential for error in data entry </li></ul><ul><li>Potential for error in filing </li></ul><ul><li>No machine retrieval or analysis </li></ul><ul><li>Even today, OCR would be impossible </li></ul>
    7. 7. Bertillon system: Public Records DB <ul><li>By 1910… </li></ul><ul><li>Indexing system has improved </li></ul><ul><li>Typewriters instead of pen </li></ul><ul><li>Better haircuts </li></ul><ul><li>But still … </li></ul><ul><li>Null fields </li></ul><ul><li>Subject to data entry errors; lost or misfiled cards/data </li></ul><ul><li>Limited large-scale analysis resources </li></ul>
    8. 8. Bertillon system: Public Records DB <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><ul><li>Data entry potential for error </li></ul><ul><li>Filing potential for error </li></ul><ul><li>No machine retrieval or analysis </li></ul><ul><li>Even today, no OCR </li></ul><ul><li>By 1910… </li></ul><ul><li>Indexing system has improved </li></ul><ul><li>Typewriters instead of pen </li></ul><ul><li>Better haircuts </li></ul><ul><li>But still … </li></ul><ul><li>Null fields </li></ul><ul><li>Subject to data entry errors; lost or misfiled cards/data </li></ul><ul><li>Limited large-scale analysis resources </li></ul>Early “hard drives,” data retrieval and data analysis of public records
    9. 9. Bertillon system: Public Records DB <ul><li>Early public records </li></ul><ul><li>Intricate data collection </li></ul><ul><li>Data entry potential for error </li></ul><ul><li>Filing potential for error </li></ul><ul><li>No machine retrieval or analysis </li></ul><ul><li>Even today, no OCR </li></ul><ul><li>By 1910… </li></ul><ul><li>Indexing system has improved </li></ul><ul><li>Typewriters instead of pen </li></ul><ul><li>Better haircuts </li></ul><ul><li>But still … </li></ul><ul><li>Null fields </li></ul><ul><li>Subject to data entry errors; lost or misfiled cards/data </li></ul><ul><li>Limited large-scale analysis resources </li></ul><ul><li>A public record, but one of limited usage </li></ul><ul><li>A DOCUMENT , but no efficient, productive, insightful way to FIND the data </li></ul><ul><li>A DOCUMENT , but no efficient, productive, insightful way to EXTRACT the data </li></ul><ul><li>Sorta like a PDF </li></ul>Early “hard drives,” data retrieval and data analysis of public records
    10. 10. Traditional Data In  Analysis  Info Out Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Maps </li></ul><ul><li>How? Who? </li></ul>
    11. 11. Digital Age Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits </li></ul><ul><li>How? Who? </li></ul><ul><li>New data is ubiquitous, shareable, scaleable. </li></ul><ul><li>Retrieval, copying and storage costs trivial </li></ul><ul><li>Can be validated and explored by individuals and applications </li></ul>
    12. 12. Digital Age Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits </li></ul><ul><li>How? Who? </li></ul>} <ul><li>All data today requires NEW tools for ANALYSIS and STORY-TELLING </li></ul><ul><li>Statutes are usually adequate; the CULTURES are the challenge. </li></ul>
    13. 13. <ul><li>Important point </li></ul><ul><li>The document is not the data . Without analysis, the data are not the story. </li></ul>3
    14. 14. Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail </li></ul><ul><li>Craig Harris: “ Arizona pension systems a soaring burden ” </li></ul><ul><li>Waite: water, developers, land use = disappearing wet lands </li></ul><ul><li>UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet </li></ul>
    15. 15. Journalism and GIS <ul><li>Steve Doig [Miami Herald] </li></ul><ul><li>1992 </li></ul>Hurricane Andrew + damage reports + building inspection = jail terms
    16. 16. Doig: Hurricane Andrew
    17. 17. Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail </li></ul><ul><li>Craig Harris: “ Arizona pension systems a soaring burden ” </li></ul>
    18. 18. Analysis with real data Search Sort DB info
    19. 19. Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail </li></ul><ul><li>Craig Harris: “ Arizona pension systems a soaring burden” </li></ul><ul><li>Waite: water, developers, land use = “ Vanishing Wetlands ” </li></ul>
    20. 20. Vanishing Wetlands
    21. 21. Four stories <ul><li>Doig: Hurricane Andrew, Data (from documents) = Pulitizer Prize & bldg. inspectors in jail </li></ul><ul><li>Craig Harris: “ Arizona pension systems a soaring burden” </li></ul><ul><li>Waite: water, developers, land use = disappearing wet lands </li></ul><ul><li>UK: Investigate Your MPs Expenses “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go” MP’s expense claims on Google spreadsheet </li></ul><ul><ul><li>EFF Seeks Cooperating FOIA Reviewers </li></ul></ul>
    22. 22. UK MP’s expenses Solid search tools These are PDFs, POST -search
    23. 23. Major questions? <ul><li>As participants in a liberal democracy… </li></ul><ul><li>How do we get the necessary data? </li></ul><ul><li>And from where? </li></ul><ul><li>And in appropriate forms? </li></ul>
    24. 24. Files, Transparency, Ease of Analysis Easier Challenging
    25. 25. Files, Transparency, Ease of Analysis
    26. 26. Data In: Objectives/Requirements <ul><li>Move data from “out there” to analytic site/tools </li></ul><ul><li>Looking for connections; patterns </li></ul>
    27. 27. Data In: Objectives/Requirements <ul><li>Seeking fine-grained data, NOT aggregations </li></ul><ul><ul><li>Seek data in original form (i.e. NO PDFs) </li></ul></ul><ul><ul><li>Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text </li></ul></ul><ul><ul><li>Who collected the data? Why? How? </li></ul></ul><ul><ul><li>Who proofed/edited the data? Why? How? </li></ul></ul><ul><ul><li>If from data base, first ask for “record layout” or “code sheet” or “schema” </li></ul></ul><ul><ul><li>Definitions of variables or fields. Constant or ??? </li></ul></ul>
    28. 28. Data In: “Typical” problems with gov sites <ul><ul><li>Barriers data = barriers to analysis </li></ul></ul><ul><ul><ul><li>NO site search capability; no site map </li></ul></ul></ul><ul><ul><ul><li>Failure to use open-standard HTML; using closed-standard Adobe Flash/Shockwave environment. </li></ul></ul></ul><ul><ul><ul><li>Page formats/layouts not consistent; too many drill-downs instead of search-driven generators </li></ul></ul></ul><ul><ul><ul><li>Jiggly roll-overs; too much effort spent on bling </li></ul></ul></ul><ul><ul><ul><li>Impossible to download or scrape data for analysis </li></ul></ul></ul><ul><ul><ul><li>Information available only in Adobe PDF files; notoriously unfriendly to data analysis. </li></ul></ul></ul>
    29. 29. Good NM sites Search! Español Feedback!
    30. 30. NM Legis. Bill Finder Could be better: no way to find what bills were introduced by X legislator Download bill in TWO formats
    31. 31. Data In: Challenges <ul><ul><li>New site in New Mexico: www.sunshineportalnm.com </li></ul></ul><ul><ul><li>“ Beta ,” but facade for taxpayers; a secondary tax bcs of minimal utility; torture for journos </li></ul></ul>
    32. 32. Data In: Challenges in SunshinePort <ul><ul><li>Comprehensive Annual Financial Reports </li></ul></ul><ul><ul><ul><li>Possible to machine download, but laborious to format for analysis </li></ul></ul></ul><ul><ul><li>Investment Holdings reports are far worse </li></ul></ul><ul><ul><ul><li>They are poor-quality static image files, not machine-readable. </li></ul></ul></ul><ul><ul><ul><li>Tabular data roughly formatted; makes conversion for analysis an arduous, if not impossible task. </li></ul></ul></ul>
    33. 33. Bottom line on SunshinePortalNM.com <ul><li>“ If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there. ” </li></ul>“ This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “
    34. 34. Bottom line on SunshinePortalNM.com <ul><li>“ If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there. ” </li></ul>“ This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “ “ A perfect example of creating the appearance of transparency without actually being transparent.”
    35. 35. Good data sites – Gov and NGO <ul><li>Data.gov [A beta site] www.data.gov/ </li></ul><ul><ul><li>Metrics www.data.gov/metric </li></ul></ul><ul><li>DataSF - http://datasf.org/ a clearinghouse of datasets available from the City & County of San Francisco </li></ul><ul><li>San Francisco Enterprise GIS Program - http:// gispub02.sfgov.org/data.asp </li></ul><ul><li>Maplight.com – an example of how citizens can use data Nonprofit, nonpartisan research organization, provides citizens and journalists the transparency tools to shine a light on the influence of money on politics. </li></ul><ul><li>Prize-winning gov’t agency web sites: http:// www.centerdigitalgov.com/survey/88/2010 </li></ul>
    36. 36. Common aspects? <ul><li>All have up-front search capabilities </li></ul><ul><li>All are written in “data-accessible” code </li></ul><ul><li>All data can be downloaded with “relative” ease </li></ul><ul><li>Some have various languages available </li></ul><ul><li>ALL are run by GOVERNMENT; no commercial sites </li></ul>
    37. 37. Challenge for Watchdogs? <ul><li>Failure on the part of planners/bureaucrats to simply… </li></ul><ul><li>Give The People THEIR Data… </li></ul><ul><li>In The Most Basic, Original, Straightforward Form… </li></ul><ul><li>And Let Them Figure Out What Should Be Done With It! </li></ul><ul><li>The governor agrees </li></ul>
    38. 38. Tomorrow? Public Access to Original Data Impact Why not?
    39. 39. It’s not the documents, it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m Gracias a todos
    40. 40. It’s not the documents, it’s the DATA! <ul><li>Presentation at </li></ul><ul><li>“ 2011 Open Government Academy” March 26, 2011 </li></ul><ul><li>Presented by the New Mexico Foundation for Open Government , </li></ul><ul><li>New Mexico Press Association and New Mexico Broadcasters Association </li></ul>This PowerPoint deck and Tipsheet posted at: http://johnson-fog.notlong.com       
    41. 41. FOI history <ul><li>The world’s rst reedom o inormation legislation was adopted by the Swedish parliament in 1766. This publication includes the English translation o this ordinance on reedom o writing and the press. The enlightenment thinker and politicianAnders </li></ul><ul><li>Chydenius (1729-180), rom the Finnish city o Kokkola, played a crucial </li></ul><ul><li>role in creating the new law. As Proessor Juha Manninen describes in his article, the key achievements o the 1766 Act were the abolishment o political censorship and the gaining o public access to government documents. Although the innovation was suspended rom 1772-1809, the principle o publicity has since remained central in the Nordic countries. </li></ul><ul><li>http://www.scribd.com/doc/5885744/The-Worlds-First-Freedom-of-Information-Act-SwedenFinland-1766 </li></ul>
    42. 42. Early police data base: incomplete data Source: Jay, Ricky. “Grifters, Bunco Artists & Flimflammen.” Wired, Feb. 2011, p.88. http://rickyjay.com/
    43. 43. NM HB 406 <ul><li>“… information contained in information systems databases created or maintained by or on behalf of a public body … shall be subject to disclosure to any person requesting the information in the format requested. </li></ul><ul><li>“ The information shall be provided in the most effective and efficient manner available to the custodian , as defined in the Inspection of Public Records Act. </li></ul><ul><li>           B. The custodian may charge a reasonable fee for production of the information requested . The fee shall not exceed the cost of the materials and reasonable charges for the personnel required to retrieve and provide the information. </li></ul>But what if it wasn’t New Mexico state employees directly at fault?
    44. 44. Analytic Tools <ul><li>Text </li></ul><ul><ul><li>ThemeRiver - http://infoviz.pnl.gov/research_themeriver.stm </li></ul></ul>
    45. 45. “ Analytic tools” also for story-telling <ul><li>Spreadsheets: </li></ul><ul><ul><li>Tables, charts, infographics </li></ul></ul><ul><li>Data base programs </li></ul><ul><ul><li>Charts, graphs, data tables </li></ul></ul><ul><li>Stats programs ( SPSS or SAS or R ) </li></ul><ul><ul><li>Generate graphics </li></ul></ul><ul><li>Social network analytic graphics </li></ul><ul><li>GIS </li></ul>
    46. 46. FOIA b(3) Exemptions Original: http://www.propublica.org/article/foia-exemptions-sunshine-law
    47. 47. Content Analysis
    48. 48. Content analysis of legis party text
    49. 49. Positive example of gov’t data <ul><li>Positive example: NM Leg Bill Locator </li></ul><ul><li>http://www.nmlegis.gov/lcs/_session.aspx?chamber=H&legtype=B&legno=%20406&year=11 </li></ul>Same data available in two formats!
    50. 50. NM HB 406 <ul><li>Senate approved 39-0 on Feb. 9 http:// www.nmlegis.gov/Sessions/11%20Regular/bills/house/HB0406.html </li></ul><ul><li>“ An Act RELATING TO PUBLIC RECORDS; PROVIDING FOR THE INSPECTION OF ELECTRONIC RECORDS.” </li></ul>
    51. 51. “ Data In” questions Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>#1 – Keep a logbook (Try using Notesync.com) </li></ul><ul><li>Qualitative and/or Quantitative? </li></ul><ul><li>Objective: strive to get the data in the most fine-grained and original form. </li></ul><ul><ul><li>Online data is rarely complete nor totally accurate </li></ul></ul><ul><li>Where is the data? In what format? I-o-P? Original digital file type(s)? </li></ul>
    52. 52. “ Data In” questions Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>#1 – Keep a logbook (Try using Notesync.com) </li></ul><ul><li>Who created the data? Why? How? Legal catalysts for creation? If so, what do they say? </li></ul><ul><li>Have definitions and collection process changed? </li></ul><ul><li>Who could review and edit the data? What was/is the vetting process to insure accuracy? </li></ul><ul><li>Who has analyzed the data? For what purpose and with what methods? </li></ul>
    53. 53. <ul><li>Data In </li></ul><ul><li> </li></ul><ul><li>Analysis </li></ul><ul><li> </li></ul><ul><li>Info Out </li></ul>
    54. 54. “ Analysis” phase Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits How? Who? </li></ul><ul><li>What are we looking for? How can we be surprised? </li></ul><ul><li>Previous/parallel investigations? (Start with IRE site stories and tipsheets) </li></ul><ul><li>Context, i.e. past environment(s) and changes? Trends past and future? </li></ul><ul><li>Quantitative and Qualitative methods? </li></ul><ul><li>Data cleaning tools? </li></ul>
    55. 55. “ Analysis” phase Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits How? Who? </li></ul><ul><li>Measurement of phenomena </li></ul><ul><ul><li>Strength of relationships </li></ul></ul><ul><ul><li>Change </li></ul></ul><ul><li>Estimating </li></ul><ul><li>Counting </li></ul><ul><li>Statistical </li></ul><ul><li>Geostatistical </li></ul><ul><li>Social Network Analysis </li></ul><ul><li>Forensic accounting </li></ul><ul><li>Who’s your rabbi? </li></ul>
    56. 56. Data In  Analysis  Info Out Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits How? Who? </li></ul><ul><li>What are we looking for? How can we be surprised? </li></ul><ul><li>Source </li></ul><ul><li>Definition </li></ul><ul><li>Context </li></ul><ul><li>Estimating </li></ul><ul><li>Counting </li></ul><ul><li>Statistical </li></ul><ul><li>Geostatistical </li></ul><ul><li>Social Network Analysis </li></ul><ul><li>Forensic accounting </li></ul>
    57. 57. Data In  Analysis  Info Out Data In  Analysis  Info Out <ul><li>Notes </li></ul><ul><li>Text </li></ul><ul><li>Numeric </li></ul><ul><li>Images </li></ul><ul><li>Charts/Graphs </li></ul><ul><li>Maps </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Atoms  Bits How? </li></ul><ul><li>What are we looking for? How can we be surprised? </li></ul><ul><li>Source </li></ul><ul><li>Definition </li></ul><ul><li>Context </li></ul><ul><li>Estimating </li></ul><ul><li>Counting </li></ul><ul><li>Statistical </li></ul><ul><li>Geostatistical </li></ul><ul><li>Social Network Analysis </li></ul><ul><li>Forensic accounting </li></ul><ul><li>Broadcast </li></ul><ul><li>Web </li></ul><ul><li>Audio </li></ul><ul><li>Video </li></ul><ul><li>Text </li></ul><ul><li>Data visualization </li></ul><ul><li>Maps </li></ul><ul><li>Dynamic databases </li></ul><ul><li>Archives </li></ul>
    58. 58. Theory of Journalistic Process Copyright © J. T. Johnson Data In <ul><li>Interviews </li></ul><ul><li>Text docs </li></ul><ul><li>Clips </li></ul><ul><li>Pictures </li></ul><ul><li>Infographics </li></ul>This is a headline DATELINE -- And the traditional text story starts here and goes on and on and on. Info Out Analysis
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×