SQL Database Design For Developers at php[tek] 2024
It's not the documents; it's the DATA
1. Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m It’s not the documents; it’s the DATA!
39. It’s not the documents, it’s the DATA! Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m Gracias a todos
40.
41.
42. Early police data base: incomplete data Source: Jay, Ricky. “Grifters, Bunco Artists & Flimflammen.” Wired, Feb. 2011, p.88. http://rickyjay.com/
Show of hands: How many journalists? How many from NGOs? How many lawyers (you can double count yourself)? How many students?
Theory of PROCESS Same for journalists, lawyers socially concerned citizens Same up to the point of “Information Out” – there we might write and present somewhat differently for different audiences
While sometimes an individual document can be important, today seeing the patterns of behaviors is usually just as important and insightful if not more so.
Early public records Intricate data collection Potential for error in data entry Potential for error in filing No machine retrieval or analysis Even today, OCR would be impossible http://cultureandcommunication.org/deadmedia/index.php/Bertillon_System This rare Bertillon Card (named after the inventor of Anthropometry) Decline of Bertillonage Fingerprint killed the Bertillon star The complexity of the Bertillon system —the very thing that provided it with such accurate and reliable data—also proved to be its downfall: it was simply too cumbersome to replicate with sufficient accuracy. As soon as Bertillon’s procedures began to be disseminated outside of Paris there were problems; as Cole explains: Learning the system from translated books, far from the exacting presence of Bertillon himself, identification clerks seldom replicated the rigor that characterized operations in Paris. Instead, they skimped on learning the morphological vocabulary, glossed over the precise movements in the measuring process, and contented themselves with sloppily recording a few measurements. Worse, most identification bureaus, too proud to simply adopt Bertillon’s system wholesale, took it upon themselves to modify various aspects of the system. (Cole 2001, 52) Bertillon anticipated these problems, writing a strongly-worded message in his instruction manual directed towards all those who would consider meddling with his finally tuned methods: The arrangement of these instruments was the subject of many experiments and numberless improvements before they reached their present shape, which we consider as final. So we reject in advance every modification, every further change, however slight, either in their form or in their manner of using them. That is a great temptation for beginners, to whom numerous new ideas occur, but who are not aware that all these ideas, even those that they believe to be the most original, the most personal, have already been proposed by others, tried and finally rejected for divers reasons. (Bertillon 1896, 19) Alas, Bertillon’s warnings were not heeded, and the accuracy of anthropometric measurements—and the reputation of the system as a whole—suffered as result. Even if the integrity of Bertillon’s system could be sustained outside of Paris, it was soon to be overtaken by another form of criminal identification. As Kaluszynski notes, “at the last moment before it seemed likely to dominate the future, anthropometry was to undergo a rude shock. Its success had barely been established and savored when its supremacy began to falter in the face of a new and infallible technique” (2001, 128). Of course, the new technique was fingerprinting, a much simpler process than Bertillonage. “A fingerprint is a physical sign that cannot be falsified or disguised, and the mathematical likelihood of two individuals having identical fingerprints is infinitely small” (128). Occam’s razor would dictate that fingerprinting soon supplant Bertillonage as the world-wide standard for criminal identification.
By 1910… Indexing system has improved Typewriters instead of pen Better haircuts But still … Null fields Subject to data entry errors; lost or misfiled cards/data Limited large-scale analysis resources
Early “hard drives,” data retrieval and data analysis of public records
A public record, but one of limited usage A DOCUMENT , but no efficient, productive, insightful way to FIND the data A DOCUMENT , but no efficient, productive, insightful way to EXTRACT the data Sorta like a PDF
All data today requires NEW tools for ANALYSIS and STORY-TELLING Statutes are usually adequate; the CULTURES are the challenge. Both the culture of politicians and bureaucrats AND the culture of traditional journalism -- which reports the event not the issue -- and lack contemporary analytic skills.
Those were combined with data from the National Wetlands Inventory and the state Fish and Wildlife Conservation Commission . Waite: The government doesn't know how many acres of Florida wetlands have been destroyed in the past 15 years. No state or federal agency has kept track, not even the Army Corps of Engineers, which has the final say on protecting wetlands. Another federal agency, the National Wetlands Inventory, is supposed to track losses nationwide. The tiny agency, based in St. Petersburg, mapped Florida's wetlands 20 years ago, but hasn't updated its maps except for two of Florida's 67 counties. So the St. Petersburg Times examined satellite images of Florida to determine the loss of wetlands. Satellite images taken in the late 1980s were compared with those taken in 2003. A random sample of 385 places on the resulting maps were checked against other data through property records, aerial images and site visits. No satellite image analysis can be 100 percent accurate, particularly one covering such a broad area. In this case, the accuracy was about 85 percent, the level required by the U.S. Geological Survey for similar satellite analysis. To filter out temporary changes from long-lasting ones, the analysis relied on a map of urbanization created by the state wildlife agency . That showed about 84,000 missing acres of wetlands. The methodology was reviewed by Barnali Dixon, professor of geography at the University of South Florida; Leonard G. Pearlstine, assistant scientist at the University of Florida's Fort Lauderdale Research and Education Center; and Tom Lillesand, professor of geography and director of the Environmental Remote Sensing Center at the University of Wisconsin-Madison. [Last modified December 14, 2006, 18:10:27] UK The Expenses Files -- MPs' Expenses A year ago the High Court backed an earlier ruling by the Information Tribunal that full details of MPs' expenses, including receipts, should be made public. Since then MPs have been accused of dragging their feet and playing for time. Full details are slated to be published in July but with some crucial details – such as addresses of second homes – blacked out. An investigation by the Telegraph has uncovered the full files. The Guardian - Join us in digging through the documents of MPs' expenses to identify individual claims, or documents that you think merit further investigation. You can work through your own MP's expenses, or just hit the button below to start reviewing. (Update, Fri pm: we now have a virtually complete set of expenses documents so you should be able to find your MP's) Already created an account? Log in here. “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go”
The reporter harvested the data Cleaned and verified the data A team produced the story for multiple delivery platforms. But it all started with the data, some of which probably never existed as an ink-on-paper DOCUMENT Project homepage: http://www.azcentral.com/news/articles/2010/11/12/20101112arizona-pension-funds.html State pension fund tried suing 'Republic‘ http://www.azcentral.com/news/articles/arizona-pension-funds-records.html
Waite: The government doesn't know how many acres of Florida wetlands have been destroyed in the past 15 years. No state or federal agency has kept track, not even the Army Corps of Engineers, which has the final say on protecting wetlands. Another federal agency, the National Wetlands Inventory, is supposed to track losses nationwide. The tiny agency, based in St. Petersburg, mapped Florida's wetlands 20 years ago, but hasn't updated its maps except for two of Florida's 67 counties. So the St. Petersburg Times examined satellite images of Florida to determine the loss of wetlands. Satellite images taken in the late 1980s were compared with those taken in 2003. Those were combined with data from the National Wetlands Inventory and the state Fish and Wildlife Conservation Commission . A random sample of 385 places on the resulting maps were checked against other data through property records, aerial images and site visits. No satellite image analysis can be 100 percent accurate, particularly one covering such a broad area. In this case, the accuracy was about 85 percent, the level required by the U.S. Geological Survey for similar satellite analysis. To filter out temporary changes from long-lasting ones, the analysis relied on a map of urbanization created by the state wildlife agency . That showed about 84,000 missing acres of wetlands. The methodology was reviewed by Barnali Dixon, professor of geography at the University of South Florida; Leonard G. Pearlstine, assistant scientist at the University of Florida's Fort Lauderdale Research and Education Center; and Tom Lillesand, professor of geography and director of the Environmental Remote Sensing Center at the University of Wisconsin-Madison. [Last modified December 14, 2006, 18:10:27] UK The Expenses Files -- MPs' Expenses A year ago the High Court backed an earlier ruling by the Information Tribunal that full details of MPs' expenses, including receipts, should be made public. Since then MPs have been accused of dragging their feet and playing for time. Full details are slated to be published in July but with some crucial details – such as addresses of second homes – blacked out. An investigation by the Telegraph has uncovered the full files. The Guardian - Join us in digging through the documents of MPs' expenses to identify individual claims, or documents that you think merit further investigation. You can work through your own MP's expenses, or just hit the button below to start reviewing. (Update, Fri pm: we now have a virtually complete set of expenses documents so you should be able to find your MP's) Already created an account? Log in here. “We have 458,832 pages of documents. 27,731 of you have reviewed 223,475 of them. Only 235,357 to go”
Early, major example of crowd-source analysis “ Wet wear” content analysis tool Text data AND PDF but connection to the PDF is – SORTA -- the end result
Range of file “states/form” Range of the challenge in extracting and analyzing the data “ JSON is an important standard for ease of interaction across systems. It's becoming the preferred route over XML in many cases. “ And as geo-spatial data explodes, addressing the standards there might be helpful. I would include KML, GeoJSON and SHP files for vector and many options for raster: bil, netCDF, ECW, GeoTIFF, etc.” (Guerin)
And even these are NOT perfect; have to know some of the underlying assumptions inherent in these file types. That said, this is still the best point of departure when seeking to acquire files and their data. Just as an example, csv does not allow trailing zeros in a numeric field, so my zip would collapse from 02151 to 2151. Or, the field would be represented as text, "02151" (surrounded by quote marks). Some translation programs do that automatically, but there is no standard. Same problem with phone numbers, some equations, etc. Csv also assumes field headers are on one line. They need to be in one cell in excel to translate correctly that way. Often, they are not, or the excel file has multiple levels of heads. XML is the general link format people want to use, but not all states have adopted it, and a standard schema. Yeah, csv standard does not even allow a blank row or a formatting row (like ---------) between the header and the live data table. The format row is usually read as a zero, not null, and that screws up averages, medians and so forth. Excel "cheats" on calculating medians, etc. (SSR) Should be ANSI standard CVS (SSR)
Move data from “out there” to analytic site/tools Looking for connections; patterns
Seeking fine-grained data, NOT aggregations Seek data in original form (i.e. NO PDFs) Get data in lowest common denominator format: - Comma-delimited files in ASCII or Text Who collected the data? Why? How? Who proofed/edited the data? Why? How? If from data base, first ask for “record layout” or “code sheet” or “schema” Definitions of variables or fields. Constant or ???
Barriers data = barriers to analysis NO site search capability; no site map Failure to use open-standard HTML; using closed-standard Adobe Flash/Shockwave environment. Page formats/layouts not consistent; too many drill-downs instead of search-driven generators Jiggly roll-overs; too much effort spent on bling Impossible to download or scrape data for analysis Information available only in Adobe PDF files; notoriously unfriendly to data analysis.
State of NM gov’t agency develops creditable web site Search engine Choice of Spanish Opportunity for feedback Registered, i.e. OWNED, by the CITIZENS of New Mexico Award Digital Government Achievement Awards http://www.centerdigitalgov.com/survey/88/2010
Another relatively valuable NM state website Clean site design Search engine Quick links to actual document in Word format and PDF
No search engine No Spanish version Head-scratching logic in the taxonomy of the silos Why the roll-overs that don’t do much? If we drill down into “Capital Outlay” (top left menu) we end up with a 70-page PDF. Again.
Go to these sites and, if lucky, find the document we want…. But they are all and only PDFs PDFs can be retrieved and saved – one at a time – to your desktop. Apps available that OCR what probably is the output of an Excel document. But that shows up on two partial pages. Which just adds more time and effort After extracting to Excel, then must be closely copy edited to make sure the extraction process read ever zero as a zero and not a capital letter “Oh” Another interesting problem: screwy, idiosyncratic fonts
It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users. A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you'll find an embedded URL -- http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: James Jennings < [email_address] > Date: Monday, February 14, 2011 Subject: Ease of scraping this site? To: chris feola < [email_address] > 1-The entire site is in flash. I might be able to pipe some of the search data to a csv but not everything is searchable. This is the best job of making public data as inaccessible as possible that I have ever seen. It is a masterwork. I would call and just ask them to send it all in a spreadsheet and see what happens. jj
It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users. A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you'll find an embedded URL -- http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: James Jennings < [email_address] > Date: Monday, February 14, 2011 Subject: Ease of scraping this site? To: chris feola < [email_address] > 1-The entire site is in flash. I might be able to pipe some of the search data to a csv but not everything is searchable. This is the best job of making public data as inaccessible as possible that I have ever seen. It is a masterwork. I would call and just ask them to send it all in a spreadsheet and see what happens. jj
Examples of good GOVERNMENT sites.
What do we see on all these GOOD gov’t sites? All have up-front search capabilities All are written in “data-accessible” code All data can be downloaded with “relative” ease Some have various languages available ALL are run by GOVERNMENT; no commercial sites
Failure on the part of planners/bureaucrats to simply… Give The People THEIR Data… In The Most Basic, Original, Straightforward Form… And Let Them Figure Out What Should Be Done With It! The governor agrees
See “The Public Document Information Act” -- http://sunlightfoundation.com/policy/poia/ The state has had – and has employed – for multiple years the Fiscal Impact Report, which is required to be attached to EVERY bill introduced in the Legislature In the 21 st Century – if we are to have not just citizens participating in democracy in an informed manner but economic growth We should have a requirement for every bill introduced that says, How will this bill advance the public’s access to the ORIGINAL FORM data related to this bill and topic?
https://secure.wikimedia.org/wikipedia/en/wiki/Public_records Historic perspective The concept of public records first emerged in western Europe in the late middle ages. Some of the first public records were census records,birth, burial, and marriage records such as the Doomsday Book (1085-6) of William the Conquerer [2] and royalty marriage agreements, which were perceived as international treaties brokered by private parties. In the United Kingdom, Public Record Office Act was passed in 1838. [3] Of particular significance was the evolution of the common law right &quot;to access court records to inspect and to copy&quot;. The expectation inherent in the common law right to access court records is that any person may come to the office of the clerk of the court during business hours and request to inspect court records, with almost instantaneous access. Such right is a central safeguard for the integrity of the courts. Any decision to conceal court records requires a sealing order. The right to access court records is also central to liberty: There is no conceivable way to exercise the Habeas Corpus right, deemed by the late Justice Brennan [4] as &quot;the cornerstone&quot; of the United States Constitution, absent access to court records as public records. In the United States the common law right to &quot;access court records to inspect and to copy&quot; was re-affirmed by the US Supreme Court in Nixon v Warner Communications, Inc (1978), where the court found various parts of the right to access court records as inherent to the First, Fourth, Sixth, and Fourteenth Amendments. Therefore, in the United States, access to court records is governed by Civil Rights in the Amendments to the United States Constitution, not by the Freedom of Information Act. [ edit ] Public records in the United States Access to public records in the US at the federal level is guided by the Freedom of Information Act (FOIA). Requests for access to records pursuant to FOIA are often frustrated by federal agencies through the numerous exemptions found in the law, and through redaction of critical data. Each state has its own version of FOIA. For example, in Colorado there is the Colorado Open Records Act [5] (CORA) and in New Jersey the law is known as the Open Public Records Act [6] (OPRA). There are many degrees of accessibility to public records between states, with some making it fairly easy to request and receive documents, and others with many exemptions and restricted categories of documents. One state that is fairly responsive to public records requests is New York, which utilizes the Committee on Open Government [7] to assist citizens with their requests. A state that is fairly restrictive in how they respond to public records requests is Pennsylvania, where the law currently presumes that all documents are exempt from disclosure [8] , unless they can be proven otherwise. The California Public Records Act - California Government Code §§6250-6276.48 - covers the arrest and booking records of inmates in the State of California jails and prisons, which are not covered by First Amendment rights. Public access to arrest and booking records is seen as a critical safeguard of Liberty.
The ThemeRiver™ visualization helps users identify time-related patterns, trends, and relationships across a large collection of documents. The themes in the collection are represented by a &quot;river&quot; that flows left to right through time. The river widens or narrows to depict changes in the collective strength of selected themes in the underlying documents. Individual themes are represented as colored &quot;currents&quot; flowing within the river. The theme currents narrow or widen to indicate changes in individual theme strength at any point in time.
Original source: http://www.propublica.org/article/foia-exemptions-sunshine-law NB: And if one picks the CIA, for example, You get a “vitural” webpage, NOT an actual document You can drill down into viewing the PDF, a secondary result of the search, not not primary result
“ DATA” upon analysis becomes information “ DATA” is sensual, qualitative and quantitative. * Smell of a forest fire, expression of an interviewee’s feelings, copy of the state budget or a bill marked up by committee Quality of the “Information Out” can be no better than the DATA that goes in (and that means Research and Reporting) and the Analysis applies to that high-quality data. In the Infosphere, “Information” is often released back into the Infosphere to become “DATA” for some other species or colleagues’ use.