IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I do with it?"  Albuquerque, NM Feb 12-13, 2011
Upcoming SlideShare
Loading in...5
×
 

IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I do with it?" Albuquerque, NM Feb 12-13, 2011

on

  • 1,289 views

Tom Johnson's presentation at IRE workshop, Feb. 2011

Tom Johnson's presentation at IRE workshop, Feb. 2011

Statistics

Views

Total Views
1,289
Views on SlideShare
1,289
Embed Views
0

Actions

Likes
1
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Data In OK, it’s downloaded. Where ya gonna save it? Dropbox, SugarSync , Syncplicity $$, Jungle Disk ($3p/m), Zumodrive (2gb=$3p/m), AeroFS , SpiderOak , MiMedia , Wuala , Quanp , Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth Bookmarks: search on Tucows; Xmarks, Diigo, Goals bookmarks: save on PC, in cloud, sync, export, share Get data as fine-grained; into lowest common denominator
  • Data In OK, it’s downloaded. Where ya gonna save it? Dropbox, SugarSync , Syncplicity $$, Jungle Disk ($3p/m), Zumodrive (2gb=$3p/m), AeroFS , SpiderOak , MiMedia , Wuala , Quanp , Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth Bookmarks: search on Tucows; Xmarks, Diigo, Goals bookmarks: save on PC, in cloud, sync, export, share Get data as fine-grained; into lowest common denominator
  • Data In OK, it’s downloaded. Where ya gonna save it? Dropbox, SugarSync , Syncplicity $$, Jungle Disk ($3p/m), Zumodrive (2gb=$3p/m), AeroFS , SpiderOak , MiMedia , Wuala , Quanp , Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth Bookmarks: search on Tucows; Xmarks, Diigo, Goals bookmarks: save on PC, in cloud, sync, export, share Get data as fine-grained; into lowest common denominator
  • Data In OK, it’s downloaded. Where ya gonna save it? Dropbox, SugarSync , Syncplicity $$, Jungle Disk ($3p/m), Zumodrive (2gb=$3p/m), AeroFS , SpiderOak , MiMedia , Wuala , Quanp , Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth Bookmarks: search on Tucows; Xmarks, Diigo, Goals bookmarks: save on PC, in cloud, sync, export, share Get data as fine-grained; into lowest common denominator
  • Data In OK, it’s downloaded. Where ya gonna save it? Dropbox, SugarSync , Syncplicity $$, Jungle Disk ($3p/m), Zumodrive (2gb=$3p/m), AeroFS , SpiderOak , MiMedia , Wuala , Quanp , Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth Bookmarks: search on Tucows; Xmarks, Diigo, Goals bookmarks: save on PC, in cloud, sync, export, share Get data as fine-grained; into lowest common denominator
  • It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users.   A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you'll find an embedded URL --   http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From: James Jennings < [email_address] > Date: Monday, February 14, 2011 Subject: Ease of scraping this site? To: chris feola < [email_address] > 1-The entire site is in flash.  I might be able to pipe some of the search data to a csv but not everything is searchable.  This is the best job of making public data as inaccessible as possible that I have ever seen.  It is a masterwork. I would call and just ask them to send it all in a spreadsheet and see what happens. jj
  • Griff Palmer: “We can't talk about motive for doing the kind of stuff they're trying to do here. But we can talk about demonstrable effect, and we can talk about practical alternatives.” It is possible to build a good-looking site, integrating Flash technology if desired, while still making the underlying structured data directly available to users.   A good example is our election day results files ( http://elections.nytimes.com/2010/results/house ). If you view the source markup for this page, which includes very sharp-looking Flash elements, you'll find an embedded URL --   http://elections.nytimes.com/2010/results/house.tsv . That is a link to a tab-delimited file containing the data underlying the map. --Griff Palmer
  • Calculating Rates: e.g. 2,000 murders / 7,300,000 population = .0002739 * 100,000 Murder rate p/100,000 = 27.39
  • http://wiki.answers.com/Q/How_do_you_calculate_inflation_rate
  • Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase . Freebase is an open, Creative Commons licensed repository of structured data of almost 20 million entities . An entity is a single person, place, or thing. Freebase connects entities together as a graph . Ways to use Freebase: Use Freebase's Ids to uniquely identify entities anywhere on the web Query Freebase's data using MQL Build applications using our API or Acre , our hosted development platform Freebase is also a community of thousands of data-lovers, working together to improve Freebase's data. Learn how to contribute , join our mailing list , or find out more on our community page . Fusion Tables is a service for managing large collections of tabular data in the cloud. You can upload tables of up to 100MB and share them with collaborators, or make them public. You can apply filters and aggregation to your data, visualize it on maps and other charts, merge data from multiple tables, and export it to the Web or csv files. You can also conduct discussions about the data at several levels of granularity, such as rows, columns and individual cells.

IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I do with it?"  Albuquerque, NM Feb 12-13, 2011 IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I do with it?" Albuquerque, NM Feb 12-13, 2011 Presentation Transcript

  • DATA: Now I’ve got it; what do I do with it? Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m
  • DATA: Now I’ve Got It; what do I do with it?
    • Presentation at
    • IRE’s Albuquerque Watchdog Workshop
    • Feb. 12-13, 2011
    • Hosted by the University of New Mexico
    This PowerPoint deck and Tipsheet posted at: http://Johnson-IREwatchdog.notlong.com     
    • Important point
    • The document is not the data.
    1 View slide
    • Important point
    • The document is not the data . Without analysis, the data are not the story.
    2 View slide
    • Important point
    3 Nothing is as important, and valuable, as a good theory!
  • A good, important THEORY
    • Theory of Journalistic Process
    Data In  Analysis  Info Out
  • DATA IN: Retrieve
    • Bookmark apps
      • Objectives:
        • Access via browser – but not standard equipment
        • Create/manage sub-folders, categories & keywords, annotations
        • Private and/or public sharing
        • Save and Export to backup system(s)
      • Examples:
        • Xmarks: www.xmarks.com/
        • Diigo: www.diigo.com/index
        • Freeware/shareware search for at www.tucows.com
  • DATA IN: Store & Share in the Cloud
    • OK, it’s downloaded. Where ya gonna save it?
    • Multiple back-up sites: desktop and…
      • Safer in Cloud than otherwise
        • Passwords, but share capabilities
      • Easier with “Cloud-sync” apps
      • Free to low-cost
  • DATA IN: Store & Share in the Cloud
    • OK, it’s downloaded. Where ya gonna save it?
        • Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth
    • Dropbox - www.dropbox.com
    Your Hard Drive
    • OK, it’s downloaded. Where ya gonna save it?
        • Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth
    • Dropbox - www.dropbox.com
    DATA IN: Store & Share in the Cloud Viewed in your browser Folders, subfolders, sub-subfolders, etc. Nearly instant sync-ing with/from your desktop
  • DATA IN: Store & Share in the Cloud
    • OK, it’s downloaded. Where ya gonna save it?
        • Avoid MS Windows Live, SkyDrive and Mesh – more trouble than they are worth
    • Dropbox - www.dropbox.com
    • SugarSync - www.sugarsync.com
    • Syncplicity - www.syncplicity.com
    • Jungle Disk ($3p/m) - www.jungledisk.com
    • Zumodrive (3p/m) - www.zumodrive.com
    • AeroFS - www.zumodrive.com
    • SpiderOak - spideroak.com
    • MiMedia , Wuala , Quanp ,
  • Data In  Analysis  Info Out Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • Atoms  Bits
    • How? Who?
  • Data In: Objectives
    • Move data from “out there” to analytic site/tools
    • Seeking fine-grained data, NOT aggregations
      • Seek data in original form (i.e. NO PDFs)
      • Who collected the data? Why? How?
      • Who proofed/edited the data? Why? How?
      • If from data base, first ask for “record” or “code sheet” or “schema”
      • Definitions of variables or fields. Constant or ???
      • Get data in lowest common denominator format: Comma-delimited files in ASCII or Text
  • Data In: Challenges
      • New site in New Mexico: www.sunshineportalnm.com
      • “ Beta ,” but looks to be a cruel joke on taxpayers; torture for journos
  • Data In: “Typical” problems with SunshineportalNM
      • Barriers data = barriers to analysis
        • NO site search capability; no site map
        • Completely abandoned open-standard HTML, going for the closed-standard Adobe Flash/Shockwave environment.
        • Page formats/layouts not standard; too many drill-downs instead of search-driven generators
        • Jiggly roll-overs; too much effort spent on bling
        • Impossible to download or scrape data for analysis
        • State makes information available only in Adobe PDF files; notoriously unfriendly to data analysis.
  • Data In: Challenges in SunshinePort
      • Comprehensive Annual Financial Reports
        • Possible to machine download, but laborious to format for analysis
      • Investment Holdings reports are far worse
        • They are poor-quality static image files, not machine-readable.
        • Tabular data roughly formatted; makes conversion for analysis an arduous, if not impossible task.
  • Bottom line on SunshinePortalNM.com
    • “ If the State of New Mexico takes the position that through this site it is discharging all of its disclosure obligations with respect to these particular records, open government is in trouble there. ”
    “ This is not even a web page, it’s a Flash application, so there’s not going to be much sunlight escaping from this portal. “ “ A perfect example of creating the appearance of transparency without actually being transparent.”
  • Challenge for Watchdogs?
    • Failure on the part of planners/bureaucrats to simply…
    • Give The People THEIR Data…
    • In The Most Basic, Original, Straightforward Form…
    • And Let Them Figure Out What Should Be Done With It!
    HB406
  • Positive example of gov’t data
    • Positive example: NM Leg Bill Locator
    • http://www.nmlegis.gov/lcs/_session.aspx?chamber=H&legtype=B&legno=%20406&year=11
    Same data available in two formats!
  • NM HB 406
    • Senate approved 39-0 on Feb. 9 http:// www.nmlegis.gov/Sessions/11%20Regular/bills/house/HB0406.html
    • “ An Act RELATING TO PUBLIC RECORDS; PROVIDING FOR THE INSPECTION OF ELECTRONIC RECORDS.”
  • NM HB 406
    • “… information contained in information systems databases created or maintained by or on behalf of a public body … shall be subject to disclosure to any person requesting the information in the format requested.
    • “ The information shall be provided in the most effective and efficient manner available to the custodian , as defined in the Inspection of Public Records Act.
    •            B. The custodian may charge a reasonable fee for production of the information requested . The fee shall not exceed the cost of the materials and reasonable charges for the personnel required to retrieve and provide the information.
    But what if it wasn’t New Mexico state employees directly at fault?
  • Why is it sunshineportalNM. COM ?
    • Domain Name: SUNSHINEPORTALNM.COM
    • Registrar:
    • Referral URL: http://www.wildwestdomains.com
    • Name Server: ENESFOUR.SKS.COM
    • Name Server: ENESONE.SKS.COM
    • Name Server: ENESTHREE.SKS.COM
    • Name Server: ENESTWO.SKS.COM
    • Status: clientDeleteProhibited
    • Status: clientRenewProhibited
    • Status: clientTransferProhibited
    • Status: clientUpdateProhibited
    • Updated Date: 30-mar-2010
    • Creation Date: 30-mar-2010
    • Expiration Date: 30-mar-2011NOTICE:
    WILD WEST DOMAINS, INC
  • Wild West Domains
    • Registrant :
    • Wild West Domains, Inc. 14455 N Hayden Rd Suite 219 Scottsdale, Arizona 85260 United States
    • Registered through: WWDomains.com
    • Domain Name: WILDWESTDOMAINS.COM Created on: 22-Aug-00
    • Expires on: 22-Jul-19
    • Last Updated on: 08-Dec-09
    • Administrative Contact:
    • Wild West Domains, Inc. [email_address]
    • Wild West Domains, Inc.
    • 14455 N Hayden Rd Suite 219
    • Scottsdale, Arizona 85260 United States
    • +1.4805058800 Fax -- +1.4805058844
    • Technical Contact:
      • Wild West Domains, Inc. [email_address]
      • Wild West Domains, Inc.
      • 14455 N Hayden Rd Suite 219
      • Scottsdale, Arizona 85260 United States
      • +1.4805058800 Fax -- +1.4805058844
    • Domain servers in listed order:
    • CNS1.SECURESERVER.NET
    • CNS2.SECURESERVER.NET
  • Post-data recovery: Analytic DNA
    • Qualitative
    • Who
    • What
    • When
    • Why
    • Where
    • How
    • Quantitative
    • How many/much
    • What categories
    • What type data and levels
    • What changes
    • What“ timeline ”
    • Geo-location
    • All stories have geography
    • People are interested in how close is this to me?
  • Data In  Analysis  Info Out Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • Atoms  Bits How? Who?
    • What are we looking for? How can we be surprised?
    • Source
    • Definition
    • Context
    • Estimating
    • Counting
    • Statistical
    • Geostatistical
    • Social Network Analysis
    • Forensic accounting
  • The “Fundamental Five” Statistics
    • Calculating percent of change
      • (New-Old) ÷ Old * 100
      • or
      • ((new/old) –1) * 100
    • Calculating proportion:
      • (# of parts ÷ TOTAL # of parts) * 100 = % of whole
  • The “Fundamental Five” Statistics
    • Calculating Rates: (incidents ÷ population) * 10,000 (or 100,000)
    • Calculating Ratios:
      • Take first of two numbers being compared and divide by second.
      • 600 ÷ 30 = 20 [Ratio is 20-to-1; if fraction, round off]
  • The “Fundamental Five” Statistics
    • Calculating Inflation:
      • (CPI Now ÷ CPI Then) * Item Price Then = Item then in today’s $$$ [Tool: http://www.westegg.com/inflation/]
      • Calculating INFLATION RATE CPI in 2000 is 3,500 CPI in 2001 is 4,500 What's the inflation rate? 4500 - 3500 = 1000 1000/3500 = .2857.... .2857 * 100 = 28.57 is the INFLATION RATE
  • Data In  Analysis  Info Out
    • Online tools
      • Google Docs Spreadsheets
      • Google Refine
      • Freebase
      • Google Fusion Tables
    Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase . Fusion Tables: a service for managing large collections of tabular data in the cloud. You can upload tables of up to 100MB and share them with collaborators, or make them public. Freebase is an open, Creative Commons licensed repository of structured data of almost 20 million entities . An entity is a single person, place, or thing. Freebase connects entities together as a graph .
  • Data In  Analysis  Info Out Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • Atoms  Bits How?
    • What are we looking for? How can we be surprised?
    • Source
    • Definition
    • Context
    • Estimating
    • Counting
    • Statistical
    • Geostatistical
    • Social Network Analysis
    • Forensic accounting
    • Broadcast
    • Web
    • Audio
    • Video
    • Text
    • Data visualization
    • Maps
    • Dynamic databases
    • Archives
  • “ Analytic tools” also for story-telling
    • Spreadsheets:
      • Tables, charts, infographics
    • Data base programs
      • Charts, graphs, data tables
    • Stats programs ( SPSS or SAS or R )
      • Generate graphics
    • Social network analytic graphics
    • GIS
  • “ Analytic tools” also for story-telling
      • Many Eyes: http://www-958.ibm.com/software/data/cognos/manyeyes/
      • Timelines:
        • Sarah Cohen's Timeflow https://github.com/FlowingMedia/TimeFlow/wiki/
        • xTimeline ( http://www.xtimeline.com/timeline/JTJ-Newspaper-History )
  • Tomorrow?
    • Our job is to “monitor the centres of power.” -- Amira Haassaid
    The document is not the data
  • DATA: Now I’ve got it; what do I do with it? Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA t o m @ j t j o h n s o n . c o m Gracias a todos
  • DATA: Now I’ve Got It; what do I do with it?
    • Presentation at
    • IRE’s Albuquerque Watchdog Workshop
    • Feb. 12-13, 2011
    • Hosted by the University of New Mexico
    This PowerPoint deck and Tipsheet posted at: http://Johnson-IREwatchdog.notlong.com     
  • “ Data In” questions Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • #1 – Keep a logbook (Try using Notesync.com)
    • Qualitative and/or Quantitative?
    • Objective: strive to get the data in the most fine-grained and original form.
      • Online data is rarely complete nor totally accurate
    • Where is the data? In what format? I-o-P? Original digital file type(s)?
  • “ Data In” questions Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • #1 – Keep a logbook (Try using Notesync.com)
    • Who created the data? Why? How? Legal catalysts for creation? If so, what do they say?
    • Have definitions and collection process changed?
    • Who could review and edit the data? What was/is the vetting process to insure accuracy?
    • Who has analyzed the data? For what purpose and with what methods?
    • Data In
    • Analysis
    • Info Out
  • “ Analysis” phase Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • Atoms  Bits How? Who?
    • What are we looking for? How can we be surprised?
    • Previous/parallel investigations? (Start with IRE site stories and tipsheets)
    • Context, i.e. past environment(s) and changes? Trends past and future?
    • Quantitative and Qualitative methods?
    • Data cleaning tools?
  • “ Analysis” phase Data In  Analysis  Info Out
    • Notes
    • Text
    • Numeric
    • Images
    • Charts/Graphs
    • Maps
    • Audio
    • Video
    • Atoms  Bits How? Who?
    • Measurement of phenomena
      • Strength of relationships
      • Change
    • Estimating
    • Counting
    • Statistical
    • Geostatistical
    • Social Network Analysis
    • Forensic accounting
    • Who’s your rabbi?