Working With Data and Humans


Published on

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I’m Dan O’Neil, and I run the Smart Chicago Collaborative, an organization devoted to improving lives in Chicago through technology. Among other things, I work with Chicago city government, developers, and community groups to use civic data in new and useful ways. As a co-founder of EveryBlock, I’m also a previous Knight News Challenge granteeI certainly wouldn’t be doing any of this today if it weren’t for the vision of the Knight Foundation.
  • Let’s do a level-set
  • Let’s do a level-set
  • Explain EveryBlock
  • Explain EveryBlock
  • Let’s do a level-set
  • Explain EveryBlock
  • Let’s do a level-set
  • Data has certain characteristics
  • People have certain characteristics
  • Something to keep in mind– this data is generated and maintained by humans.
  • And if you use the default search for crime records, you get this screen.It has records going back to 2005.You fill out the form and you get your answers back.Pretty typical experience.
  • What you wouldn’t be able to tell, unless you searched the Dallas Police Web site more deeply, is this.The Dallas Police publishes an amazing cache of crime data in flat files.All of it, with no search, no letters or emails, going back 12 years.Why anyone would make any FOIA request– or why the Dallas Police would want anyone to do that– is beyond me.And this data has some of the most amazing crime details– the police narrative– that you can find in crime data anywhere.This is hidden in plain sight.
  • Lastly, I highly recommend the Data Journalism Handbook, which was created, in part, by many people in this room.It’s a really excellent resource.
  • Data is often more structured than you think.Over the weekend I participated in the Knight-Mozilla-MIT "Story & Algorithm" Hack Day run by Dan Sinker.I met a couple of Boston developers and we executed on a project I’ve had for about 7 years.Like many of you here, I’m not smart enough to actually make things, so I have to rely on the kindness of developers.What we made was “Condition of Anonymity”– a Web site that automatically pulls the reason that anonymity was granted to an anonymous source by a reporter for the New York Times.We often think about data as the stuff inside spreadsheets and published in flat files to FTP servers, but there is a whole world of semi-structured data like this hidden in plain sight, inside plain text.We used the NYT Search API to review every article in the NYT back to January 1, 2000 for the phrase, “condition of anonymity”, then used a natural language processing toolkit to find what I call the “because clauses”.There’s some gold in there.It takes an abundance of data types to tell a story.This story feels like a Walt Whitman poem to me.
  • The analysis is where it’s at.The most amazing insight I can share is that data is boring.I’ve had a long time to consider why that is true, and I think I have the answer.The reason is because people are boring.We forget that data is made by people.And most people are boring most of the timeEvery object should have a page on the Internet (so let’s get to work)
  • Here’s kind of a master example.I live near this building.It was been empty for a very long time.Then construction started.The construction was heralded by a building permit.But, of course, the building permit was boring.So I looked further.
  • I searched ten different databases and lo and beyhold, more data made it less boring.Why? Because almost all people are interesting some of the time.So if you look hard enough, you’ll find those stories.I found a business license for a 3-day pop-up store.So this place has been empty for decades, but was open for three days.And I missed it.It used to be a bank, and in 1937 I found out that– from the NYT archive, in PDF format– the hidden Web– that there was a bank run at this location in 1937.Again, not boring.
  • This machine can be described as a generic context engineTo evenly distribute informationAnd tell me what the information meansI know: that sounds like a “reporter”But people used to think that “search engine” sounded a lot like “librarian”, tooWe need humans and machines
  • Find datasetReview datasetDescribe what the data meansFind another datasetDescribe what the other dataset meansDescribe what the first dataset means in the context of the second datasetRepeatLet’s do this thing.
  • Here’s an example of two things:Finding data in unstructured text and finding interesting data.This is an Advanced Search in Google for the word “jimmied” in the Dallas crime data published by EveryBlock.So that site becomes a public, searchable instance of a previously hidden data set.Apparently police have used the word “jimmied” to describe an action taken by suspected criminals 2,430 times.All sorts of things are jimmied, apparently.It’s not boring.
  • We’ve noticed that custom applications created with dedicated budgets are reliably updated.Getting more connectivity between these established projects and the newer, open data projects is key.
  • Explain EveryBlock
  • Hi.
  • Working With Data and Humans

    1. 1. Working With Data and Humans Daniel X. O’Neil @juggernautco
    2. 2. Me • Daniel X. O’Neil • Co-founder of EveryBlock • 2007 Knight News Challenge • Executive Director of Smart Chicago Collaborative • 2012 Knight Community Information Challenge @juggernautco
    3. 3. The Data Revolution • I know about some, but not all of it • Since about 2005 • Working with the Mayor’s Office in Chicago • • Then at EveryBlock, where I was responsible for data acquisition @juggernautco
    4. 4. The Data Revolution • 8 Principles of Open Government Data • Independent Government Observers Task Force • POTUS Executive Orders on Inauguration Day • Apps contests • Municipal ordinances • Socrata • Code for America @juggernautco
    5. 5. There’s Data and There’s Humans • Talk to me about your data and your humans in your projects @juggernautco
    6. 6. Data • Dense • Sits by itself • Not social • Not self-aware • Unable to contextualize itself • Does not have any problems, because it doesn’t care about anything @juggernautco
    7. 7. People • Naturally social • Soft • Have problems • See everything in context • Prone to mistakes @juggernautco
    8. 8. People make data
    9. 9. @juggernautco
    10. 10. @juggernautco
    11. 11. @juggernautco
    12. 12. @juggernautco
    13. 13. Value from data • Know more than anyone • Surfacing from the hidden Web • Context, context, context • Even if it is just one data set mashed against another data set • Did it rain * Did property crime go up or down • Foreclosures * Retail stores • Also: the simple act of aggregation + text @juggernautco
    14. 14. @juggernautco
    15. 15. Ten Databases • Building permits • Business licenses • Historic preservation list • Sanborn maps (1929 and 1950) • County assessor • County recorder of deeds • Original photography • Google search for news coverage • New York Times archive • Walgreens surplus property @juggernautco
    16. 16. We need a machine. • A generic context engine • To evenly distribute information • And tell me what the information means • I know: that sounds like a “reporter” • But people used to think that “search engine” sounded a lot like “librarian”, too • We need humans and machines @juggernautco
    17. 17. It’s easy. • Find dataset • Review dataset • Describe what the data means • Find another dataset • Describe what the other dataset means • Describe what the first dataset means in the context of the second dataset • Repeat • Let’s do this thing. @juggernautco
    18. 18. @juggernautco
    19. 19. Dedicated databases work
    20. 20. Call any time. • @juggernautco • (773) 960-6045 @juggernautco