SlideShare a Scribd company logo
Big, Ugly Datasets for Thumb-Fingered Journalists	 @nclarkjudd, thumb-fingered journalist
We’re swimming in data Open Graph Social Media Data Mining Government Data
It’s not getting easier to use … With exceptions, like TimeFlow
This is where we come in	 There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets Without the resources of a New York Times or Washington Post, how do you do that?
What are you doing with data? Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised Deducing: Proving a hypothesis, pulling specific records — looking for something in particular
Know right questions to ask When you’re picking a dataset to use, understand its: Provenance Sampling Method Quality Completeness
Data Workflow Understand your needs Acquire your data (Download, FOIL, Sources) Clean your data Load it into a Relational Database Management System (RDBMS) Analyze what you’ve got Output relevant segments for visualization
Cleaning Your Data Use a script or a robust text editor like vi It’s difficult. It takes a while. It gets done.
Load your data
Fail and Iterate Again: It probably won’t work the first time. It’s difficult. It takes a while. It gets done.
Analyze Check your script. Did I write my query correctly? Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them? Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same? Consult experts: Ask — Does this mean what I think it means? Do these results make sense? Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations
Share Photo: Britta Bohllinger / Flickr ,[object Object]
IRE.org
HacksHackers.com,[object Object]

More Related Content

What's hot

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
Dr. Neil Brittliff
 
Data Skills for Digital Era
Data Skills for Digital EraData Skills for Digital Era
Data Skills for Digital Era
Mohamadreza Mohtat
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
Mohamadreza Mohtat
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and ProcessingCRRC-Armenia
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
Jeffrey Strickland, Ph.D., CMSP
 
BigData
BigDataBigData
BigData
Viveka Sharma
 
Road map to secondary data
Road map to secondary dataRoad map to secondary data
Road map to secondary data
bhavniktok
 
charlie
charliecharlie
charlie
Edward Thomas
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
DataONE
 
Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6
ARDC
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
C. Tobin Magle
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6
ARDC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Edureka!
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
National Information Standards Organization (NISO)
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
Data Science Thailand
 
Myths of Data Science
Myths of Data ScienceMyths of Data Science
Myths of Data Science
Data Science Thailand
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
Merce Crosas
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data Management
C. Tobin Magle
 

What's hot (20)

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Data Skills for Digital Era
Data Skills for Digital EraData Skills for Digital Era
Data Skills for Digital Era
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Data Archiving and Processing
Data Archiving and ProcessingData Archiving and Processing
Data Archiving and Processing
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
BigData
BigDataBigData
BigData
 
Road map to secondary data
Road map to secondary dataRoad map to secondary data
Road map to secondary data
 
charlie
charliecharlie
charlie
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6Fsci 2018 tuesday31_july_am6
Fsci 2018 tuesday31_july_am6
 
Data Archiving and Sharing
Data Archiving and SharingData Archiving and Sharing
Data Archiving and Sharing
 
Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6Fsci 2018 friday3_august_am6
Fsci 2018 friday3_august_am6
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)Introduction to Data Science (Data Science Thailand Meetup #1)
Introduction to Data Science (Data Science Thailand Meetup #1)
 
Electronic Databases
Electronic DatabasesElectronic Databases
Electronic Databases
 
Myths of Data Science
Myths of Data ScienceMyths of Data Science
Myths of Data Science
 
FAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data SharingFAIR Data Management and FAIR Data Sharing
FAIR Data Management and FAIR Data Sharing
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data Management
 

Viewers also liked

SHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTSSHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTS
Open Journalism on the Open Web
 
Evan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLabEvan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLab
Open Journalism on the Open Web
 
Oliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLabOliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLab
Open Journalism on the Open Web
 
Open Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of ValueOpen Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of Value
Socrata
 
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLabLecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Open Journalism on the Open Web
 
Elements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James GarrettElements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James Garrett
Open Journalism on the Open Web
 

Viewers also liked (6)

SHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTSSHAZNA NESSA on CHANGE AGENTS
SHAZNA NESSA on CHANGE AGENTS
 
Evan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLabEvan Hansen of Wired talkss to #MozNewsLab
Evan Hansen of Wired talkss to #MozNewsLab
 
Oliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLabOliver Reichenstein's presentation to #MozNewsLab
Oliver Reichenstein's presentation to #MozNewsLab
 
Open Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of ValueOpen Data Value Framework: Open Data's Four Pillars of Value
Open Data Value Framework: Open Data's Four Pillars of Value
 
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLabLecture slides from @Mohamed of @AJEnglish for #MozNewsLab
Lecture slides from @Mohamed of @AJEnglish for #MozNewsLab
 
Elements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James GarrettElements of User Experience by Jesse James Garrett
Elements of User Experience by Jesse James Garrett
 

Similar to Big Ugly Datasets For Thumb-Fingered Journalists

MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisShawn Day
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Mahir Haque
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsfBrad Houston
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Sa discover text webinar
Sa discover text webinarSa discover text webinar
Sa discover text webinarQuestionPro
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
NATASHABANO
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
Bernard Marr
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
Paradigm4
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
News Leaders Association's NewsTrain
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
NagarajanG35
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Connotate
 
Data management plans
Data management plansData management plans
Data management plansBrad Houston
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
Aseel Addawood
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
Attivio
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
varshakumar21
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
sfdatascience
 

Similar to Big Ugly Datasets For Thumb-Fingered Journalists (20)

MPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for AnalysisMPhil Lecture on Data Vis for Analysis
MPhil Lecture on Data Vis for Analysis
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data management plans
Data management plansData management plans
Data management plans
 
Sa discover text webinar
Sa discover text webinarSa discover text webinar
Sa discover text webinar
 
big data and machine learning ppt.pptx
big data and machine learning ppt.pptxbig data and machine learning ppt.pptx
big data and machine learning ppt.pptx
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Big Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must KnowBig Data: The 4 Layers Everyone Must Know
Big Data: The 4 Layers Everyone Must Know
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Data science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptxData science in business Administration Nagarajan.pptx
Data science in business Administration Nagarajan.pptx
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
 
Data management plans
Data management plansData management plans
Data management plans
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
 

More from Open Journalism on the Open Web

Open Source Process: jQuery by John Resig
Open Source Process: jQuery by John ResigOpen Source Process: jQuery by John Resig
Open Source Process: jQuery by John Resig
Open Journalism on the Open Web
 
Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'
Open Journalism on the Open Web
 
Burt Herman: Follow the story
Burt Herman: Follow the storyBurt Herman: Follow the story
Burt Herman: Follow the story
Open Journalism on the Open Web
 

More from Open Journalism on the Open Web (8)

Open Source Process: jQuery by John Resig
Open Source Process: jQuery by John ResigOpen Source Process: jQuery by John Resig
Open Source Process: jQuery by John Resig
 
Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'Christian Heilmann's 'State of the Browser in 2011'
Christian Heilmann's 'State of the Browser in 2011'
 
Amanda Cox - Visualizing data at the New York Times
Amanda Cox - Visualizing data at the New York TimesAmanda Cox - Visualizing data at the New York Times
Amanda Cox - Visualizing data at the New York Times
 
Burt Herman: Follow the story
Burt Herman: Follow the storyBurt Herman: Follow the story
Burt Herman: Follow the story
 
Amanda Hickman's presentation
Amanda Hickman's presentationAmanda Hickman's presentation
Amanda Hickman's presentation
 
Why does Mozilla care about news?
Why does Mozilla care about news? Why does Mozilla care about news?
Why does Mozilla care about news?
 
Hh p2 pu class
Hh p2 pu classHh p2 pu class
Hh p2 pu class
 
Open journalismfortheopenweb intro-sept2010
Open journalismfortheopenweb intro-sept2010Open journalismfortheopenweb intro-sept2010
Open journalismfortheopenweb intro-sept2010
 

Big Ugly Datasets For Thumb-Fingered Journalists

  • 1. Big, Ugly Datasets for Thumb-Fingered Journalists @nclarkjudd, thumb-fingered journalist
  • 2. We’re swimming in data Open Graph Social Media Data Mining Government Data
  • 3. It’s not getting easier to use … With exceptions, like TimeFlow
  • 4. This is where we come in There’s an increasing need for journalists at all levels to be equipped to acquire and analyze big, ugly datasets Without the resources of a New York Times or Washington Post, how do you do that?
  • 5. What are you doing with data? Exploring: Looking for patterns, following hunches, finding context and background — looking to be surprised Deducing: Proving a hypothesis, pulling specific records — looking for something in particular
  • 6. Know right questions to ask When you’re picking a dataset to use, understand its: Provenance Sampling Method Quality Completeness
  • 7. Data Workflow Understand your needs Acquire your data (Download, FOIL, Sources) Clean your data Load it into a Relational Database Management System (RDBMS) Analyze what you’ve got Output relevant segments for visualization
  • 8. Cleaning Your Data Use a script or a robust text editor like vi It’s difficult. It takes a while. It gets done.
  • 10. Fail and Iterate Again: It probably won’t work the first time. It’s difficult. It takes a while. It gets done.
  • 11. Analyze Check your script. Did I write my query correctly? Write queries multiple ways. Do the numbers add up the same when the RDBMS makes sums and when I do them? Use checksums: Can I compare results from a segment of this data with previously published and vetted results? Are they the same? Consult experts: Ask — Does this mean what I think it means? Do these results make sense? Output smaller segments of your data to another tool such as Socrata or ManyEyes in order to generate graphs, tables, and visualizations
  • 12.
  • 14.
  • 15. Assignment You are an investigative team that does freelance work around the country and are working up a pitch for your next project. Pick a subject matter you want to investigate Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick one available on the Web already, e.g. through Data.gov. Plan: What do you need to clean these data? The schema you’ll make to house the dataset(s) What are you doing with this data — are you using it for exploratory or deductive reasoning? What will your queries look like? Will you join multiple databases together? If so, how are you sure the results will be relevant? How will you express the results of your inquiry? What questions won’t the data answer that you want to address in your project? Who will you turn to as you start looking for these answers?