Data data data Session III
I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
What do we mean when we talk about data? Session III
We ask many many questions about the world around us.
To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
 
Data are the raw facts about our world
 
 
Thomas Nylen & Andrew Fountain (PSU), NASA, NSF
A lot of this data is available for you to use
Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
Direct data
Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
photo by  solarnu  on  Flickr
Indirect data
Indirect data Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org / http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com / http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
Source: Google Flu Trends (http://www.google.org/flutrends)
There has been an almost incomprehensible growth in digital data
 
5 Megabytes  - a high resolution photo 5 Gigabytes  - just more than a common DVD stores 1 Terabyte  - size of common home computer harddrive 15 Petabyte  - data produced at  CERN each year 5 Exabytes  ~ every word  spoken by humans 1.2 Zettabytes ~   Digital Universe in 2010 1,180,591,620,717,411,303,424 bytes or 2 70 
Okay, so there is a lot of data!
Is all this data free to use?
No
What you can do with data is largely dictated by license and copyright
Because of this many organizations have begun advocating and practicing  'open data' 'open data'
What is open data?
Free and unrestricted access to data
http://www.youtube.com/watch?v=3YcZ3Zqk0a8
Kepler Data
'...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
Eric.Nielsen.Photos on Flickr
 
Different but related ideas Open Government data.gov, data.gov.uk Open Access plos.org Open Source Apache, Firefox
What is data science? Part II
Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997,  Introduction to Data Analysis: The Rules of Evidence
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
It is a set of skills performed often but not exclusively by scientists
The availability of data on the internet is making data analysis accessible to anyone
http://www.youtube.com/watch?v=PnpGIgzNBJo&feature=player_embedded
Part III Where do you find the data you need?
My community has a particular set of data that we rely on very often
 
 
 
 
During our class, to find the data you need...
Search First
Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
Now you have found it, how do you get it?
Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
 
Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from. 
 
Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable. 
 
Remember that all of these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
XLS VS CSV
How can we join datasets together?
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau AFG
If you start working with data, really interesting things can appear.
 
 
  GBIF.org
from Eric Fisher on Flickr

Data, data, data

  • 1.
    Data data dataSession III
  • 2.
    I. What do wemean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
  • 3.
    What do wemean when we talk about data? Session III
  • 4.
    We ask manymany questions about the world around us.
  • 5.
    To answer thosequestions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
  • 6.
    “ The goalis to transform data into information, and information into insight”  Carly Fiorina
  • 7.
  • 8.
    Data are theraw facts about our world
  • 9.
  • 10.
  • 11.
    Thomas Nylen &Andrew Fountain (PSU), NASA, NSF
  • 12.
    A lot ofthis data is available for you to use
  • 13.
    Where does thedata come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
  • 14.
  • 15.
    Direct data Governmenthttp://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
  • 16.
    photo by  solarnu on  Flickr
  • 17.
  • 18.
    Indirect data Governmenthttp://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org / http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com / http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
  • 19.
    Source: Google FluTrends (http://www.google.org/flutrends)
  • 20.
    There has beenan almost incomprehensible growth in digital data
  • 21.
  • 22.
    5 Megabytes - a high resolution photo 5 Gigabytes - just more than a common DVD stores 1 Terabyte  - size of common home computer harddrive 15 Petabyte - data produced at CERN each year 5 Exabytes  ~ every word spoken by humans 1.2 Zettabytes ~   Digital Universe in 2010 1,180,591,620,717,411,303,424 bytes or 2 70 
  • 23.
    Okay, so thereis a lot of data!
  • 24.
    Is all thisdata free to use?
  • 25.
  • 26.
    What you cando with data is largely dictated by license and copyright
  • 27.
    Because of thismany organizations have begun advocating and practicing 'open data' 'open data'
  • 28.
  • 29.
    Free and unrestrictedaccess to data
  • 30.
  • 31.
  • 32.
    '...survey a portionof our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
  • 33.
    Launch a telescopeinto space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
  • 34.
  • 35.
  • 36.
    Different but relatedideas Open Government data.gov, data.gov.uk Open Access plos.org Open Source Apache, Firefox
  • 37.
    What is datascience? Part II
  • 38.
    Data analysis isa body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997,  Introduction to Data Analysis: The Rules of Evidence
  • 39.
    “ The goalis to transform data into information, and information into insight”  Carly Fiorina
  • 40.
    It is aset of skills performed often but not exclusively by scientists
  • 41.
    The availability ofdata on the internet is making data analysis accessible to anyone
  • 42.
  • 43.
    Part III Wheredo you find the data you need?
  • 44.
    My community hasa particular set of data that we rely on very often
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
    During our class,to find the data you need...
  • 50.
  • 51.
    Open government data,Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
  • 52.
    See the growinglist of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
  • 53.
    Now you havefound it, how do you get it?
  • 54.
    Maybe you candownload it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
  • 55.
    Sometimes, APIs, orapplication programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
  • 56.
  • 57.
    Scraping Generally thehardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from. 
  • 58.
  • 59.
    Linked data Thisis an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable. 
  • 60.
  • 61.
    Remember that allof these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
  • 62.
    Remember that datacomes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
  • 63.
  • 64.
    How can wejoin datasets together?
  • 65.
    Afghanistan 3 Bolivia2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
  • 66.
    Afghanistan 3 Bolivia2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau AFG
  • 67.
    If you startworking with data, really interesting things can appear.
  • 68.
  • 69.
  • 70.
  • 71.

Editor's Notes