Data data data Session III
I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find...
What do we mean when we talk about data? Session III
We ask many many questions about the world around us.
To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us towa...
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
 
Data are the raw facts about our world
 
 
Thomas Nylen & Andrew Fountain (PSU), NASA, NSF
A lot of this data is available for you to use
Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific rese...
<ul><li>Direct data </li></ul>
Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.nc...
photo by  solarnu  on  Flickr
<ul><li>Indirect data </li></ul>
Indirect data Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.n...
Source: Google Flu Trends (http://www.google.org/flutrends)
There has been an almost incomprehensible growth in digital data
 
<ul><li>5 Megabytes  - a high resolution photo </li></ul><ul><li>5 Gigabytes  - just more than a common DVD stores </li></...
Okay, so there is a lot of data!
Is all this data free to use?
No
What you can do with data is largely dictated by license and copyright
Because of this many organizations have begun advocating and practicing  'open data' 'open data'
What is open data?
Free and unrestricted access to data
http://www.youtube.com/watch?v=3YcZ3Zqk0a8
Kepler Data
'...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitab...
Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detec...
Eric.Nielsen.Photos on Flickr
 
Different but related ideas <ul><li>Open Government </li></ul><ul><li>data.gov, data.gov.uk </li></ul><ul><li>Open Access ...
What is data science? Part II
Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses...
“ The goal is to transform data into information, and information into insight”  Carly Fiorina
It is a set of skills performed often but not exclusively by scientists
The availability of data on the internet is making data analysis accessible to anyone
http://www.youtube.com/watch?v=PnpGIgzNBJo&feature=player_embedded
Part III Where do you find the data you need?
My community has a particular set of data that we rely on very often
 
 
 
 
During our class, to find the data you need...
Search First
Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resourc...
See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
Now you have found it, how do you get it?
Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you t...
Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more progra...
 
Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to ...
 
Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human rea...
 
Remember that all of these data sources have different formats and potential sources of error, we will have a full session...
Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categori...
XLS VS CSV
How can we join datasets together?
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau ...
If you start working with data, really interesting things can appear.
 
 
  GBIF.org
from Eric Fisher on Flickr
Upcoming SlideShare
Loading in...5
×

Data, data, data

485

Published on

Session 3 for DMID

Published in: Education, Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
485
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • Humans are naturally curious about our world. \nIn addition, we have social, economic, and personal motivations to understand how and why the world around us changes\n
  • \n
  • CEO HP\n
  • 1993 David Vaughan British Anta Survey\nPredicted breaking in 30yrs\n2008 he conceded that his estimates had been to conservative\n
  • \n
  • \n
  • \n
  • Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
  • Automated weather station \nLake Vida Antarctica\n19 Meters of Ice\n2500 years\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Genbank\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Not everyone has the means to take this data and study it in a sophisticated analysis\nBut a lot of people are interested in space, astronomy, and our universe\nSo how could Kepler insure that these people could help them and have fun\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • You will not likely encounter this anytime over the next couple of weeks\nbut it is best to be aware of\n
  • \n
  • \n
  • \n
  • XLS might be easier for you to navigate\nopen it in excel, sort columns, search for what you want\nbut CSV will almost always be easier to use anyplace other than excel\nsmaller, compact, but easily parsable\n
  • This is not the linked data\n
  • ISO - International organization for standards\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Data, data, data

    1. 1. Data data data Session III
    2. 2. I. What do we mean when we talk about data and where does data come from? II. What is data science? III. Where do you find the data you need? IV. Science Module: Niche Modelling
    3. 3. What do we mean when we talk about data? Session III
    4. 4. We ask many many questions about the world around us.
    5. 5. To answer those questions accurately requires a body of data and a set of tools to perform analyses that will lead us toward an answer
    6. 6. “ The goal is to transform data into information, and information into insight”  Carly Fiorina
    7. 8. Data are the raw facts about our world
    8. 11. Thomas Nylen & Andrew Fountain (PSU), NASA, NSF
    9. 12. A lot of this data is available for you to use
    10. 13. Where does the data come from? Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
    11. 14. <ul><li>Direct data </li></ul>
    12. 15. Direct data Government http://www.data.gov / http://data.gov.uk/ http://www.census.g ov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org/ http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov / http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com/ http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap .org/
    13. 16. photo by  solarnu  on  Flickr
    14. 17. <ul><li>Indirect data </li></ul>
    15. 18. Indirect data Government http://www.data.gov / http://data.gov.uk/ http://www.census.gov/ Scientific research http://www.ncbi.nlm.nih.gov/genbank/ http://www.gbif.org / http://earthengine.googlelabs.com Semi-automated and large scale collections http://eospso.gsfc.nasa.gov/ http://www.airquality.co.uk/autoinfo.php http://www.statistics.gov.uk/ For profit http://www.flickr.com / http://www.google.com/trends http://www.facebook.com/data Citizens http://www.wikipedia.org / http://protectedplanet.net/ http://www.openstreetmap.org/
    16. 19. Source: Google Flu Trends (http://www.google.org/flutrends)
    17. 20. There has been an almost incomprehensible growth in digital data
    18. 22. <ul><li>5 Megabytes - a high resolution photo </li></ul><ul><li>5 Gigabytes - just more than a common DVD stores </li></ul><ul><li>1 Terabyte  - size of common home computer harddrive </li></ul><ul><li>15 Petabyte - data produced at CERN each year </li></ul><ul><li>5 Exabytes  ~ every word spoken by humans </li></ul><ul><li>1.2 Zettabytes ~   Digital Universe in 2010 </li></ul><ul><li>1,180,591,620,717,411,303,424 bytes or 2 70  </li></ul>
    19. 23. Okay, so there is a lot of data!
    20. 24. Is all this data free to use?
    21. 25. No
    22. 26. What you can do with data is largely dictated by license and copyright
    23. 27. Because of this many organizations have begun advocating and practicing 'open data' 'open data'
    24. 28. What is open data?
    25. 29. Free and unrestricted access to data
    26. 30. http://www.youtube.com/watch?v=3YcZ3Zqk0a8
    27. 31. Kepler Data
    28. 32. '...survey a portion of our region of the Milky Way galaxy to discover dozens of Earth-size planets in or near the habitable zone and determine how many of the billions of stars in our galaxy have such planets...'
    29. 33. Launch a telescope into space for orbit around the sun Collect data from thousands of stars Spend 3 months trying to detect the presence of plants in orbit around these stars RELEASE THE DATA FOR USE BY ANYONE!!!
    30. 34. Eric.Nielsen.Photos on Flickr
    31. 36. Different but related ideas <ul><li>Open Government </li></ul><ul><li>data.gov, data.gov.uk </li></ul><ul><li>Open Access </li></ul><ul><li>plos.org </li></ul><ul><li>Open Source </li></ul><ul><li>Apache, Firefox </li></ul>
    32. 37. What is data science? Part II
    33. 38. Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Levine, 1997,  Introduction to Data Analysis: The Rules of Evidence
    34. 39. “ The goal is to transform data into information, and information into insight”  Carly Fiorina
    35. 40. It is a set of skills performed often but not exclusively by scientists
    36. 41. The availability of data on the internet is making data analysis accessible to anyone
    37. 42. http://www.youtube.com/watch?v=PnpGIgzNBJo&feature=player_embedded
    38. 43. Part III Where do you find the data you need?
    39. 44. My community has a particular set of data that we rely on very often
    40. 49. During our class, to find the data you need...
    41. 50. Search First
    42. 51. Open government data, Large published datasets such as Wikipedia, Flickr, or public FusionTables, Natural sciences resources, Weather, Atmosphere and Geographic datasets
    43. 52. See the growing list of datasets our class will uncover https://github.com/andrewxhill/DMID/wiki/Datasets
    44. 53. Now you have found it, how do you get it?
    45. 54. Maybe you can download it directly from the website. Pay attention to what formats you find, some will be easier for you to use than others.
    46. 55. Sometimes, APIs, or application programming interfaces are available. These are probably for people with a bit more programming experience, but if you need the data ask the instructors and we might be able to help
    47. 57. Scraping Generally the hardest method, as it means programmatically pulling data from sources not necessarily designed to have data pulled from. 
    48. 59. Linked data This is an evolution of both scraping and APIs, where many web resources are now designed to be both human readable and programatically navigable. 
    49. 61. Remember that all of these data sources have different formats and potential sources of error, we will have a full session on data preparation, cleaning, and analysis
    50. 62. Remember that data comes in many shapes and sizes, be aware of what you find General formats - png, jpg, xls, doc Categorical and storage - csv, sql Geographic - shp, tif, asc
    51. 63. XLS VS CSV
    52. 64. How can we join datasets together?
    53. 65. Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0 AFG 19.03
    54. 66. Afghanistan 3 Bolivia 2 Guyana 1 Palau 4 GUY 1.03 BOL 1.34 PLW 0.20 AFG 1.93 Afghanistan AFG Bolivia BOL Guyana GUY Palau AFG
    55. 67. If you start working with data, really interesting things can appear.
    56. 70.   GBIF.org
    57. 71. from Eric Fisher on Flickr
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×