Big data lab
BIOL2050
Challenges of Big Data
• Overwhelming
• Difficult to sort through to find something
meaningful
• Hard to manage
Examples of Big data
• http://www.coopercenter.org/demographics/
Racial-Dot-Map
• http://internet-map.net/
Examples of Big data
www.google.com/trends/
- FIFA world cup
- Beyonce
- Potatoes
- VHS
Big Data: What is the Big deal?
Google grew from processing 100 TB of data a
day in 2004 to 20 PB a day in 2008
We are producing more data than we are able to
store or analyze
Economist, 2010
Big Data: What is the Big deal?
Far out software
Big Data: What is the Big deal?
“Focusing on one individual at a time, we can provide better
reminders, search results, and advertisements by considering
all the locations the person is likely to be close to in the future
(e.g., “Need a haircut? In 4 days, you will be within 100 meters
of a salon that will have a $5 special at that time.”)”
Big Data: What is the Big deal?
Enable scientific breakthroughs
- Large Hadron Collider
- Sloan Sky Survey
- Genomics
- Climate data
Hampton et al, 2013
Big data for ecology
• Ecologists produce large amount of data, but
needs to be compiled
• Ecologists must treat data as products, just
like publications
• Archive & share -> data repositories
Big data modeling
exercise
Big Data for climate
Many different climate projects
- WorldClim
- CalClimate Commons
- NOAA
- European Climate Data
- Climate Data WMO
Climate data and rasters
Point < Line < Raster
Climate data and rasters
Weather station 1 Weather station 2
Climate data and rasters
Weather station 1 Weather station 2
Interpolated values
Climate data and rasters
Climate data and rasters
Big data & species distributions
Desert native
Chaenactis fremontii
Invasive thistle
Centaurea solstitalis
Climate & species distributions
Example
Consortium of California herbaria – plant database
http://ucjeps.berkeley.edu/consortium/
CalAdapt – Climate commons
http://cal-adapt.org/data/tabular/
Plantago insularis
- Copy from internet
- Paste special, “as text”
- Delete everything except GPS and ID
- Re-label specimen to “id”
- Re-label “lat” and “lng”
- Copy and paste
- Click away from data area
- Check settings to match below
- Copy and paste
- Click away from data area
- Check settings to match below
Model climate change
• Pick one GPS point, remove all the others
• Set time interval for daily, CCSM3
• Download data
• Plot temperatures from 1950 – 2099
• Will your species go extinct?
• Try other points

BIOL2050 - Big data lab

  • 1.
  • 3.
    Challenges of BigData • Overwhelming • Difficult to sort through to find something meaningful • Hard to manage
  • 4.
    Examples of Bigdata • http://www.coopercenter.org/demographics/ Racial-Dot-Map • http://internet-map.net/
  • 5.
    Examples of Bigdata www.google.com/trends/ - FIFA world cup - Beyonce - Potatoes - VHS
  • 6.
    Big Data: Whatis the Big deal? Google grew from processing 100 TB of data a day in 2004 to 20 PB a day in 2008 We are producing more data than we are able to store or analyze Economist, 2010
  • 7.
    Big Data: Whatis the Big deal? Far out software
  • 8.
    Big Data: Whatis the Big deal? “Focusing on one individual at a time, we can provide better reminders, search results, and advertisements by considering all the locations the person is likely to be close to in the future (e.g., “Need a haircut? In 4 days, you will be within 100 meters of a salon that will have a $5 special at that time.”)”
  • 9.
    Big Data: Whatis the Big deal? Enable scientific breakthroughs - Large Hadron Collider - Sloan Sky Survey - Genomics - Climate data
  • 11.
  • 12.
    Big data forecology • Ecologists produce large amount of data, but needs to be compiled • Ecologists must treat data as products, just like publications • Archive & share -> data repositories
  • 13.
  • 14.
    Big Data forclimate Many different climate projects - WorldClim - CalClimate Commons - NOAA - European Climate Data - Climate Data WMO
  • 15.
    Climate data andrasters Point < Line < Raster
  • 16.
    Climate data andrasters Weather station 1 Weather station 2
  • 17.
    Climate data andrasters Weather station 1 Weather station 2 Interpolated values
  • 19.
  • 20.
  • 21.
    Big data &species distributions Desert native Chaenactis fremontii Invasive thistle Centaurea solstitalis
  • 22.
    Climate & speciesdistributions
  • 23.
    Example Consortium of Californiaherbaria – plant database http://ucjeps.berkeley.edu/consortium/ CalAdapt – Climate commons http://cal-adapt.org/data/tabular/
  • 25.
  • 28.
    - Copy frominternet - Paste special, “as text” - Delete everything except GPS and ID - Re-label specimen to “id” - Re-label “lat” and “lng”
  • 32.
    - Copy andpaste - Click away from data area - Check settings to match below
  • 33.
    - Copy andpaste - Click away from data area - Check settings to match below
  • 36.
    Model climate change •Pick one GPS point, remove all the others • Set time interval for daily, CCSM3 • Download data • Plot temperatures from 1950 – 2099 • Will your species go extinct? • Try other points

Editor's Notes

  • #3 In 2010, Google estimated that their search index holds 100 million gigabytes of data. Every minute, 48 hours of video is uploaded to YouTube, we send over 100,000 Tweets, Flickr users add 3,125 new photographs, and more than 570 new websites are created
  • #4 Big data is great, but there are some associated challenges. It is overwhelming in terms of how much there is. Consequently, it is difficult to sort through. Thinking of the previous infographic, having 48 hours of youtube video isn’t necessarily informative. How can we better sort this data into something that is manageable. This leads to the last challenge in that it is difficult to manage. Even if you have a question and know the data to answer it, how would you go about managing it.
  • #5 Well there is dedicated science dedicated to organizing and processing extremely large amounts of data and conveying it in simpler way. Here are two easy to understand visualizations that use exceptionally large amounts of data.
  • #6 An industry leader in processes data is google. Google analyzes exceptional amounts of data every second and one visualization of it is Google trends. This website outputs the popularity of a search term over time and provides other statistics including events that contributed to the popularity or associated country. Compare how trends increase and decrease over time. Things that may push the trends in a certain direction. Relate to how this data would need to be collected and perpetually updated. Let the students explore this on their own.
  • #7 This trend of analyzing data is increasing. In 2004 google was analyzing 100 terabytes of data. This increased 10,000 fold to 20 petabytes in 2008. Imagine today the amount of data being processed.*** 1024 terabytes in a petabyte (PB).
  • #8 “Far out” software claims to be able to predict your location years into the future - even if you don't know where you'll be. 'Far Out' is the result of statistical research that looks at GPS data, learns your typical movements and then extrapolates to decide on your likely future location. The result, according to the team behind it, is a system that can make "highly accurate" predictions about where you'll be years down the line.
  • #9 Knowing where you are was 2008. Knowing where you were going to be was last year. Now companies not only want to know where you are going to be, but how to tailor what you are going to come across.
  • #10 Other than for advertising or industry, big data can help with scientific breakthroughs. The particle accelerator in Cern, the Sloan Sky survey or genomics.
  • #11 For ecology, some big data sets include long-term experimental research, crowd-sourced data sets from the public such as the breeding bird survey. There is also climate measurements from weather stations and remote sensing from aerial photography.
  • #12 Big data in ecology isn’t always single long term datasets. There is already loads of existing data out there than can be compiled to answer new questions. Similar experiments occurring in tandem globally can answer world challenges. Ecologists produce large volumes of data, but do not compile
  • #15 There are many different climate projects based on different areas.
  • #16 A raster is a plane of data. If you have a data point, it is a single spot in space. A line is two points with interpolated values in between. This means that along the entirety of that line, there are values. A raster is one step further in that it is a plane of data like a piece of paper.
  • #17 Imagine two weather stations in which one is hot and the other is cold. They both record temperatures continuously over time.
  • #18 A raster generates interpolated values along the entire area in between the two weather stations from hot to cold.
  • #19 Now, extending this to many more weather stations on a global scale.
  • #20 It generates this network of values based on the weather stations constantly recording.
  • #21 That becomes rasterized based on interpolated values. With this raster, there is a temperature value for every point within this area.
  • #22 Big data can also be used to map species distributions. They can be publically generated. For instance, here is Cal Flora where anyone can record the occurrence of a plant species in a location of California. This data is constantly uploaded and generates maps of where the species can be found. Compare the differences between a desert native plant species found mostly in the Mojave region, while an invasive thistle dominates in the non-desert areas.
  • #23 These species distributions can then be mapped onto climate data for that area. With this information we can make inferences about the species and where it may be predicted.
  • #24 California is advanced in terms of managing of compiling data including Climate and species distributions. We are going to use the Consortium of California Herbaria that is a publically filled data based on plant occurrences for the last 50 years. This data is publically available and contains a fair amount of information other than just the occurrence. CalAdapt is a climate database that uses weather stations from previous climate to predict future climate scenarios. Our exercise is to model the distributions of a plant species with future climate projections.
  • #36 The thermal niche of Plantago insularis, where above 25 degrees and below 11 degrees the likelihood of occurrence decreases signficiantly.