3. Challenges of Big Data
• Overwhelming
• Difficult to sort through to find something
meaningful
• Hard to manage
4. Examples of Big data
• http://www.coopercenter.org/demographics/
Racial-Dot-Map
• http://internet-map.net/
5. Examples of Big data
www.google.com/trends/
- FIFA world cup
- Beyonce
- Potatoes
- VHS
6. Big Data: What is the Big deal?
Google grew from processing 100 TB of data a
day in 2004 to 20 PB a day in 2008
We are producing more data than we are able to
store or analyze
Economist, 2010
8. Big Data: What is the Big deal?
“Focusing on one individual at a time, we can provide better
reminders, search results, and advertisements by considering
all the locations the person is likely to be close to in the future
(e.g., “Need a haircut? In 4 days, you will be within 100 meters
of a salon that will have a $5 special at that time.”)”
9. Big Data: What is the Big deal?
Enable scientific breakthroughs
- Large Hadron Collider
- Sloan Sky Survey
- Genomics
- Climate data
12. Big data for ecology
• Ecologists produce large amount of data, but
needs to be compiled
• Ecologists must treat data as products, just
like publications
• Archive & share -> data repositories
23. Example
Consortium of California herbaria – plant database
http://ucjeps.berkeley.edu/consortium/
CalAdapt – Climate commons
http://cal-adapt.org/data/tabular/
28. - Copy from internet
- Paste special, “as text”
- Delete everything except GPS and ID
- Re-label specimen to “id”
- Re-label “lat” and “lng”
29.
30.
31.
32. - Copy and paste
- Click away from data area
- Check settings to match below
33. - Copy and paste
- Click away from data area
- Check settings to match below
34.
35.
36. Model climate change
• Pick one GPS point, remove all the others
• Set time interval for daily, CCSM3
• Download data
• Plot temperatures from 1950 – 2099
• Will your species go extinct?
• Try other points
Editor's Notes
In 2010, Google estimated that their search index holds 100 million gigabytes of data. Every minute, 48 hours of video is uploaded to YouTube, we send over 100,000 Tweets, Flickr users add 3,125 new photographs, and more than 570 new websites are created
Big data is great, but there are some associated challenges. It is overwhelming in terms of how much there is. Consequently, it is difficult to sort through. Thinking of the previous infographic, having 48 hours of youtube video isn’t necessarily informative. How can we better sort this data into something that is manageable. This leads to the last challenge in that it is difficult to manage. Even if you have a question and know the data to answer it, how would you go about managing it.
Well there is dedicated science dedicated to organizing and processing extremely large amounts of data and conveying it in simpler way. Here are two easy to understand visualizations that use exceptionally large amounts of data.
An industry leader in processes data is google. Google analyzes exceptional amounts of data every second and one visualization of it is Google trends. This website outputs the popularity of a search term over time and provides other statistics including events that contributed to the popularity or associated country. Compare how trends increase and decrease over time. Things that may push the trends in a certain direction. Relate to how this data would need to be collected and perpetually updated. Let the students explore this on their own.
This trend of analyzing data is increasing. In 2004 google was analyzing 100 terabytes of data. This increased 10,000 fold to 20 petabytes in 2008. Imagine today the amount of data being processed.*** 1024 terabytes in a petabyte (PB).
“Far out” software claims to be able to predict your location years into the future - even if you don't know where you'll be. 'Far Out' is the result of statistical research that looks at GPS data, learns your typical movements and then extrapolates to decide on your likely future location. The result, according to the team behind it, is a system that can make "highly accurate" predictions about where you'll be years down the line.
Knowing where you are was 2008. Knowing where you were going to be was last year. Now companies not only want to know where you are going to be, but how to tailor what you are going to come across.
Other than for advertising or industry, big data can help with scientific breakthroughs. The particle accelerator in Cern, the Sloan Sky survey or genomics.
For ecology, some big data sets include long-term experimental research, crowd-sourced data sets from the public such as the breeding bird survey. There is also climate measurements from weather stations and remote sensing from aerial photography.
Big data in ecology isn’t always single long term datasets. There is already loads of existing data out there than can be compiled to answer new questions. Similar experiments occurring in tandem globally can answer world challenges. Ecologists produce large volumes of data, but do not compile
There are many different climate projects based on different areas.
A raster is a plane of data. If you have a data point, it is a single spot in space. A line is two points with interpolated values in between. This means that along the entirety of that line, there are values. A raster is one step further in that it is a plane of data like a piece of paper.
Imagine two weather stations in which one is hot and the other is cold. They both record temperatures continuously over time.
A raster generates interpolated values along the entire area in between the two weather stations from hot to cold.
Now, extending this to many more weather stations on a global scale.
It generates this network of values based on the weather stations constantly recording.
That becomes rasterized based on interpolated values. With this raster, there is a temperature value for every point within this area.
Big data can also be used to map species distributions. They can be publically generated. For instance, here is Cal Flora where anyone can record the occurrence of a plant species in a location of California. This data is constantly uploaded and generates maps of where the species can be found. Compare the differences between a desert native plant species found mostly in the Mojave region, while an invasive thistle dominates in the non-desert areas.
These species distributions can then be mapped onto climate data for that area. With this information we can make inferences about the species and where it may be predicted.
California is advanced in terms of managing of compiling data including Climate and species distributions. We are going to use the Consortium of California Herbaria that is a publically filled data based on plant occurrences for the last 50 years. This data is publically available and contains a fair amount of information other than just the occurrence. CalAdapt is a climate database that uses weather stations from previous climate to predict future climate scenarios. Our exercise is to model the distributions of a plant species with future climate projections.
The thermal niche of Plantago insularis, where above 25 degrees and below 11 degrees the likelihood of occurrence decreases signficiantly.