Biodiversity data is extremely approachable – the concept of a specimen existing in time and place is clear to grasp and interesting to a wide range of people. I exploited this inherent feature of Biodiversity data to create an educational framework for teaching undergraduate Data Science. The project utilized Discovery Learning theory, based in the belief that it is best for learners to discover facts and relationships for themselves. Students were given a choice of databases and were mentored through an entire data analysis pipeline, including gathering, cleaning, analyzing, and visualization of the data. Their work culminates into a tutorial posted online (curiositydata.org) – instilling proper documentation, open science, and data management techniques. These tutorials can then be used and remixed as documentation for the databases, curriculum, and workshops detailing how to access and analyze the databases data. Increased documentation will overcome accessibility challenges that plague many Biodiversity databases, in an overarching aim to increase the usage and in turn value of these vital data resources. Computer Science, Statistics, and Biology undergraduates are increasingly “data literate”, and if mentored properly, we can foster a symbiotic relationship between the real-world Data Science education and the increase of usability of Biodiversity databases.
Designing a synergistic relationship between undergraduate Data Science education and usability of Biodiversity databases
1. Designing a Synergistic Relationship Between
Undergraduate Data Science Education and
Usability of Biodiversity Databases
Ciera Martinez, PhD
Berkeley Institute of Data Science
Mozilla Foundation
5. Learning to perform computational research is hard
- Attempting to gain the correct skills is frustrating
- They don’t teach the skills you need in classes
- PIs don’t teach these skills to their students
How I learned
- Using online tutorials and documentation
- Talking with peers, largely online
- Teaching others
Motivation
Now committed to programming education
6. Chitwood et al, 2014, Martinez et al, 2019, Martinez et al, 2016a, Martinez et al., 2016b,
Utilizing data made my research faster and better
Motivation
11. Ect
Yay!
So much cool data!
Nay! It’s hard to:
Navigate these databases
Understand what is in these databases
Access data within these databases
Motivation
14. Guided by mentors, students are given free rein to explore a
database of their choosing with the end goal of creating a
tutorial outlining their process.
Curiosity Data Project
To use biodiversity databases and inquiry based learning to teach
data literacy skills to undergraduate and graduate students.
Project Overview
Project set-up
16. 1.Got team together
• Recruited eight undergraduates from pool of 30
• Some programming knowledge (R or Python)
• Majors: CS, Stats, Biology
• Recruited mentor help: 1 PhD student and 1
industry data scientist
Project set-up
17. 2. The students were asked to pick one
database and document how they access,
clean, and analyze the data from that database
Project set-upProject set-up
18. 2. The students were asked to pick one
database and document how they access,
clean, and analyze the data from that database
List of prompt questions to begin exploration
• What features you find most useful?
• What kind of questions can we ask when using this database?
• What tools are available to use this database?
• What skills are required to use this database?
• How do you access the data?
• What tutorials help you access the database?
• What are the challenges to using this data?
• What are the rules for using this database?
Project set-upProject set-up
19. • We are giving back to the community by teaching others
• They get over their fears of posting their code online
• Expose them to open science philosophy and
computational reproducibility
• Focus on code readability
• Allow them to learn how to present a story
• Teaches them computational research notebook
management
• Regularly use Git and Github
• Start online presence and portfolio
3. Student end project goal is a tutorial that will
be posted online
Project set-up
Reasoning
Project set-up
20. 4. We met for 20 min individually each week
to discuss what they worked on, challenges,
and plan next weeks work. (~3 hours a
week)
Project set-upProject set-up
28. What worked?
What didn’t work?
What did the students like about the project?
What did the students dislike about the project?
Results of the project
29. - They all thrived in self directed inquiry
-These tutorials can be coopted for workshops AND be
remixed for lesson modules
-Feedback works if directly in contact with developer
- They all enjoyed it and stuck with it, I didn’t lose one
student
- Students became extremely interested in the data
What worked?
Results of the project
30. - The students skills did improve!
What worked?
Results of the project
31. -They sometimes floundered with so much freedom
-It was hard to assess how much work they did each week
-They could not create tools like I originally planned
-The limiting factor is me, hard to scale.
-Reviewing the code
-Dealing with notebook to website
-Individual meetings
Results of the project
What didn’t work?
32. - “Ability to take the project in whatever direction I
wanted”
- “I love manipulating data programmatically and making
plots with the data.”
-“I enjoyed learning new skills outside of the classroom and
actually being able to create a tutorial using data that I was
really interested in!”
-“Exploring different data visualization methods and tailor
them to fit the data set”
Results of the project
What did they like about the project?
33. Results of the project
What did they dislike about the project?
-“Debugging code and figuring out how to make an
interesting…story out of the data”
-“Downloading data using API and storing big data.”
-"Lack of structure”
-“Trying to figure out how to solve the problems like dealing
with missing and "bad" data”
-“That some visualization methods/packages are difficult to
install and use “
34. -“Debugging code and figuring out how to make an
interesting…story out of the data”
-“Downloading data using API and storing big data.”
-"Lack of structure”
-“Trying to figure out how to solve the problems like dealing
with missing and "bad" data”
-“That some visualization methods/packages are difficult to
install and use “
But everything they listed is normal frustrations about
working with code, so they learned what is common
frustrations with the code
Results of the project
What did they dislike about the project?
37. - Is there something here with gender and this data? Why? How do
we leverage for bridging gender gap in STEM?
- Can this scale as an academic course?
- Can this scale in the online community?
- Do you need to know how to program to run a similar project?
- Will people use these tutorials?
- Where is an appropriate place to house these tutorials?
End questions
38. Thank You!
Neotoma, PBDB, GloBI, Movebank, Google Earth,
iDigBio, BHL, Morphosource
All the museums, collections, and universities that hosted me to visit, and
provided insightful discussion and critical feedback
Berkeley Museum of Vertebrate and Zoology, New York Botanical Garden,
National Museum of Denmark, Kew Gardens, Natural History Museum of Utah,
American Museum of Natural History, Cooper-Hewitt Design Museum, UT
Austin Vertebrate Paleontology Laboratory Collection, Digimorph Facility, UT
Austin Ichthyology Collection, Lewis and Clark College, Natural History
Museum (London)
Special Thanks: Deb Paul
39. Learning Objectives
Link to exact document
- identify assumptions about data
- formulate research questions
- tell a story with data
- visualize data
- build data science portfolio
- write collaborative code
- employ version control
- apply data / file management
- think with reproducibility in mind
- handle and merge data
- clean messy data
40. Overlapping Data Literacy Skills
1. Manipulation of data frames
• subset
• merge
• transform long vs wide
• apply formulas to values
2. Map points onto a geographic map, understand assumptions
3. Clean taxonomic names, understand assumptions
1. Species concept
2. Spelling errors
3. Species synonyms
4. Hierarchical classification
5. Species databases available and differences between them
4. Time stamp data formats
1. Time collected vs time measured
2. Time formats
3. Visualization techniques for time and space
5. Look into biology for sanity checks
6. Understand the value of plotting data and which visualization is important
Minimum
Editor's Notes
So this is the outline of my talk.
Today I am going to talk about a project I that has been the culmination of over ten years of interests of mine. So to start, I am going to take a few minutes to explain my motivations for building this project and how my background led me here.
The short answer is evolution. I am obsessed with thinking about how organisms get their shape. The moment I learned that there are people who can devote their careers to asking such questions, is the exact moment I set a trajectory to become an evolutionary biologist. I vowed to attack this question, How do organism get their shape, to where ever it led me.
Where it led me, is where almost every modern scientist is led. To a lot of data and a lot of frustrations on how to use that data. Learning to program is hard. It took many years to figure out I was not alone with this frustration,
And now I am very committed to teaching programming and the pedagogy behind learning these programming and datasceince skills.
And overtime I stopped being so hard on myself. And as my programming skills improved, my research questions became bigger and I started uncovering surprising patterns in my research that would not have been possible without programming. Describe slide. Instead of being frustrated by large amounts of data I became craving it
Now I am hooked, I am currently a postdoc studying comparitive genomics at Berkeley. I consider myself a data scientist, just as much as I consider myself an evolutionary biologist and I began thinking about where the coolest data in the world is to answer my evolutionary questions, and it hit me, duh, Natural History Museums. I got my start in research at the Field Museum in Chicago (maybe you were there last week?),
but from working there I knew the sheer magnitude of specimens not open to the public and I knew Natural History Museum Collections and Botanical Gardens must have amazing data.
But lucky for me, I don’t have to literally go to these museums. I was thrilled to learn about the digitization efforts that this community in the past fifteen years. I could not believe how many databases their were. How did I not know about these before? So then I started devouring as much information I could about how to use this data. and realized THIS is hard. Navigating these databases is hard. Understanding what is in these databases is hard. Accessing the data is hard. But still, I could plainly see, the data from this community is extraordinary and immense, I just wish there was more information on how to use this data!
But lucky for me, I don’t have to literally go to these museums. I was thrilled to learn about the digitization efforts that this community in the past fifteen years. I could not believe how many databases their were. How did I not know about these before? So then I started devouring as much information I could about how to use this data. and realized THIS is hard. Navigating these databases is hard. Understanding what is in these databases is hard. Accessing the data is hard. But still, I could plainly see, the data from this community is extraordinary and immense, I just wish there was more information on how to use this data!
So this year, with the help of the Mozilla foundation, I decided to dive head first into my new obsession - biodiversity and natural history data, while also incorporating my second passion of teaching programming, data science, and computational reproducibility
I call the project The Curiosity Data Project, After Cabinet of Curiosities.
So I wanted to help not only undergraduates, but also increase the usability of these databases.
So while the databases provide the data and point of contact for questions, The students
I knew from working with undergraduates on my research that they are extremely well trained and hungry for projects that allows them to handle real data.
They all started the same list of questions about what to ask about the data.
This structure was built so that it will be useful for 1.Future users of the database and 2. Can be used as critical feedback for database maintainers.
These are the sorts of questions that every database should have somewhere on their site, but often take days to figure out.
Start online presence and portfolio (which is instrumental in them getting a job or get into graduate school)
Regularly use Git and Github if you have ever tried to teach or learn git, you know that you can only learn with regular practice. A short course or learning module is boring and nearly impossible way to learn git this way
- even with sometimes very little knowledge of biology. They do what we all do, Google.
- Feedback works if directly in contact with developer (for example GloBI maintainer Jorrit Polean worked directly with Yikang Lee discussing problems she had. Gtihub issues were created and database improved)
- Students became extremely interested in the data. Meeting with them was a delight, they would tell me all about dinosaurs, how fossils are collected, California fires, sea start diseases, whale migratory patterns. It was a delight.
I should have limited them to the scope of their questions early on.
because exploratory data analysis is not linear
Understanding and cleaning the data was full time.
Lastly I want to thank my team. Notice anything?
Gender: 6 female, 1 male, 1 non-binary gendered student
You might be thinking “oh, but maybe they were attracted to a female led data science team”. Not the case, I also have a team of undergraduates with my comparative genomics project: which is 5 male 2 females. The majority of applicants for this project was female - 24 out of the 30 applicant that applied were females.
This is something that I have noticed about this community in general, a lot of females are attracted to this type of data. Why? I have some theories, but overall, I think it is data itself,
So I end with a list of questions that I would love your opinion on.
Please come find me if you have any thoughts on these questions or data science education in general. I can take any questions not tag
They all started in the same place, and from there, they had to follow their own curiosity (the ultimate motivation).