Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Pro...
What We’re Interested In
Computation with biodiversity data
• Research at scale
• Lowering barriers to access
• Reproducab...
Quick Review of Ways That We Work With Datasets
Focus here is on using large aggregated datasets to answer
research questi...
Working With Datasets - Web Portals
Good: searching, visualizing location, browsing
Less good: data characterization, mode...
Working With Data - Purpose-Built Applications
Good: low barrier to entry, expert-built, documentation, peers
Less good: l...
Working With Data - APIs & Libraries
Good: direct access to data, some simple analysis
Less good: programming barrier, per...
Working With Data - Download & Code
Good: ultimate flexibility, combine & merge
Less good: data management barrier, you’re...
Working With Data - GUODA
Global Unified Open Data Access
(If SPNHC can be Spinach, GUODA Gouda)
An informal collaboration...
Working With Data - GUODA Continued
Goals
• Have technologists discuss the technical challenges and
solution approaches in...
What Questions Does GUODA Make Approachable?
Can we create structured data from the unstructured text in
iDigBio records?
...
Data Characterization
Looking at the Darwin
Core terms
fieldNotes,
occurrenceRemarks,
and eventRemarks to
see how many
cha...
The Code to Produce That Figure
idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")
notes = sqlConte...
The Interface to Write The Code
Notebooks
“Literate Programming”
Comments, code, and
outputs all together in a
readable do...
GUODA Notebook Architecture
A look at interacting with the GUODA data service through
Jupyter Notebooks
GUODA Data Service At Scale
Python NLTK parsing
and part-of-speech
tagging of notes fields
with noun-phrase
assembly.
Exam...
The Code - 6 minutes for 3.2M Records
c.train(c.load_training_data("../data/chunker_training_50_fixed.json"))
def pipeline...
What Else is GUODA Besides Notebooks?
Remember “collaboration” and “infrastructure” to lower
barriers
• Twice monthly Goog...
Why is GUODA Important?
Perform research at a faster pace by “outsourcing” some of the
harder parts
Collect entire large d...
How You Can Fit With GUODA
• Make your data available
• Data standards to make it relatable to other datasets
• Making dat...
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Pro...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Download to read offline

Introduces the Global Unified Open Data Architecture (GUODA) collaboration between iDigBio, independent developers, and EOL which aims to provide support for processing large biodiversity data sets using Apache Spark. A specific example with text mining is described. This presentation was given during the 31st Annual Meeting in 2016 of the Society for Presentation of Natural History Collections (SPNHC) in Berlin, Germany

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

  1. 1. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service Matthew Collins (iDigBio) Jorrit Poelen (independant) Alexander Thompson (iDigBio) Jennifer Hammock (EOL)
  2. 2. What We’re Interested In Computation with biodiversity data • Research at scale • Lowering barriers to access • Reproducability Matthew Collins Technical Operations Manager - iDigBio Jorrit Poelen Independant Alexander Thompson Software Products Lead - iDigBio Jennifer Hammock Marine Theme Coordinator - EOL
  3. 3. Quick Review of Ways That We Work With Datasets Focus here is on using large aggregated datasets to answer research questions
  4. 4. Working With Datasets - Web Portals Good: searching, visualizing location, browsing Less good: data characterization, modeling, analysis, graphing
  5. 5. Working With Data - Purpose-Built Applications Good: low barrier to entry, expert-built, documentation, peers Less good: limited scope, limited ability to change
  6. 6. Working With Data - APIs & Libraries Good: direct access to data, some simple analysis Less good: programming barrier, performance limits
  7. 7. Working With Data - Download & Code Good: ultimate flexibility, combine & merge Less good: data management barrier, you’re the sysadmin
  8. 8. Working With Data - GUODA Global Unified Open Data Access (If SPNHC can be Spinach, GUODA Gouda) An informal collaboration between technologists from organizations like EOL , ePANDDA, and iDigBio as well as independent biodiversity informaticists. We share data use cases, best practices, infrastructure, code, and ideas around the science that can be done by analyzing large open-access biodiversity datasets.
  9. 9. Working With Data - GUODA Continued Goals • Have technologists discuss the technical challenges and solution approaches in the biodiversity informatics domain • Provide on-ramp for those who might not think of themselves as “technologists” • Fast parallel computation infrastructure and practices (currently using Apache Spark) • Local copies of entire datasets already formatted, ready for computation at scale on provided infrastructure • Hosting for services that rely on above
  10. 10. What Questions Does GUODA Make Approachable? Can we create structured data from the unstructured text in iDigBio records? GUODA provides a platform to quickly start working on this problem. 1. No data download 2. Jupyter Notebooks 3. Parallel processing of entire dataset
  11. 11. Data Characterization Looking at the Darwin Core terms fieldNotes, occurrenceRemarks, and eventRemarks to see how many characters are in which fields
  12. 12. The Code to Produce That Figure idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet") notes = sqlContext.sql(""" SELECT `http://portal.idigbio.org/terms/uuid` as uuid, TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document FROM idbtable WHERE `http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR `http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR `http://rs.tdwg.org/dwc/terms/eventRemarks` != '' """) notes = notes.withColumn('document_len', sql.length(notes['document'])) notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes'])) notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks'])) notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks'])) notes_pd = notes[ sub_set ].toPandas() sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10)) sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0 ].apply(numpy.log10)) sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0 ].apply(numpy.log10)) ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0 ].apply(numpy.log10))
  13. 13. The Interface to Write The Code Notebooks “Literate Programming” Comments, code, and outputs all together in a readable document that describes what is being done
  14. 14. GUODA Notebook Architecture A look at interacting with the GUODA data service through Jupyter Notebooks
  15. 15. GUODA Data Service At Scale Python NLTK parsing and part-of-speech tagging of notes fields with noun-phrase assembly. Example phrases: • Intercept trap • Forest litters • Field notes • Field notebook • Fogging fungus covered log • Tropical forest • Flight intercept trap
  16. 16. The Code - 6 minutes for 3.2M Records c.train(c.load_training_data("../data/chunker_training_50_fixed.json")) def pipeline(s): return c.assemble(c.tag(p.tag(t.tokenize(s)))) pipeline_udf = sql.udf(pipeline, types.ArrayType( types.MapType( types.StringType(), types.StringType() ))) phrases = notes .withColumn("phrases", pipeline_udf(notes["document"])) .select(sql.explode(sql.col("phrases")).alias("text")) .filter(sql.col("text")["tag"] == "NP") .select(sql.lower(sql.col("text")["phrase"]).alias("phrase")) .groupBy(sql.col("phrase")) .count() phrases.write.parquet('../data/idigbio_phrases.parquet')
  17. 17. What Else is GUODA Besides Notebooks? Remember “collaboration” and “infrastructure” to lower barriers • Twice monthly Google Hangouts • Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL, TraitBank so far • Apache Spark cluster for computation • Backs Effechecka http://effechecka.org/ • Backs Fresh Data https://github.com/gimmefreshdata/ • ePANDDA (we’re sharing ideas) • iDigBio data quality workflows
  18. 18. Why is GUODA Important? Perform research at a faster pace by “outsourcing” some of the harder parts Collect entire large datasets together in one place for cross- dataset exploration without data management barrier Provides a foundation, both community and infrastructure, upon which to build purpose-built applications and APIs bigger and faster than before
  19. 19. How You Can Fit With GUODA • Make your data available • Data standards to make it relatable to other datasets • Making data available doesn’t end with handoff to the aggregator - where is your data used? • Support workforce development • Support next-wave things like ePANDDA • Collaborate with GUODA when starting your own research
  20. 20. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Thank you! http://guoda.bio

Introduces the Global Unified Open Data Architecture (GUODA) collaboration between iDigBio, independent developers, and EOL which aims to provide support for processing large biodiversity data sets using Apache Spark. A specific example with text mining is described. This presentation was given during the 31st Annual Meeting in 2016 of the Society for Presentation of Natural History Collections (SPNHC) in Berlin, Germany

Views

Total views

126

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

3

Shares

0

Comments

0

Likes

0

×