SlideShare a Scribd company logo
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
Mining Whole Museum Collections Datasets for
Expanding Understanding of Collections with
the GUODA Service
Matthew Collins (iDigBio)
Jorrit Poelen (independant)
Alexander Thompson (iDigBio)
Jennifer Hammock (EOL)
What We’re Interested In
Computation with biodiversity data
• Research at scale
• Lowering barriers to access
• Reproducability
Matthew Collins
Technical Operations
Manager - iDigBio
Jorrit Poelen
Independant
Alexander Thompson
Software Products
Lead - iDigBio
Jennifer Hammock
Marine Theme
Coordinator - EOL
Quick Review of Ways That We Work With Datasets
Focus here is on using large aggregated datasets to answer
research questions
Working With Datasets - Web Portals
Good: searching, visualizing location, browsing
Less good: data characterization, modeling, analysis, graphing
Working With Data - Purpose-Built Applications
Good: low barrier to entry, expert-built, documentation, peers
Less good: limited scope, limited ability to change
Working With Data - APIs & Libraries
Good: direct access to data, some simple analysis
Less good: programming barrier, performance limits
Working With Data - Download & Code
Good: ultimate flexibility, combine & merge
Less good: data management barrier, you’re the sysadmin
Working With Data - GUODA
Global Unified Open Data Access
(If SPNHC can be Spinach, GUODA Gouda)
An informal collaboration between technologists
from organizations like EOL , ePANDDA, and iDigBio as well as
independent biodiversity informaticists. We share data use
cases, best practices, infrastructure, code, and ideas around
the science that can be done by analyzing large open-access
biodiversity datasets.
Working With Data - GUODA Continued
Goals
• Have technologists discuss the technical challenges and
solution approaches in the biodiversity informatics domain
• Provide on-ramp for those who might not think of
themselves as “technologists”
• Fast parallel computation infrastructure and practices
(currently using Apache Spark)
• Local copies of entire datasets already formatted, ready for
computation at scale on provided infrastructure
• Hosting for services that rely on above
What Questions Does GUODA Make Approachable?
Can we create structured data from the unstructured text in
iDigBio records?
GUODA provides a platform to quickly start working on this
problem.
1. No data download
2. Jupyter Notebooks
3. Parallel processing of entire dataset
Data Characterization
Looking at the Darwin
Core terms
fieldNotes,
occurrenceRemarks,
and eventRemarks to
see how many
characters are in
which fields
The Code to Produce That Figure
idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet")
notes = sqlContext.sql("""
SELECT
`http://portal.idigbio.org/terms/uuid` as uuid,
TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ',
`http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document
FROM idbtable WHERE
`http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR
`http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR
`http://rs.tdwg.org/dwc/terms/eventRemarks` != ''
""")
notes = notes.withColumn('document_len', sql.length(notes['document']))
notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes']))
notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks']))
notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks']))
notes_pd = notes[ sub_set ].toPandas()
sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10))
sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0
].apply(numpy.log10))
sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0
].apply(numpy.log10))
ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0
].apply(numpy.log10))
The Interface to Write The Code
Notebooks
“Literate Programming”
Comments, code, and
outputs all together in a
readable document that
describes what is being
done
GUODA Notebook Architecture
A look at interacting with the GUODA data service through
Jupyter Notebooks
GUODA Data Service At Scale
Python NLTK parsing
and part-of-speech
tagging of notes fields
with noun-phrase
assembly.
Example phrases:
• Intercept trap
• Forest litters
• Field notes
• Field notebook
• Fogging fungus covered log
• Tropical forest
• Flight intercept trap
The Code - 6 minutes for 3.2M Records
c.train(c.load_training_data("../data/chunker_training_50_fixed.json"))
def pipeline(s):
return c.assemble(c.tag(p.tag(t.tokenize(s))))
pipeline_udf = sql.udf(pipeline, types.ArrayType(
types.MapType(
types.StringType(),
types.StringType()
)))
phrases = notes
.withColumn("phrases", pipeline_udf(notes["document"]))
.select(sql.explode(sql.col("phrases")).alias("text"))
.filter(sql.col("text")["tag"] == "NP")
.select(sql.lower(sql.col("text")["phrase"]).alias("phrase"))
.groupBy(sql.col("phrase"))
.count()
phrases.write.parquet('../data/idigbio_phrases.parquet')
What Else is GUODA Besides Notebooks?
Remember “collaboration” and “infrastructure” to lower
barriers
• Twice monthly Google Hangouts
• Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL,
TraitBank so far
• Apache Spark cluster for computation
• Backs Effechecka http://effechecka.org/
• Backs Fresh Data https://github.com/gimmefreshdata/
• ePANDDA (we’re sharing ideas)
• iDigBio data quality workflows
Why is GUODA Important?
Perform research at a faster pace by “outsourcing” some of the
harder parts
Collect entire large datasets together in one place for cross-
dataset exploration without data management barrier
Provides a foundation, both community and infrastructure,
upon which to build purpose-built applications and APIs
bigger and faster than before
How You Can Fit With GUODA
• Make your data available
• Data standards to make it relatable to other datasets
• Making data available doesn’t end with handoff to the
aggregator - where is your data used?
• Support workforce development
• Support next-wave things like ePANDDA
• Collaborate with GUODA when starting your own research
iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of
Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings,
and conclusions or recommendations expressed in this material are those of the author(s) and
do not necessarily reflect the views of the National Science Foundation.
www.idigbio.org
facebook.com/iDigBio
twitter.com/iDigBio
vimeo.com/idigbio
idigbio.org/rss-feed.xml
webcal://www.idigbio.org/events-calendar/export.ics
Thank you!
http://guoda.bio

More Related Content

Viewers also liked

I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer KumarI/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
BEAM - Bridge Events & Meets
 
Aplicaciones web 2
Aplicaciones web 2Aplicaciones web 2
Aplicaciones web 2roxana1995
 
Herramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfHerramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfnayarbarom
 
Workshop Mentes hiperactivas
Workshop Mentes hiperactivasWorkshop Mentes hiperactivas
Workshop Mentes hiperactivasJorge Lima
 
Carnaval 2012 no CEGV
Carnaval 2012 no CEGVCarnaval 2012 no CEGV
Carnaval 2012 no CEGVJorge Lima
 
Новые возможности с MONAVIE
Новые возможности с MONAVIEНовые возможности с MONAVIE
Новые возможности с MONAVIENatalya Shulga
 
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitoresEdital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Raquel Freitas
 
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...Giorgio Federico Garbetta
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppDynamics
 
Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!
Dan English
 
Social Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small FirmsSocial Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small Firms
Internet Law Center
 
Kra & kpa by nitish rathi
Kra & kpa by nitish rathiKra & kpa by nitish rathi
Kra & kpa by nitish rathi
Nitish Rathi
 
QA automation
QA automationQA automation
QA automation
Strategybeach
 
Le merchandising au sein de la grande distribution
Le merchandising au sein de la grande distributionLe merchandising au sein de la grande distribution
Le merchandising au sein de la grande distribution
Guillaume Bourgogne
 
Descripcion mascota
Descripcion mascotaDescripcion mascota
Descripcion mascotagracielasudi
 

Viewers also liked (18)

Mohammad SUPERVISOR
Mohammad SUPERVISORMohammad SUPERVISOR
Mohammad SUPERVISOR
 
I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer KumarI/O: Intelligent Outsourcing 2016 | Jennifer Kumar
I/O: Intelligent Outsourcing 2016 | Jennifer Kumar
 
Aplicaciones web 2
Aplicaciones web 2Aplicaciones web 2
Aplicaciones web 2
 
Herramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdfHerramientas web 2 jfjhdjdkghjkdf
Herramientas web 2 jfjhdjdkghjkdf
 
Workshop Mentes hiperactivas
Workshop Mentes hiperactivasWorkshop Mentes hiperactivas
Workshop Mentes hiperactivas
 
Carnaval 2012 no CEGV
Carnaval 2012 no CEGVCarnaval 2012 no CEGV
Carnaval 2012 no CEGV
 
Новые возможности с MONAVIE
Новые возможности с MONAVIEНовые возможности с MONAVIE
Новые возможности с MONAVIE
 
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitoresEdital nº 008/2015 Intinerario de veiculo para transporte de eleitores
Edital nº 008/2015 Intinerario de veiculo para transporte de eleitores
 
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
MA DISSERTATION - Study of the Relationships Between Farmers and the Ghanaian...
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!Getting the new year started with Microsoft Power BI!
Getting the new year started with Microsoft Power BI!
 
Social Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small FirmsSocial Media and Reputation Management for Small Firms
Social Media and Reputation Management for Small Firms
 
Kra & kpa by nitish rathi
Kra & kpa by nitish rathiKra & kpa by nitish rathi
Kra & kpa by nitish rathi
 
QA automation
QA automationQA automation
QA automation
 
Laurie_Skipper_Resume_2017 Business
Laurie_Skipper_Resume_2017 BusinessLaurie_Skipper_Resume_2017 Business
Laurie_Skipper_Resume_2017 Business
 
Le merchandising au sein de la grande distribution
Le merchandising au sein de la grande distributionLe merchandising au sein de la grande distribution
Le merchandising au sein de la grande distribution
 
Descripcion mascota
Descripcion mascotaDescripcion mascota
Descripcion mascota
 
Disaster
DisasterDisaster
Disaster
 

Similar to Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
Chelle Gentemann
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
Hao Chen
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
MapR Technologies
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
Matthew J Collins
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
jixuan1989
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
PyData
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
suyu22
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
MongoDB
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
MongoDB
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
jaxLondonConference
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
Ian Foster
 

Similar to Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
GUODA: A Unified Platform for Large-Scale Computational Research on Open-Acce...
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
Green dao
Green daoGreen dao
Green dao
 
Webinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick DatabaseWebinar: How Banks Use MongoDB as a Tick Database
Webinar: How Banks Use MongoDB as a Tick Database
 
Big Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision MakerBig Data: Guidelines and Examples for the Enterprise Decision Maker
Big Data: Guidelines and Examples for the Enterprise Decision Maker
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 

Recently uploaded

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 

Recently uploaded (20)

Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 

Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service

  • 1. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Mining Whole Museum Collections Datasets for Expanding Understanding of Collections with the GUODA Service Matthew Collins (iDigBio) Jorrit Poelen (independant) Alexander Thompson (iDigBio) Jennifer Hammock (EOL)
  • 2. What We’re Interested In Computation with biodiversity data • Research at scale • Lowering barriers to access • Reproducability Matthew Collins Technical Operations Manager - iDigBio Jorrit Poelen Independant Alexander Thompson Software Products Lead - iDigBio Jennifer Hammock Marine Theme Coordinator - EOL
  • 3. Quick Review of Ways That We Work With Datasets Focus here is on using large aggregated datasets to answer research questions
  • 4. Working With Datasets - Web Portals Good: searching, visualizing location, browsing Less good: data characterization, modeling, analysis, graphing
  • 5. Working With Data - Purpose-Built Applications Good: low barrier to entry, expert-built, documentation, peers Less good: limited scope, limited ability to change
  • 6. Working With Data - APIs & Libraries Good: direct access to data, some simple analysis Less good: programming barrier, performance limits
  • 7. Working With Data - Download & Code Good: ultimate flexibility, combine & merge Less good: data management barrier, you’re the sysadmin
  • 8. Working With Data - GUODA Global Unified Open Data Access (If SPNHC can be Spinach, GUODA Gouda) An informal collaboration between technologists from organizations like EOL , ePANDDA, and iDigBio as well as independent biodiversity informaticists. We share data use cases, best practices, infrastructure, code, and ideas around the science that can be done by analyzing large open-access biodiversity datasets.
  • 9. Working With Data - GUODA Continued Goals • Have technologists discuss the technical challenges and solution approaches in the biodiversity informatics domain • Provide on-ramp for those who might not think of themselves as “technologists” • Fast parallel computation infrastructure and practices (currently using Apache Spark) • Local copies of entire datasets already formatted, ready for computation at scale on provided infrastructure • Hosting for services that rely on above
  • 10. What Questions Does GUODA Make Approachable? Can we create structured data from the unstructured text in iDigBio records? GUODA provides a platform to quickly start working on this problem. 1. No data download 2. Jupyter Notebooks 3. Parallel processing of entire dataset
  • 11. Data Characterization Looking at the Darwin Core terms fieldNotes, occurrenceRemarks, and eventRemarks to see how many characters are in which fields
  • 12. The Code to Produce That Figure idbdf = sqlContext.read.parquet("../data/idigbio/occurrence.txt.parquet") notes = sqlContext.sql(""" SELECT `http://portal.idigbio.org/terms/uuid` as uuid, TRIM(CONCAT(`http://rs.tdwg.org/dwc/terms/occurrenceRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/eventRemarks`, ' ', `http://rs.tdwg.org/dwc/terms/fieldNotes`)) as document FROM idbtable WHERE `http://rs.tdwg.org/dwc/terms/fieldNotes` != '' OR `http://rs.tdwg.org/dwc/terms/occurrenceRemarks` != '' OR `http://rs.tdwg.org/dwc/terms/eventRemarks` != '' """) notes = notes.withColumn('document_len', sql.length(notes['document'])) notes = notes.withColumn('fieldNotes_len', sql.length(notes['fieldNotes'])) notes = notes.withColumn('eventRemarks_len', sql.length(notes['eventRemarks'])) notes = notes.withColumn('occurrenceRemarks_len', sql.length(notes['occurrenceRemarks'])) notes_pd = notes[ sub_set ].toPandas() sns.distplot(notes_pd['document_len'].dropna().apply(numpy.log10)) sns.distplot(notes_pd['fieldNotes_len'].dropna()[ notes_pd['fieldNotes_len']>0 ].apply(numpy.log10)) sns.distplot(notes_pd['occurrenceRemarks_len'].dropna()[ notes_pd['occurrenceRemarks_len']>0 ].apply(numpy.log10)) ax = sns.distplot(notes_pd['eventRemarks_len'].dropna()[ notes_pd['eventRemarks_len']>0 ].apply(numpy.log10))
  • 13. The Interface to Write The Code Notebooks “Literate Programming” Comments, code, and outputs all together in a readable document that describes what is being done
  • 14. GUODA Notebook Architecture A look at interacting with the GUODA data service through Jupyter Notebooks
  • 15. GUODA Data Service At Scale Python NLTK parsing and part-of-speech tagging of notes fields with noun-phrase assembly. Example phrases: • Intercept trap • Forest litters • Field notes • Field notebook • Fogging fungus covered log • Tropical forest • Flight intercept trap
  • 16. The Code - 6 minutes for 3.2M Records c.train(c.load_training_data("../data/chunker_training_50_fixed.json")) def pipeline(s): return c.assemble(c.tag(p.tag(t.tokenize(s)))) pipeline_udf = sql.udf(pipeline, types.ArrayType( types.MapType( types.StringType(), types.StringType() ))) phrases = notes .withColumn("phrases", pipeline_udf(notes["document"])) .select(sql.explode(sql.col("phrases")).alias("text")) .filter(sql.col("text")["tag"] == "NP") .select(sql.lower(sql.col("text")["phrase"]).alias("phrase")) .groupBy(sql.col("phrase")) .count() phrases.write.parquet('../data/idigbio_phrases.parquet')
  • 17. What Else is GUODA Besides Notebooks? Remember “collaboration” and “infrastructure” to lower barriers • Twice monthly Google Hangouts • Hadoop HDFS data store with datasets: GBIF, iDigBio, BHL, TraitBank so far • Apache Spark cluster for computation • Backs Effechecka http://effechecka.org/ • Backs Fresh Data https://github.com/gimmefreshdata/ • ePANDDA (we’re sharing ideas) • iDigBio data quality workflows
  • 18. Why is GUODA Important? Perform research at a faster pace by “outsourcing” some of the harder parts Collect entire large datasets together in one place for cross- dataset exploration without data management barrier Provides a foundation, both community and infrastructure, upon which to build purpose-built applications and APIs bigger and faster than before
  • 19. How You Can Fit With GUODA • Make your data available • Data standards to make it relatable to other datasets • Making data available doesn’t end with handoff to the aggregator - where is your data used? • Support workforce development • Support next-wave things like ePANDDA • Collaborate with GUODA when starting your own research
  • 20. iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. www.idigbio.org facebook.com/iDigBio twitter.com/iDigBio vimeo.com/idigbio idigbio.org/rss-feed.xml webcal://www.idigbio.org/events-calendar/export.ics Thank you! http://guoda.bio