Foredrag på GoOpen, Oslo, 2011 (Norwegian language)
NHST Media Group lager nettsidene for bl.a. Dagens Næringsliv, Dagens IT og en rekke engelskspråklige bransjeaviser. Systemutvikler Hans Jørgen Hoel og søke-arkitekt Jan Høydahl forteller om prosessen etter at det ble besluttet å erstatte søkeløsningen fra FAST med fri programvare Apache Solr. Vi vil forsøke å besvare bl.a.: Hvilke utfordringer møtte vi som følge av forskjeller i de to plattformene? Hvorfor bygde vi vårt eget søkerammeverk? Har det nye søket innfridd forventningene?
Se også www.goopen.no, www.cominvent.com og www.nhst.no og Twitter hashtag #GoOpen
This is the slide eck that we used when we raised $1.2 million from investors for the angel round of IMSafer, back in 2006. The original company name was Collabarent.
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
This is the slide eck that we used when we raised $1.2 million from investors for the angel round of IMSafer, back in 2006. The original company name was Collabarent.
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
My talk at Lucene/Solr Revolution 2017, Las Vegas
The improved plugin system being proposed in this talk utilizes PF4J to add bundle packaging (zip/jar), plugin discovery (repositories), one-line install/upgrade and automatic version compatibility checks. Think of it as Homebrew or Apt-Get for Solr :) The hope is that this will encourage hundreds of new plugins being created and thus give Solr developers a sense of community and a new “stage” to perform on.
Enterprise search can grow big, really big! And growing. Tens, yes hundreds of servers may be involved, locally or in the cloud. Managing this has been complex and time consuming - until now :)
SolrCloud to the rescue
Using the world's most popular Open Source search engine, Apache Solr™, we will show you how the new upcoming version 4.0 makes scaling search in the cloud really simple and robust. A new feature called SolrCloud adds centralized configuration, distributed indexing & searching, automatic failover, recovery and leader election. Scaling is now as simple as adding a new server to your cluster and it will find its role where it is most needed and start serving searches.
A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.
Key topics when migrating from FAST to Solr, EuroCon 2010Cominvent AS
Presented during Lucene EuroCon 2010 in Prague. This presentation assumes no prior experience with FAST ESP, but some idea of what Solr/Lucene is. It gives you some hints on what to expect when migrating.
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlCominvent AS
Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.
Frokostseminar mai 2010 solr open source cominvent asCominvent AS
Slides fra frokostseminar om Open Souce søk med Apache Lucene/Solr i Oslo mai 2010. Dette var et arrangement av Cominvent AS og FindWise AB.
Presentation is in Norwegian language
Presentation of Norwegian based search consulting company Cominvent AS, focusing on Apache Solr/Lucene/ElasticSearch and other enterprise search and big data technology.
3. Jan Høydahl
1995: Utvikler telecom
1998: Java-utvikler
2000: Søk - FAST
2006: Lucene
2007: new
Cominvent()
2009: Lucene/Solr
Ca 100 prosjekter
4. Virksomhetskritisk søk
Lucene/Solr og FAST
Domenekunnskap & beste praksis!
Konsulent Kurs Support
(www.solrkurs.no)
5. Agenda
Bakgrunn for prosjektet
Arkitektur før
Søk ABC, intro til Solr
Prosjektgjennomføring
Oppsummering, Q&A
6. Bakgrunn for prosjektet
Stort antall artikler både på papir og nett
FAST ESP som plattform for søk fra 2006
Apache Solr for skattelistesøk
NHST bruker i stor grad Java og mye åpen programvare
Da FAST ble kjøpt opp måtte hele løsningen vurderes
Endte opp med å gå for Solr
Brakte inn Jan som konsulent
14. Utfordringer
FAST er en søke-plattform, Solr er rent søk
Prosessering av kildedata
Språkstøtte
Entiteter (personer, steder, firmaer)
15. FAST - Solr forskjeller
En indeks, delt inn Flere indekser (cores), hver
med collections med sitt eget skjema
Lemmatisering: Stemming:
bil, biler, bilene => bil bil, biler, bilene => bil
billig, billigere => billig billig => bil
billigere => billiger
Meget bra fler-språklig støtte Mer begrenset. Vi bygget inn
språk-støtte i rammeverket
17. Tuning for nyhetssøk
Hva er viktigste faktor for nyhets-søk?
Ferskvare !
umiddelbar indeksering
dato-boost i søk
Solr Function Query
recip(
ms(NOW,publishdate),
3.16e-11, 0.5, 0.5
)^4000.0
19. Oppsummering / gevinster
Solr mye mindre ressurskrevende enn FAST
Kan til og med kjøres virtualisert
Ryddigere arkitektur, separate kjerner og skjemaer
Tjent mye på felles søkemellomvare og presentasjonslag
Gode muligheter for tuning
Noen utfordringer, men alt i alt veldig fornøyd