SlideShare a Scribd company logo
1 of 61
Download to read offline
The original vision of Nutch, 14 years later:
Building an open source search engine
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
"The original motivation for the Nutch project was to provide
a transparent alternative to the growing power of a
handful of private search services over most users’ view of
the Web.
CommerceNet Labs Technical Report, Nov 2004
However, as Nutch has been adopted with greater
enthusiasm by smaller organizations, the Nutch
Organization has de-emphasized operating a multi-
billion-page index in the public interest."
again?
transparency
reproducibility
https://uidemo.commonsearch.org
https://explain.commonsearch.org/?q=python&g=en
Agenda
• Values & tech choices
• Search engine components
• Challenges
• Opportunities
Values & tech choices
Radical transparency
• Open source (Apache License v2)
• Open data
• (Governance)
Privacy
• Results can be tailored by language/country, but
NOT by user/cookie/sessionid
• o/ Cache everything!
• Tor service: http://comsearchl2zlnre.onion
Participation & Pragmatism
• Use high-level languages as much as possible
(Python, Go)
• Embrace active communities (Spark, Elasticsearch)
• Use mainstream participation platforms, even if they
are nonfree (GitHub, Slack)
Search engines
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
Crawler
http://commoncrawl.org
Today at 3:30pm!
http://scrapy.org
http://github.com/cocrawler/cocrawler
Indexer
Specs
• HTML parsing & analysis
• Tokenization / NLP
• Static rankings
• Language detection
• I/O from crawls to databases
Common Search Pipeline
Doc sources
Common Crawl,
WARC files,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, file,
HDFS, S3, ...
HTML parsers
• BeautifulSoup & friends
• lxml
• html5lib
• Gumbo!
https://github.com/google/gumbo-parser
Gumbocy
• Use Cython instead of ctypes
• Smaller API
• Tree traversal on the Cython side with basic
boilerplate/visibility support
https://github.com/commonsearch/gumbocy
https://github.com/commonsearch/urlparse4
Database(s)
http://lucene.apache.org/
Ranker
Ranking formula
rank = f( static_score , dynamic_score( query ) )
Alexa
DMOZ
Blacklists
PageRank
...
ElasticSearch & Lucene
TF-IDF
BM25
...
https://about.commonsearch.org/developer/get-started
Today @ 4:30pm ;-)
Searcher / Frontend
Specs
• Send user query to databases
• Search-as-you-type
• HTML & JSON endpoints
• High performance
https://github.com/commonsearch/cosr-front
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Parser
Index
SearcherRanker
Challenges
Funding / Scale
• Frugalism
• Caching
• In-kind services
• Individual donations / Foundation grants
• General economic incentives
Spam
• Email spam
• Wikipedia vandalism
• Algorithm complexity & scale
• Given enough eyeballs, all spam is shallow?
Relevance
• Exhaustivity
• Rescoring
• Evaluation
• More at 4:30pm ;-)
More search dimensions
• Realtime search
• Local search
• Universal search
Semantic search
• Wikidata
• YAGO
• Conversational / Voice search
Outreach
• Easy onboarding & docs
• Making people care believe
Opportunities
Decentralization
• YaCy
• Extremely high technical & social cost!
• Transparency?
Research
• More people should know how to build search
engines
• Spam, Relevance, Large-scale data processing
• We need more open datasets!
https://about.commonsearch.org/blog/
Make the Web a better place!
• SEO
• Transparency
• Influence of money
• Public service
Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

More Related Content

What's hot

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
South London Geek Nights
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 

What's hot (14)

Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USASearch Intelligently - Liferay Symposium North America 2016, Chicago, USA
Search Intelligently - Liferay Symposium North America 2016, Chicago, USA
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Graphs, Graphs everywhere - Lucene powered relation exploration
Graphs, Graphs everywhere - Lucene powered relation explorationGraphs, Graphs everywhere - Lucene powered relation exploration
Graphs, Graphs everywhere - Lucene powered relation exploration
 
NoSQL Roundup
NoSQL RoundupNoSQL Roundup
NoSQL Roundup
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
How Graph Databases efficiently store, manage and query connected data at s...
How Graph Databases efficiently  store, manage and query  connected data at s...How Graph Databases efficiently  store, manage and query  connected data at s...
How Graph Databases efficiently store, manage and query connected data at s...
 
Riding the Semantic Web
Riding the Semantic WebRiding the Semantic Web
Riding the Semantic Web
 
Infrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKProInfrastructure crossroads... and the way we walked them in DKPro
Infrastructure crossroads... and the way we walked them in DKPro
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
 

Similar to The original vision of Nutch, 14 years later: Building an open source search engine

Similar to The original vision of Nutch, 14 years later: Building an open source search engine (20)

Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenI2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
"A Toolkit for Digital Research" - CNI 2013
"A Toolkit for Digital Research" - CNI 2013"A Toolkit for Digital Research" - CNI 2013
"A Toolkit for Digital Research" - CNI 2013
 
2018 09-03 aOS Aachen - SharePoint demystified - Thomas Vochten
2018 09-03 aOS Aachen - SharePoint demystified - Thomas Vochten2018 09-03 aOS Aachen - SharePoint demystified - Thomas Vochten
2018 09-03 aOS Aachen - SharePoint demystified - Thomas Vochten
 
Maximizing the Impact of Institutional Knowledge Using DSpace
Maximizing the Impact of Institutional Knowledge Using DSpaceMaximizing the Impact of Institutional Knowledge Using DSpace
Maximizing the Impact of Institutional Knowledge Using DSpace
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Fc3 integration strategies
Fc3 integration strategiesFc3 integration strategies
Fc3 integration strategies
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Fried dallas spug
Fried dallas spugFried dallas spug
Fried dallas spug
 
CKAN - the open source data portal platform
CKAN - the open source data portal platformCKAN - the open source data portal platform
CKAN - the open source data portal platform
 

More from Sylvain Zimmer

Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
Sylvain Zimmer
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
Sylvain Zimmer
 

More from Sylvain Zimmer (12)

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011
 
Web Crawling with NodeJS
Web Crawling with NodeJSWeb Crawling with NodeJS
Web Crawling with NodeJS
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecases
 

Recently uploaded

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

The original vision of Nutch, 14 years later: Building an open source search engine