Ranking the Web with Spark

•

3 likes•966 views

Sylvain Zimmer

This talk was presented at Apache BigData Europe 2016 in Sevilla, Spain on Nov 15th.

Technology

Ranking the Web with Spark
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus

/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)

https://explain.commonsearch.org/?q=python&g=en

Disclaimer: IANASRE
(I Am Not A Search Relevance Engineer)

What's in a score
score = fn( doc, query, language, user, time )

What's in a score
score = fn( doc, query )

What's in a score
score = fn( static_score, dynamic_score ( query ))

Static features
• Scopes:
• Page: URL depth, markup stats, ...
• Domain: Age, page count, blacklists, ...
• WebGraph: PageRank, ...

http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker

Dynamic features
• Text match: TF-IDF, BM25, proximity, topic, ...
• Query-level: number of words, popularity, ...
• Usage: clicks, dwell time, reformulations, ...
• Time

Users
Database
Elasticsearch
Indexer
Python, Spark
Data sources
Common Crawl, Alexa top 1M, ...
words, static score
query top 10 docs, ﬁnal scores
Ofﬂine
Online
Searcher
Go

Issues with this architecture
• Static & dynamic scoring are in different
codebases
• No control over result diversity
• Hard to optimize
• Very dependent on Elasticsearch

Users
Database
Indexer
words, static score, features
query
Searcher
top 1k docs, features
Rescorer
ﬁnal 10 docs

Issues with rescoring
• Latency
• Pagination
• Harder to explain

LTR Model
• Features
• Training dataset
• Evaluation: NDCG, ERR, ...
• Algorithms: AdaRank, ListNet, LambdaMART, ...
• Learning with Spark!

The right questions
• What do users expect?
• What features?
• How to evaluate and ﬁne-tune in the real world?

https://github.com/commonsearch/cosr-back

Common Search Pipeline
Doc sources
Common Crawl,
WARC ﬁles,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, ﬁle,
HDFS, S3, ...

SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

https://about.commonsearch.org/developer/get-started

Spamdexing
• Keyword stufﬁng, hidden text
• Scraper sites, Mirrors
• Link farms
• Splogs, Comment spam
• Domaining
• Cloaking
• Bombing

Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

What's hot

RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA

Graph Databases for SQL Server ProfessionalsStéphane Fréchette

Graphs, Graphs everywhere - Lucene powered relation explorationZbyszko Papierski

Search Intelligently - Liferay Symposium North America 2016, Chicago, USAAndré Ricardo Barreto de Oliveira

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights

DBpedia JapaneseFumihiro Kato

Deriving an Emergent Relational Schema from RDF DataGraph-TA

Managing RDF data with graph databasesGraph-TA

NoSQL RoundupDaniel Fields

How Graph Databases efficiently store, manage and query connected data at s...jexp

Elasticsearch @JBoss.org, 2014Lukas Vlcek

What's hot (11)

RDF Graph Data Management in Oracle Database and NoSQL Platforms

Graph Databases for SQL Server Professionals

Graphs, Graphs everywhere - Lucene powered relation exploration

Search Intelligently - Liferay Symposium North America 2016, Chicago, USA

NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...

DBpedia Japanese

Deriving an Emergent Relational Schema from RDF Data

Managing RDF data with graph databases

NoSQL Roundup

How Graph Databases efficiently store, manage and query connected data at s...

Elasticsearch @JBoss.org, 2014

Viewers also liked

Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Sylvain Zimmer

PyCon FR 2016 - Et si on recodait Google en Python ?Sylvain Zimmer

Predictive modeling healthcareTaposh Roy

Building distributed processing system from scratch - Part 2datamantra

Introduction to Structured Streamingdatamantra

Keyboard covert channelsFreeman Zhang

AMP Camp 5 Introjeykottalam

Spark sqlFreeman Zhang

Introduction to datasetdatamantra

Evolution of apache sparkdatamantra

Anatomy of Spark SQL Catalyst - Part 2datamantra

Spark on yarndatamantra

Getting Started Running Apache Spark on Apache MesosPaco Nathan

Anatomy of in memory processing in Sparkdatamantra

Building a modern Application with DataFramesSpark Summit

Kafka and Spark Streamingdatamantra

Building Distributed Systems from Scratch - Part 1datamantra

Introduction to Structured Data Processing with Spark SQLdatamantra

Resilient Distributed DataSets - Apache SPARKTaposh Roy

Building Distributed Systems in ScalaAlex Payne

Viewers also liked (20)

Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2

PyCon FR 2016 - Et si on recodait Google en Python ?

Predictive modeling healthcare

Building distributed processing system from scratch - Part 2

Introduction to Structured Streaming

Keyboard covert channels

AMP Camp 5 Intro

Spark sql

Introduction to dataset

Evolution of apache spark

Anatomy of Spark SQL Catalyst - Part 2

Spark on yarn

Getting Started Running Apache Spark on Apache Mesos

Anatomy of in memory processing in Spark

Building a modern Application with DataFrames

Kafka and Spark Streaming

Building Distributed Systems from Scratch - Part 1

Introduction to Structured Data Processing with Spark SQL

Resilient Distributed DataSets - Apache SPARK

Building Distributed Systems in Scala

Similar to Ranking the Web with Spark

Information Architecture - Don't Make Me ThinkKerry Dirks MCPS MS

2015 Data Science Summit @ dato ReviewHang Li

Mark Tortoricci - Talent42 2015Talent42

SPLive Orlando - Beyond the Search Center - Application or Solution?Agnes Molnar

Feature driven agile oriented web applicationsRam G Athreya

Search Engine Optimization (SEO) 101pointit

SEO in the Age of Artificial Intelligence | How AI influences SearchPhilipp Klöckner

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

Ordering the chaos: Creating websites with imperfect dataAndy Stretton

Arron daniels 1 pager researching the tech talent marketTalent42

Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan

OSCON 2014: Data Workflows for Machine LearningPaco Nathan

Search Analytics for Fun and ProfitLouis Rosenfeld

The Human Side of ProductivityTathagat Varma

Tri vuong-resumeTri Vuong

Rapid Data Exploration With HadoopPeter Skomoroch

Semtech bizsemanticsearchtutorialBarbara Starr

Search Analytics: Powerful diagnostics for your siteLouis Rosenfeld

Data Driven Design: Using Web Analytics to Improve Information ArchitecturesAndrea Wiggins

A fresh new look into Information Gathering - OWASP SpainChristian Martorella

Similar to Ranking the Web with Spark (20)

Information Architecture - Don't Make Me Think

2015 Data Science Summit @ dato Review

Mark Tortoricci - Talent42 2015

SPLive Orlando - Beyond the Search Center - Application or Solution?

Feature driven agile oriented web applications

Search Engine Optimization (SEO) 101

SEO in the Age of Artificial Intelligence | How AI influences Search

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Ordering the chaos: Creating websites with imperfect data

Arron daniels 1 pager researching the tech talent market

Data Workflows for Machine Learning - SF Bay Area ML

OSCON 2014: Data Workflows for Machine Learning

Search Analytics for Fun and Profit

The Human Side of Productivity

Tri vuong-resume

Rapid Data Exploration With Hadoop

Semtech bizsemanticsearchtutorial

Search Analytics: Powerful diagnostics for your site

Data Driven Design: Using Web Analytics to Improve Information Architectures

A fresh new look into Information Gathering - OWASP Spain

Recently uploaded

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Bluetooth Controlled Car with Arduino.pdfngoud9212

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Artificial intelligence in the post-deep learning eraDeakin University

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

AI as an Interface for Commercial BuildingsMemoori

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko

Pigging Solutions in Pet Food Manufacturing

Bluetooth Controlled Car with Arduino.pdf

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

My Hashitalk Indonesia April 2024 Presentation

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Artificial intelligence in the post-deep learning era

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Human Factors of XR: Using Human Factors to Design XR Systems

Vertex AI Gemini Prompt Engineering Tips

My INSURER PTE LTD - Insurtech Innovation Award 2024

Unleash Your Potential - Namagunga Girls Coding Club

Streamlining Python Development: A Guide to a Modern Project Setup

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Designing IA for AI - Information Architecture Conference 2024

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

AI as an Interface for Commercial Buildings

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Ranking the Web with Spark

1. Ranking the Web with Spark Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus

2. /usr/bin/whoami • Jamendo (Founder & CTO, 2004-2011) • TEDxParis (Co-founder, 2009-2012) • dotConferences (Founder, 2012-) • Pricing Assistant (Co-founder & CTO, 2012-)

3. transparency reproducibility

5. https://uidemo.commonsearch.org

6. https://explain.commonsearch.org/?q=python&g=en

7. Ranking

8. Disclaimer: IANASRE (I Am Not A Search Relevance Engineer)

9. What's in a score score = fn( doc, query, language, user, time )

10. What's in a score score = fn( doc, query )

11. What's in a score score = fn( static_score, dynamic_score ( query ))

12. Static score

13. Static features • Scopes: • Page: URL depth, markup stats, ... • Domain: Age, page count, blacklists, ... • WebGraph: PageRank, ...

14. http://infolab.stanford.edu/~backrub/google.html The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Crawler Indexer Database SearcherRanker

15. Dynamic score

16. Dynamic features • Text match: TF-IDF, BM25, proximity, topic, ... • Query-level: number of words, popularity, ... • Usage: clicks, dwell time, reformulations, ... • Time

17. Scoring function

18. Users Database Elasticsearch Indexer Python, Spark Data sources Common Crawl, Alexa top 1M, ... words, static score query top 10 docs, ﬁnal scores Ofﬂine Online Searcher Go

19.

20. https://explain.commonsearch.org/?q=python&g=en

21. Issues with this architecture • Static & dynamic scoring are in different codebases • No control over result diversity • Hard to optimize • Very dependent on Elasticsearch

22. Rescoring

23. Users Database Indexer words, static score, features query Searcher top 1k docs, features Rescorer ﬁnal 10 docs

24. Issues with rescoring • Latency • Pagination • Harder to explain

25. Learning to rank

26. LTR Model • Features • Training dataset • Evaluation: NDCG, ERR, ... • Algorithms: AdaRank, ListNet, LambdaMART, ... • Learning with Spark!

27. The right questions • What do users expect? • What features? • How to evaluate and ﬁne-tune in the real world?

28. PageRank with Spark

29.

30. http://commoncrawl.org

31. https://github.com/commonsearch/cosr-back

32. Common Search Pipeline Doc sources Common Crawl, WARC ﬁles, URLs ... Filter plugins Document parsing Output plugins Data output Database, ﬁle, HDFS, S3, ...

33. Most popular Wikipedia pages

34. Dumping the web graph

35. Naive pyspark PageRank

36. GraphFrames

37. SparkSQL PageRank

38. SparkSQL PageRank https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

39. Tests http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

40. https://about.commonsearch.org/developer/get-started

41.

42. Top 10

43.

44. Spam