Search engines in the industry
Upcoming SlideShare
Loading in...5
×
 

Search engines in the industry

on

  • 351 views

 

Statistics

Views

Total Views
351
Views on SlideShare
351
Embed Views
0

Actions

Likes
1
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Search engines in the industry Search engines in the industry Presentation Transcript

  • Search engines in the industry a use case
  • Different interests ● researchers / engineers look for high precision and recall ● editors / writers are concerned about matching of queries and results ● marketers want to change / adapt results
  • Designing a search engine ● functional requirements ○ search ■ keywords, boolean retrieval, natural language ○ indexing ■ data sources ■ data types ○ administration ■ manage scoring / boosting functions
  • Designing a search engine ● architectural requirements ○ resiliency ○ scalability ○ no downtime ○ work with existing infrastructure ○ platforms ○ migrating from legacy systems ○ talk to other systems
  • Designing a search engine ● performance requirements ○ search ■ query per second ■ time per search request ○ index ■ document per second ■ time per indexing request ○ SLA?
  • Designing a search engine ● search engine performance requirements ○ recall percentiles threshold ○ precision percentiles threshold ○ minimize empty results
  • ● often mostly unknown ○ published vs unpublished / to be written documents ● almost always umanageable ○ cannot decide when ■ it’ll be ready ■ it’ll have to be indexed ■ it’ll have to be searchable ● heterogeneous ○ different writers, languages, topics, styles, etc. Data
  • Process
  • Project ● ~50M heterogeneous documents ● Migrating from old commercial solution to Apache Solr ● Google like search ● Targeted search for different types of contents
  • Advanced capabilities ● Smart understanding of queries ● Smart suggestion of queries ● Suggestion of similar / important contents ● Automatic classification of contents
  • Responsibilities ● architecture analysis and design ○ scaling under high load ● continuous definition of algorithms for indexing and searching ● system maintenance
  • Skills required ● basics of information retrieval ● a bit of distributed systems ● some natural language processing ● some machine learning
  • Architecture analysis and design ● Shape up a prototype architecture ○ separate machines for indexing and search ○ multiple load balanced machines for searching ○ define indexing and search algorithms ● Evaluate architecture ○ stress tests (performance) ○ quality tests (accuracy) ● Iterate
  • Architecture analysis and design ● analyze existing documents ○ avg size ○ language ○ topics, style, etc. ● analyze existing query logs ○ avg response time ○ avg length (how much it takes to specify a query?) ○ avg query per second
  • Most time spent on ● testing how documents get indexed ● testing how user queries get transformer in platform specific queries ● tweaking indexing algorithms ● tweaking search algorithms ● tweaking ranking ● platform optimization for scalability
  • Challenges ● Architecture constraints ● Performance ● Diverging stakeholders concerns ● Dynamically scaling search
  • Sample architecture constraint #1 ● Data storage has to be on NFS ● Lucene is IO intensive ● NFS makes it slower ● Concurrent read writes makes it error prone
  • Sample architecture constraint #2 ● Change search engine ● Systems talking to the SE need to switch API ● Only in the long run ● In the short run an adapter layer for old APIs on new APIs has to be developed
  • Indexing performance ● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format ● The adaption layer between old and new API becomes the bottleneck ● Time to switch to the new API natively
  • Diverging concerns ● Article authors check the search engine exactly handles their writings wanting perfect recall and precision ○ so lot of time is spent on adjusting ranking ● Markters want to be able to overcome ranking and put something they want to sell ○ ranking algorithm gets breached ● Need flexible algorithms
  • Scale dinamically ● Search engine needs not to break even under high peaks of load ● Such peaks are often unpredictable ● Need a fast way to add more computing power
  • Takeaways ● small iterations (no waterfalls!) ○ analyze portion of data / queries ○ change search / index algorithms ○ test, involve stakeholders ○ forces ability to reindex quickly ● look at data (documents, query logs)