Naukri Search Team achievements, 2009-2010

775 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
775
On SlideShare
0
From Embeds
0
Number of Embeds
68
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Naukri Search Team achievements, 2009-2010

  1. 1. Search TeamEngineering Achievements
  2. 2. Agenda Challenges Why a Platform? Information Extraction  Need, Impact  Research / Evaluations  Approach / Implementation Information Retrieval  Need, Impact  Research / Evaluations  Approach / Implementation
  3. 3. Challenges Job Alerts  Over 13 Million searches, 3 times a week  Complex Matching: Multiple Filters, Boosts, Sorts Resdex  130K active users daily  470K searches daily  Over 220 million resumes and growing. Job Search  High QPS 112, 760K searches a day  Near Real-time Indexing  Jobs Refreshed 92 times daily Product Demands  > 99.99% uptime, Stability, Scalability → User Experience  Varied Functional Requirements (Complexity)  NIRM, FN Suggestors, etc.  Turnaround Time  Over 17 applications and growing  About a week to deploy / configure a new one
  4. 4. Why a Platform? Technical Challenges  Code / Bug Duplication, Reusability  Agility  Product Requirements Drive Platform-Wide Features  SOA, Integration, Business Logic Separation  Comprehensive Documentation  Scalability  Development and QA Time/Cost Reduction Product Challenges  Turnaround Time  Business Logic Implementation = Configuration Miscellaneous  Maintenance Cost Reduction  Resource Optimization/Integration (...Cloud)  Standardized Reporting / Health Monitoring
  5. 5. Information Extraction Data/Information Acquisition Structurize Raw Information  Training based Models for Class Inference Functional Area Detection  Rule based Extraction Nested Funnels/Filter Layers Regular Expressions Feedback Loop  Wisdom of Crowd/Collective Intelligence SAP/SimCV: Capture User Response for Recommendations  Continuous Quality Improvement
  6. 6. IE:Use Cases/Impact Resume Parsing  Resman (unreg apply flow)  Accuracy ~75%  Dropouts reduced ~44% Job Acquisition/Aggregation  Naukri India:  JobMail: 23 Sites, 8.8K Jobs  JobPosting: 16 Competition Sites, 472K Jobs  Naukri Gulf: JobAlerts, JobSearch  21 Sites, 6.5K Jobs Taxonomy Acquisition (Entity Extraction)  FN Institute Names  Contact Information
  7. 7. IE: Research / Evaluation Nutch Selenium Celerity UIMA* Heritrix HTTP Unit HTML Unit* Open NLP* Net::LWP*
  8. 8. IE: Architecture
  9. 9. Crawler Framework Crawler Propagation Capabilities  JavaScript/Ajax/Event Support  Follow JavaScript Links  URL Detection (Final URL Presentation)  URLs obtained via JavaScript Execution  Recursively Redirected URLs  Handle Dom Events button, link, check-box, click, mouseOver, doubleClick.  URLs obtained after form-submission
  10. 10. Crawler Framework (contd.) Browser Emulation  Built over Headless Browser  Human-Like Propagation Strategies  Handles Cookies  Handles POST/GET methods Compliance (Obeys Robots.txt) Configurable Stateful/Delta Crawling Nested Multi-threaded Execution Pause/Resume/Restart Capabilities at Site/Seed URL levels Controllable Depth
  11. 11. Crawler Framework (contd.) Real-time Crawler Statistics  State Information  MISes Crawl-Payload persistence strategies  Multiple, Combinable Persistence Modules  Multiple Output Format Support  Flat File, XML  JDBC-connectable data stores  Search Engine Index Formats (e.g. Lucene, Sphinx)  Archive Formats (bz, gz, rar, zip, ...)
  12. 12. Extraction Strategies Analysis Plugins  Entity Extraction  Composable Funnelling Filters Sections, Subsections, ..., Entity  Regex-based Subpart matching  Corpus, NLP-based matching UIMA, OpenNLP  Machine-Learning Approaches  Classification / Tagging (Bayesian, SVM)  Clustering
  13. 13. Information Retrieval Custom/Controllable Relevance/Matching Scalability of Search  Large Volumes  High Churn  QPS Extraction/Acquisition Pipeline Pluggability Results Post Processing
  14. 14. IR: Research / Evaluations MySQL Full-Text FastESP Solr Sphinx Lucene * OpenNLP * LingPipe Mozilla Rhino (JavaScript) *
  15. 15. IR: Use Cases/Impact NSE on Resdex India  Relevance
  16. 16. IR: Use Cases/Impact  Error Count the week Before: 91, week After: 1  Availability (Before: 97.71% - 99.44%, After: 99.99%)  Performance  Slow Queries ( 10 secs): < 0.2%  Average Search Time: 0.55 secs  QA Quote ”There is an overall decrease in the page download time for Resdex Search Results page. Incase the cache is cleared the page download time has decreased by 34% to 35%, while the page download time has drastically decreased, more than 73%, when checked without clearing cache.” NSE on Resdex FirstNaukri  PM Quote ”Hardly any bugs considering the complexity of project. Search results are also coming @ speed of thought.”
  17. 17. IR: Use Cases/Impact (contd.) Improved Concurrency → ` ` `
  18. 18. IR: Architecture
  19. 19. IR: Platform Features High Availability, Stability, Performance  Caching  Adaptive Caching of Hit Attributes  Caching of Expression Evaluations  Pre-configurable Caching Query Filters  Distributed Search  Search over Sharded Indexes  Auto Failover  Auto Healing  Search/Sort/Group Millions of results  Complex expressions.  Miscellaneous  Status Reports, Performance Analytics  Suggestive Garbage Collection  Preload Indexes into RAM  Ease of Deployment
  20. 20. IR: Platform Features Text Transformations Tokenization/Transformation/Tagging  Controlled, Combinable Stemming  Plural, Tenses, Noun-Forms, etc. [Relevance ]  Inversion of Stem-roots Highlighting/Did You Mean/Query Expansion  Phonetic Token Mapping/Augmentation  Custom Word Mapping/Synonyms (iMatch)  Linguistic Tagging  PoS, Entity Extraction  Match/Boost on Tags  Sentence Detection  Apply different analytics to different fields  Context Sensitive Spelling Correction
  21. 21. IR: Platform Features Indexing  Dynamic Rule Based Sharding, Distributed Search  Multiple Data Source Type Support  (Near-)Real Time Indexing, Search  Generic Auxillary Index Format  Fast Updation/Retrieval  Realtime Per-User Filtering/Sorting Matching/Filtering Lucene Query Functionality  Phrase, Proximity, Fuzzy, Wildcard  FirstNaukri Suggestor Implementation
  22. 22. IR: Platform Features Result Grouping/Clustering Expressions  Embedded JavaScript Support  Aggregate Functions (superset of SQL)  Sort/Group/Filter during indexing, search Sorting  Dynamic/Stateful Sorting, e.g. for Ad Rotation  Quota-Based Resorting
  23. 23. IR: Platform Features Scoring  Fully Controlled, Customizable Relevance Scores  More controllable/testable than Solr/Default Lucene Scoring  Named Query Parts usable in Expressions  Custom Scorer Variables Vector Space, Query Boost, LCS, Numwords Configurability, API  SQL-like client wrapper Engine-App interactions look like SQL  XML configurability
  24. 24. Road AheadIf you dont know where you are going, any road will get you there. - The Cheshire Cat, Alice in Wonderland.
  25. 25. Road Ahead Cloud → ` ` ` Semantic Relevance (Search is Dead!)  Contextual Information Extraction  NLP  Ontology Extraction
  26. 26. Thanks!

×