Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps


Published on

DataScience Talk by Roman Kern, Know Center - Graz University of Technology
Date: April 12th 2012
Graz, Austria

Published in: Technology
  • Be the first to comment

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

  1. 1. Building an open-source based search solution – first steps Roman Kern Institute of Knowledge Management Graz University of Technology Know-Center Graz, Data Science Meetup / 2012-04-12
  2. 2. Overview Graz University of Technology Motivation Background Solr Ecosystem Solr Features Conclusions 2 / 28
  3. 3. Motivation Graz University of Technology Search Change in users expectations Missing, sub-optimal search causes frustration Science Information retrieval Success story Mostly focused on web search Industry Enterprise search Heterogeneous data sources 3 / 28
  4. 4. Background of the Speaker Graz University of Technology 4 / 28
  5. 5. Apache Lucene Umbrella Project Graz University of Technology Components Search engine ⇒ Lucene Search server ⇒ Solr Web search engine ⇒ Nutch Lightweight crawler ⇒ Droids File-format parsing ⇒ Tika Communicate with CMS ⇒ ManifoldCF Distributed coordination ⇒ ZooKeeper Natural language processing ⇒ OpenNLP Related projects: Hadoop, Mahout, Carrot2, ... Common aspects Apache license, implemented in Java, community 5 / 28
  6. 6. Lucene Graz University of Technology Search Engine Library Java API Only for expert users Search-Index File-system In-memory index Advanced features Incremental indexing Update while searching Base for many projects Solr ir-lib elasticsearch LIA (Lucene in Action) 6 / 28
  7. 7. Nutch Graz University of Technology Web search engine Builds upon Solr Web crawler Link database, crawl database Distributed Runs on Hadoop Mode of operation Crawl a single domain Crawl the web with seed sites 7 / 28
  8. 8. Droids Graz University of Technology Crawler component Lightweight crawler Main features Throttling Multi-threaded Well behaved (robots.txt) 8 / 28
  9. 9. Tika Graz University of Technology Text extraction Text & meta-data File-formats Office Microsoft Formats (Apache POI) OpenDocument Common text formats PDF (PDFBox) HTML (tagsoup) Non-text Images Sound 9 / 28
  10. 10. ManifoldCF Graz University of Technology Content Management System Connectors Communicate with CMS/DMS Connectors FileNet P8 (IBM) Documentum (EMC) LiveLink (OpenText) Meridio (Autonomy) Windows shares (Microsoft) SharePoint (Microsoft) More: Alfresco, JDBC, ... Data is then stored and indexed e.g. Solr 10 / 28
  11. 11. ZooKeeper Graz University of Technology Distributed coordination Orchestrate servers Distributed Configuration Name lookup Synchronization 11 / 28
  12. 12. OpenNLP Graz University of Technology Natural language processing Process plain text Maximum entropy classification with beam search Models Sentence splitting Token splitting Part-of-speech (POS) tagging Named entity recognition more: chunker, parser, co-reference resolution 12 / 28
  13. 13. Hadoop Graz University of Technology Distributed computing Scale out framework Distributed file-system Data is partitioned Stored on multiple nodes Map/Reduce paradigm Map your algorithms to mappers & reducers Related projects: HBase, Pig, Hive, ... 13 / 28
  14. 14. Mahout Graz University of Technology Distributed machine learning Scale out framework Machine learning Recommender systems Clustering Classification Integration Standalone Hadoop Amazon EC2 14 / 28
  15. 15. Details Graz University of Technology 15 / 28
  16. 16. Search Server Graz University of Technology What Solr is Web-Service Full-text indexing & search Support to store arbitrary content What Solr isn’t Solr = grep Database But, somehow similar to No-SQL databases Solr vs. IR-Lib Solr: easy to use, easy to integrate, XML configuration IR-Lib: expert knowledge to use, Java configuration, fast 16 / 28
  17. 17. Index Structure Graz University of Technology Inverted Index Dictionary of words (terms) Map from term to document Document List of fields Input fields are them mapped according to the schema Field-types Defined in the schema Type (string, boolean, date, number) - internally mapped to string 17 / 28
  18. 18. Index Management Graz University of Technology API HTTP Server Various formats (XML, binary, JavaScript, ...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) 18 / 28
  19. 19. Input Handling Graz University of Technology Different input formats XML CSV JDBC (database) DIH (data import handler) Support incremental updates (via timestamps) Solr Cell Binary content Apache Tika Text content and metadata 19 / 28
  20. 20. Text Processing Graz University of Technology Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒ triplic, ...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi) n-grams, soundex, umlauts 20 / 28
  21. 21. Query Processing Graz University of Technology Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive 21 / 28
  22. 22. Search Features Graz University of Technology Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants 22 / 28
  23. 23. Search Result Graz University of Technology Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix, ... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) 23 / 28
  24. 24. Additional Solr Features Graz University of Technology Query by Example More like this Stats Per field Min, max, sum, missing, ... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely 24 / 28
  25. 25. Integration Graz University of Technology Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP, ... 25 / 28
  26. 26. Multi-core Graz University of Technology Multiple indices Each index has its own configuration Operations Reload (when configuration has been changed) Rename Swap Merge Create, Status 26 / 28
  27. 27. Scale Solr Graz University of Technology Replication Master and slaves nodes Replication Slaves poll master Dispatch search request Load balancer 27 / 28
  28. 28. Sharding Indexes Graz University of Technology Single index Index spawned over multiple machines Search is done in parallel Mapping Application has to provide a deterministic mapping Document ⇒ index 28 / 28
  29. 29. Conclusions Graz University of Technology Ecosystem Vivid community Corporative backing Solr Easy to get started Hard to optimize for specific requirements 29 / 28
  30. 30. The End Graz University of Technology Thank you! 30 / 28