Solr -

  • 3,355 views
Uploaded on

General

General

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,355
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
103
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  •    SolrCloud 是基于 Solr 和 Zookeeper 的分布式搜索方案,是正在开发中的 Solr4.0 的核心组件之一,它的主要思想是使用 Zookeeper 作为集群的配置信息中心。它有几个特色功能: 1 )集中式的配置信息  2 )自动容错  3 )近实时搜索  4 )查询时自动负载均衡 

Transcript

  • 1. SolrHao Chen 2012.04
  • 2. What is Solr? Solr is the popular open source enterprise search platform from the Apache Lucene project. Solr powers the search and navigation features of many of the worlds largest internet sites.
  • 3. Lucene Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross- platform.
  • 4. Lucene Vs Solr Lucene is a search library built in Java. Solr is a web application built on top of Lucene. Certainly Solr = Lucene + Added features. Often there would a question, when to choose Solr and when to choose Lucene. To get more control use Lucene. For faster development, easy to learn, choose Solr. http://www.findbestopensource.com/article-detail/lucene-vs-solr
  • 5. Why do we need Solr? Full-text Search – MySQL “like %keyword%” Too slow! And weak!
  • 6. Major Features of Solr Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces - XML,JSON and HTTP Comprehensive HTML Administration Interfaces Server statistics exposed over JMX for monitoring Scalability - Efficient Replication to other Solr Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture http://lucene.apache.org/solr/
  • 7. Typical Application Architecture Cache (memcached, Redis, etc.)http request Web Server Database (MySQL) DIH Solr / Lucene All the components could be distributed, to make the architecture scalable.
  • 8. Lucene/Solr ArchitectureRequest Handlers Response Writers Update Handlers/admin /select /spell XML Binary JSON XML CSV binarySearch Components Update Processors Query Highlighting Extracting Signature Request Spelling Statistics Schema Logging Handler Faceting Debug Indexing (PDF/WORD) More like this Clustering Apache Tika Query Parsing Distributed Search Config Data Import Handler (SQL/ Analysis RSS) High- Faceting Filtering Search Caching lighting Index Replication Core Search Apache Lucene Indexing IndexReader/Searcher Text Analysis IndexWriter 8
  • 9. Demo – A live website powered by Solr I’ll be showing you more later!
  • 10. Demo – The backend of the website
  • 11. Demo - Standard directory layout
  • 12. Demo - Multiple cores
  • 13. Demo – Run Solr! java -jar start.jar Production enviroment: – java -Xms200m -Xmx1400m -jar start.jar >>/home/web_logs/solr/solr$date.log 2>&1 & – tailf /home/web_logs/solr/solr20120423.log 2012-04-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
  • 14. Demo – Web Admin Interface http://localhost:8983/solr/admin
  • 15. Demo – Web Admin Interface http://localhost:8983/solr/admin • SCHEMA: This downloads the schema configuration file (XML) directly to the browser. • CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr. • ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later. •SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. Well return here later. •STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solrs performance.
  • 16. Demo – Web Admin Interface http://localhost:8983/solr/admin • INFO: This lists static versioning information about internal components to Solr. Frankly, its not very useful. • DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations. • PING: Ignore this, although it can be used for a health-check in distributed mode. • LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as were running it, this output goes to the console and nowhere else.
  • 17. QueryIndexing
  • 18. Query INFO: [core1] webapp=/solr path=/admin/ping params={} status=0 QTime=2 Apr 23, 2012 5:42:46 PM org.apache.solr.core.SolrCore execute INFO: [core1] webapp=/solr path=/select params={wt=json&rows=100&json.nl=map&start=0&q=searchKey word:ipad2} hits=48 status=0 QTime=0
  • 19. Query INFO: [] webapp=/solr path=/select params={wt=json&rows=20&json.nl=map&start=0&sort =volume+desc&q=CId:50011744+AND+price: [100+TO+*]} hits=1547 status=0 QTime=41 q=CId:50011744+AND+price:[100+TO+*] sort=volume+desc start=0 rows=20 hits=1547 status=0 QTime=41
  • 20. Query q - 查询字符串,必需 fl - 指定返回那些字段内容,用逗号或空格分隔多个。 start - 返回第一条记录在完整找到结果中的偏移位置, 0 开始,一般分页用 。 rows - 指定返回结果最多有多少条记录,配合 start 来实现分页。 sort - 排序,格式: sort=<field name>+<desc|asc>[,<field name>+<desc|asc>]… 。示例:( inStock desc, price asc )表示先 “ inStock” 降序 , 再 “ price” 升序,默认是相关性降序。 wt - (writer type) 指定输出格式,可以有 xml, json, php, phps, 后面 solr 1.3 增加的,要用通知我们,因为默认没有打开。 fq - ( filter query )过滤查询,作用:在 q 查询符合结果中同时是 fq 查询 符合的,例如: q=mm&fq=date_time:[20081001 TO 20091031] ,找关键 字 mm ,并且 date_time 是 20081001 到 20091031 之间的。 More: http://wiki.apache.org/solr/CommonQueryParameters
  • 21. Demo – PHP Solr Client
  • 22. Query - Demo
  • 23. Indexing Data
  • 24. Indexing Data - Communicating with Solr – Direct HTTP or a convenient client API – Data streamed remotely or from Solrs filesystem
  • 25. Indexing Data - Data formats/sources – Solr-XML: – Solr-binary: This is only supported by the SolrJ client API. – CSV: CSV is a character separated value format (often a comma). – Rich documents like PDF, XLS, DOC, PPT – Solrs DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.
  • 26. Lucene/Solr Indexing PDF <doc> <title> HTTP POST HTTP POST /update /update/csv /update/xml /update/extract XML Update Extracting XML Update CSV Update with custom RequestHandler Handler Handler processor chain (PDF, Word, …) Update Processor Chain (per handler) Text Index Analyzers RSS Data Import Remove Duplicates pullfeed Handler processor Custom Transform Lucene Database pull processor RSS pull Logging pullSQL DB Simple processor transforms Index Lucene Index processor
  • 27. Indexing Data - Schema schema.xml
  • 28. Advanced Chinese Word Segmentation ( 中文分词 ) DIH (Data Import Handler) Sharding Replication Performance Tuning
  • 29. Chinese Word Segmentation ( 中文分词 )
  • 30. Chinese Word Segmentation ( 中文分词 )
  • 31. Chinese Word Segmentation ( 中文分词 )IKAnalyzer3.2.8.jar
  • 32. Chinese Word Segmentation ( 中文分词 ) 相关原理请参阅《 解密搜索引擎技术实战 》
  • 33. DIH (Data Import Handler)Most applications store data in relational databases or XML filesand searching over such data is a common use-case.The DataImportHandler is a Solr contrib that provides a configuration driven way toimport this data into Solr in both "full builds" and using incremental delta imports. jdbc/DIH MySQL Solr • full-import • delta-import
  • 34. DIH (Data Import Handler)• Imports data from databases through JDBC (Java Database Connectivity)• Imports XML data from a URL (HTTP GET) or a file• Can combine data from different tables or sources in various ways• Extraction/Transformation of the data• Import of updated (delta) data from a database, assuming a last- updated date• A diagnostic/development web page• Extensible to support alternative data sources and transformation steps
  • 35. DIH (Data Import Handler)• curl http://localhost:8983/solr/dataimport to verify the configuration.• curl http://localhost:8983/solr/dataimport?command=full-import• curl http://localhost:8983/solr/dataimport?command=delta-import
  • 36. DIH (Data Import Handler) - Full Import Example 完全索引data-config.xml
  • 37. DIH (Data Import Handler) - Delta Import Example 增量索引data-config.xml
  • 38. DIH (Data Import Handler) - Demo Linux aaa 2.6.18-243.el5 #1 SMP Mon Feb 7 18:47:27 EST 2011 x86_64 x86_64 x86_64 GNU/Linux Intel(R) Xeon(R) CPU E5620 @ 2.40GHz cpu cores :1 MemTotal: 2058400 kB 2 millions rows imported in about 20 minutes.
  • 39. Sharding Sharding is the process of breaking a single logical index in a horizontal fashion across records versus breaking it up vertically by entities. S1 S2 S3 S4
  • 40. Sharding-IndexingSHARDS = [http://server1:8983/solr/, http://server2:8983/solr/]unique_id = document[:id]if unique_id.hash % SHARDS.size == local_thread_id# index to shardend
  • 41. Sharding-QueryThe ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it.
  • 42. Replication Master Slaves
  • 43. Combining replication and sharding Sharding M1 M2 M3 Masters Replication S1 S2 S3 S1 S2 S3 Slave Pool 1 Slave Pool 2 Queries sent to pools of slave shards
  • 44. Combining replication and sharding http://wiki.apache.org/solr/SolrCloud http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html
  • 45. Performance Tuning JVM http cache Solr Cache Better schema Better indexing strategy
  • 46. Solr Caching Caching is a key part of what makes Solr fast and scalable There are a number of different caches configured in solrconfig.xml: – filterCache – queryResultCache – documentCache
  • 47. More Info 《 Solr 1.4 Enterprise Search Server 》 http://wiki.apache.org/solr/ http://solr.pl/en/ 《解密搜索引擎技术实战》
  • 48. Thank you!