3. History – Apache Lucene
– “Information retrieval software library”
– Free/open source
– Supported by Apache Foundation
– Created by Doug Cutting
– Written in 1999
4. History – Elastic Search
– You know, for Search”
– Also Free & Open Source
– Built on top of Lucene
– Created by Shay Banon @kimchy
– Versions
• First public release, v0.4 in February 2010
– A rewrite of earlier “Compass” project, now with scalability built-in
from the very core
• Now stable version at 0.20.6
• Beta branch at 0.90 (working towards 1.0 release)
– In Java, so inherently cross-platform
5. WHAT DOES IT ADD TO LUCENE?
– RESTfull Service
• JSON API over HTTP
• Want to use it from PHP?
– CURL Requests, as if you’d do requests to the Facebook Graph API.
– High Availability & Performance
• Clustering
– Long Term Persistency
• Write through to persistent storage system.
47. 数据Terminology
DB Elastic Search
DB Index
Table Type
Row Document
Column Filed
Schema Mapping
Index Everything is indexed
SQL Query DSL
Select * from table GET http://…
Update table SET … PUT http://
55. 创建文档
• curl -XPUT "localhost:9200/megacorp/employee/1?pretty" -H "Content-Type:
application/json" -d
"{"first_name":"John","last_name":"Smith","age":25,"about":"I love to go
rock climing","interest":["sports","music"]}"
76. ElasticSearch倒排索引-例子 – Posting list
ID Name Age Sex
1 Kate 24 Female
2 John 24 Male
3 Bill 29 Male
Term Posting List
Kate 1
John 2
Bill 3
Term Posting List
24 [1,2]
29 3
Term Posting List
Female 1
Male [2,3]
77. ElasticSearch倒排索引-例子-
Term Index
• 这棵树不会包含所有的term,它包含的是term的一些前缀。通过
term index可以快速地定位到term dictionary的某个offset,然
后从这个位置再往后顺序查找。
• 所以term index不需要存下所有的term,而仅仅是他们的一些前
缀与Term Dictionary的block之间的映射关系,再结合
FST(Finite State Transducers)的压缩技术,可以使term index
缓存到内存中。从term index查到对应的term dictionary的
block位置之后,再去磁盘上找term,大大减少了磁盘随机读的
次数。
87. 映射 Mapping
• Mapping is the process of defining how a document, and the fields it contains, are
stored and indexed.
• Mapping主要用于定义Index下面的Types
– Meta-fileds
– Fields or properties
• Settings to prevent mappings explosion
– index.mapping.total_fields.limit: the max number of fields in an index, default
1000
– index.mapping.depth.limit : default 20, inner objects
– index.mapping.nested_fields.limit : default 50
88. Mapping – Field datatypes
• 简单类型: String(ES 5之后不再使用,用text和keyword代替), Numeric(long, integer,
short, byte, double, float), date, boolean, binary, Range (integer_range, float_range,
long_range, double_range, date_range), Text(用于全文搜索)
• 复杂类型: Array, Object(object of single JSON Object), Nested(nested for arrays of
JOSN Objects)
• Geo地理: Geo-point(lat/lon points), Geo-shape(复杂类型如polygons)
• Keyword: A field to index structured content such as email addresses, hostnames,
status codes, zip codes or tags.
• 特别类型: IP (IPv4/IPv6), Completion (provide auto-complete suggestions), Token
count( count the number of tokens in a string), Mapper-murmur3 (compute hashes
of values at index-time and store them in the index), Attachment datatype,
Percolator datatype
• Multi-fields复合类型:it is often useful to index the same field in different ways for
different purposes.
91. 映射参数
• analyzer : 指定分析器
• normalizer : 正则器
• boost : 一般不使用,忽略; 权重
• coerce: 清除脏数据,例如,String的数字会被转为数字类型,单精度会被转为整型等
• copy_to: 用于创建自定义的_all 字段
• doc_values: 结构化搜索并且不需要排序和聚合运算可以关掉, 不支持text
• dynamic: filed级别关掉自动映射
• enabled: field级别,仅存储,不索引和搜索
• fielddata: text类型做聚合
• format: 格式
• ignore_above : Strings longer than the ignore_above setting will not be indexed or stored.
• ignore_malformed: Trying to index the wrong datatype into a field throws an exception by
default, and rejects the whole document. The ignore_malformed parameter, if set to true,
allows the exception to be ignored. The malformed field is not indexed, but other fields in the
document are processed normally.
92. 映射参数
• include_in_all : 跟_all有关
• index_options : 添加信息到倒排序索引
• index : 控制是否索引
• fields : It is often useful to index the same field in different ways for different purposes
• norms : 正则化参数
• null_value : 空值处理,默认的空值null是不索引的,不会出现在倒排序索引中,因此不能被搜索
• position_increment_gap
• properties: object/ nested’s sub fields called properties
• search_analyzer : 搜索分析器
• similarity : 相似度算法
• store: 额外指定类似_source的字段,用以接受返回结果
• term_vector: 词向量指定使用位置、位移等
96. 映射: Text –N-Garm Additional
Field
• Indexed Document stays the same
• Additional index field title.prefix
• Can be queried like any field
• 对存储有需求
• 对查询的性能影响比较大