Introduction to Elasticsearch

2. 20141202

3. ! Hosang

4. Jeon

5. hosang.jeon@jobplanet.co.kr Introduction to Elasticsearch

8. Contents 1. The

9. basic

10. concepts

11. of

12. Elasticsearch

13. 2. Analysis

14. [Basic]

16. 3. Indexing,

17. Updating

18. and

19. Deleting

20. 4. Searching

21. [Basic]

22. 5. Aggregations

23. [Basic]

25. The

26. basic

27. concepts

28. of

29. Elasticsearch

31. of

32. Elasticsearch 1. Open

33. source

34. 2. Distributed

35. search

36. engine

37. 3. Built

38. on

39. top

40. of

41. Apache

42. LuceneTM

43. 4. Support

44. REST

45. API Apache

46. LuceneTM

49. High-performance,

50. full-featured

51. text

52. search

53. engine

54. library

55. written

56. entirely

57. Java.

58. http://lucene.apache.org/core/

61. is

62. the

63. most

64. important

65. in

66. search

67. engine? 1. Doing

68. all

69. quickly

71. Performance

73. 2. Returning

74. relevant

75. search

76. results

78. Relevancy

79. 3. Returning

80. some

81. statistics

83. Aggregations

86. How

87. to

88. search

89. quickly? Company ID Tags 1 수영장, 복지 2 헬스장 3 스타트업, 영어 4 복지, 스타트업 5 수영장, 헬스장 Raw

90. data Tag Company IDs 복지 1, 4 수영장 1, 5 스타트업 3, 4 영어 3 헬스장 2, 5 Inverted

91. indexed

92. data Inverted

93. index An

94. index

95. data

96. structure

97. storing

98. from

99. content,

100. such

101. as

102. words

103. or

104. numbers,

105. to

106. its

107. locations

108. in

109. a

110. database

111. file,

112. or

113. in

114. a

115. document

116. or

117. a

118. set

119. of

120. documents

122. :

123. How

124. to

125. search

126. quickly? Inverted

127. index Tag Company IDs 복지 1, 4 수영장 1, 5 스타트업 3, 4 영어 3 헬스장 2, 5 Inverted

128. indexed

129. data Appropriate

130. for

131. a

132. search

133. engine

134. when

135. it

136. comes

137. to

138. relevance,

139. too. “스타트업” 1. matched

140. documents

141. 2. #

142. of

143. matched

144. documents “#

145. of

146. matched

147. document”

148. is

149. very

150. important

151. in

152. relevancy.

153. We

154. will

155. talk

156. about

157. this

158. in

159. the

160. next

161. slides.

163. :

164. How

165. to

166. make

167. search

168. to

169. be

170. smart? “삼성전자

171. 채용” http://www.samsung.com/sec/ http://www.jobkorea.co.kr/ “삼성전자”

172. :

173. trivial

174. “채용”

175. :

176. important “삼성전자”

177. :

178. important

179. “채용”

180. :

181. trivial

183. :

184. How

185. to

186. make

187. search

188. to

189. be

190. smart? TF-IDF TF

191. :

192. Term

193. Frequency

194. ! 하나의

195. 문서

196. 내에서의

197. 단어의

198. 빈도

199. 높을수록

200. 연관성이

201. 높음 IDF

202. :

203. Index

204. Document

205. Frequency

206. ! 역문서

207. 빈도,

208. 즉

209. 전체

210. 문서에서

211. 해당

212. 단어의

213. 빈도

214. 앞에서

215. 살펴본

216. #

217. of

218. matched

219. documents

220. 낮을수록

221. 연관성이

222. 높음

224. :

225. How

226. to

227. make

228. search

229. to

230. be

231. smart? “The

232. best

233. startup

234. in

235. Korea” Which

236. one

237. is

238. more

239. relevant? Food

240. delivery

241. startup

242. in

243. Korea

244. devours

245. $36

246. million

247. funding

248. from

249. Goldman

250. Sachs

251. ...

252. A

253. South

254. Korean

255. startup

256. is

257. trying

258. to

259. make

260. personal

261. fire

262. transfers

263. between

264. ... startup

265. :

266. 2

267. in

268. :

269. 1

270. Korea

271. :

272. 2 What

273. started

274. out

275. as

276. just

277. a

278. great

279. excuse

280. to

281. see

282. in

283. my

284. home

285. country

286. quickly

287. turned

288. in

289. ...

290. that

291. I

292. learned

293. about

294. the

295. startup

296. ecosystem

297. in

298. Korea:. startup

299. :

300. 1

301. in

302. :

303. 3

304. Korea

305. :

306. 1 For

307. the

308. deep

309. understanding

310. :

312. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scoring-theory.html

313. Very

314. common

315. word.

316. it

317. doesn’t

318. say

319. much

320. about

321. how

322. relevant

323. it

324. is.

327. data

328. is

329. organized

330. in

331. Elasticsearch Logical

332. layout Physical

333. layout 애플리케이션

334. 또는

335. 사람이

336. 인식하는

337. 데이터

338. 구조 Elasticsearch가

339. 백그라운드에서

340. 데이터를

341. 처리하는

342. 방식

344. data

345. is

346. organized

347. in

348. Elasticsearch Logical

349. layout RDBMS Elasticsearch Database Index Table Type Row Document Column Field Index

350. +

351. Type

352. +

353. document

354. ID

355. =

356. Unique

357. ID

358. of

359. a

360. document

362. data

363. is

364. organized

365. in

366. Elasticsearch Physical

367. layout Node • 1

368. node

369. !=

370. 1

371. physical

372. server

374. • 1

375. node

376. ==

377. 1

378. Elasticsearch

379. instance Shard • A

380. part

381. of

382. index.

383. • A

384. directory

385. of

386. files

387. containing

388. an

389. inverted

390. index.

391. • The

392. smallest

393. unit

394. that

395. Elasticsearch

396. deals

397. with.

398. • A

399. shard

400. =

401. A

402. Lucene

403. index

404. • Default

405. #

406. of

407. shards

408. :

409. 5 KEY

410. POINT

411. of

412. a

413. Distributed

414. System

417. data

418. is

419. organized

420. in

421. Elasticsearch Physical

422. layout Shard Primary

423. shard

424. VS

425. Replica

426. shard • Replica

427. shard

428. is

429. the

430. copy

431. of

432. a

433. primary

434. shard

435. • Default

436. #

437. of

438. replica

439. :

440. 1 참고.

442. Replica

443. shard는

444. 절대로

445. 동일한

446. 데이터를

447. 갖고

448. 있는

449. primary

450. shard와

451. 같은

452. Elasticsearch

453. 노드상에

454. 존재할

455. 수

456. 없다.

458. Analysis

459. [Basic]

461. is

462. the

463. meaning

464. of

465. “ANALYSIS”

466. in

467. Elasticsearch? It

468. is

469. NOT

470. a

471. kind

472. of

474. Data

475. Analytics

476. Analysis는 Elasticsearch가 문서를 인덱싱하기 전에 문서의 본문에 수행하는 작업을 의미한다. Analysis는 인덱싱과정 뿐만 아니라, 검색과정에서도 검색어에 대하여 수행된다.

478. steps

479. of

480. analysis •STEP 1. Character filtering  - 특정 문자들을 인식하여 검색에 사용할 수 있는 적합한 문자로 변환하는 작업  - ex) mapping char filter, html strip char filter, pattern replace char filter  •STEP 2. Tokenizing  - 문장을 검색에 사용가능한 조각(token)으로 쪼개는 작업  - ex) whitespace tokenizer, ngram tokenizer, keyword tokenizer  •STEP 3. Token filtering  - 하나의 토큰을 입력으로 받아 수정하거나 삭제, 필요한 경우 토큰을 추가한다.  - ex) lowercase token filter, ascii folding token filter, length token filter  •STEP 4. Token indexing  - Token filtering 까지 거친 토큰들은 최종적으로 인덱싱된다.  - 이 토큰들이 바로 inverted index를 구성하는 요소이다.  이 네 가지 요소들이 결합된 세트가 바로 Analyzer 이다.

482. steps

483. of

484. analysis

485. -

486. example •STEP 1. Character filtering (mapping char filter)  “I love you and me”  •STEP 2. Tokenizing (whitespace tokenizer)  “I”, “love”, “you”, “and”, “me”  •STEP 3. Token filtering (lowercase + stop + synonym token filter)  “i”, “love”, “you”, “me”, “like”  •STEP 4. Token indexing  indexing [“i”, “love”, “you”, “me”, “like”]  검색시에는 쿼리의 종류에 따라 동일한 analysis 과정이 수행되거나 그렇지 않을 수도 있다. “I love u me” indexing request

488. steps

489. of

490. analysis

491. -

492. example

493. with

494. YAML index: analysis: analyzer: myAnalyzer: type: custom char_filter: myMappingFilter tokenizer: whitspace filter: [lowercase, stop, sysnonym] char_filter: myMappingFilter: type: mapping mappings: [ = and , u = you ] [“i”, “love”, “you”, “me”, “like”] analysis의 최종 결과물이 바로 term이 된다. “I love u me” indexing request

496. 부터

497. 어근(root

498. word)을

499. 분리해내는

500. 과정 dogs dog stemming administrations administr administrators administr stemming 1. Algorithmic

501. stemming  -

502. ex)

503. snowball,

504. porter

505. stem,

506. stem  2. Dictionary

507. stemming  -ex)

508. hunspell 2

509. types

510. of

511. stemming

513. Indexing,

514. Updating

515. and

516. Deleting

518. definition

519. of

520. each

521. FIELD

522. in

523. a

524. TYPE. • What

525. is

526. the

527. type

528. of

529. the

530. field?

531. • How

532. the

533. field

534. will

535. be

536. indexed?

537. • How

538. the

539. field

540. can

541. be

542. searched? Mapping

543. example $ curl -XGET localhost:9200/blog/_mapping?pretty { blog : { mappings : { post : { properties : { body : { type : string }, postDate : { type : date, format : dateOptionalTime [… 이하생략 …]

545. example $ curl -XGET localhost:9200/blog/_mapping?pretty { blog : { mappings : { post : { properties : { body : { type : string }, postDate : { type : date, format : dateOptionalTime [… 이하생략 …] • Mapping

546. 은

547. 별도로

548. 정의해주지

549. 않아도

550. 입력된

551. 데이터의

553. 값을

554. 기준으로

555. 자동으로

556. 생성되 기

557. 때문에

558. 처음에는

559. 별로

560. 신경쓰지

561. 않을

562. 수도

563. 있다.  • 하지만,

564. 매핑정보는

565. 한번

566. 생성된

567. 필드에

568. 대한

569. 정의를

570. 변경하는

571. 것이

572. 매우

573. 어렵기

574. 때문에

575. 처 음부터

576. 신중하게

577. 결정하는

578. 것이

579. 중요하다.  • Mapping

580. 정보가

581. 활용되는

582. 곳이

583. 바로

584. 앞에서

585. 살펴본

586. Analysis

587. 단계이다.

589. documents $ curl -XPUT localhost:9200/blog/post/12345678?pretty -d '{ title: How can use Elasticsearch?, body: Some text to be here ... [... 이하생략 ...] }' 참고.

590. ID를

591. 별도로

592. 지정해주지

593. 않을

594. 경우,

595. Elasticsearch

596. 가

597. 자동으로

598. ID를

599. 랜덤하게

600. 생성하는데,

601. 이

602. 경우에는

603. HTTP

604. POST

605. 메서드를

606. 이용해야

607. 한다. Node 1 1 2 3 4 5 Node 2 4 5 1 2 3 2. Hashing the document ID (ex. 12345678 - hash function - 4) [Hash range == # of shards] 1. Indexing request 3. Move and index the document to the target primary shard 4. Index the document to the replica 5. Send response 이렇게 해싱된 값을 “routing value” 라고 하며, 이 값을 이용하여 문서를 특 정 샤드에 할당하는 작업을 “routing” 이라고 한다.

609. documents $ curl -XPOST localhost:9200/blog/post/12345678/_update -d '{ script: ctx._source.title = new_title, params: { new_title: This is a new title. } }' STEP 1. Retrieve the existing document STEP 2. Modify the document STEP 3. Indexing the changed document STEP 4. Remove the original document Update

610. process

612. documents $ curl -XPOST localhost:9200/blog/post/12345678/_update -d '{ doc: { new_title: This is a new title. } }' 1. Sending a partial document Three

613. ways

614. to

615. update

616. documents $ curl -XPOST localhost:9200/blog/post/12345678/_update -d '{ script: ctx._source.title = new_title, params: { new_title: This is a new title. } }' 3. Using script 2. Using upsert $ curl -XPOST localhost:9200/blog/post/12345678/_update -d '{ doc: { new_title: This is a new title. }, upsert: { name : This is a new title., date: 2014-12-04T19:00 } }' = The most simple way = Insert if the document does not exist. Upsert = Update + Insert = Can handle complex situations (default script language : MVEL)

618. data 3+1

619. types

620. of

621. deleting

622. data 1. Delete Document(s) $ curl -XDELETE localhost:9200/blog/post/12345678 $ curl -XDELETE localhost:9200/blog/post/_query?q=elasticsearch 2. Delete Type $ curl -XDELETE localhost:9200/blog/post/_mapping 3. Delete Index $ curl -XDELETE localhost:9200/blog 4. Close/Open Index $ curl -XPOST localhost:9200/blog/_close $ curl -XPOST localhost:9200/blog/_open 이 두가지 방식은 문서가 검색되지 않도록 mark만 하고, 실제 삭제 작업은 비 동기적으로 추후에 수행된 다. 이 방식은 파일을 실시간 으로 삭제한다. 인덱스를 close 하면 데 이터를 읽거나 쓸수 없다. 이러한 방식은 주로 로그 데이터 관리 등에 사용한 다.

624. Searching

625. [Basic]

627. 앞에서

628. 알아본

629. indexing

630. 과

631. 마찬가지로

632. analysis

633. 과정을

634. 거친다.

636. 단,

637. query

638. 의

639. 종류에

640. 따라

641. 검색어를

642. analysis

643. 하는

644. 경우와

645. 그렇지

646. 않은

647. 경우가

648. 존재한다. • 검색어를

649. analysis

650. 하는

651. query

652. :

653. match

654. query,

655. multi_match

656. query

657. …⋯

658. • 검색어를

659. analysis

660. 하지않는

661. query

662. :

663. term

664. query,

665. terms

666. query,

667. prefix

668. query

669. …⋯ analysis Searching “I love u me”

671. query와

672. 마찬가지로

673. 전체

674. 문서들

675. 가운데

676. 원하는

677. 내용을

678. 추려내는

679. 목적으로

680. 사용한다.

681. 그렇다면

682. query

683. 와

684. filter

685. 의

686. 차이는

687. 무엇일까? Query

688. vs

689. Filter

690. Query Filter Affect the score O X Caching the result X O Confusing? RDBMS의

691. query와

692. 용어가

693. 같아서

694. 혼란이

695. 발생할

696. 수

697. 있음.

699. vs

700. Filter

701. Query Filter Affect the score O X Caching the result X O RDBMS에서의

702. query

703. :

704. “이런

705. 조건을

706. 만족하는

707. 결과를

708. 찾아라”

713. -

714. 오히려

715. Elasticsearch의

716. filter

717. 와

718. 가까운

719. 개념 Elasticsearch에서의

720. query

721. :

722. “검색어와

723. 가장

724. 연관성이

725. 높은(스코어)

726. 결과를

727. 찾아라”

732. -

733. 검색엔진에서만

734. 사용되는

735. 개념

737. vs

738. Filter

739. Query Filter Affect the score O X Caching the result X O Conclusion 검색

740. 고유의

741. 역할과

742. 관련된

743. 부분

748. -

749. Query

750. 를

751. 사용 그

752. 외의

753. 모든

754. (가능한)

755. 경우

760. -

761. Filter

762. 를

763. 사용

764.  

768. -

769. Filter는

770. bitset을

771. 이용하여

772. 결과를

773. 캐싱하기

774. 때문에,

775. 원하는

776. 결과를

777. 얻을

778. 수

779. 있 다면

780. 가급적

781. 필터를

782. 사용하는

783. 것이

784. 좋다.

785. (bitset에

786. 대해서는

787. 추후에

788. 설명)

790. Aggregations

791. [Basic]

793. 왜

794. 통계가

795. 필요하지?

796. • “XX은행”의

797. 리뷰에서

798. 가장

799. 많이

800. 등장하는

801. 단어들은?

802. • 연령대별

803. 상위

804. 검색순위의

805. 분포는?

806. • 지난주

807. 대비

808. 금주

809. 급상승한

810. 검색어는?

811. • 평균

812. 만족도가

813. 3점

814. 이하인

815. 회사의

816. 리뷰에서

817. 가장

818. 많이

819. 등장하는

820. 단어는? Facets? aggregations facets can be nested, significant terms, etc. Just

821. FORGET

822. about

823. it! Facets

824. are

825. deprecated

826. since

827. 1.x

828. version

829. of

830. Elasticsearch

831. and

832. will

833. be

834. removed

835. in

836. a

837. future.

839. of

840. aggregations Metrics Buckets

842. of

843. aggregations Metrics Buckets • 최대값, 최소값, 평균, 표준편와 같은 일반적인 통계수치를 집계 • 하위에 다른 aggregation을 중첩하여 사용할 수 없다. • ex) 쇼핑몰 상품의 카테고리별 평균 가격, 웹사이트의 Unique Visitor 통계 • 검색결과를 특정 기준에 의하여 여러 버킷으로 분류 • 하위에 다른 aggregation을 중첩하여 사용할 수 있다. • ex) 산업군별 연봉 통계 및 증가 추이 Single-bucket aggregations Multi-bucket aggregations ex) global, filter, missing aggregations ex) terms, significant terms, range, histogram aggregations

845. you

846. should

847. remember

848. for

849. aggregations Aggregation은 항상 query의 결과에 대해서 수행된다. query aggregations 검색 결과 집계 결과 Filter 또는 Post-filter를 추가한다면? Aggregations 대상을 필터링 하고 싶다면? filtered query aggregations 검색 결과 집계 결과 query aggregations 검색 결과 집계 결과 filter / post_filter

851. of

852. Metrics

853. aggregations use

854. cases aggregations

855. 최저연봉,

856. 최고연봉,

857. 평균연봉

858. 등

859. stats

860. (min,

861. max,

862. avg)

863. 연령대별(bucket)

864. 연봉의

865. 분산과

866. 표준편차

867. extended_stats

868. User

869. agent

870. 별

871. 접속량

872. 통계

873. percentiles

874. 고유

875. IP주소

876. 개수

877. cardinality Metrics Buckets use

878. cases aggregations

879. 회사별

880. 장점에

881. 가장

882. 많이

883. 등장하는

884. 용어

885. terms

886. 금주

887. 급상승한

888. 검색어

889. significant

890. terms

891. 일자별

892. 데이터의

893. 분류

894. date

895. range

896. 연령대별

897. 연봉

898. 분포

899. histogram

901. work 1. Searching

902. [Detail]

903. 2. Analysis

904. [Detail]

905. 3. Aggregations

906. [Detail]

907. 4. Relations

908. of

909. documents

910. 5. Performance

911. tuning

Introduction to Elasticsearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Elasticsearch

Similar to Introduction to Elasticsearch (20)

Recently uploaded

Recently uploaded (20)

Introduction to Elasticsearch