0
Build a Scalable Search Engine With
Amazon CloudSearch
Agenda
•  Introduction to Search
•  Amazon CloudSearch
•  Building with CloudSearch
Introduction to Search
Search Engines Connect Us To Data
Documents
Representation of a Document
Field Value
id tt0371746
title Iron Man
description When wealthy industrialist Tony Stark is ...
Data Types
Doubles
Dates
Signed Integers
Text
Literal
Geo
•  Latlon data type
•  Region search
•  Distance sort
•  Supports mobile
Text Processing (Normalization)
•  Tokenization
(parsing)
•  Downcasing
•  Stemming
•  Stopword removal
•  Synonym Additio...
Indexing
Term Documents (Posting List)
Iron The Man in the Iron Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
...
...
Matching
The Man in the Iron
Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
Rain Man
The Man in the Moon
Iron Man 2...
Ranking and Relevance
•  The meat of the search engine
•  TF-IDF – uniqueness and presence
•  Additional Criteria
–  Measu...
Summary
•  Search makes data accessible
•  Search documents gather information about one search target
•  Reverse indices ...
Amazon CloudSearch
Building a Search service
•  Build your own
–  Extend datastores and build custom relevance engine
•  Open Source
–  Apach...
Challenges with building a Search service
•  COMPLEX: Requires extensive search expertise
•  COSTLY: High upfront expendit...
Where CloudSearch fits in the picture
Amazon CloudSearch is a fully managed search service in the cloud that
makes it easy...
Reference Architecture
Automatic Scaling
SEARCH INSTANCE
Index Partition n
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 2
SEARCH INSTANCE
Index ...
Building With CloudSearch
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Create a Domain
Upload Data
2014年3月 CloudSearch Launch
Arabic,
Armenian,
Basque,
Bulgarian,
Catalan,
Simplified Chinese,
Traditional Chinese,
Czech,
D...
CloudSearchへのデータ投入(コンソールCSV)
生成したSDFフォーマットのファイルを
ダウンロードすることも出来る	
  
1	
  
2	
  
3	
  
Japanese Text Processing
•  形態素解析(Morphological Analysis)
–  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業
(http://ja.wikipedia.org/...
Japanese Text Processing
•  正規化(Normalize)
–  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちら
の場合もヒットして欲しい
–  CloudSearchでサ...
Japanese Text Processing
•  Stemming
–  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む
–  ステミング辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing
•  Stopword Removal
–  「の」、「は」、「か」といった意味の無い言葉を除く
–  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing
•  Synonym Addition
–  Synonym = 同義語
•  「ベニス」「ベネチア」「ヴェネチア」
•  「昨年」「去年」
–  同じ意味なので検索された場合にヒットさせる
–...
Japanese Text Processing
•  Synonym Addition
–  シノニム辞書への追加 (API/SDKでも追加可能)
•  Alias
–  pupilで検索してstudentのドキュメントがヒット
–  stu...
Document Upload
http(s)://< document service endpoint >/2013-01-01/documents/
batch!
!
Accept: application/json !
Content-...
Simple Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Simple Queries
http(s)/<search endpoint>/2013-01-01/search?q=iron+man!
{"status": {"rid": "oei6zt8oAgq5QOc=",!
"time-ms": ...
Complex Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Faceting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Drilldown
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Adjustable Ranking
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Highlighting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Availability Options
Scaling Options
IAM Integration
Configuration API Only
{!
"Version":"2012-10-17",!
"Statement": [!
{ "Effect": "Allow",

"Action": ["cloud...
Closing Thoughts
•  Content Discovery goes hand in hand with Content. Search is
everywhere!
•  Amazon CloudSearch is a ful...
Questions?
Jon Handler (handler@amazon.com)
Pravin Muthukumar (pravinm@amazon.com)
Upcoming SlideShare
Loading in...5
×

Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

2,539

Published on

at AWSプロダクトシリーズ|よくわかるAmazon CloudSearch http://kokucheese.com/event/index/168838/

Published in: Technology, Design
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,539
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
6
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler"

  1. 1. Build a Scalable Search Engine With Amazon CloudSearch
  2. 2. Agenda •  Introduction to Search •  Amazon CloudSearch •  Building with CloudSearch
  3. 3. Introduction to Search
  4. 4. Search Engines Connect Us To Data
  5. 5. Documents
  6. 6. Representation of a Document Field Value id tt0371746 title Iron Man description When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. director John Favreau actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard ... rating 7.9 release_date 2008-05-02T00:00:00Z
  7. 7. Data Types Doubles Dates Signed Integers Text Literal
  8. 8. Geo •  Latlon data type •  Region search •  Distance sort •  Supports mobile
  9. 9. Text Processing (Normalization) •  Tokenization (parsing) •  Downcasing •  Stemming •  Stopword removal •  Synonym Addition When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. when wealth industrial tony stark force build armor suit after life threaten incident ultimate decide use technology fight against evil
  10. 10. Indexing Term Documents (Posting List) Iron The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady ... Man Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man ...
  11. 11. Matching The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man Iron Man 2 Iron Man
  12. 12. Ranking and Relevance •  The meat of the search engine •  TF-IDF – uniqueness and presence •  Additional Criteria –  Measures of document value (e.g. rating) –  Observed user behavior –  Freshness
  13. 13. Summary •  Search makes data accessible •  Search documents gather information about one search target •  Reverse indices provide the basis of text-text matching •  Relevance brings the best matches
  14. 14. Amazon CloudSearch
  15. 15. Building a Search service •  Build your own –  Extend datastores and build custom relevance engine •  Open Source –  Apache Solr, ElasticSearch •  Enterprise Search –  FAST, Autonomy, Endeca
  16. 16. Challenges with building a Search service •  COMPLEX: Requires extensive search expertise •  COSTLY: High upfront expenditure •  SLOW: Long time to market. Slows innovation •  UNDIFFERENTIATED: Operational overhead that doesn’t add value to core product
  17. 17. Where CloudSearch fits in the picture Amazon CloudSearch is a fully managed search service in the cloud that makes it easy to setup, operate, and scale a search solution for your website or application Similar benefits as other AWS Managed Services •  Easy to setup and operate (Console, SDK, CLT) •  Pay as you go •  No need to guess capacity •  Experiment fast with low risk •  Go Global in minutes
  18. 18. Reference Architecture
  19. 19. Automatic Scaling SEARCH INSTANCE Index Partition n Copy 1 SEARCH INSTANCE Index Partition 2 Copy 2 SEARCH INSTANCE Index Partition n Copy 2 SEARCH INSTANCE Index Partition 2 Copy n SEARCH INSTANCE DATA Document Quantity and Size TRAFFIC Search Request Volume and Complexity Index Partition n Copy n SEARCH INSTANCE Index Partition 1 Copy 1 SEARCH INSTANCE Index Partition 2 Copy 1 SEARCH INSTANCE Index Partition 1 Copy 2 SEARCH INSTANCE Index Partition 1 Copy n
  20. 20. Building With CloudSearch
  21. 21. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  22. 22. Create a Domain
  23. 23. Upload Data
  24. 24. 2014年3月 CloudSearch Launch Arabic, Armenian, Basque, Bulgarian, Catalan, Simplified Chinese, Traditional Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish •  Support  for  33  languages
  25. 25. CloudSearchへのデータ投入(コンソールCSV) 生成したSDFフォーマットのファイルを ダウンロードすることも出来る   1   2   3  
  26. 26. Japanese Text Processing •  形態素解析(Morphological Analysis) –  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業 (http://ja.wikipedia.org/wiki/形態素解析) •  英語のようにスペースで区切られている言語と異なり、 •  日本語は日本語用の構文解析が必要 –  例) 彼はエンジニアだ •  彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞) •  “エンジニア”を抽出してインデックスを作ることにより、 •  ”エンジニア”で検索された際に、高速なレスポンスの実現が可能
  27. 27. Japanese Text Processing •  正規化(Normalize) –  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちら の場合もヒットして欲しい –  CloudSearchでサポートされている機能 –  更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま しい場合もある •  NFD(Canonical Decomposition): 正規化形式D •  NFC(Canonical Composition): 正規化形式C •  NFKD(Compatibility Decomposition): 正規化形式KD •  NFKC(Compatibility Composition): 正規化形式KC
  28. 28. Japanese Text Processing •  Stemming –  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む –  ステミング辞書への追加 (API/SDKでも追加可能)
  29. 29. Japanese Text Processing •  Stopword Removal –  「の」、「は」、「か」といった意味の無い言葉を除く –  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)
  30. 30. Japanese Text Processing •  Synonym Addition –  Synonym = 同義語 •  「ベニス」「ベネチア」「ヴェネチア」 •  「昨年」「去年」 –  同じ意味なので検索された場合にヒットさせる –  Stopwords, Stemming同様に追加可能
  31. 31. Japanese Text Processing •  Synonym Addition –  シノニム辞書への追加 (API/SDKでも追加可能) •  Alias –  pupilで検索してstudentのドキュメントがヒット –  studentで検索してpupilのドキュメントはヒットしない •  Group –  1st, first, oneどれで検索しても –  1st, first, oneの全てのドキュメントがヒット
  32. 32. Document Upload http(s)://< document service endpoint >/2013-01-01/documents/ batch! ! Accept: application/json ! Content-Length: 1176 ! Content-Type: application/json ! Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com ! ! { : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/ MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life- threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},! { , : "tt0434409"} ]!
  33. 33. Simple Queries Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  34. 34. Simple Queries http(s)/<search endpoint>/2013-01-01/search?q=iron+man! {"status": {"rid": "oei6zt8oAgq5QOc=",! "time-ms": 4},! "hits": {"found": 9, "start": 0,! "hit": [! {"id": "tt1228705"},! {"id": "tt0120744"},! {"id": "tt0371746"},! {"id": "tt1866249"},! {"id": "tt0119558"},! {"id": "tt0402894"},! {"id": "tt1258972"},! {"id": "tt1300854"},! {"id": "tt0462465"} ] } }!
  35. 35. Complex Queries Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  36. 36. Faceting Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  37. 37. Drilldown Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  38. 38. Adjustable Ranking Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  39. 39. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron" Highlighting
  40. 40. Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
  41. 41. Availability Options
  42. 42. Scaling Options
  43. 43. IAM Integration Configuration API Only {! "Version":"2012-10-17",! "Statement": [! { "Effect": "Allow",
 "Action": ["cloudsearch:*"],
 "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },! { "Effect": "Deny",! "Action": ["cloudsearch:DeleteDomain"],! "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }! ]! }!
  44. 44. Closing Thoughts •  Content Discovery goes hand in hand with Content. Search is everywhere! •  Amazon CloudSearch is a fully managed, easy to use, cost effective search service – easy to build, easy to scale •  Get the powerful search features found in open source engines (Apache Solr) combined with value add AWS features (easy setup, on demand pricing, auto scaling, Multi-AZ, global availability)
  45. 45. Questions? Jon Handler (handler@amazon.com) Pravin Muthukumar (pravinm@amazon.com)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×