6. Representation of a Document
Field Value
id tt0371746
title Iron Man
description When wealthy industrialist Tony Stark is forced to build
an armored suit after a life-threatening incident, he
ultimately decides to use its technology to fight against
evil.
director John Favreau
actors Robert Downey Jr., Gwyneth Paltrow, Terrence
Howard ...
rating 7.9
release_date 2008-05-02T00:00:00Z
8. Geo
• Latlon data type
• Region search
• Distance sort
• Supports mobile
9. Text Processing (Normalization)
• Tokenization
(parsing)
• Downcasing
• Stemming
• Stopword removal
• Synonym Addition
When wealthy industrialist Tony Stark is forced to
build an armored suit after a life-threatening
incident, he ultimately decides to use its
technology to fight against evil.
when wealth industrial tony stark force build
armor suit after life threaten incident ultimate
decide use technology fight against evil
10. Indexing
Term Documents (Posting List)
Iron The Man in the Iron Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
...
Man Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
...
11. Matching
The Man in the Iron
Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
Iron Man 2
Iron Man
12. Ranking and Relevance
• The meat of the search engine
• TF-IDF – uniqueness and presence
• Additional Criteria
– Measures of document value (e.g. rating)
– Observed user behavior
– Freshness
13. Summary
• Search makes data accessible
• Search documents gather information about one search target
• Reverse indices provide the basis of text-text matching
• Relevance brings the best matches
15. Building a Search service
• Build your own
– Extend datastores and build custom relevance engine
• Open Source
– Apache Solr, ElasticSearch
• Enterprise Search
– FAST, Autonomy, Endeca
16. Challenges with building a Search service
• COMPLEX: Requires extensive search expertise
• COSTLY: High upfront expenditure
• SLOW: Long time to market. Slows innovation
• UNDIFFERENTIATED: Operational overhead that doesn’t add value to
core product
17. Where CloudSearch fits in the picture
Amazon CloudSearch is a fully managed search service in the cloud that
makes it easy to setup, operate, and scale a search solution for your
website or application
Similar benefits as other AWS Managed Services
• Easy to setup and operate (Console, SDK, CLT)
• Pay as you go
• No need to guess capacity
• Experiment fast with low risk
• Go Global in minutes
19. Automatic Scaling
SEARCH INSTANCE
Index Partition n
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 2
SEARCH INSTANCE
Index Partition n
Copy 2
SEARCH INSTANCE
Index Partition 2
Copy n
SEARCH INSTANCE
DATA Document Quantity and Size
TRAFFIC
Search
Request
Volume and
Complexity
Index Partition n
Copy n
SEARCH INSTANCE
Index Partition 1
Copy 1
SEARCH INSTANCE
Index Partition 2
Copy 1
SEARCH INSTANCE
Index Partition 1
Copy 2
SEARCH INSTANCE
Index Partition 1
Copy n
43. IAM Integration
Configuration API Only
{!
"Version":"2012-10-17",!
"Statement": [!
{ "Effect": "Allow",
"Action": ["cloudsearch:*"],
"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },!
{ "Effect": "Deny",!
"Action": ["cloudsearch:DeleteDomain"],!
"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }!
]!
}!
44. Closing Thoughts
• Content Discovery goes hand in hand with Content. Search is
everywhere!
• Amazon CloudSearch is a fully managed, easy to use, cost effective
search service – easy to build, easy to scale
• Get the powerful search features found in open source engines
(Apache Solr) combined with value add AWS features (easy setup, on
demand pricing, auto scaling, Multi-AZ, global availability)