Adding Search to Amazon DynamoDB

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Darin Briskman
AWS Technical Evangelist
briskman@amazon.com
Adding Search to
Amazon DynamoDB

AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion

AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Lex
Polly
Rekognition Machine
Learning
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Deep Learning, MXNet
Database Migration
Schema Conversion

Schemaless data model
Consistent low latency performance
Predictable provisioned throughput
Seamless scalability with no storage limits
High durability & availability (replication across 3 facilities)
Easy administration – we scale for you!
Low cost
DynamoDB
DAXApp
DynamoDB Accelerator (DAX) offers caching without
coding for sub-millisecond read latency and up to 10x
throughput
DynamoDB: Non-Relational
Managed Database Service

Availability Zone A
Partition A
Host 4 Host 6
Availability Zone B Availability Zone C
Partition APartition A Partition CPartition C Partition C
Host 5
Partition B
Host 1 Host 3Host 2
Partition B
Host 7 Host 9Host 8
Partition B
CustomerOrdersTable
Data is always
replicated to three
Availability Zones
3-way replication
OrderId: 1
CustomerId: 1
ASIN: [B00X4WHP5E]
Hash(1) = 7B
Highly available and durable
Partition A

Consistently fast at any scale
Consistent Single-Digit Millisecond Latency
Requests (millions)
Latency (milliseconds)

Scales throughput automatically (Auto Scaling)
Specify: 1) Target capacity in percent 2) Upper and lower bound

Partition Key
Mandatory
Key-value access pattern
Determines data distribution
Optional
Model 1:N relationships
Enables rich query capabilities
DynamoDB table
A1
(partition key)
A2
(sort key)
A3 A4 A7
A1
(partition key)
A2
(sort key)
A6 A4 A5
A1
(partition key)
A2
(sort key)
A1
(partition key)
A2
(sort key)
A3 A4 A5
SortKey
Table
Items

Local secondary indexes
10 GB max per
partition key,
i.e. LSIs limit the
# of sort keys!
A1
(partition key)
A3
(sort key)
A2 A4 A5
A1
(partition key)
A4
(sort key)
A2 A3 A5
A1
(partition key)
A5
(sort key)
A2 A3 A4
• Alternate sort key
attribute
• Index is local to a
partition key

Reads and writes
provisioned
separately for GSIs
INCLUDE A2
A
LL
KEYS_ONLY
A3
(partition key)
A1
(table key)
A2 A4 A7
A3
(partition key)
A1
(table key)
A3
(partition key)
A1
(table key)
A2
• Alternate partition
(+sort) key
• Sparse
• Can be added or
removed anytime
A3
(partition key)
A1
(table key)
A2 A4 A7
A3
(partition key)
A1
(table key)
A2
A3
(partition key)
A1
(table key)
Global secondary indexes

DynamoDB Streams
Partition A
Partition B
Partition C
üOrdered stream of item
changes
üExactly once, strictly
ordered by key
üHighly durable, scalable
ü24 hour retention
üSub-second latency
üCompatible with Kinesis
Client Library
DynamoDB Streams
1
Shards have a lineage and
automatically close after time
or when the associated
DynamoDB partition splits
2
3
Updates
KCL
Worker
Amazon
Kinesis Client
Library
Application
KCL
Worker
KCL
Worker
GetRecords
DynamoDB
Table
DynamoDB Stream
Shards

DynamoDB Streams and Triggers
AWS Lambda
function
Amazon SNS
ü Implemented as AWS Lambda functions
ü Scale automatically
ü C#, Java, Node.js, Python
Triggers
Amazon ES
Amazon ElastiCache

Integration with Amazon DynamoDB

Integration with Amazon EMR
The Elasticsearch-Hadoop or ES-Hadoop connector enables
several Hadoop stack applications running on EMR or EC2 to
power real-time search and analytics with Amazon Elasticsearch
as well as beautiful visualizations with Kibana.
• seamlessly moves data between Hadoop and ElasticSearch
and allows indexing of Hadoop Data (HDFS/EMRFS) to and
query from Amazon Elasticsearch.
Amazon
EMR
Amazon ES
ES-Hadoop

ES-Hadoop Connector – for Spark & Friends
Hadoop Applications on
EMR/EC2
STORM
Amazon
Elasticsearch
Index data to
Elasticsearch Cluster
* Query data from
ES-Hadoop Connector
0
Analyze
Search
Visualize
Discover
* With Spark SQL, at runtime, Spark SQL translates to Query DSL. Data is filtered at source.

ES-Hadoop Connector – considerations
ES-Hadoop
• Performance:
Since Amazon Elasticsearch cluster nodes are not collocated
on EMR cluster nodes, local discovery should be disabled so
the ES-Hadoop Connector only connects through the declared
es.nodes during all operations, including reads and writes.
es.nodes.wan.only should be set to true
Since partition to partition architecture or parallelism cannot be
achieved, performance may be impacted at scale and ES-
Hadoop connector tasks should be tested for bottlenecks.

ES-Hadoop Connector – considerations (contd.)
ES-Hadoop
• Security:
• For EMR Cluster in a public subnet, use IP-based access
policy with Amazon Elasticsearch to whitelist EMR IPs.
• For EMR Cluster in a private subnet, use Identity-based
access policy with Amazon Elasticsearch and install AWS
ES/Kibana Proxy on EMR nodes via bootstrap action.

Kinesis Firehose delivery architecture with
transformations
S3 bucket
source records
data source
source records
Amazon Elasticsearch
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
transformation failure

Integration with Amazon Lambda
VPC
Flow Logs
CloudTrail
Audit Logs
S3
Access
Logs
ELB
Access
Logs
CloudFront
Access
Logs SNS
Notiﬁcations
DynamoDB
Streams
SES
Inbound
Email
Cognito
Events
Kinesis
Streams
CloudWatch
Events &
Alarms
Conﬁg
Rules
S3
CloudWatch
Logs
Lambda
Service

Elasticsearch works with structured JSON
{
"name" : {
"first" : "Jon",
"last" : "Smith",
}
"age": 26,
"city" : "palo alto",
"years_employed" : 4,
"interests" : [
"guitar",
"sports"
]
}
• Documents contain fields –
name/value pairs
• Fields can nest
• Value types include text,
numerics, dates, and geo
objects
• Field values can be single or
array
• When you send documents to
Elasticsearch they should arrive
as JSON*
*ES 5 can work with unstructured documents

If your data is not already in
structured JSON, you must
transform it, creating
structured JSON that
Elasticsearch "understands"

The most basic way to transform data
• Run a script in Amazon EC2, Lambda, etc. that reads data
from your data source, creates JSON documents, and ships
to Amazon Elasticsearch Service directly

Logstash simplifies transformation
• Logstash is open-source ETL over streams. Run colocated
with your application or read from your source
• Many input plugins and output plugins make it easy to
connect to Logstash
• Grok pattern matching to pull out values and re-write
Application
Instance

Elasticsearch 5 ingest processors
When you index documents, you can specify a pipeline.
The pipeline can have a series of processors that
pre-process the data before indexing.
Twenty processors are available, some are simple:
{ "append":
{ "field": "field1"
"value": ["item2", "item3", "item4"] } }
Others are more complex, like the Grok processor for
regex with aliased expressions.

Firehose transformations add robust delivery
S3 bucket
source records
data source
source records
Service
Firehose
delivery stream
transformed
records
delivery failure
Data transformation
function
• Inline calls to
Lambda for
free-form
changes to the
underlying data
• Failed
transforms
tracked and
delivered to S3

Firehose transformations add robust delivery
intermediate
Amazon S3
bucket
backup S3 bucket
source records
data source
source records
Service
Firehose
delivery stream
transformed
records transformed
records
delivery failure
• Inline calls to Lambda for free-form changes to the
underlying data
• Failed transforms tracked and delivered to S3

Common transformations
• Rewrite to JSON format
• Decorate documents with data from other sources
• Rectify dates

Cluster is a collection of nodes
Amazon ES cluster
1
3
3
1
Instance 1
2
1
1
2
Instance 2
3
2
2
3
Instance 3Dedicated master nodes
Data nodes: queries and updates

Data pattern
Amazon ES cluster
logs_01.21.2017
logs_01.22.2017
logs_01.23.2017
logs_01.24.2017
logs_01.25.2017
logs_01.26.2017
logs_01.27.2017
Shard 1
Shard 2
Shard 3
host
ident
auth
timestamp
etc.
Each index has
multiple shards
Each shard contains
a set of documents
Each document contains
a set of fields and values
One index per day

Indices and Mappings
Index: product
Type: cellphone
documentId
Fields: make (keyword), inventory
(int), location (geo point)
Type: reviews
documentId
Fields: make(keyword), review (text),
rating (float), date (date)
http://hostname/product/cellphone/1 http://hostname/product/reviews/1

Physical Layout
/product/cellphone/3
1
2
3
Instance 1 Instance 2 Instance 3
Cluster
- 3 Instances
- 3 Primary Shards
- 1 Replica per
primary
1 1
2
2
33
Index Operation on documents
spreads it across Shards

Shards
- Indexes are split into multiple shards
- Primary shards are defined at index creation
- Defaults to 5 Primaries and 1 Replica Shard
- Shards allow
- Horizontal scale
- Distribute and parallelize the operations to increase
throughput
- Create replicas to provide high availability in case of failures

Shards … contd
- Shard is a Lucene index
- Number of Replica shards can be changed on the fly but
not the primary shards
- To change the number of primary shards, the index
needs to be re-created
- Shards are automatically balanced when cluster is re-
sized

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Document
Fields
host ident auth timestamp verb request status size
Field indexes
199.72.81.55
unicomp6.unicomp.net
199.120.110.21
burger.letters.com
199.120.110.21
205.212.115.106
d104.aa.net
1, 4, 8, 12, 30, 42, 58, 100...
Postings
Elasticsearch creates an index for
each field, containing the
decomposed values of those fields

host:199.72.81.55 AND verb:GET
1,
4,
8,
12,
30,
42,
58,
100
...
Look up
199.72.81.55 GET
1,
4,
9,
50,
58,
75,
90,
103
...
AND
Merge
1,
4,
58
Score
1.2,
3.7,
0.4
Sort
4,
1,
58
The index data structures support fast
retrieval and merging. Scoring and
sorting support best match retrieval

- Create Index called product
- Get list of Indices
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b
$ curl –XPUT ‘http://hostname/product/’
Index and Document Command Examples
$ curl ‘http://hostname/_cat/indices’
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open product 95SQ4TS 5 1 0 0 260b 260b

Index and Document Command Examples ..
- Indexing a document
- Retrieving a document
$ curl -XPUT ’http://hostname/product/cellphone/1' -H 'Content-Type:
application/json' -d’
{
”make": ”Apple”,
“inventory”: 100
}’
$ curl -XGET ’http://hostname/product/cellphone/1’
{
"_index" : ”product",
"_type" : ”cellphone",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : { ”make": ”Apple”, “inventory: 100 }
}

What happens at Index Operation
http PUT – http://hostname/product/cellphone/1
Instance 1 Instance 2
1
2
32 1
3
Instance 3
1. Indexing operation
2. Shard determined is based on hashing with
document ID.
3. Current node forwards document to node
holding the primary shard
4. Primary shard ensures all replica shards
replay the same indexing operation
1
3
4

Mappings
1. Mappings are used to define types of documents.
2. Define various fields in a document
3. Mapping Types –
1. Core
1. Text or keyword
2. Numeric
3. Date
4. Boolean
2. Arrays and Multi-fields
1. Arrays – “tags” : [“blue”,”red”]
2. Multi-fields – Index same data with different settings
3. Pre-defined fields
1. _ttl, _size
2. _uid, _id, _type, _index
3. _all, _source

Mapping command examples
curl -XPUT ’http://hostname/product' -H 'Content-Type: application/json' –d‘
{
"mappings": {
"cellphone": {
"properties": {
"make": {
"type": "text"
}
}
}
}
}’
Create an index called product with mapping, cellphone and field make
as type text –

curl -XPUT ’http://hostname/product/_mapping/reviews' -H 'Content-Type:
{
"properties": {
”review": {
"type": "text"
},
“rating”: {
“type”: “int”
}
}
}’
Add a new mapping, reviews, with fields review, as string and rating, as
int, to existing index, product –

curl -XPUT ’http://hostname/product/_mapping/cellphone' -H 'Content-Type:
{
"properties": {
”inventory": {
"type": ”int"
}
}
}’
Add a new field, inventory as integer, to existing mapping, cellphone in
index product –

Adding Search to Amazon DynamoDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Adding Search to Amazon DynamoDB

Similar to Adding Search to Amazon DynamoDB (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Adding Search to Amazon DynamoDB