The talk illustrates a real-world example of how to collect data from your web, mobile, server and cloud apps and then send them to third party services and tools or load them into your data warehouse.
The data collection pipeline is integrated with multiple AWS services, such as Kinesis Firehose, Lambda functions and StepFunctions; Python is used to write each module. The data workflow is fully described pointing out how to store backup correctly, manage the conditional routing (in order to allow or discard data for specific services), implement a retry strategy on failure and finally compare performance and costs for each module.
8. Amazon Kinesis Stream
What is Amazon Kinesis Stream?
• Collect and process large streams of data records in real ;me.
• Typical scenarios for using Streams:
• Manage mul;ple producers that push their data feed directly into a stream;
• Collect real-;me analy;cs and metrics;
• Process applica;on logs;
• Create pipeline with other AWS services (the consumers).
9. from time import gmtime, strftime
import boto3
client = boto3.client(
service_name="kinesis",
region_name="us-east-1",
)
for i in xrange(300):
print "sending event {}".format(i+1)
response = client.put_record(
StreamName="data-collection-stream",
Data='{"name":"event-%d","data":{"payload":%d}}' % (i, i),
PartitionKey=strftime("PK-%Y%m%d-%H%M%S", gmtime()),
)
print "response for event {}: {}".format(i+1, response)
Amazon Kinesis Stream
10. Amazon Kinesis Stream - Tips
• Use API Gateway as entry point for front-end and mobile.
• Start with a single shard and increase only when needed.
• Output events one by one to avoid data loss.
• Generate Par44onKey using uuid (e.g., for test purpose).
11. Amazon Lambda
What is AWS Lambda?
• It processes a single event at real-;me without managing servers.
• Highly scalable.
• Fallback strategy in case of errors.
22. Amazon Lambda - Our 4ps
• Enable Kinesis Stream as a trigger for other AWS services.
• To preserve the priority Configure trigger with Batch size: 1 and Star;ng posi;on: Trim Horizon
• An S3 file can be used to define the rou;ng rules.
• Invoke Lambda Func;ons that work as connector asynchronously
• Always create aliases and versions for each Func;on.
• Use environment variables for configura;ons.
• Create a custom IAM role for each Func;on.
• Detect delays in stream processing monitoring IteratorAge metric
in the Lambda console’s monitoring tab.
25. DLQ - Simple Queue Service (SQS)
What is AWS SQS?
• Lambda automa4cally retries failed execu;ons for asynchronous invoca;ons.
• Configure Lambda (advanced secngs) to forward payloads that were not
processed to a dead-leIer queue (an SQS queue or an SNS topic).
• We used a SQS.
26. def get_events_from_sqs(
sqs_queue_name,
region_name='us-west-2',
purge_messages=False,
backup_filename='backup.jsonl',
visibility_timeout=60):
"""
Create a json backup file of all events in the SQS queue with the
given 'sqs_queue_name'.
:sqs_queue_name: the name of the AWS SQS queue to be read via boto3
:region_name: the region name of the AWS SQS queue to be read via boto3
:purge_messages: True if messages must be deleted after reading, False otherwise
:backup_filename: the name of the file where to store all SQS messages
:visibility_timeout: period of time in seconds (unique consumer window)
:return: the number of processed batch of events
"""
forwarded = 0
counter = 0
sqs = boto3.resource('sqs', region_name=region_name)
dlq = sqs.get_queue_by_name(QueueName=sqs_queue_name)
# continues to next slide ..
Amazon Lambda - Events rou4ng
27. Amazon Lambda - Events rou4ng
# continues from previous slide ..
with open(backup_filename, 'a') as filep:
while True:
batch_messages = dlq.receive_messages(
MessageAttributeNames=['All'],
MaxNumberOfMessages=10,
WaitTimeSeconds=20,
VisibilityTimeout=visibility_timeout,
)
for msg in batch_messages:
try:
line = "{}n".format(json.dumps({
'attributes': msg.message_attributes,
'body': msg.body,
}))
print("Line: ", line)
filep.write(line)
if purge_messages:
print('Deleting message from the queue.')
msg.delete()
forwarded += 1
except Exception as ex:
print("Error in processing message %s: %r", msg, ex)
counter += 1
print('Batch %d processed', counter)
28. DLQ - Our 4ps
• Set a DLQ on each Lambda Func;on that can fail.
• Re-process events sent to DLQ with a custom script.
• Tune DLQ config directly from Lambda Func;on panel.
29. Conclusions
Why a serverless architecture?
• scalability - prevent data loss - full control on each step - costs
Open points:
• Integrate a custom CloudWatch dashboard.
• Configure Firehose for a Backup.
• Write a script that manages events sent to DLQs.
• Create a listener for anomaly detec;on with Kinesis Analy;cs.
• Amazon StepFunc;ons.
31. Useful links
These slides:
Create a serverless architecture for data collec4on with Python and AWS
—> hGp://clda.co/pycon8-serverless-data-collec;on
Blog post with code snippets:
Building a serverless architecture for data collec4on with AWS Lambda
—> hGp://clda.co/pycon8-data-collec;on-blogpost
Serverless Learning Path:
GeJng Started with Serverless Compu4ng
—> hGp://clda.co/pycon8-serverless-LP