Understand how to structure and measure metrics gathered from microservices on a large scale and how to build useful ecosystems around the processed data.
15. 15
Kinesis Logger Config
1. local logger_module = “api-gateway.logger.BufferedAsyncLogger"
3. local logger_opts = {
4. flush_length = 500,
5. flush_interval = 5,
6. flush_concurrency = 16,
7. flush_throughput = 10000,
8. sharedDict = “stats_kinesis”
9. }
PutRecords - ever
request up to 500
records
5s - interval in
seconds to flush
regardless if the
buffer is full or not
Max parallel
threads used for
sending logs
max logs /
SECOND that can
be sent to the
Kinesis backend
dict for caching the
logs
20. 20
- Batch layer
Batch
HDFS SQLStreaming
Agg. Index
ServiceSpeed
{API}
Speed IndexConsumers
Kinesis
S3
Docker / Marathon
Batch size
Checkpointing Kinesis /
Spark
Store in Parquet format
Temporary storage
Parquet files
S3 sync
Elasticsearch
Index
Docker / Chronos
Daily / Hourly
aggregation
Run job hourly or daily
21. #nginx #nginxconf21
OLAP Data Cube
count
time
consumer
service
The Elasticsearch aggregated
index can be represented as a
Data Cube
The cube is actually a hypercube
with more than 3 dimensions
Users can apply
filters, roll-ups or drill-downs
My names is Cosmin Stanciu, I work for Adobe, and today, I’d like to encourage you to think big, big as in big-data, and I’ll try to give you an example that in a short period of time, leveraging today’s technologies you can very fast develop solutions that could highly impact your business.
How many of you have heard about Creative Cloud, Document Cloud Adobe I/O?
Besides applications, adobe provides API that you can use with you own application.
As a developer you can incorporate Adobe technologies into your application using APIs
Adobe IO provides documentation and developer tools to interact with Adobe’s API’s
Ex.: if you develop mobile applications, use the Creative SDK to power you application
you can use PDF services to programmatically create PDFs, just uploading the assets and retrieving the PDFs once the conversion is done.
Adobe IO is the entry point / the gateway for all these APIs
API gateway is powered by NGINX.
- Authorizes applications. Doing request validations
- Caching for faster response.
- Service throttling / Rate limiting (Presented by Dragos)
- Tracking and monitoring.
We understand the power of the bazar comparison cathedral - open sourced our modules.
Of the OpenResty and so all our modules are written in Lua.
Api gateway-which is the core
request validation - for validating requests /
api-gateway-aws - which allows integrations with the AWS apis /
api-gateway-async-logger- which allows to send logs and metrics in an async manner.
The gateway proxies around 600Mil requests a day and that number will rise considerably in the months to come.
Metadata of the request is being sent to Kinesis.
Entire ecosystem around data: debugging, analytics, notifications, and ideas of a new business model.
Adobe evolved - given the technology evolution and competitive marketing.
Migrated from a license based model to a subscription based model. And that has been an incredible beneficial for us.
But now we talk about a new model which is consumption / usage based
Nowadays you have a lot of tools to process big-data.
So it’s a much easier problem to solve that it has been in the past.
But every tool come with it’s flavor and set of options. You need to chose those that best fit you needs.
Nowadays you have a lot of tools to process big-data.
So it’s a much easier problem to solve that it has been in the past.
But every tool come with it’s flavor and set of options. You need to chose those that best fit you needs.
To process Nginx data you must first invent the Big Data ecosystem.
The streaming application is easy but make it production ready presents some challenges: architect the cluster, setup of the applications, performance tuning these applications is tricky.
In the few slides I will describe how the entire architecture looks like starting with the Nginx and ending with fancy charts.
I don’t believe in perfect solutions.
Silver bullet for any requirement.
We have decided for kinesis because:
It’s a managed service, relatively cheap, deployed in all regions
You need to have a low latency between your producer and your message bus.
Total number of records failed and retried represents 0.0048% of the entire traffic.
K: putRecords up to 500 each shard up to 1000/s
Records sent when: buffer is full, seconds since last flush, available threads, throughput not exceeded
You can’t communicate with kinesis cross account.
STS AsumeRole Returns a set of temporary security credentials.
We have kinesis in our insights account, and we can control who which gateway collections can send us records.
We run our service in AWS and we need flexibility into running our applications.
Be able to scale up / down on demand.
Run stageful / stateless applications.
Cell-OS cluster - Operating system for the datacenter in AWS
Explain the most important elements of the cluster.
It automatically creates VPC, subnets, ELBs, Internet Gateway, Security groups
Kinesis consumers that read data in real time and send that data to ES.
Can be a spark applications that send data to ES but does other tasks such as monitoring and anomaly detection
Store every record in ES
Read with spark and write to parquet in HDFS
Sync parquet to S3
Run Spark aggregation jobs
Store aggregated records into Elastic
Writes to HDFS in batches, gather all the requests and stores them temporarily in HDFS in parquet format. Then we have jobs that syncs the data to s3. May wonder why not directly write to S3. Library incompatibility (see email to Martin). And end up with data inconsistency when Spark writes multiple times in the same location.
This we’ve learned: checkpointing checkpointing checkpointing, in kinesis and in spark.
Why parquet format?
The problem with small files:
http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html
Write bigger files on s3, you’ll thank me later. Every file is an RDD, shuffling is expensive.
Having bigger files will help you manage resources better.
Spark batch aggregating result and sending them to ES
(diagram with the cube and all the dimensions)
We run batches every 1/h but we’ll upgrade the api to read from both realtime and aggregated index.
Kinesis and storing it to S3 in a parquet format.
The problem is that the hadoop-aws jar depends on aws-sdk 1.7.4 which is incompatible with the kinesis library (the kinesis library requires a more recent version of the aws-sdk lib 1.10.x+). This means that we can’t use kinesis and s3a with hadoop-aws version 2.7.2 which is actually the last available version.
At the end of the batch layer we end up with our OLAP Hypercube
We can do all kind of operations in order provide data to users.
Api provides data to 3rd party systems.
Performance / Functional testing
Average throughput - 80 000 req / sec
Max throughput - 110,000 / sec
Metrics delay < 1 sec
ES - 10 cluster node - r3.8xl
Spins locally all modules as docker containers, docker compose
Send records to Nginx and expects the aggregation to be correct
Canary traffic - send records in all regions in production. Expect
Canary is a microservice test the tester with runscope.
Don’t put the canary app in the same cluster.
- Debugging
Analytics service providers / end users
Consumption billing reports at the end of month - bill clients
Change the way we do business
At the end of the day we can provide debugging and analytics features to the service provides. And also we can provide them usage numbers / consumption numbers. How many calls have been made for a service by this client. Changing the entire business model from subscription based to consumption based.
Users can see their usage. And can track in realtime the requests they make.
Publishers can troubleshoot by searching the logs in Kibana. Same they would do with splunk or sumo logic.
Why build this solution?
We’ve been using and still are using Sumo and Splunk.
Not just a debugging, primary an analytics solution.
- We’ve tried meany solutions.
We’ve been using Graphite. (send longs directly to graphite)
Any component can die, you can still recover.
In much more efficient, you have control over data.
No solution is perfect, it has to fit your needs.
Analyze traffic pattern - send notification
What is normal.
Forecasting methods - moving avg. (naive method)
More complex, taking into account seasonality and historical data implemented using machine learning.
Analyze traffic and send notifications in case pattern is different.
Forecasting methods. What is the expected value of the number of requests. Or the proportion between success and client or server errors?
We could use a naive method of forecasting these values by calculating the Moving average / weighted moving average / exponential moving average.
Machine learning to understand what’s normal.
You need to take into consideration the historical data and seasonality.
Make a forecast and then calculate the deviation with the expected value.
Local testing environment
Kibana
Spins locally all modules as docker containers, docker compose
Send records to Nginx and expects the aggregation to be correct
K: putRecords up to 500 each shard up to 1000/s
Records sent when: buffer is full, seconds since last flush, available threads, throughput not exceeded