2. • Technical Evangelist, Developer Advocate,
… Software Engineer
• Own bed in Finland
• Previously:
• Solutions Architect @AWS
• Lead Cloud Architect @Dreambroker
• Director of Engineering, Software Engineer, DevOps, Manager, ... @Hdm
• Researcher @Nokia Research Center
• and a bunch of other stuff.
• Climber, like Ginger shots.
3. What to expect from the session
1. What is the Well-Architected framework
2. Look at the different pillars
3. How to be Well-Architected
4. Conclusion
5. Customer Challenges
Faster response to change
in market
Delivery time Change Management Reduce human errors
Faster recovery High availability AutomationScaling to demand
6. AWS well-architected framework
Security Reliability Performance
efficiency
Cost optimization
Set of questions you can use to evaluate how well an architecture is
aligned to AWS best practices
Operational
excellence
9. Design Principles for Security
Apply security at all layers
Enable traceability
Implement a principle of least privilege
Focus on securing your system
Automate security best practices
10. Credentials
• Enforce MFA for everyone from day 1.
• Use AWS IAM Users and Roles from day 1.
• Enforce strong passwords.
• Protect and rotate credentials.
• No access keys in code.
11. EC2 Role
1: Create EC2 role
Create role in IAM service with
limited policy
2: Launch EC2 instance
Launch instance with role
3: App retrieves credentials
Using AWS SDK application
retrieves temporary credentials
4: App accesses AWS resource(s)
Using AWS SDK application uses
credentials to access resource(s)
Instance
S3EC2
12. Layers with Security Groups
Availability Zone A
User
WEB
Server
RDS DB Instance
Web Subnet A
DB Subnet A
WEB
Security Group
DB
Security Group
13. Bastion Host & Security Groups
Availability Zone A
Developer
WEB
Server
RDS DB Instance
Public Subnet A
Private Subnet A
WEB
Security Group
DB
Security Group
Bastion
Host
Bastion
Security Group
Port 22
IP restriction
> start_bastion
> ssh -A
> stop_bastion
17. Design Principles for Reliability
Test recovery procedures
Automatically recover from failure
Scale horizontally to increase aggregate system availability
Stop guessing capacity
Manage change in automation
18. Web
Instance
RDS DB Instance
Active (Multi-AZ)
Availability Zone Availability Zone
Web
Instance
RDS DB Instance
Standby (Multi-AZ)
Load
balancer
Multi-AZ Architecture
Available & redundant application
User
Amazon
Route 53
Amazon
CloudFront
Amazon S3
21. Auto Scaling
• Maintain your Amazon EC2 instance availability
• Automatically Scale Up and Down your EC2 Fleet
• Scale based on CPU, Memory or Custom metrics
22. RDS DB Instance
Active (Multi-AZ)
Availability Zone Availability Zone
RDS DB Instance
Standby (Multi-AZ)
ELB
Auto scaling groups
User
Amazon
Route 53
Amazon
CloudFront
Amazon S3
Web
Instances
Web
Instances ElastiCache
Auto-Scaling group
23. Backup and DR
• Define Objectives
• Backup Strategy
• Periodic Recovery Testing
• Automated Recovery
• Periodic Reviews
25. Design Principles for Performance Efficiency
Democratize advanced technologies
Go global in minutes
Use serverless architectures
Experiment more often
Mechanical sympathy
26. • Cache content at the edge for
faster delivery
• Lower load on origin
• Dynamic and static content
• Streaming video
• Custom SSL certificates
• Low TTLs
Amazon CloudFront (CDN)
34. How Lambda works
S3 event
notifications
DynamoDB
Streams
Kinesis
events
Cognito
events
SNS
events
Custom
events
CloudTrail
events
LambdaDynamoDB
Kinesis S3
Any custom
Invoked in response to events
- Changes in data
- Changes in state
Redshift
SNS
Access any service,
including your own
Such as…
Lambda functions
CloudWatch
events
35. Event-driven using Lambda
AWS Lambda:
Resize Images
Users upload photos
S3:
Source Bucket
S3:
Destination Bucket
Triggered on
PUTs
40. Database specialization example: Redis
In-memory data structure store, used as a database, cache and message
broker.
Specialized in data structures such as
• string
• hashes
• lists
• sets
• sorted sets with range queries
• bitmaps
• hyperloglogs
• geospatial indexes with radius queries
42. Design Principles for Cost Optimization
Adopt a consumption model
Benefit from economies of scale
Stop spending money on data center operations
Analyze and attribute expenditure
Use managed services to reduce cost of ownership
43. Manage Expenditure
• Tag Resources
• Track Project Lifecycle
• Profile Applications vs Cost
• Monitor Usage & Spend
45. Managed Services
• Let AWS do the heavy lifting.
• Databases, caches and big data solutions.
• Application Level Services.
Amazon
RDS
Amazon
DynamoDB
Amazon
Redshift
Amazon
ElastiCache
AWS
Elastic
Beanstalk
Amazon
Elasticsearch
Service
46. Auto Start/Shutdown of Instances
AWS Lambda
Amazon
Cloudwatch
Rules: every day at 21h30
Rules: every day at 6h15
Sleep trigger
Wakeup trigger
AWS Resources
(EC2 instances)
48. Design Principles for Operational Excellence
Perform Operations with Code
Align Operations Processes to Business Objectives
Make Regular, Small, Incremental Changes
Test for Responses to Unexpected Events
Learn from Operational Events and Failures
Keep Operations Procedures Current
Let’s cast our minds back to how we would think about security in a on-premises environment
You often only had security at the surface of the architecture, an egg shell model where you harden the edges but once past these protections attackers can go anywhere
Logging and audits are sporadic, some devices would not even offer the ability to do it – and they all did it differently. It is very hard to get a holistic view of the whole environment
It was hard to have tight controls on who could do what, security was often seen as a blocker, overly permissive security was common
You had to have people or manage contracts around how you or your provider would physically secure the DC
For security there was a lot of manual processes, and therefore hard to be consistent or drive improvement over time
In the cloud these constraints have been removed, that allows us to adopt these design principles to build and operate cloud native architectures -
with security across layers, tracing across usage and changes, you can trigger code to respond to events or combinations of events. You can use fine grained access controls to say who can do what, and focus your time using the shared responsibility model
and you can turn all of this into code – so it can be automatic, error free, version controlled and scalable.
The moment you create a new AWS account you need to secure it, as it is your virtual data centre. The identity and access management service provides a very powerful way of governing access to your resources, even if you have one user, setup a user login and do not use what we call the root or your email address to sign in. Besides login credentials you can also generate access keys that are a programmatic way to to gain access to your resources, these access keys should never, ever, be put into code or stored insecurely.
Adding to the building blocks I showed you earlier you can see a single web and single database instance with a security group for each and an end user accessing directly
Adding to the building blocks I showed you earlier you can see a single web and single database instance with a security group for each and an end user accessing directly
Using ssh -A for ssh agent forwarding
You mustn't put personal ssh private keys on shared systems or servers. It is a pretty serious security risk: if anyone else has access to the system, they have access to your ssh private key, and may be able to use that to impersonate you.
Going to the instance level now, monitoring and alerting are key and understanding what a normal state looks like in your application. We also have some great services such as CloudTrail that we highly recommend you enable, VPC flow logs which provide visibility of network traffic flow and help in trouble shooting. Finally encryption everywhere is key, for our block storage service EBS and S3 its simply a check box
Let’s think about how we might think about reliability in a traditional environment:
We often test if things work normally – we check if it meets expectations. But, we rarely test what happens after things fail, so the first time we test our recover process is in the middle of a live reliability failure (not a great learning experience!). This is why you used to see lots of systemic failures – X failed and then Y failed (Y being the thing we never got to test)
When a failure occurs, we manually fix it – if it happens a lot we write down the procedures for fixing – a very manual process
And, we had to guess how much we needed, so we if we got that wrong we had long provisioning times, which could lead to outages.
And we made changes to our environments manually, which introduced the opportunity for human error and snowflake servers (perfectly individual)
In the cloud constraints have been removed, that allows us to adopt these design principles to build and operate cloud native architectures -
to test beyond destruction to make sure recovery procedures are automatic and successful,
we can have multiple resources answering requests – such that a failure in any single component always has siblings who can step in and absorb the load
we can use the horizontal scaling to meet demand
and when we make changes to our environment we can do that through code – and apply the same best practices we would apply to application code.
Next up we need to address the lack of failover and redundancy in our infrastructure.
We’re going to do this by adding in another webapp instance, and enabling the Multi-AZ feature of RDS, which will give us a standby instance in a different AZ from the Primary.
We’re also going to replace our EIP with an Elastic Load Balancer to share the load between our two web instances
Now we have an app that is a bit more scalable and has some fault tolerance built in as well.
That’s a lot of potential wasted infrastructure and cost. 76% wasted potentially, while only 24% of it on average for the month gets utilized. Traditionally this is how IT did things. You bought servers for a 6-12 month vision on what growth might be. So, since we can all agree this is bad, what is the solution.
That’s a lot of potential wasted infrastructure and cost. 76% wasted potentially, while only 24% of it on average for the month gets utilized. Traditionally this is how IT did things. You bought servers for a 6-12 month vision on what growth might be. So, since we can all agree this is bad, what is the solution.
Backup and disaster recovery is another key aspect of reliability, using automation tools and with infrastructure driven by code allows for scenarios where automatic recovery is possible. You should of course always test your backups and review your strategy. *EBS SNAPSHOTS S3 VERSIONING
As we did before, let’s think about the kinds of constraints we had in a traditional environment when thinking about PE:
We tended to use the same tech for everything, when the only tool you had is a hammer – every problem looks like a nail. Generally, this is why you saw so many RDBMS
We stayed local, as global was too hard and too expensive – even the thought of negotiating a contract with a supplier in a different country, legal framework and language was enough to stop most conversations here
We used lots of servers, that did one thing – and we had to have people to manage all those servers.
It’s hard to get the resources to do experiment, it takes a lot of time to set up, and it’s not very common
We tended to force technologies to do what we need, and then hope we could get the performance we needed
In the cloud constraints have been removed, that allows us to adopt these design principles to build and operate cloud native architectures -
skills such as machine learning and media transcoding are not evenly distributed across technologists, so having AWS setup and configure those services for you, it makes adoption easier.
deploying to global locations is a click of a button and not a legal process,
we can create solutions that are fully managed so we can focus on the code that add values
and experimentation is something we can do continuously
and we have a bigger toolbox of techniques, and select the one that works best for what we are trying to do. For example, if you have relational information then you would use a relational database while if you needed internet scale lookups you would use a No SQL solution such as DynamoDB.
Cloudfront allows you to cache static content at the CF edge for faster delivery from a local pop to the end user; in other words, your static content gets cached locally to a user and then delivered locally reducing download times for the website overall
there are over 60 CF cache pops around the world as we mentioned earlier.
CloudFront helps lower load on your origin infrastructure
You can front end static content as discussed and dynamic content as well
For dynamic content, CF proxies and accelerates your connection back to your dynamic origin and you would set a 0 ttl on your dynamic content so CloudFront always goes back to origin to fetch this content.
Write and updates
Counters!!!! Not on the DB – redis!!
Database Federation is where we break up the database by function.
In our example, we have broken out the Forums DB from the User DB from the Products DB
Of course, cross functional queries are harder to do and you may need to do your joins at the application layer for these types of queries
This will reduce our database footprint for a while and the great thing is, this does prevent you from having to shard until much further down the line.
This isn’t going to help for single large tables; for this we will need to shard.
Sharding is where we break up that single large database into multiple DBs. We might need to do this because of database or table size or potentially for high write IOPs as well.
Here is an example of us breaking up a database with a large table into 3 databases. Above we show where each userID is located, but the easiest way to describe how this would work would be to use the example of all users with A-H go into one DB, and I – M go in another, and N – Z go into the third DB.
Typically this is done by key space and your application has to be aware of where to read from, update and write to for a particular record. ORM support can help here.
This does create operation complexity so if you can federate first, do that.
This can be done with SQL or NoSQL, and DynamoDB does this for you under the covers on the backend as your data size increases and the reads / writes per second scale.
Again, let’s think about how we would approach CO in a traditional environment:
You had to invest CAPEX upfront for new infrastructure before you needed it
Most companies are not large enough to benefit from economies of scale
You spent time and money on the undifferentiated heavy lifting of building, maintaining and stacking and racking datacenters
Often there is only centralized costs that couldn’t be attributed back to others, so no one is incentivized to review costs and you had orphan systems
You purchased and ran servers to provide services, often with low utilization as they were hard to share
In the cloud constraints have been removed, that allows us to adopt these design principles to build and operate cloud native architectures -
you pay for computing resource as you consume them
AWS can use its economies of scale to drive down infrastructure costs, and pass them on to its customers
we do the heavy lifting of managing the physical bits for your, so you can focus on the value adding bytes
you can attribute costs back to business units and product owners, so they can drive these down
and use managed services that have a lower cost and eliminate the time and cost of managing servers.
Management of expenditure is very important, if you have a finance person to answer to, they love the ability to itemise the bill based on workload or environment and you can do this using tags or even multiple accounts. There is also a cost explorer and budgeting tool built in that gives great visibility
Instances can be procured in a few different ways, the default is on demand,
reserved instances allow you to commit for 1y or more with varying options to receive a discount if you are running them 24x7.
Spot instances allow you to take advantage of the spot market where we auction off unused capacity at discounted prices, however their suitability depends on the workload.
AWS managed services can save pain and massive amounts of time in provisioning, feeding and watering systems such as databases. Databases you can use the RDS service or maybe even the Redshift data warehouse where with a few clicks you can provision a database with your choice of engine and backups, monitoring, failover is all taken care of for you. There are also a number of automation services with my favourite cloudformation that allows you to define your entire environment as json structured code or use a DSL such as the popular Ruby one and automatically build out or update your entire environment
Simple case of Cost optimization in AWS and most often neglected.
Most changes were made by human beings following runbooks that were often out of date
It was easy to become very focused on the technology metrics rather than business outcomes
Because making change was difficult and risky, we tended not to want to do it often and therefore tended to batch changes into large releases
We rarely simulated failures or events as we were too busy fighting fires from real failures
We were so busy reacting to situations it was hard to take the time to extract learnings
It was hard to keep information current as we were making changes to everything to fight fires, every server was a snowflake.
In the cloud, constraints of a traditional environment are removed, and you can use the design principles of the Operational Excellence pillar to make all changes by code with business metrics that you can measure your success against. By automating change and using code, you can move to making incremental changes and reduce risk. You can build organizational muscle memory by running game-days that simulate failures to test your recovery processes, and learn from these and other operational events to improve your responses. Finally, because infrastructure is now code, you can detect when documentation is out of date and even generate documentation.
The development process that you use for developing business logic can be the same as what you when writing CloudFormation templates.
You start of with your favorite IDE or Text Editor to write the code, Eclipse, VIM or VisualStudio
You then commit to template to your source code repository using your usual branching strategy
and then have the template reviewed as part of your typical code review process.
The template is then integrated and run as part of your CI and CD pipelines.
Being simply a JSON document, you can even write Unit Tests for your templates. When developing a CloudFormation template you can use all of your normal software engineering principles
At the end of the day
It’s all software – a template can be reused across applications – just like code library's and a stack can be shared by multiple applications.
Over 100 AWS webinars are available on-demand. re:Invent slides and video are also available.
Solutions Architects are here to help. Go talk with them. Schedule some office hours to review your architecture and identify steps to make it even better.