2. • Technical Evangelist, Developer Advocate,
… Software Engineer
• Own bed in Finland
• Previously:
• Solutions Architect @AWS
• Lead Cloud Architect @Dreambroker
• Director of Engineering, Software Engineer, DevOps, Manager, ... @Hdm
• Researcher @Nokia Research Center
• and a bunch of other stuff.
• Climber, like Ginger shots.
3. What to Expect from the Session
• What you need to know about S3 on AWS.
• Architectural design patterns with S3.
• Best practices & tips.
• Tools to help you.
5. Amazon S3 today
Amazon S3 holds trillions of objects and regularly peaks at
millions of requests per second.
(1,000,000,000,000; one million million; 1012; SI prefix: tera-), ..American and British English
(1,000,000,000,000,000,000; one million million million; 1018; SI prefix: exa-), ..non-English-speaking countries
https://en.wikipedia.org/wiki/Trillion
6. Netflix delivers billions of hours of
content from Amazon S3.
SmugMug stores billions of photos
and images on Amazon S3.
Airbnb handles over 10PB of user
images on Amazon S3.
Soundcloud currently stores 2.5 PB
of data on Amazon Glacier.
Nasdaq uses Amazon S3 to support
years of historical tick data down to
the millisecond.
7. We currently log 20 terabytes of
new data each day, and have
around 10 petabytes of data in S3.
(2014)
FINRA stores over 700 TB of data
on Amazon S3 for low cost, durable,
scalable storage and uses Amazon
EMR for scalable compute workloads using
Hive, Presto, and Spark.
Sony moved over 1M hours of video from
magnetic tape to Glacier for digital
preservation.
8. Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
9. Choice of storage classes on S3
Active data Archive dataInfrequently accessed data
S3 Standard
• Big data analysis
• Content distribution
• Static website
hosting
Standard - IA
• Backup & archive
• Disaster recovery
• File sync & share
• Long-retained data
Amazon Glacier
• Long term archives
• Digital preservation
• Magnetic tape
replacement
11. Back up data to Amazon S3
Amazon
Route53
http://example.com
Traditional
Server
Amazon S3
Bucket
Copy to S3
On Premise
Infrastructure
12. AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data collection into Amazon S3
AWS Snowmobile
AWS Snowball Edge
13. Fun fact
Since October 2015, AWS
Snowball has moved over
5 billion objects into Amazon S3,
and AWS Snowball appliances
have traveled a distance equal to
circling the world more than 100
times.
19. Amazon Storage Partner Solutions
aws.amazon.com/backup-recovery/partner-solutions/
Note: Represents a sample of storage partners
Backup and RecoveryPrimary Storage Archive
Solutions that leverage file, block, object,
and streamed data formats as an
extension to on-premises storage
Solutions that leverage Amazon S3 for
durable data backup
Solutions that leverage Amazon
Glacier for durable and cost-effective
long-term data backup
22. Protect your data from the “oups”
Versioning
MFA
Protection on delete
• Protects from:
• unintended user deletes
• application failures
• New version with every upload
• Easy retrieval
• Roll back to previous versions
• Requires additional authentication to:
• Change the versioning state of your bucket
• Permanently delete an object version
**default
** versioning-enabled
** suspended
(multi-factor authentication)
26. • Cache content at the edge.
• Lower load on origin.
• Dynamic and static content.
• Custom SSL certificates
• Low TTLs
Amazon CloudFront (CDN)
27. Faster upload over long distances
S3 Transfer Acceleration
S3 Bucket
AWS Edge
Location
Uploader
Optimized
Throughput!
• Change your endpoint, not your code
• No firewall changes or client software
• Longer distance, larger files, more benefit
• Faster or free
• 82 global edge locations
Try it at S3speedtest.com
31. AWS SDKs
• Automatically switching to multipart
transfers when a file is over a specific
size threshold
• Uploading/downloading a file in parallel
• Progress callbacks to monitor transfers
• Retries.
33. Distributing key names
Add randomness to the beginning of the key name
– E.g. with a hash or reversed timestamp (ssmmhhddmmyy)
<my_bucket>/521335461-2013_11_13.jpg
<my_bucket>/465330151-2013_11_13.jpg
<my_bucket>/987331160-2013_11_13.jpg
<my_bucket>/465765461-2013_11_13.jpg
<my_bucket>/125631151-2013_11_13.jpg
<my_bucket>/934563160-2013_11_13.jpg
<my_bucket>/532132341-2013_11_13.jpg
<my_bucket>/565437681-2013_11_13.jpg
<my_bucket>/234567460-2013_11_13.jpg
<my_bucket>/456767561-2013_11_13.jpg
<my_bucket>/345565651-2013_11_13.jpg
<my_bucket>/431345660-2013_11_13.jpg
34. Organize your data with object tags
Manage data based on what it is as opposed to where its located
Up to 10 tags per object
• Tag your objects with key-value pairs
• Write policies once based on the type of data
• Put object with tag or add tag to existing
objects
Tags
44. Amazon Athena: SQL Query on S3
• No loading of data
• Serverless
• Support text, CSV, TSV, JSON, AVRO
• Columnar formats Apache ORC & Parquet
• Access via Console or JDBC driver
• $5 per TB scanned from S3
45.
46.
47.
48.
49. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
50. S3 Inventory
Save time Daily or Weekly delivery Delivery to S3 bucket CSV File Output
Bucket
Key
Version Id
Is Latest
Delete Marker
Size
Last Modified
ETag
StorageClass
Multipart Uploaded
Replication Status
51. Indexing S3 content using Elasticsearch
AWS
Lambda
Amazon
Elasticsearch
Service
Index Record
Search
Query
Source data
Source data
Source data
Source data
S3 Bucket
ObjectCreate
Event
54. Amazon S3 with event-driven workflow
Lambda function
Amazon S3
Invoked in response to events
SNS Topic
Source data
Source data
Source data
Source data
SQS Queue
55. Event-Driven validation layer on Amazon S3
Source data
Source data
Source data
Source data
Data Staging
Layer
/data/source-raw
Data Staging
Layer
/data/source-validated
Input Validation and
Conversion layer
AWS Lambda
56. Event-driven photo manipulation with Lambda
AWS Lambda:
e.g. Resize Images
Users upload photos
S3:
Source Bucket
S3:
Destination Bucket
Triggered on
PUTs
59. Cognito support for Identity
Username
Password
Sign In
SAML
Identity Provider
Amazon Cognito2. Get AWS credentials
API Gateway
DynamoDB S3
Lambda
Cognito User Pools
60. Leverage Amazon S3 directly from the app.
MyApp
Save Pic
Amazon Cognito
Amazon S3
2. Put object
you have a lot to cover and you are happy to field questions after the talk.
A trillion is 1,000,000,000,000, also know as 10 to the 12th power, or one million million. It’s such a large number it’s hard to get your head around it, so sometimes trillion just means “wow, a lot.”
To easily process and analyze 50 Gigabytes of data daily, Airbnb uses Amazon Elastic MapReduce (Amazon EMR). Airbnb is also using Amazon Simple Storage Service (Amazon S3) to house backups and static files, including 10 terabytes of user pictures.
SoundCloud uses a combination of Amazon Simple Storage Service (Amazon S3) and Amazon Glacier as its storage solution. The audio files are placed in Amazon S3 and distributed from there via the SoundCloud website. All files are also copied to Amazon Glacier, to ensure that the data is available at all times, even in the event of a disaster. The company currently stores 2.5 PB of data on Amazon Glacier.
Early in the 2 ½ year migration of FINRA’s Market Regulation Portfolio to the AWS Cloud, FINRA developed a system on AWS to replace an on-premises solution that allowed analysts to query this trade activity. This solution provided fast random access across trillions of trade records, which would quickly grow to over 700 TB of data.
All 3 storage classes are highly dur designed for 11 9s
Standard, designed for the active data, hot workload. High performance Designed for 4 9s availability. General Purpose storage class, new data frequently access, start with standard. Starting at 2.3c/GB.
As data age, access less. SIA, designed for colder or less frequently access data, Offers same high performance3 9s of availability., high throughput and low latency as S3 Standard. starting at 1.25c/GB, 45% lower in storage cost. There is a $0.01/gb retrieval cost.
As the data age further, no one is actively interacting with the old data and you need for record keeping. Glacier designed for long term archival storage. 4/10th of a cent, 3 retrieval options ranging from minutes to hours retrieval, you choose depending on quickly you need the data.
lifecycle
For data that is less-frequently accessed, you can leverage Amazon S3 Standard-IA to save on cost while still benefiting from the great durability and performance as S3 Standard.
In addition to transitioning your data to S-IA as its characteristics change, you can also leverage Amazon S3 Standard-IA for new data that fits the bill for Infrequently accessed data. For example you can leverage the S-IA storage class to stored detailed applications logs that you analyse in-frequently and save on storage cost.
if your storage is for Big data analysis, content distribution, website, consider using S3 standard which is designed for the active, hot analytics workload. Netflix and finra
for backup and archive, DR, you generally don’t access the data until the rare occasion when you need it, when you need it, you need it right away, SIA is designed for those use cases. consider directly putting into Standard-IA.
Glacier long term archive. Sony moved over 1M hours of video from magnetic tape to Glacier for digital preservation
As data age, you access the objects less
Asynchronous:
Storage gateway
You may need a variety of tool to move transfer data in and out of the cloud, that suit the value, time and workflow demands of the moves.
Stream data into S3,kinesis firehose a fully managed streaming service
Connect your on-premise infrastructure with AWS cloud, AWS storage gateway. added file gateway supporting NFS protocol so you can connect to your s3 bucket as an NFS mount point
Migrate PBs of data and do some local processing as the data arrives, Snowball Edge. 100TB of storage capacity with processing power.
If you have to migrate Hundreds of PB to exabytes of data quickly and securely onto the cloud, AWS Snowmobile can help you. a secure data truck with 100PB of capacity that can be dispatched to your site for high-speed data migration.
=======
a native S3 compatible end-point on the appliance, a clustered mode where multiple Snowball Edges act as a single durable storage pool, the ability to run Lambda functions as data is copied to the appliance, hosted File Storage Gateway to provide NAS capabilities, and integration to customers VPCs to connect to AWS resources.
export – you never want to leave. If you do, we get you in just as easy as we got you out
dedicated network connection from your premises to AWS, consider AWS Direct connect
up to 100PB per Snowmobile
Snowmobile uses multiple layers of security designed to protect your data including dedicated security personnel, GPS tracking, alarm monitoring, 24/7 video surveillance, and an optional escort security vehicle while in transit.
All data is encrypted with 256-bit encryption keys managed by KMS
Visualize so you can take a data driven approach to manage your storage
You can configure a storage class analysis policy to monitor an entire bucket, a prefix, or object tag. Once infrequent access pattern is observed, you can easily create a new lifecycle age policy based on the results.
Visualize so you can take a data driven approach to manage your storage
You can configure a storage class analysis policy to monitor an entire bucket, a prefix, or object tag. Once infrequent access pattern is observed, you can easily create a new lifecycle age policy based on the results.
Storage class analysis observes data access patterns to help you determine when to transition less frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access) storage class.
You can export S3 Analysis data to use it in the tools of your choice such as Amazon Quicksight
If you have data on-premises, you probably have software from one of these industry leaders. Some of these can help you migrate workload into AWS so your applications can run on AWS or in a hybrid architecture model. They make it easy for you to migrate or scale your growth leveraging S3 and Glacier as cloud storage targets.
Set policies by bucket, prefix, or tags
Set policies for current version or non-current versions
Actions can be combined
Protects from accidental overwrites and deletes
Ability to recover from unintended user deletes or application logic failures,
no performance penalty.
Keeps all versions, new uploads stored separately, with delete, latest version is maintained, delete marker added
Can retrieve deleted or roll back to previous
3 states: **default, not versions saved, deleted objects cannot be retrieved, ** versioning-enabled, as discussed, save versions of overwritten or deleted, ** suspended, all saved versions are maintained, but new versions are not created
New Region coming soon
Hong Kong
Ningxia
Paris
Stockholm
Use cases:
Compliance - store data hundreds of miles apart
Lower latency - distribute data to regional customers
Security - create remote replicas managed by separate AWS accounts
. After enabling CRR, every object uploaded to an S3 bucket is automatically replicated to a destination bucket in a different AWS region. You choose the destination region. whole bucket or prefix, Optionally specify storage class
How it works:
Only replicates new PUTs. Once configured, all new uploads into source bucket will be replicated
Entire bucket or prefix based
1:1 replication between any 2 regions
Versioning required
Deletes and lifecycle actions are not replicated
Cloudfront allows you to cache static content at the CF edge for faster delivery from a local pop to the end user; in other words, your static content gets cached locally to a user and then delivered locally reducing download times for the website overall
there are over 60 CF cache pops around the world as we mentioned earlier.
CloudFront helps lower load on your origin infrastructure
You can front end static content as discussed and dynamic content as well
For dynamic content, CF proxies and accelerates your connection back to your dynamic origin and you would set a 0 ttl on your dynamic content so CloudFront always goes back to origin to fetch this content.
User generated content use cases, you have customers that upload to a centralized bucket from all over the world.
You transfer data on a regular basis across continents. frequent upload from distributed locations
S3-XA leverages AWS edge network when you enable TA, route your data to the closest edge location, so it travel a shorter distance on the public internet and majority of the distance in an optimzed network on the Amazon backbone. There are 68 edge locations
S3-XA uses standard TCP and HTTP so it does not require any firewall exceptions or custom software installation.
Set up is easy. Enable TA on the bucket in the console, update the end point.
FASTER OR FREE There is no cost for using the XA if the upload is not faster. In the event the network is the same as normal upload, you don’t pay for XA.
Although primary is upload/ ingestion to S3. we have seen customer for downloads, sharing video on individual basis, very few people pulling the video. Downloading thru XA is a fast path
Use XA to improve their Upload availability : spotty internet connection. Uploading file across the global poor internet connectivity, increase their upload availability. (FS)
Associate performance wit availability. Taking a long time to upload or download, assume it’s not working properly. Customer may not be as patient and cancel. Finish faster.
-download files from S3 when the files are not pulled down frequently (single user versus many). Customers were seeing a benefit of pulling down files from S3 in the quickest way possible by leveraging S3-XA. for files that are frequently accessed, recommended to use CloudFront which caches your data at the edge
This is how the flow of a request transferred through S3 XA looks like:
The client’s request hits Route 53 which resolves the accelerate endpoint to the best POP latency wise. From there, S3 Transfer Acceleration selects the fastest path to send data over persistent connections to EC2 proxy fleet over HTTPS in the same AWS region as the S3 bucket. We maximize the send and receive windows here to maximize customer’s utilization of the available bandwidth. From here, the request is finally sent to S3.
The service achieves acceleration thanks to: improvements on the slide?
Routing optimized to maximize routing on AMZN network
TCP optimizations along the path to maximize data transfer
Persistent connections to minimize connection setup and maximize connection reuse
Frame.io is a video editing and collaboration company. They allow globally distributed professionals to work collaboratively and edit videos in the cloud. The ability to get the contents uploaded to the cloud as quickly as possible shortens the time it takes for collaboration to happen.
Hudl is a video analytics company used by many professional and amateur sports teams to create, edit and share videos. Coaches use this to improve game play and recruiting purposes. The ability to get these videos uploaded quickly allows coaches and their assistant to analyze the footage earlier and make quicker adjustments.
Viocorp is a cloud based video platform provider headquartered in Sydney. It provides corporates and brands with an easy to use suite of solutions to reach and engage their audiences with online video.
Multipart upload allows you to upload a large object in a set of parts.
upload parts in parallel or any order. pause and resume
when you complete MPU, S3 puts the parts together and create the object for you.
+ When you are uploading large objects over a stable connection with significant bandwidth. You can parallelize uploading parts to maximize your network throughput. Multi-threaded performance.
+ When uploading objects over a spotty network (mobile), you only need to retry the parts that were interrupted. Increase resiliency to network errors;
Third performance topic is getting higher request rate out of S3 by distributing key names. you need to consider key naming scheme if you plan to spike up above 100 RPS on a bucket. S3 routinely scales up to millions of req per second. We spread key names into different partitions, we partition automatically overtime your request rate grows. The indexing layer is sorted in alpha numeric order. So if you name your objects starting with a date, all of these keys are stored into the same partitions, when you make hundreds or thousand of request per second to this set of keys, you may get throttle
maximize the distribution of your keys.
We recommend that you place your hash immediately after the bucket name. maximize the distribution of your keys.
So your transactions can be distributed across partitions and you can scale up your RPS
You do not want the bucket name as part of your hash.
prepend key name with short hash
Store object as the hash of the name and object name in the metadata
As you store on S3, you are used to organizing your data by location (bucket and prefix). You can now organize your data based on what it is with object tags. S3 Tags are key-value pairs applied as many as 10 tags to an object. With tag, You can create IAM policies to manage access, setup S3 Lifecycle policies to transition/expire, and customize storage metrics and analytics.
2 ways to tag your objects, on put object or put tag. Tags can be created, updated or deleted at any time during the lifetime of the object.
Tags are key value pairs, case sensitive
Maximum 10 tags per object
Maximum key length—128 Unicode characters
Maximum value length—256 Unicode characters
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
A foundation of highly durable data storage and streaming of any type of data
A metadata index and workflow which helps us categorise and govern data stored in the data lake
A search index and workflow which enables data discovery
A robust set of security controls – governance through technology, not policy
An API and user interface that expose these features to internal and external users
Access via AWS Management Console.
Log into the Console
Create a table
Type in a Hive DDL Statement
Use the console Add Table wizard
Athena uses Hive DDL to define tables.
When you create a table for Athena, you are essentially just creating metadata and, as you run queries, the schema is applied to the data.
Read only query service on S3, your S3 data won’t change by using Athena
Run your query
Console will display the results. result are also written as CSV back to S3 to a location you specify
For context, Athena uses Presto with full standard SQL support, it can handle complex analysis, including large joins, window functions, and arrays.
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can simplify and speed up business workflows and big data jobs using the Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. Amazon S3 inventory provides a comma-separated values (CSV) flat-file output of your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).
Notifications can be published to different targets
SNS topic for broadcasting events to a large number of clients. Push, email, mobile alerts
SQS Queue for processing S3 events asynchronously in a flexible manner. good choice for triggering workflows that pull from queue
Lambda Automatically executes code when an S3 event occurs. Without provisioning servers or managing instances.
Apply custom logic to process content being uploaded into S3.
Watermarking / thumbnailing
Transcoding
Indexing and deduplication
Aggregation and filtering
Pre processing
Content validation
WAF updates
Apply custom logic to process content being uploaded into S3.
Watermarking / thumbnailing
Transcoding
Indexing and deduplication
Aggregation and filtering
Pre processing
Content validation
WAF updates