Posts Simple Storage Service (S3)
Post
Cancel

Simple Storage Service (S3)

  • S3 is object-based, it allows to upload files from 0 bytes to 5 TB (object size)
  • Files are stored in Buckets

Basics of S3

  1. S3 is a universal namespace, which means the names should be unique globally.
    1. for example, https://s3 ........ amazonaws.com/<bucket_name>, this <bucket_name> should be unique
  2. Once uploading successfully, it will return HTTP Status Code 200

What are objects in S3?

For object in S3, it has the several features.

  1. Key. this is the FULL path
    1. eg. <bucket_url>/somePath/someFolder/filename, here somePath/someFolder/filename is the Key, not just the filename
    2. there are no concept of "directories" in S3. the key can contain slashes /
  2. Value. content of the body
    1. max-size is 5TB, if uploading greater than 5GB, then you can use “multi-part uploading”
  3. VersionID. version control if enabled
  4. MetaData. list of text key / value pairs - system or user data
  5. Tags. Unicode key / value pairs (10 most), uesful for security or lifecycle
  6. Subresources:
    1. Access Control List
    2. Torrent

Consistency Model in S3

Basically, changes to object can take a little bit of time to propogate.

Read After Write

If you write a new file and read it immediately afterwards, you will be able to view that file

Eventually Consistent

If you update AN EXISTING file or DELETE a file, then read it immediately, you may get the older version or you may not. But if you wait for a minute, you will get the latest version.

PUTs of new objects have a read after write consistency. DELETEs and overwrite PUTs have eventual consistency across S3

S3 Features

  1. Tiered Storage Available
  2. Lifecycle Management
  3. Versioning
  4. Encryption
  5. MFA Delete
  6. Secure your data using
    1. Access Control List
    2. Bucket Policy

S3 Storage Classes

https://aws.amazon.com/s3/storage-classes/

There are several classes for S3 storage

Storage TiersDescription
S3 StandardThis tier provides 99.99% availability and 99.999999999% durability (there are 11 9s LOL)
S3 - IA (Infrequently Accessed)Data that is accessed less frequently, but requires rapid access. Lower fee than S3 - Standard. but it charge for a retrieval fee
S3 One Zone - IAThis does not require multiple availability zone data resilience
S3 - Intelligent TieringThis tier utilizes Machine Learning
S3 - GlacierFor data archive
S3 - Glacier Deep ArchiveSlowest, cheapest. File retrieval time ~ 12 hours
  • Durability across all storage classes are the same 11 9’s
  • Availability for Standard is 99.99%, for S3-IA is 99.9%

Glacier & Glacier Deep Archive

Glacier has 3 options to retrieve objects

  1. Expedited: 1 ~ 5 minutes
  2. Standard: 3 ~ 5 hours
  3. Bulk: 5 ~ 12 hours
  4. Minimum storage duration of 90 days

Glacier Deep Archive - for long term storage, cheaper even

  1. Standard: 12 hours
  2. Bulk: 48 hours
  3. Minimum storage duration of 180 days

First byte latency

  • milliseconds for S3 Standard, S3-IA, S3-Intelligent Tiering
  • select minutes or hours for S3 Glacier
  • select hours for S3 Glacier Deep Archive

S3 Security

  • User based
    • IAM policies - which API calls should be allowed for a specific user from IAM console
  • Resource based
    • Bucket policies - bucket wide rules from the S3 console - allows cross account
    • Object Access Control List - Finer Grain
    • Bucket Access Control List - Less Common

S3 Bucket Policies

JSON based policies

  • Resources: buckets and objects
  • Actions: set of API to allow or deny
  • Effect: allow or deny
  • Principal: the account or user to apply the policy to

S3 bucket for policy to

  • grant public access to the bucket
  • force objects to be encrypted at upload
  • grant access to another account (Cross-Account)

S3 Security - Other

Networking

Supports VPC Endpoints (for instances, in VPC without www internet)

Logging an Audit

  • S3 access logs can be stored in other S3 bucket
  • API calls can be logged in AWS CloudTrail

User Security

  • MFA
  • Signed Urls: URLs that are valid only for a limited time (eg: premium video service for logged in users)

S3 Encryption

All of the new files that are uploaded to S3 buckets are PRIVATE

There are three encryption methods:

  1. Encryption In Transit
    1. SSL / TLS
  2. Encryption At Rest
    1. S3 Managed Keys : SSE-S3
    2. AWS Key Management Service, Managed keys : SSE-KMS
    3. Server-side Encryption with Customer Provided keys : SSE-C
  3. Client Side Encryption, like gpg, etc.

SSE-S3

The encryption using keys handled & managed by AWS S3, the files are encrypted at server-side.

SSE-S3 use AES-256 as encryption type, also "x-amz-server-side-encryption":"AES256" must be set in the request header

This diagram shows the three-step encryption process when using SSE-S3:

This diagram shows the four-step decryption process when using SSE-S3:

SSE-KMS

The encryption using keys handled & managed by AWS KMS, the files are encrypted at server-side.

"x-amz-server-side-encryption":"aws:kms"

advantages:

  • user control
  • audit trail

SSE-C

  • Server side encryption using data keys full managed by the customer outside of AWS
  • S3 does not store encryption key
  • HTTPS must be used

Encryption In Transit

AWS S3 exposes:

  • HTTP endpoint: non-encrypted
  • HTTPS endpoint: encryption in flight (recommended)

HTTPS is mandatory for SSE-C

S3 Versioning

S3 also supports version control, it is enabled at bucket level

  1. Stores all versions of an object, inclusding all writes and even if you delete a object. For delete operation, it is a delete marker. but you can see the file when the version is toggled on (reason see below)
  2. Backup tool
  3. Once enabled, cannot be disenabled, but you can suspend
  4. Integrates with Lifecycle rules
  5. Support MFA

any file that is not versioned prior to enabling versioning will have version "null"

Delete Marker

https://docs.aws.amazon.com/AmazonS3/latest/dev/DeleteMarker.html

AWS create Delete Marker, it does so whenever you send a DELETE Object request on an object in a versioning-enabled or suspended bucket. The object named in the DELETE request is not actually deleted. Instead, the delete marker becomes the current version of the object. (The object’s key name (or key) becomes the key of the delete marker.)

If you try to get an object and its current version is a delete marker, Amazon S3 responds with:

  • A 404 (Object not found) error
  • A response header, x-amz-delete-marker: true

The response header tells you that the object accessed was a delete marker. This response header never returns false; if the value is false, Amazon S3 does not include this response header in the response.

The following figure shows how a simple GET on an object, whose current version is a delete marker, returns a 404 No Object Found error.

The only way to list delete markers (and other versions of an object) is by using the versions subresource in a GET Bucket versions request. A simple GET does not retrieve delete marker objects. The following figure shows that a GET Bucket request does not return objects whose current version is a delete marker.

S3 CORS

If you request data from another S3 bucket, you need to enable CORS

Cross Origin Resource Sharing (CORS) allows you to limit the number of websites that can request your files in S3 and limit your costs

Lifecycle Management with S3

S3 uses lifecycle rules to manage objects

  1. Automates moving your objects between different storage tiers
  2. It can be used in conjection with versioning
  3. It can be applied to current versions and previous versions

S3 Lifecycle Rules

There are two types of actions in S3 Lifecycle Rules

  • Transition actions: it defines when objects are transitioned to another storage class
    • move object to Standard IA class 60 days after creation
    • move to glacier for crchiving after 6 months
  • Expiration actions: it configures objects to expire/delete after some time
    • access log files can be set to delete after 365 days
    • delete old version files
    • delete incompleted multi-part uploads

The rules can be applied for a certain prefix (s3://<bucket_name>/mp3/*) or certain objects tags

Cross Region Replication in S3

Cross Region Replication in S3 requires enabling versioning for both source bucket and destination bucket

The buckets (source and destination) must be in different AWS Regions. It can cross different accounts. The copy is async.

Keypoints of Cross Region Replication:

  • if you put delete marker in original bucket, it will not replicate the delte marker in CRR Bucket
  • the files already in the source bucket will not be replicated automatically

S3 MFA-Delete

MFA forces user to generate a code on a device (usually a mobile phone or hardware) before doing important operations on S3

In order to use MFA-Delete feature, you need to enale Versioning on the S3 bucket, later, you will need MFA to:

  • permanently delete an object version
  • suspend Versioning on the bucket

Only the bukcet owner (root account) can enable/disable MFA-Delete feature, and it currently can only be enabled using CLI

S3 Access Logs

You can log all access to S3 buckets, any requests made to S3, from any accounts, whatever authorized or denied, will be logged into another S3 bucket (two S3 buckets here, one is the bucket we wanna monitor the requests, and another is the bucket where logs are stored in)

We can use Athena to analyze the logs

S3 Pre-signed Urls

S3 Pre-signed Urls can be generated via SDK or CLI

  • downloading: we can use CLI
  • uploading: we must use SDK
1
2
3
4
5
6
# set signature version, for files are encrypted, otherwise you will have issues
$ aws configure set default.s3.signature_version s3v4

# create a pre-signed url which lasts for 5 minutes. don't forget to add region in the command
# --expires-in option default value is 3600 seconds
$ aws s3 presign s3://<bucket_name>/<filename> --expires-in 300 --region us-east-1a

Users given a pre-signed url will inherit the permissions of the person who generated the url for GET / PUT

Example usage case:

  • only logged-in user can download a premium video on your S3 bucket
  • allow an ever changing list of users to download files by generating URLs dynamically

S3 Performance Baseline

at least

  • 3500 PUT / COPY / POST / DELETE
  • 5500 GET / HEAD

requests per second per prefix in a bucket, there are no limits to the number of prefixes in a bucket

if you spread reads across all four prefixes evenly, you can achieve 22000 requests per second for GET and HEAD

S3 - KMS Limitation

If you use SSE-KMS, then you may be impacted by the KMS limits, since

  • when you upload a file, it will call the GenerateDataKey API during the call to KMS API
  • when you download a SSE-KMS encrypted file, it will cal the Decrypt KMS API

since API calls will have a upper bound, then your performance with S3 will be impacted

Improve S3 Performance

Multi-part upload

Multipart upload is a three-step process: You initiate the upload, you upload the object parts, and after you have uploaded all the parts, you complete the multipart upload. Upon receiving the complete multipart upload request, Amazon S3 constructs the object from the uploaded parts, and you can then access the object just as you would any other object in your bucket.

S3 Transfer Acceleration (upload only)

This utilizes the CloudFront Edge Networks to accelerate your uploads to S3. Instead of uploading directly to your S3 bucket, you can use a distinct url to upload directly to an Edge Location, which will then transfer to S3 using Amazon Backbone Network

This is compatible with multi-part upload

S3 Performance - S3 Byte-Range Fetches

Parallelize GET’s by requesting specific byte ranges, better resilience in case of failures

You can fetch a byte-range from an object, transferring only the specified portion.

Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.

S3 Select & Glacier Select

With usage of S3 Select or Glacier Select, you can retrieve less data using SQL by performing server side filtering, which can be filter by row / columnes.

S3 Object Lock & Glacier Vault Lock

S3 Object LockGlacier Vault Lock
Adopt a WORM (Write-once, Read-many) modelAdopt a WORM (Write-once, Read-many) model
Block an object version deletion for a specific amount of timeLocal the policy for future edits (can no longer be changed)
 Helpful for compliance and data retention

AWS Athena

  • Athena is a Serverless service to perform analytics directly against S3 files.
  • We can use SQL to do queries, it also has JDBC / ODBC driver
  • It charged per query and amount of data scanned

Amazon CloudFront

CloudFront is a Content Delievery Network (CDN), which is:

a system of distributed servers / network that deliever webpages and other web content to a user based on the geographic locations of the user, origin of the webpage and a CDN.

  • it improves read performance, content is cached at Edge Locations
  • it has DDoS protention, integration with AWS Shield

Key Terminology of CloudFront

Edge Location

This is the location where content will be cached, and separate to the AWS Region / AZ.

Origin

This is the origin of all the files that the CDN will distribute. This can be an

  1. S3 Bucket
    1. for distributing files and caching them at the edge locations
    2. enhanced security with CloudFront Origin Access Identity (OAI)
    3. CloudFront can be used as an ingress (to upload files to S3)
  2. S3 Website
    1. must first enable the bucket as a static S3 website
  3. Cutom Origin (HHTP) - must be publicly accessible
    1. EC2 Instance
    2. Elastic Load Balancer
  4. Route53

Distribution

This is the name given the CDN which consists of a collection of Edge Locations

There are two types of distribution

  1. Web Distribution: for website
  2. RTMP: for Media Streaming

CloudFront Signed URL & Signed Cookies

It will be suitable if you want to distribute paid shared content to premium users over the world, with CloudFront Signed URL / Cookie, we attach a policy with:

  • URL expiration
  • IP ranges to be allowed to access the data
  • trusted signers - which AWS account can create signed urls

Signed URL : access to individual files (one signed URL per file)

Signed Cookies : access to multiple files (one signed cookie for many files)

How long should the URL be valid for ?

  • shared content (movie & music): make it short
  • private content (private to users): you can make it last for years

CloudFront Signed URL vs S3 Pre-Signed URL

CloudFront Signed URLS3 Pre-Signed URL
allow access to a path, no matter the originissue a request as the person who pre-signed the URL
acocount wide key-pair, but only the root user can manage ituse the IAM key of the signing IAM principal
can be filtered by IP, path, date, expirationlimited lifetime
can leverage caching features 

CoudFront Geo Restriction

  • You can restrict who can access your distribution
    • Allowlist: allow your users to access your content only if they are in one of the areas on a list of approved area
    • Blocklist: Prevent your users from accessing your content if they are in ….

This can be a good use case when copyright laws to control access to content

CloudFront vs S3 cross Region Replication

CloudFrontS3 Cross Region Replication
Global Edge Networkmust be setup for each Region you want replication to happen
Files are cached for a TTL (maybe a day)Files are updated in near real-time
Great for static content that myst be available everywhereRead Only
 Great for dynamic content that needs to be available at low-latency in few regions

AWS Global Accelerator

This utilizes the AWS internal network to route to your application. there will be 2 Anycast IPs created for your app

the traffic flow will be:

1
Anycast IP -> Edge Location -> Your application

What is Unicast IP and Anycast IP ?

Unicast IP : one server holds one IP address

Anycast IP : all servers hold the same IP address and the client is routed to the nearest one

AWS Global Accelerator works with Elastic IP, EC2 Instances, ALB, NLB (puclic or private one)

It has:

  • Consistent Performance
    • Intelligent routing to lowest latency and fast regional failover
    • No issue with client cache (IP dont change)
    • Internal AWS Network - fast
  • Health Checks
    • Global Accelerator performs a health check of you applications
    • helps make the app global (failover less than 1 minute for unhealthy)
    • disaster recovery
  • Security
    • only 2 external IP need to be allowlisted
    • DDoS protection <- AWS Shield

AWS Global Accelerator vs CloudFront

They both user AWS global network and edge location to accelerate, also they both integrate with AWS Shield for DDoS Protection

However,

CloudFront

  • improves performance for cacheable content (such as images and videos)
  • dynamic content (such as API acceleration and dynamic site delievery)
  • content is served at the edge location

Global Accelerate

  • improves performance for a wide range of applications over TCP or UDP
  • proxying packets at the edge location to applications running in one or more AWS Regions
  • good fit for non-HTTP use cases
  • HTTP cases: which requires static IP addresses, or deterministic, fast regional failover

Performance across the S3 Storage Classes

 S3 StandardS3 Intelligent-Tiering*S3 Standard-IAS3 One Zone-IA†S3 GlacierS3 Glacier Deep Archive
Designed for durability99.999999999% (11 9’s)99.999999999% (11 9’s)99.999999999% (11 9’s)99.999999999% (11 9’s)99.999999999% (11 9’s)99.999999999% (11 9’s)
Designed for availability99.99%99.9%99.9%99.5%99.99%99.99%
Availability SLA99.9%99%99%99%99.9%99.9%
Availability Zones≥3≥3≥31≥3≥3
Minimum capacity charge per objectN/AN/A128KB128KB40KB40KB
Minimum storage duration chargeN/A30 days30 days30 days90 days180 days
Retrieval feeN/AN/Aper GB retrievedper GB retrievedper GB retrievedper GB retrieved
First byte latencymillisecondsmillisecondsmillisecondsmillisecondsselect minutes or hoursselect hours
Storage typeObjectObjectObjectObjectObjectObject
Lifecycle transitionsYesYesYesYesYesYes

Because S3 One Zone-IA stores data in a single AWS Availability Zone, data stored in this storage class will be lost in the event of Availability Zone destruction.

* S3 Intelligent-Tiering charges a small tiering fee and has a minimum eligible object size of 128KB for auto-tiering. Smaller objects may be stored but will always be charged at the Frequent Access tier rates. See the Amazon S3 Pricing for more information.

** Standard retrievals in archive access tier and deep archive access tier are free. Using the S3 console, you can pay for expedited retrievals if you need faster access to your data from the archive access tiers.

*** S3 Intelligent-Tiering first byte latency for frequent and infrequent access tier is milliseconds access time, and the archive access and deep archive access tiers first byte latency is minutes or hours.

This post is licensed under CC BY 4.0 by the author.