A technology reference on object storage — what it is, how it works at the byte level, where the design space splits, when to pick which variant. Use cases (static asset hosting, video/media catalogs, data lakes, ML model and training data stores, backups, user-uploaded files) appear throughout as illustrations of the same class of technology bent to fit different workloads.
§1. What Object Storage Is
An object store is a flat key-to-opaque-blob (Binary Large OBject) store at planet scale. The contract is small: give it a bucket and a key string, hand it bytes (up to 5 TB), and it persists them with eleven nines of durability and serves them back to anyone with read permission, anywhere, indefinitely. It is the system underneath anything where the unit of access is a file-sized blob rather than a row or a record.
The contract is also defined by what it does not do:
- No mid-object reads/writes. A
GETreturns the whole object (or a byte range); aPUTwrites the whole object. No in-place edit. - No semantics over content. A 5 GB Apache Parquet file, a 2 KB JPEG, a 200 GB Iceberg snapshot, a 4 MB XGBoost (eXtreme Gradient Boosting) model pickle — all opaque bytes. Schema lives elsewhere (Hive Metastore, Iceberg catalog, a database row).
- No POSIX (Portable Operating System Interface) semantics. No
fsync, noopen()returning a file descriptor, no directory inodes, no hard links, norename(). Console "folders" are a fiction over key prefixes. - No multi-object transactions. Two
PUTs are two independent operations; noBEGIN; PUT a; PUT b; COMMIT;.
Object storage sits at one corner of a four-way design space among bytes-at-rest technologies:
- Block storage — AWS EBS (Elastic Block Store), GCE Persistent Disk, Azure Managed Disks. Raw disk to one host; random reads/writes at the sector (4 KB) level; tied to one VM. Filesystems run on top. Latency ~100µs–1ms; tens of TB per volume. Right for DB data files, OS root volumes, Redis AOF (Append-Only File) logs. Wrong for "share this video with 30M viewers."
- File storage — NFS (Network File System), AWS EFS (Elastic File System), Azure Files, GCP Filestore. POSIX over the network: directories, file handles, byte-level I/O, locks. Latency a few ms; PB range. Right for HPC (High-Performance Computing) scratch, legacy apps, shared config. Wrong for billions of small objects — EFS costs ~10x S3 per GB.
- Object storage — S3, Google Cloud Storage (GCS), Azure Blob, R2 (Cloudflare's S3-compatible offering), MinIO, Ceph RADOS Gateway. The subject of this doc.
- Databases / structured storage — schemas, transactions, queries, secondary indexes. Wrong for 2 MB image blobs (they ruin the buffer pool — see
01_databases.md); wrong for 80 GB model artifacts; wrong for a petabyte of Parquet.
Mental model: block storage is "a fast local disk for one server"; file storage is "a shared disk for many servers"; databases are "structured tables for fast lookup"; object storage is "infinite cheap durable space for everything that doesn't fit the other three."
What object storage is NOT good for:
- Low-latency point reads. S3
GETis ~30–80ms p50 from inside the same region, ~150–300ms p99. For "render this page in 50ms," put a CDN (Content Delivery Network) or in-memory cache in front. - Transactional updates. No multi-object atomicity. If your invariant is "balance moves between two accounts," do that in a database.
- Querying inside objects. S3 Select and Athena scan whole objects and bill by bytes scanned. If you need "select * where user_id = 42," put data in a database or columnar warehouse.
- Append-heavy small writes. 100-byte log lines at 100k/sec = 100k separate
PUTs = request charges (~$0.50/million) + 100k metadata slots. Buffer locally, write 64 MB batches. - Filesystem semantics. No
rename(). "Renaming" a 10 TB folder is NCOPY+ NDELETEcalls — the most-burned data-lake anti-pattern. - Random read/write inside a 5 GB file. Range
GETreads fine; mutation requires re-uploading the whole thing.
The canonical role: object storage is the passive durable layer at the bottom of the stack. The data lake, the model registry, the image-CDN origin, the database's nightly backups, the user-uploaded files — all live here. The repeated pattern: the database holds metadata pointing at the object; the object holds the bytes.
§2. Inherent Guarantees
What the technology provides by design, and what must still be layered above it.
Provided by design
- Strong read-after-write consistency for all operations. For S3 since December 2020 — before that, S3 had eventual consistency for overwrites and deletes, forcing application-level workarounds (sentinel keys, immutable naming). Today all S3 ops —
PUT,GET,LIST,DELETE,HEAD— are strongly consistent. GCS, Azure Blob, R2, and modern MinIO have been strongly consistent since launch. The historical "list-after-write" weirdness is gone. - Eleven nines of durability (99.999999999%). Expected annual loss ~1 byte per 100 billion. Mechanism: replication, erasure coding, AZ (Availability Zone) separation.
- Effectively unlimited capacity. No bucket size limit. AWS reports S3 holds over 280 trillion objects.
- Per-object availability at the SLA tier (Service Level Agreement) — S3 Standard 99.99% monthly; One Zone IA (Infrequent Access) 99.5%; Glacier tiers a separate SLA on retrievals.
- Regional or multi-region scope. A bucket lives in one region across AZs; cross-region copies are explicit (S3 CRR — Cross-Region Replication, GCS dual-region buckets).
- Versioning as opt-in: each
PUTto the same key creates a new version; old versions retained by policy.
Must be layered above
- Low-latency reads. 50–150ms p50, hundreds of ms p99 from outside the region. For sub-50ms reads, put a CDN in front (CloudFront, Cloudflare, Fastly) or cache in memory.
- Cross-object transactions. "Update object A and B atomically" is the application layer's job. The pattern (Apache Iceberg, Delta Lake, Apache Hudi): write new files with unique names, then atomically flip a metadata pointer in a catalog.
- Query inside objects. Iceberg / Delta / Hudi catalogs, Athena / BigQuery / Snowflake, or pull into a database. Object storage doesn't index.
- Rich metadata. S3 metadata is a flat KV map (
x-amz-meta-*). Anything beyond key/size/last-modified — owner, parent post, moderation status — goes in a real database. - Quota and rate management. S3 doesn't refuse writes when you've spent too much money; it bills you. Cost guardrails live in IAM (Identity and Access Management) or the application.
- Encryption with customer-managed keys. Default SSE-S3 (Server-Side Encryption with S3-managed keys) is automatic; SSE-KMS (Key Management Service), SSE-C (Customer-provided), and client-side require explicit work (§13).
- Multi-region read/write semantics. S3 MRAP (Multi-Region Access Point) is eventually consistent across regions. Active-active is its own design problem (§7.3).
- Lifecycle and tiering decisions. Provider gives tools (lifecycle policies, storage classes); deciding what to keep where for how long is the designer's call (§12).
- Permissions hygiene. IAM policies, bucket policies, ACLs (Access Control Lists, legacy), pre-signed URLs, Public Access Blocks. "S3 bucket leak" is the most common cloud-misconfiguration breach class.
Synthesis: the technology guarantees bytes you wrote come back, intact, forever with strong consistency. Everything else — fast reads, transactional semantics, query, rich metadata, encryption, cost — is the system designer's problem.
§3. The Design Space
Variants differ in pricing, ecosystem, and a few protocol details, but they all conform to one of two flavors of the S3 API: the AWS-original S3 REST API, or a re-implementation of it.
Hyperscaler managed
- AWS S3 (Simple Storage Service) — launched March 2006. The protocol everyone else copies. Eleven nines durability; 99.99% Standard availability; six storage classes. Trillions of objects, exabytes. Strongly consistent since Dec 2020. The benchmark.
- Google Cloud Storage (GCS) — same conceptual model. Distinctive: dual-region/multi-region buckets with transparent replication (no CRR setup); bucket-wide versioning; tiers are Standard, Nearline (30-day min), Coldline (90-day min), Archive (365-day min). Strong consistency from day one.
- Azure Blob Storage — three blob types: Block (S3-equivalent), Append (optimized for log appends), Page (EBS-equivalent, backs VM disks). Hot / Cool / Cold / Archive tiers. Strong consistency. Tight integration with Azure Active Directory RBAC (Role-Based Access Control), SQL Server, Office 365.
S3-compatible alternatives
- Cloudflare R2 — launched 2022. Defining feature: zero egress fees. S3 charges ~$0.09/GB egress to the internet; R2 charges $0. For public-asset and large-dataset workloads, that's a 10–100x cost reduction. Storage ~$0.015/GB/mo. S3-API-compatible. Catch: limited tiering, smaller ecosystem.
- Backblaze B2 — launched 2015. Storage ~$0.006/GB/mo, egress $0.01/GB. S3 API. Bandwidth alliance with Cloudflare gives effectively zero egress to the Cloudflare network. Well-regarded for backup.
- Wasabi — single tier, $0.0059/GB/mo, no egress fees, no API call fees. "All-in" predictable pricing. Trade-off: no tiered options, smaller global footprint.
- DigitalOcean Spaces — S3-compatible at $5/mo for 250 GB + 1 TB egress. Small workloads.
Self-hosted
- MinIO — open-source, S3-API-compatible, single Go binary. Scales from a home lab to multi-PB. Common in air-gapped environments, on-prem data lakes, local-dev S3 stand-in. Erasure coding configurable. AGPL (Affero General Public License) v3 community edition.
- Ceph + RADOS Gateway — open-source distributed storage with object, block, and file interfaces. RADOS (Reliable Autonomic Distributed Object Store) is the underlying layer;
radosgwputs an S3 REST API on top. Used at CERN (exabyte scale), DigitalOcean Spaces (Ceph-backed for years), many universities. Higher operational complexity than MinIO; more flexible. - OpenStack Swift — original (2010), historically eventual consistency. Used at Rackspace, eBay. Declining as the field consolidates around S3-API-compatible MinIO/Ceph.
Comparison table
| Dimension | AWS S3 | GCS | Azure Blob | R2 | MinIO (self-hosted) |
|---|---|---|---|---|---|
| Storage cost (Standard, USD/GB/mo) | $0.023 | $0.020 | $0.018 | $0.015 | hardware cost |
| Egress cost (USD/GB to internet) | $0.09 | $0.12 | $0.087 | $0.00 | bandwidth cost |
PUT cost (per 1k requests) |
$0.005 | $0.005 | $0.0055 | $4.50/M class A | none |
GET cost (per 1k requests) |
$0.0004 | $0.0004 | $0.0004 | $0.36/M class B | none |
| Durability | 11 9s | 11 9s | 11 9s (LRS: 11 9s, GRS: 12 9s) | 11 9s | depends on EC config |
| Strong consistency | Yes (since 2020) | Yes | Yes | Yes | Yes |
| Max object size | 5 TB | 5 TB | ~4.77 TB (block blob) | 5 TB | configurable |
| Storage tiers | 6 (Standard → Deep Archive) | 4 (Standard → Archive) | 4 (Hot → Archive) | 1 | 1 (tiers via lifecycle) |
| Multi-region replication | CRR (configurable) | Dual/multi-region buckets | GRS / RA-GRS | Future | manual |
| Versioning | Yes (opt-in) | Yes (bucket-level) | Yes | Yes | Yes |
| Object Lock (WORM) | Yes | Bucket Lock | Immutability policies | Yes | Yes |
| Typical use | General | Google ecosystem, ML | Microsoft ecosystem | Public asset serving | On-prem, air-gapped |
The pattern: storage cost is converging at $0.015–$0.023/GB/mo across hyperscalers; egress cost is where the actual money is. A 100 TB content library serving 1 PB/month of egress costs $90,000/month on S3, $0 on R2 — that gap dwarfs the $230 storage difference. R2's pricing model is a deliberate disruption of "egress as a customer-retention moat."
§4. Byte-Level: How Object Storage Actually Works Internally
The depth section. We zoom into the storage layer and explain the actual mechanics — erasure coding, the prefix sharding model that drives the throughput limits, multipart upload byte choreography, pre-signed URL HMAC (Hash-based Message Authentication Code) math, versioning, and a single PUT byte-by-byte.
4.1 What's actually in a bucket: keys and a flat namespace
The "folders and files" mental model is fiction. A bucket is a flat KV map from string keys to opaque bytes plus a tiny metadata blob. Forward slashes are just bytes; users/42/avatar.jpg is one key, not three nested directories. The S3 console parses slashes to render a tree, but the storage layer doesn't know.
Consequences:
- No
rename(). "Moving"users/42/tousers/v2/42/is N copies + N deletes. For a billion-object prefix, a multi-day job. - Console "renames" secretly do the copy-delete dance.
- No folder-level permissions. Bucket policies use prefix string matching (
arn:aws:s3:::my-bucket/users/42/*).
A key is up to 1024 bytes (S3) or 1024 characters (GCS). Per-object metadata: key, version-id, size, content-type (MIME — Multipurpose Internet Mail Extensions), last-modified, etag (MD5 of content or multipart-derived hash), storage class, encryption headers, user x-amz-meta-* headers (a few KB).
4.2 Erasure coding: how 11 nines actually happen
Object storage advertises "eleven nines of durability." Where do the nines come from? The answer is erasure coding — Reed-Solomon — applied across many disks and AZs, replacing or supplementing classical replication.
Naive approach: triple replication. Store three copies. To lose data, all three disks must die simultaneously. Disk failure rate ~2%/year on commodity SATA (Serial ATA); 3-of-3 simultaneous failure is small. Cost: 3x storage overhead. At exabyte scale this matters a lot.
Erasure coding does better. Reed-Solomon takes a data block of size D and produces D + P fragments; any D of D+P suffice to reconstruct. Common deployed configurations:
- (10, 4) — 10 data + 4 parity. Tolerates 4 simultaneous losses. Overhead 1.4x. Stronger than 3x replication and 2x cheaper.
- (6, 3) — MinIO default. 1.5x overhead.
- (17, 3) — used in Meta's f4 cold storage. 1.18x overhead, tolerates 3 losses.
The math sketch. Reed-Solomon is finite-field linear algebra over GF(256) (a Galois field of 256 elements). The D data bytes are polynomial coefficients; D + P fragments are evaluations at D + P points; any D points let you Lagrange-interpolate the polynomial and recover coefficients. Arithmetic is XOR (eXclusive OR) and GF multiplication, ~GB/sec per core with AVX (Advanced Vector Extensions).
Spreading across failure domains. Erasure coding is necessary but not sufficient — fragments must spread across independent failure domains (different disks, racks, power circuits, AZs). S3 stripes a single object's fragments across multiple AZs in a region. The Dec 2017 US-East-1 outage (§8) demonstrated this works: surviving AZs still served GETs.
Concrete numbers for an (e.g.) (9, 6) scheme across 3 AZs:
Object O is 100 MB. Encode into 15 fragments of ~11.1 MB each.
AZ-1: fragments 1..5
AZ-2: fragments 6..10
AZ-3: fragments 11..15
A GET reconstructs from any 9 of 15 → typically reads from 2 of 3 AZs.
Failure scenarios:
Lose 1 disk → 14 of 15 left. Fine.
Lose 1 rack → fine if rack held <7 fragments.
Lose 1 AZ → 10 of 15 left. Still readable (need 9). Repair re-encodes lost fragments elsewhere.
Lose 2 AZs → 5 left, below threshold. DEGRADED. Recovery from other regions' copies if CRR.
Durability math. With 2%/year disk failure rate, repair time in hours, and (10, 4) across independent failure domains, expected per-object annual loss probability is ~10⁻¹¹ — the eleven-nines number.
Hot vs cold trade-off. EC has a cost: reading an object requires reading D fragments instead of one. For hot reads, this is bandwidth amplification. S3 Standard uses lower-overhead schemes or replication for the active tier; Glacier Deep Archive uses extreme EC (high D, very long repair budgets) on different media (tape libraries, SMR — Shingled Magnetic Recording — drives) optimized for cost.
4.3 The prefix sharding model: where 5500 PUT / 3500 GET per prefix comes from
S3's published limit is 3,500 GET / HEAD per second per prefix and 5,500 PUT / POST / COPY / DELETE per second per prefix. What's a "prefix" and where do the numbers come from?
Internal architecture (publicly described in AWS re:Invent talks): S3 partitions the keyspace by hashing on the key prefix. A bucket's namespace is divided into many shards (or "index partitions"); each owns a contiguous range of the hashed prefix space, and runs on one set of metadata + storage servers. The 3500/5500 number is one shard's envelope.
Shards auto-split — S3 detects sustained hot prefixes and splits the range so load spreads. The split is gradual (minutes to hours, longer for bursty wide loads), and during the lag you throttle.
Concrete failure mode. Common data-lake anti-pattern:
s3://my-lake/year=2026/month=05/day=22/hour=14/file-000.parquet
s3://my-lake/year=2026/month=05/day=22/hour=14/file-001.parquet
... 10,000 files written per hour, all under the same prefix
First ~5 chars (year=) are constant; the hash clusters all files in one shard. At burst >5500/sec, you get 503 SlowDown with backoff hints. The job "slows randomly" because S3 is throttling.
Fix: hash-prefix the key.
s3://my-lake/a3f7/year=2026/month=05/day=22/hour=14/file-000.parquet
s3://my-lake/b9e2/year=2026/month=05/day=22/hour=14/file-001.parquet
Prepending a 4-char hex hash from MD5(filename)[:4] spreads keys across 65k prefix buckets; each shard sees ~1/65k of the traffic. Bucket throughput becomes effectively unbounded.
Modern table formats (Iceberg, Delta, Hudi) write paths like s3://lake/db/table/data/00000-12-abc123/file.parquet where a hash segment is built in. Hive-style partition tables need explicit guidance.
Bucket-wide envelope. No documented upper bound — "unlimited, limited by distinct-prefix count." Netflix, Lyft, Uber have published >100,000 GET/sec sustained against a single well-distributed bucket.
4.4 Multipart upload mechanics: the 5 GB single-PUT limit
A single PUT Object request maxes at 5 GB. To upload an object larger than 5 GB (up to the 5 TB object limit), you use multipart upload. Even for objects 100 MB ~ 5 GB, multipart is recommended for parallelism.
The flow.
1. Client → S3: CreateMultipartUpload(bucket, key)
← UploadId (a server-side handle)
2. Client splits the file into N parts, each 5 MB–5 GB:
part 1: bytes 0 .. P1
part 2: bytes P1 .. P2
...
part N: bytes Pn-1 .. EOF
3. Client uploads each part independently (in parallel, with retries):
UploadPart(bucket, key, UploadId, partNumber=1, body=<part 1>)
← ETag1
UploadPart(bucket, key, UploadId, partNumber=2, body=<part 2>)
← ETag2
...
UploadPart(bucket, key, UploadId, partNumber=N, body=<part N>)
← ETagN
4. Client → S3: CompleteMultipartUpload(bucket, key, UploadId, [(1, ETag1), (2, ETag2), ..., (N, ETagN)])
← S3 atomically assembles the parts in order, stores the
resulting object, returns the multipart ETag and version-id.
ETag for multipart = MD5 of concatenated part MD5s, with -N suffix.
Parts can be uploaded in any order, retried independently, paused and resumed across days. Until CompleteMultipartUpload, parts are "staging" — visible only via ListParts(uploadId), not GET. Complete is the atomic switch: before, no object; after, full object visible.
Why this exists. Three reasons:
- The 5 GB single-PUT limit. Large HTTP bodies are awkward — TCP connections drop, proxies impose size limits, no resume mechanism. Multipart turns a 5 TB upload into 1000 × 5 GB chunks.
- Parallelism. One TCP connection saturates ~50 MB/sec due to congestion control and HTTP/1.1 single-stream limits. Ten 100 MB parts in parallel reach ~500 MB/sec on a fast link.
- Resumability. A 100 GB upload that fails at 80 GB resumes from part 17, not zero.
5 MB minimum, etag math. Non-final parts must be ≥ 5 MB. The multipart etag is:
multipart_etag = MD5(MD5(part1) || MD5(part2) || ... || MD5(partN)) || "-N"
Not the MD5 of the assembled object — common source of "client-side checksum doesn't match etag" bugs.
Garbage from failed uploads. Start a multipart, never Complete or Abort → parts stage forever, you pay storage. Common cost leak: a buggy retry that opens a new multipart instead of resuming. Detect with ListMultipartUploads; lifecycle policy to auto-abort >N days.
4.5 Pre-signed URLs: HMAC-signed query parameters for time-limited access
A pre-signed URL is an S3 URL with cryptographic-signature query parameters. Anyone holding it can perform the operation it's signed for (GET, PUT, DELETE) without their own AWS credentials, until expiry.
Mechanism (S3 v4 signature, simplified).
1. Application has (access_key, secret_key) and decides:
op=PUT, bucket=uploads, key=user-42/avatar.jpg,
expires=900s, headers=Content-Type: image/jpeg
2. Build canonical_request from (method, URI, query params, headers, payload-hash).
3. Build string_to_sign = "AWS4-HMAC-SHA256\n<date>\n<scope>\nSHA256(canonical_request)"
4. Derive signing key by HMAC-chain:
date_key = HMAC-SHA256("AWS4"+secret_key, "20260522")
region_key = HMAC-SHA256(date_key, "us-east-1")
service_key = HMAC-SHA256(region_key, "s3")
signing_key = HMAC-SHA256(service_key, "aws4_request")
5. signature = HMAC-SHA256(signing_key, string_to_sign)
6. Build URL:
https://uploads.s3.amazonaws.com/user-42/avatar.jpg
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=AKIA.../20260522/us-east-1/s3/aws4_request
&X-Amz-Date=20260522T143000Z
&X-Amz-Expires=900
&X-Amz-SignedHeaders=host;content-type
&X-Amz-Signature=<hex>
7. Client PUTs directly to S3, never touching the app server.
S3 verifies: same canonical request, same string-to-sign, same signing key (S3 has access to the secret_key derivation). Signature matches + not expired → allow.
Why this is so useful. Without pre-signed URLs, every upload passes through your app server (client → app → S3); 10 MB × 10k concurrent uploads saturates app server bandwidth. With pre-signed URLs, client → app (200-byte signed URL) → client → S3 directly. This is how Notion, Slack, Discord, and every modern web app handle large uploads.
Security properties. The URL is the credential — anyone holding it has the rights until expiry. Deliver over HTTPS only; never email. Keep expiry short (15 min upload, 1 hr download). Signature includes HTTP method — a GET-signed URL can't PUT.
4.6 The "list operations are eventually consistent" history
Before December 2020, S3 was famously confusing:
PUTof a new object → strongly consistent.PUToverwriting → eventually consistent. GET could return old or new for seconds.DELETE→ eventually consistent. GET could still find a deleted object.LIST→ eventually consistent. New objects might miss a listing for seconds.
This caused a bug class: write → list to confirm → see empty list → retry → duplicate writes. Hadoop/Spark spent years building workarounds (S3Guard, EMRFS consistent view) because "I wrote a file, I should see it" was violated.
December 2020: AWS shipped strong read-after-write for all operations. GCS and Azure Blob have been strongly consistent since launch (~2010). Older code may carry S3-specific workarounds you can now remove.
4.7 Object versioning internals
Versioning is opt-in per bucket in S3. When enabled, each PUT to the same key creates a new version with a unique server-assigned version-id. GET returns the current version; older versions accessible via ?versionId=....
Bucket: my-bucket / Key: user-42/avatar.jpg
v_abc123 ← current (2026-05-22, 12 KB)
v_xyz789 (2026-04-15, 11 KB)
v_def456 (2026-03-01, 10 KB)
All three versions stored as independent objects, each at full storage cost.
DELETE under versioning. A normal DELETE (no version-id) doesn't remove anything — it adds a delete marker as the new current version. GET returns 404; older versions readable explicitly. Permanent purge needs DELETE per version-id or a lifecycle rule.
Why versioning matters. Three reasons:
- Accidental delete/overwrite recovery. Wrong-source
cp→ without versioning, file gone; with versioning,GET --version-idrolls back. - Audit and provenance. "What was this object at 2026-04-12T14:00?" Look up the version active then.
- Concurrency safety. Etag-conditional writes (
PutObjectwithIf-Match: <etag>) prevent lost updates when writers race.
Cost implication. A 1 TB bucket with versioning and frequent updates can accumulate 5 TB of old versions in a year. Lifecycle rules purging non-current versions after 30 / 90 / 365 days are standard.
4.8 One concrete PUT operation, byte by byte
Walk a single PUT Object end to end. The example: uploading a 250 MB user video. Same machinery underlies any blob upload (ML model to a registry, Parquet to a data lake, database backup).
- App server accepts upload, validates auth/size/content-type. Picks multipart (10 × 25 MB) for parallelism.
- App → S3:
CreateMultipartUpload(bucket=videos, key=user-42/clip-987.mp4, ContentType=video/mp4)→UploadId=abc-123. S3 allocates a metadata slot keyed by(bucket, key, upload-id). - App slices 250 MB into 10 × 25 MB parts.
- For each part i in 1..10 (in parallel):
-
UploadPartHTTPPUTto.../clip-987.mp4?partNumber=i&uploadId=abc-123. Compute Content-MD5; S3 verifies. - Over TLS (Transport Layer Security), through S3's frontend (re:Invent says "200+ microservices"), via load balancer to a regional frontend server. - Frontend erasure-codes the 25 MB into ~14 fragments of ~1.8 MB; distributes across storage nodes in 3 AZs; eachfsyncs to NVMe (Non-Volatile Memory Express) or HDD (Hard Disk Drive — colder tiers). - Frontend writes(upload-id, part-number, etag, size, fragment-locations)to the metadata service. -200 OK+ETag: "<hex MD5>". App stores(part-number, etag). - App → S3:
CompleteMultipartUpload(... Parts=[(1, etag1), ..., (10, etag10)]). - S3 atomically assembles. Metadata operation, not data copy — fragments already durable; complete writes an index entry pointing at the ordered parts:
Index entry:
bucket=videos, key=user-42/clip-987.mp4, version-id=v_<...>,
parts=[part-1-loc, ..., part-10-loc], size=250MB,
etag="<multipart-etag>-10", last-modified=<now>
- Index write is the durability point. S3 replicates the index entry to a quorum of metadata replicas (consensus internally, similar in spirit to Paxos/Raft). Once acked, success.
- App receives
200 OKwith etag and version-id. SubsequentGETassembles fragments and streams them.
Crash scenarios.
- Network drop on one part. Client retries
UploadPart(idempotent per(upload-id, part-number)); new upload replaces staging part. - Crash between
UploadPart7 andComplete. Restart →ListParts(uploadId)returns parts 1-3 uploaded; resume from 4. - Crash after step 6 before client ack. Client retries
Complete; idempotent if parts list identical. - Abandoned upload. Parts stage forever; lifecycle rule "abort >7 days" auto-cleans.
Durability invariant. 200 OK from Complete means the metadata index entry is replicated to a quorum of metadata replicas, data fragments are fsync'd to a quorum of storage nodes across 3 AZs. The 11 nines come from EC + AZ distribution + metadata quorum combined.
§5. Capacity Envelope
What this technology can do, illustrated across very different scales.
Startup — 100 GB. A SaaS with 10k users × 10 MB = 100 GB. ~$2.30/mo on S3 Standard. Direct uploads via pre-signed URL. A few PUTs/min, single-digit GETs/sec. No CDN needed initially. "Set up a bucket, hand out signed URLs, forget about it." Next bottleneck: nothing for years.
Small-to-mid — TB scale, CDN-fronted. A growing image catalog — Instagram's first year, Pinterest's early pins. ~10 TB of images, billions of objects, served via CloudFront / Cloudflare at >95% cache hit. Origin sees thousands of GETs/sec; CDN serves the rest. ~$230/mo storage + bandwidth. Bottleneck: viral event explodes cache miss rate, S3 throttles hot prefixes — fix by hash-prefixing keys.
Mid — Dropbox, Notion, Discord at PB. Tens of TB to multi-PB. Hundreds of PUTs/sec sustained, thousands of GETs/sec to origin. Multiple buckets by region or tenant. Heavy multipart, pre-signed URLs for browser uploads, CDN for reads. $10k-100k/mo.
Large — Netflix, YouTube, Instagram catalogs. Netflix's catalog: single-digit PB masters + tens of PB encoded variants. YouTube stores exabytes. Instagram reported 60 billion photos by 2017 (>200 billion now); Pinterest >100 billion pins. Multi-EB on S3/GCS/proprietary. Millions of GETs/sec aggregate. Custom CDNs (Netflix Open Connect) absorb most reads before they reach the object store.
Hyperscaler — S3 itself. Publicly: over 280 trillion objects and 100 million+ requests/sec in aggregate (2023). Exabytes of data across many regions, multiple AZs each, thousands of storage nodes per AZ.
The range — 100 GB at cents/month to EB at millions of QPS — spans six orders of magnitude. The same API and storage class serve a hobby project and the Netflix catalog. What scales: prefix spread, CDN posture, lifecycle strategy. The protocol does not.
§6. Architecture in Context
The canonical pattern for object storage in a production system. Not "the X system" — the shape that recurs across image hosting, video streaming, data lakes, ML pipelines, backup systems.
┌──────────────────────────────┐
│ Browser / mobile client │
│ uploads files, fetches │
│ media, downloads exports │
└──────────┬─────────┬─────────┘
│ │
(1) request signed URL (4) PUT or GET directly
│ │ to object storage
▼ │
┌──────────────────────────────┐
│ Application server │
│ (stateless, horizontal) │
│ - authenticate user │
│ - validate request │
│ - return pre-signed URL │
└────────┬────────────┬────────┘
│ (2) │ (3) write
│ │ metadata row
│ ▼
│ ┌────────────────────────┐
│ │ Transactional DB │
│ │ (Postgres / MySQL) │
│ │ - object_key │
│ │ - uploaded_by_user │
│ │ - mime_type, size │
│ │ - status, moderation │
│ └────────────────────────┘
│
▼ (5) hand out signed URL
┌──────────────────────────────┐
│ CDN (CloudFront, Cloudflare,│
│ Fastly) — pull-through cache│
│ for public/anonymous reads │
└──────────┬───────────────────┘
│ (cache miss)
▼
┌──────────────────────────────┐
│ Object storage │
│ (S3 / GCS / Azure Blob /R2) │
│ │
│ bucket: user-uploads-prod │
│ ├─ a3f7/2026-05-22/... │
│ ├─ b9e2/2026-05-22/... │
│ └─ c1d8/2026-05-22/... │
│ │
│ Lifecycle: │
│ 30d → IA tier │
│ 365d → Glacier │
│ Versioning: ON │
│ Encryption: SSE-KMS │
│ Public access: BLOCKED │
└──────────────────────────────┘
│
▼ (async; for derived workloads)
┌──────────────────────────────┐
│ CDC / event notifications │
│ S3 → SQS / SNS / EventBridge│
│ triggers post-processing: │
│ - virus scan │
│ - thumbnail generation │
│ - transcoding │
│ - data lake ingest │
└──────────────────────────────┘
The object store (right side) holds the bytes. A transactional database holds metadata with the object key as the foreign-key pointer. Around it cluster:
- A CDN in front for public read traffic. Cache hit rates of 90-99% mean origin sees a fraction of total reads.
- A pre-signed URL flow for client-direct uploads and downloads, offloading bandwidth from the application server.
- An event-notification path (S3 Event Notifications → SNS — Simple Notification Service — or Lambda or SQS — Simple Queue Service — or EventBridge) that triggers downstream processing: virus scanning, thumbnail generation, image moderation, transcoding, data lake ingestion.
- A lifecycle policy that ages data into cheaper tiers automatically.
- A versioning + access-control posture that protects against accidental deletion and accidental public exposure.
This shape recurs across:
- A user-generated content app (Instagram, Notion, Slack).
- A data lake (Iceberg / Delta / Hudi tables on S3, queried by Athena / Trino / Spark).
- An ML pipeline (training data in S3, model artifacts in S3, fine-tuning checkpoints in S3).
- A backup system (database snapshots, log archives, encrypted backups).
The metadata-DB-plus-object-storage pattern is the universal answer to "I have lots of big things and want to find them by key and serve them efficiently."
§7. Hard Problems Inherent to Object Storage
Six fundamental challenges. Each shows up regardless of use case; illustrating examples come from across domains.
7.1 Prefix hot-spotting
One line. Writes concentrate under one prefix, saturating one internal shard and getting throttled.
Where it shows up. A Hive-style data lake writes hourly partitions: year=2026/month=05/day=22/hour=14/part-000-of-10000.parquet. A logging pipeline: logs/2026/05/22/14/<host>.json.gz. A clickstream: events/yyyy/mm/dd/hh/<batch>.avro.
Why it breaks. Every key starts with ~20 identical chars. Hash puts them in one shard. At 6000 writes/sec → 503 SlowDown; Spark slows 5x. AWS support says "spread your prefix."
Fix: hash-prefix. Prepend MD5(filename)[:4]:
s3://lake/a3f7/year=2026/month=05/day=22/hour=14/part-000.parquet
s3://lake/b9e2/year=2026/month=05/day=22/hour=14/part-001.parquet
Four hex chars → 65,536 shards → effectively unbounded bucket throughput. Modern table formats (Iceberg, Delta, Hudi) build hash segments into their paths automatically. S3 also auto-splits long-lived hot prefixes over minutes-to-hours; for burst workloads, hash-prefixing is the only fix.
7.2 Cost explosion from listings
One line. Naive ls over a huge tree is slow, expensive, and often the bottleneck.
Where it shows up. A backup-rotation job runs aws s3 ls s3://bucket/backups/ --recursive. A data-lake planner lists all files in a partition. A monitoring script does s3 ls every minute. ListObjectsV2 returns ~1000 keys per call; for 100M keys, 100,000 calls.
Why it breaks. LIST costs $0.005 per 1000 calls → $0.50 per full scan, $720/mo if run every minute. Wall time is bounded by call rate — at 100/sec, 17 minutes per scan. Anything waiting blocks.
Fix: sideboard index, not LIST.
- DB index. Your transactional DB already has rows for each object key. Query it.
- S3 Inventory. S3 emits daily/weekly CSV/Parquet inventories at ~$0.0025/M objects. Read the inventory, not the live listing.
- Iceberg / Delta / Hudi manifests. Table formats maintain their own file lists.
Principle: LIST is for ad-hoc exploration, not the hot path. If you LIST more than a few times an hour over a non-trivial prefix, you're doing it wrong.
7.3 Multi-region consistency and latency
One line. Cross-region replication is async, eventually consistent, and adds tens to hundreds of ms.
Where it shows up. A global company stores uploads in us-east-1, serves Europe from a CDN; cache misses pay ~80ms RTT. DR setup: writes in us-east-1 must show up in us-west-2 fast. Compliance copies a data lake across regions.
Why it breaks. S3 CRR is async — after a PUT in us-east-1, us-west-2 lags minutes (15-min SLA with S3 Replication Time Control / RTC). A GET in us-west-2 right after a PUT in us-east-1 may 404. No native multi-master.
Fix patterns.
- Single source of truth, CDN fanout. Writes to one region; CloudFront caches reads globally. Cache misses pay cross-region latency. Default for static assets.
- Active-active with conflict-free naming. Each region writes to its own bucket/prefix; metadata DB knows which region owns each object. Failover replays the pointer.
- S3 MRAP (Multi-Region Access Point). Single endpoint over multi-region replicas; eventual cross-region, last-write-wins for writes. Read-heavy global workloads.
- GCS dual/multi-region buckets. Natively replicated; either region reads/writes with strong consistency, paying ~50-100ms cross-region quorum on writes.
There's no perfect answer; pick which region pays the tax.
7.4 Lifecycle management — getting tiers right
One line. Aging to cheaper tiers is a recurring cost problem; wrong policy means money lost or fast access gone.
Where it shows up. A SaaS keeps everything in Standard at $0.023/GB; 100 TB at $2300/mo, but 90% untouched in 6 months. Backup retains 30 dailies in Standard; only the latest is ever read. ML dataset is hot for 2 weeks, then a cold reference.
Why it breaks. Two ways:
- "Keep everything hot." 1 PB Standard = $23k/mo vs $1k/mo Deep Archive — $264k/yr.
- "Move everything to Glacier." Deep Archive restore is 12 hours; bulk retrieval $0.02/GB → $20 to restore 1 TB.
Fix: tiered policy.
0-30 days: S3 Standard ($0.023/GB)
30-90 days: S3 Standard-IA ($0.0125/GB)
90-365 days: S3 Glacier Instant Retrieval
365 days+: S3 Glacier Deep Archive
Intelligent-Tiering automates the decision: S3 monitors access and moves objects between tiers. Small monitoring fee ($0.0025/1000 objects/mo, waived >128 KB).
Edge cases. Min storage durations (IA 30d, Glacier Instant 90d, Deep Archive 180d) — putting a 2-day-old object in Glacier costs more than Standard. Lifecycle rules apply by prefix or tag, not per-object. Versioning interacts: rules on noncurrent versions purge old.
7.5 Permissions and the layered access model
One line. IAM policies (identity-based), bucket policies (resource-based), ACLs (legacy), and pre-signed URLs (delegated), all layered. Misconfiguration is the most common cloud-data leak.
Where it shows up. A bucket made public for a build script also holds PII (Personally Identifiable Information). A migration grants s3:* to an IAM role; the role is later attached to a vulnerable service. A pre-signed URL leaks into a browser history. The 2019 Capital One breach — SSRF (Server-Side Request Forgery) → EC2 instance role with s3:GetObject → exfiltration.
Why it breaks. Four layers interact:
- IAM policy — on a principal (user, role, group).
- Bucket policy — on the bucket (cross-account too).
- ACLs — legacy; new buckets default "Bucket Owner Enforced" disables them.
- Pre-signed URLs — whoever holds it.
Effective permissions = union, with deny winning. Public access happens through Principal: * policy, an ACL granting "All Users", or a leaked signed URL.
Fix.
- Account-level Public Access Block. Post-Capital One safety net; rejects any policy/ACL granting public access.
- IAM for services, bucket policies for cross-account, no ACLs.
- Short-lived pre-signed URLs. 15 min upload, 1 hr download.
- Least-privilege IAM. Read
bucket/users/<user-id>/*, not the whole bucket. - AWS Access Analyzer. Weekly review of policies granting external access.
- CloudTrail + S3 Access Logs. Alert on large downloads, new IP ranges.
7.6 Egress cost as a tax
One line. Cloud providers charge $0.05-$0.12/GB for bytes leaving — often dwarfing storage cost.
Where it shows up. A media company serves 500 TB/mo of video. Storage for 200 TB catalog: $4,600/mo. Egress at $0.09/GB: $45,000/mo — 10x storage. A SaaS exports 10 TB/mo to on-prem: $900/mo. An ML team copies 100 TB cross-region (us-east-1 → us-west-2): $2,000 per copy at $0.02/GB.
Why it breaks. Egress is the lock-in mechanism. Migrating 1 PB from S3 to GCS = ~$90,000. Asymmetry (free ingress, expensive egress) keeps data in place.
Fix.
- CDN in front. CloudFront → S3 internal at $0.02/GB; CloudFront egress at $0.085. 95% cache hit drops real S3 egress 20x.
- R2 / Backblaze / Cloudflare bandwidth alliance. If your business is bandwidth (public video, dataset distribution, AI model hosting), R2's $0/GB egress is a 10-100x cost reduction.
- VPC endpoints / private link. Same-region EC2 → S3 via VPC Endpoint doesn't traverse the internet; egress free.
- Compute close to data. Query / train / analyze in the bucket's region.
- Negotiate. Large customers (>$100k/mo egress) get custom discounts.
The egress tax is structural; awareness changes which provider, region, and CDN you pick.
§8. Failure Mode Walkthrough
Object storage fails. The mechanisms differ from a database failure, but the recovery story is similar in shape: durability points, replay from logs, rollback to a prior version.
8.1 Region outage
Scenario. The Dec 4, 2017 S3 us-east-1 outage. An engineer running a routine debug command typed a parameter wrong; an unintended subset of S3 subsystems was restarted; cascading dependencies brought down the S3 metadata service in us-east-1 for ~4 hours. Many AWS services (which themselves depended on us-east-1 S3 for their own data) also went down — including the AWS service health dashboard, which couldn't update because it depended on us-east-1 S3.
What happens. Every GET and PUT to S3 us-east-1 returns 5xx. Anything in your application that depends on S3 fails. CDN cache hits keep serving (cache survives origin loss); cache misses fail.
Recovery.
- For reads: CDN cache reduces blast radius. If your assets are warmed in a CDN, most users see no impact.
- For writes: Buffer writes locally; queue for retry when S3 returns. Application servers stash uploads on local EBS / EFS / Redis and replay when S3 is healthy.
- Multi-region preemptively: Cross-Region Replication or dual-region buckets, with the application capable of failing over. The 2017 outage motivated many teams to add multi-region replication.
Durability point. Even during the outage, no data was lost — the 11-nines durability is per object, independent of API availability. After the outage, all objects were readable again. The outage was an availability event, not a durability event.
8.2 Prefix hot-spot throttling
Scenario. A nightly batch job writes 50,000 Parquet files per hour into s3://lake/year=2026/month=05/day=22/. At 8000 writes/sec, S3 returns 503 SlowDown on ~30% of the writes.
What happens. The job's exception count grows; the Spark / Glue / Hadoop client backs off, retries, eventually completes 2x slower or fails partway.
Recovery during the run. Exponential backoff with jitter. The S3 SDK's default retry strategy handles this — but the job's wall time grows. You hit the lateness SLA and the downstream analytics jobs starve.
Real fix. Hash-prefix the keys (§7.1). For an in-progress job, work-around: split the write into multiple prefixes manually (bucket/p1/..., bucket/p2/..., ...). For ongoing fix: refactor the partitioning scheme.
8.3 Lifecycle policy that deleted data
Scenario. An engineer writes a lifecycle policy intended to clean up temporary upload staging: "delete objects in tmp/ prefix after 30 days." A misconfiguration applies the rule to the whole bucket or to a wrong prefix.
What happens. Thirty days after the policy is applied, S3 silently deletes all matching objects. No warning. The discovery is usually "users complaining their data is gone" days or weeks later.
Recovery.
- If versioning was on: Each "delete" was a delete-marker; the underlying object versions are intact. Restoration: enumerate the delete markers, delete them (which restores the previous version as current). Tedious for many keys but recoverable.
- If versioning was off: Data is gone. Restoration only possible from backups (cross-account, cross-bucket, or off-cloud).
- MFA Delete on bucket: If enabled, no lifecycle policy can permanently delete versions; an MFA-authenticated user must approve. Strong protection against this category.
Prevention.
- Always enable versioning on important buckets. The 5-50% storage cost increase pales next to one accidental delete event.
- Lifecycle policies in IaC (Infrastructure as Code) with code review. No console-edited lifecycle rules.
- Test lifecycle rules in dry-run / staging buckets first.
8.4 Accidental public bucket (the "S3 leak" archetype)
Scenario. A bucket containing customer PII is misconfigured with public read. Discovered by a security researcher running an S3 bucket enumerator; reported publicly; brand damage.
What happens. Anyone on the internet can GET any object in the bucket. Personally Identifiable Information, credentials, internal documents — whatever's in the bucket. Detection of who's accessed it is via S3 Access Logs (if enabled) or CloudTrail Data Events (if enabled), often after the fact.
Recovery.
- Immediate remediation: Disable public access on the bucket. Verify with
aws s3api get-bucket-policy-statusand Public Access Block. - Rotate any credentials that may have been exposed.
- Notify affected users per GDPR / CCPA / state breach laws (jurisdiction-specific timelines).
- Audit logs for what was accessed; the access logs (if enabled) are critical evidence.
Prevention.
- Account-level Public Access Block. Blocks any bucket-level public access regardless of policy. A safety net for human error.
- Periodic audits (AWS Access Analyzer, third-party tools like Spectral / Wiz / Lacework) for buckets with public access.
- Tagging buckets with data-classification tags (
Tier=PII,Tier=Public). Alerts on PII buckets becoming public.
8.5 Object overwrite without versioning
Scenario. A user uploads a new version of their avatar; the application PUTs to the same key. Two minutes later the user reports the upload failed and asks for the previous avatar back. Without versioning, the old image is gone.
Recovery.
- With versioning:
aws s3api list-object-versions --prefix users/<user-id>/avatar.jpgreturns both versions; restore the previous one. - Without versioning: Backup or nothing.
Prevention. Enable versioning on user-content buckets.
8.6 ETag mismatch on in-flight uploads
Scenario. A large file is being uploaded via multipart; the upload is in progress when the application crashes. On restart, a different client (maybe a different user retry) starts a new multipart for the same key. Both completes succeed; the visible object is from whichever completed last; the staging parts of the loser pile up indefinitely.
Recovery.
- Detect orphan multiparts:
ListMultipartUploadsper bucket. Anything older than a few days that's not actively progressing is orphan. - Lifecycle rule: "Abort multipart uploads incomplete after 7 days." Standard hygiene.
Prevention. Use a unique UploadId per user session and per file version; don't share. Use S3 Object Lock if accidental overwrites are a high-risk concern (financial records, compliance documents) — Object Lock makes an object immutable for a retention period.
§9. Why Not Just Store Files on a Filesystem
The obvious naive replacement for object storage is "a Linux box with a big disk." Let's walk through why this breaks.
Setup. A startup with user uploads. They set up an EC2 instance with a 10 TB EBS volume, mount it at /data, and cp uploaded files there. The application serves files via nginx static file serving.
Failure mode 1: single point of failure. The EC2 instance dies. The EBS volume is intact (separate failure domain) but no machine is serving it. Recovery: launch a new instance, attach the volume, restart nginx. Downtime: 5-30 minutes if you have automation, hours if you don't.
Failure mode 2: durability. The EBS volume has 99.9999% annual durability — five nines. Compared to S3's eleven nines, that's 100,000x worse. Over a 10-year period running 1000 volumes, expected loss: 1 volume. Lost volume = lost data (unless you have backups, which you must set up yourself).
Failure mode 3: capacity. Disk fills up. Adding more EBS requires expanding the volume (no downtime, modern EBS) and growing the filesystem, but eventually you hit the per-volume limit (~16 TB per gp3 volume; ~64 TB per io2). At that point you need multiple volumes, and now you're sharding by hand — assigning users to volumes, knowing which volume holds which file, balancing load.
Failure mode 4: replication. EBS is single-AZ. To survive an AZ failure, you need cross-AZ replication. To survive a region failure, cross-region. Building this yourself = rsync cronjob to a different AZ; lag is whatever cron interval you set; consistency is "whatever rsync did." S3 gives you this by default, RPO=0 across AZs.
Failure mode 5: scaling reads. Serving 1000 reads/sec from one nginx box on EBS works. 100,000 reads/sec needs many boxes, all reading the same files. Now you need a shared filesystem (NFS, EFS) or copy files everywhere. Either way, latency goes up, complexity goes up, cost goes up.
Failure mode 6: scaling writes. Writes go to one node (the owner of the EBS volume). To scale, you shard — assign files to volumes. Now your application logic includes "look up which volume owns this file," and rebalancing volumes is an operational nightmare.
Failure mode 7: cost. EBS gp3 is $0.08/GB/mo. S3 Standard is $0.023/GB/mo. For 100 TB, EBS costs $8,000/mo vs S3 $2,300/mo — 3.5x. And EBS doesn't auto-tier; cold data costs the same as hot.
Failure mode 8: serving the world. Once your customers are global, you need a CDN. The CDN expects to read from a public-internet endpoint with proper headers. Wiring an internal EBS-backed nginx to a CDN involves an SSL cert, a public IP, scaling, and rate-limiting — work S3 + CloudFront does for free.
The cumulative case: a filesystem on a server works for ~10 GB and one host. It does not scale to a serious workload. Every team that tried (Dropbox is the famous example) eventually built their own object storage anyway. Magic Pocket, Dropbox's internal object storage, replaced their use of S3 (~600 PB at the time of public talks) with a custom 11-nines distributed object store — they didn't "go back to filesystems," they built S3 themselves.
§10. Scaling Axes
How object storage scales.
Type 1: more objects, more clients — auto-scales
The growth pattern. A startup at 100 GB, growing to 100 TB over five years; user base from 10k to 10M; request rate from 100 QPS to 100k QPS.
How object storage handles it. Without action. S3 / GCS / Azure Blob auto-scale: more objects → more internal shards (S3 auto-splits prefixes); more requests → load balanced across more frontend nodes; more storage → more underlying storage capacity allocated. The user doesn't see the seams.
What might bite anyway.
- Bucket-level throughput limits. Documented limits (3500 GET, 5500 PUT per prefix per second) apply per prefix. Sustained, well-distributed workloads can hit 100k+ QPS bucket-wide.
- List operations. As object count grows, listing time grows linearly. Switch to inventory or table-format manifests.
- Cost-of-LIST and cost-of-PUT. At billions of operations, the per-call costs add up. Batch operations where possible.
Inflection points. None for storage capacity (effectively unbounded). For QPS, the inflection is "we sustained 5000 QPS on one prefix and got throttled" — hash-prefix the keys. For listings, "the LIST is taking longer than the actual processing" — switch to a sideboard index.
Type 2: one prefix hot — hash prefix
The growth pattern. A logging pipeline writes app logs to logs/2026/05/22/14/. The hour-of-day boundary is the hot prefix; every hour, 100% of writes hit one prefix; the rest of the bucket is cold.
How object storage tries to handle it. S3's auto-split kicks in for sustained hot prefixes, but lag is measured in minutes-to-hours. For bursty workloads (every hour-boundary creates a new prefix), auto-split never has time to catch up.
Real fix. Hash-prefix. Discussed in §4.3 and §7.1.
Inflection point. When sustained write rate per prefix exceeds ~3000/sec, hash-prefixing becomes mandatory. Burst rates above ~5500/sec get throttled.
Type 3: from a single bucket to many
The growth pattern. Your application has one bucket; lifecycle and access patterns are mixed (hot user uploads, cold backups, multi-tenant data). Bucket policies and lifecycle rules become unwieldy.
Fix. Split into many buckets by access pattern: user-uploads-prod, backups-prod, analytics-data-lake-prod, model-artifacts-prod. Each bucket gets its own lifecycle, encryption settings, access controls, tags. Cleaner cost attribution (tags per bucket); cleaner audit; easier permissions.
Limit. AWS default account limit is 100 buckets per account (raisable to 1000); names are globally unique.
Type 4: from one region to many
The growth pattern. Latency-sensitive global users; compliance (data residency); disaster recovery.
Fix options.
- CRR (Cross-Region Replication). Async copy of objects from source to one or more destination buckets. Standard for DR. Latency from source to destination: minutes (RTC tier promises 15-minute SLA).
- Multi-Region Access Point (MRAP). Single endpoint over multiple regional buckets. Used for read-heavy globally-distributed access patterns.
- GCS dual-region / multi-region buckets. Natively replicated; bucket lives in two regions or a multi-region geography. Reads/writes either region with strong consistency at the cost of cross-region latency on writes.
Inflection point. When cross-region GET latency from CDN cache miss > acceptable user latency (50-100ms typical for media-rich apps); or when compliance requires data residency.
§11. Decision Matrix: Object Storage vs Adjacent Categories
Side-by-side comparison along named dimensions.
| Dimension | Object storage | Block storage (EBS) | File storage (EFS / NFS) | CDN edge cache | Database BLOB column |
|---|---|---|---|---|---|
| Access unit | Whole object | 4 KB block | Bytes within file | URL (with HTTP semantics) | Row/column |
| Random read inside item | Range GET (slow) | Native (~100µs) | Native | Range GET | Native (slow at scale) |
| Random write inside item | No (re-upload) | Native | Native | N/A (read-only) | Native |
| Concurrency | Strong consistency | Single attacher | Multi-host with locks | Read-only | ACID transactions |
| Latency p50 (intra-region) | 30-80ms | 100-500µs | 1-5ms | 5-20ms (edge POP) | 1-10ms |
| Throughput per item | Multi-Gbps | 1-10 Gbps | 10s of Gbps | High (per-edge) | Limited by row I/O |
| Durability | 11 9s | 5-6 9s | 11 9s | none (ephemeral) | depends on DB |
| Capacity per item | 5 TB | 64 TB (volume) | unbounded (per file: huge) | N/A | typically <1 GB/cell |
| Total capacity | Unbounded | Per-volume cap | PB | TB-PB per POP | TB-PB per shard |
| Cost (USD/GB/mo) | $0.023 | $0.08 | $0.30 | varies | DB-storage rate (>$0.10) |
| Egress | Expensive | Inter-AZ free | Inter-AZ free | CDN egress | inside-DB free |
| Typical use | Media, lake, model | DB data files, OS disk | Shared config, HPC scratch | Public asset serving | Tiny embedded blobs |
When to pick which:
- Object storage: the artifact is big (>10 KB), accessed as a whole (not by partial mutation), durability matters, and you can tolerate 50-150ms latency. Default choice for: user uploads, video / media, ML models, training data, data lake files, backups, log archives.
- Block storage: the artifact is small (4 KB blocks), accessed with random reads and writes, attached to one host. Default for: database data files, OS root volumes, application caches that need fsync.
- File storage: you need POSIX semantics (open / read / write / lock) shared across many hosts. Default for: HPC clusters, legacy applications, shared configuration that updates frequently.
- CDN edge cache: you serve the same large item to many readers worldwide. Pair with object storage as origin.
- Database BLOB column: the blob is tiny (<10 KB) and lives logically inside a row that you query frequently. Even then, prefer storing a URL in the row and the blob in object storage; database buffer pool is too valuable for blobs.
Concrete thresholds.
- 5 KB blob attached to 1M user rows: BLOB column might be ok.
- 50 KB image, served 1B times/yr: object storage + CDN.
- 5 MB document, attached to a user: object storage.
- 5 GB ML model file: object storage, definitely.
- 10 GB / hour log stream from one server: object storage with batching (don't write one log line per PUT).
- 50 KB random reads at 100k QPS: a cache (Redis / Memcached / DragonflyDB), not object storage.
§12. Storage Tiers and Economics
Object storage charges by storage class. Picking the right one across an object's lifecycle is one of the biggest cost levers.
S3 storage classes (US-East-1 reference prices, 2024)
| Class | $/GB/mo | Retrieval cost | Min storage | First-byte latency | Use case |
|---|---|---|---|---|---|
| S3 Standard | $0.023 | none (pay per GET) | none | ms | Hot, frequently accessed |
| S3 Intelligent-Tiering | varies | none (auto-moves) | none | ms | Unknown / changing access patterns |
| S3 Standard-IA (Infreq Access) | $0.0125 | $0.01/GB | 30 days | ms | Backup, less-frequent access |
| S3 One Zone-IA | $0.01 | $0.01/GB | 30 days | ms | Recreatable, single-AZ ok |
| S3 Glacier Instant Retrieval | $0.004 | $0.03/GB | 90 days | ms | Quarterly accessed |
| S3 Glacier Flexible Retrieval | $0.0036 | $0.01-$0.10/GB | 90 days | min-hours | Annual backups, deep archive |
| S3 Glacier Deep Archive | $0.00099 | $0.02-$0.025/GB | 180 days | 12 hours | True cold archive, compliance |
The economics.
- S3 Standard at $0.023/GB/mo is the default hot tier. Used for anything actively served.
- Standard-IA at $0.0125 halves storage cost but charges $0.01 per GB retrieved. Worth it if you read <0.5x per month per GB. Higher and the retrieval fees eat the savings.
- One Zone-IA at $0.01 is 20% cheaper than IA, single-AZ. Loses an AZ → data gone. Only use for recreatable data (transcoded variants from a master, derived analytics) where loss is recoverable.
- Glacier Instant at $0.004 is ~5x cheaper than Standard, milliseconds latency. Designed for "I need fast access but only rarely." Quarterly compliance reports, archived user data with rare access.
- Glacier Flexible at $0.0036 has minutes-to-hours retrieval. ~6x cheaper than Standard. For monthly-or-rarer access patterns. Often used for backups.
- Glacier Deep Archive at $0.00099 is 23x cheaper than Standard but 12-hour retrieval. For pure archive, compliance retention (7-year financial records, decades-long medical), data you must keep but really hope not to read.
Lifecycle policies
A lifecycle policy is a declarative description of when to move objects between tiers and when to delete them. Example:
{
"Rules": [{
"ID": "user-uploads-aging",
"Filter": { "Prefix": "user-uploads/" },
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 } // 7 years
}]
}
This rule: objects in the user-uploads/ prefix move to IA after 30 days, Glacier Instant after 90, Deep Archive after 365, deleted after 7 years.
Caveats.
- Transitions cost money. Each transition is a per-object operation; for 1B objects, transition cost is non-trivial.
- Min storage duration. Moving an object that's only 10 days old to IA charges the 30-day minimum anyway.
- Lifecycle is per bucket, on prefixes or tags. Granular control requires bucket-or-tag layout.
The "we keep everything hot" cost disaster
A common pattern: a startup grows from 100 GB to 1 PB over five years. Everything sits in S3 Standard because "we might need it." Storage cost: 1,000,000 GB × $0.023 = $23,000/mo = $276k/yr.
If 90% of it is accessed less than once a year:
100 TB hot in Standard: 100,000 × $0.023 = $2,300/mo
900 TB in Glacier Flexible: 900,000 × $0.0036 = $3,240/mo
Total: $5,540/mo (down from $23,000/mo)
Savings: $17,460/mo = $209k/yr
A few hours of lifecycle policy authoring, $200k/year saved. This is the single biggest object-storage cost lever.
Intelligent-Tiering when access patterns are unknown
S3 Intelligent-Tiering automates the tier decision: S3 monitors access per object and auto-moves between Frequent → Infrequent → Archive Instant → Archive → Deep Archive Access tiers. Monitoring fee: $0.0025 per 1000 objects per month (waived for objects > 128 KB).
When to use: you have many objects with unpredictable access (user uploads where some go viral and others sit forever; a data lake where most files are read once on ingest and never again, but a few become "reference" files). Not worth it: you know your access pattern (manual policy is cheaper).
§13. Security
Object storage has been the highest-profile data leak vector in the cloud era. Getting security right requires understanding multiple overlapping mechanisms.
13.1 IAM, bucket policies, ACLs, pre-signed URLs
The four layers, recapped from §7.5:
- IAM (Identity and Access Management). Identity-based. "This user/role can do these actions on these resources." Lives in the IAM service, attached to users/roles/groups.
- Bucket policy. Resource-based. "These principals (identified by IAM ARN or AWS account ID) can do these actions on this bucket." Lives on the bucket.
- ACLs (Access Control Lists). Legacy per-object or per-bucket grants. Deprecated by default in new buckets (Bucket Owner Enforced).
- Pre-signed URLs. Time-limited, signed grants embedded in a URL.
Evaluation logic. A request is allowed if any of (IAM policy, bucket policy, ACL) explicitly allows AND no policy explicitly denies, with deny taking precedence. Public access is also gated by the Block Public Access settings (account + bucket level).
Best practice posture.
- Default-deny stance. Nothing is allowed unless explicitly granted.
- Use IAM for service-to-service auth. Application roles attached to EC2 / ECS (Elastic Container Service) / Lambda; IAM policies grant exactly the actions needed.
- Use bucket policies for cross-account access. Sharing data between AWS accounts; specify accounts in the bucket policy.
- Don't use ACLs. Disable them via Bucket Owner Enforced.
- Account-level Block Public Access. Default on; prevents any policy from making a bucket public.
13.2 Pre-signed URLs and offloading auth
Pre-signed URLs (§4.5) are the workhorse for client-direct upload/download. The application server authenticates the user, decides "yes, this user can upload to users/<id>/avatar.jpg," generates the signed URL, returns it. The client PUTs directly to S3. Auth happens once at URL generation; S3 verifies the signature on receipt.
Properties.
- The URL is the credential. Anyone with the URL can perform the action until expiry.
- Time-limited. Typical: 15 minutes for upload, 1 hour for download. Don't make it longer than necessary.
- Specific to operation and resource. A GET-signed URL can't be used for PUT; a URL for
key1can't accesskey2. - Headers can be pinned. Pin Content-Type, Content-MD5 to prevent the client from uploading something else.
13.3 Server-side encryption
Three variants of S3 server-side encryption:
- SSE-S3 (default). S3 manages keys. Each object is encrypted at-rest with a unique data key, which is encrypted by an S3-managed master key. Transparent: client just
PUTs, S3 encrypts on receipt, decrypts onGET. No code change needed. Sufficient for "we want at-rest encryption for compliance" without managing keys. - SSE-KMS (KMS-managed). S3 uses customer-controlled CMKs (Customer Master Keys) in AWS KMS (Key Management Service). Each object encrypted with a unique data key, the data key encrypted by the KMS CMK. Properties: audit trail in CloudTrail (every encryption/decryption logged), KMS access policies (revoke key access to lock out reads), key rotation. Trade-off: KMS API calls cost (~$0.03/10000 calls), so heavy-read workloads pay a noticeable surcharge.
- SSE-C (Customer-provided). Client provides the encryption key with each request. S3 uses it to encrypt at receipt and discards. To read, client provides the same key. Properties: keys never stored by AWS (they're discarded after use), but client must manage them — losing the key loses the data permanently. Niche; rarely used.
13.4 Client-side encryption
The strictest posture: encrypt before uploading. The provider sees only ciphertext.
Client: data → AES-256(data, client_key) → ciphertext
PUT ciphertext to S3
S3: stores ciphertext (which is also SSE-encrypted at rest)
Client: GET ciphertext from S3
decrypt with client_key
When to use. When the provider must not be able to read the data — extreme regulatory environments (some healthcare, some finance), or paranoia. Backup vendors (Borg, Restic, Tarsnap) do this for customer backups going to S3. The provider sees encrypted blobs only.
Trade-off. No server-side processing (S3 Select, Athena, etc.) on encrypted data. Key management is on you; losing the key loses everything.
13.5 VPC endpoints / Private Link
By default, S3 traffic from EC2 traverses the public internet (over TLS, but still public-routed). With a VPC Endpoint for S3, traffic stays inside AWS's backbone:
EC2 in VPC → VPC Gateway Endpoint for S3 → S3 in same region
(no internet gateway, no public IP, no NAT)
Properties.
- Network isolation. Even if an attacker had your IAM creds, they couldn't reach S3 from outside your VPC (assuming the bucket policy restricts to your VPC endpoint).
- No egress data transfer fees. S3 access via VPC Endpoint is free (within same region).
- Lower latency (~5-15ms saved by not going through internet routing).
- Compliance for environments that mandate "no public internet traffic."
VPC endpoints are essentially mandatory in production for security-conscious deployments.
13.6 Object Lock and WORM (Write-Once-Read-Many) compliance
Some regulations require data immutability: financial records (SEC Rule 17a-4), legal-hold documents, healthcare audit logs.
Object Lock. Once an object is locked, no one — not even the bucket owner, not even the root account — can delete or modify it until the lock expires.
Modes.
- Governance mode. Locked, but special IAM principals can override.
- Compliance mode. Locked, no override, period. The retention period must elapse.
Use. Bank ledgers, audit logs subject to retention rules, healthcare records, anything where "we promise we will not modify this for N years" is a legal obligation.
13.7 Threat model recap
- Accidental public exposure. Mitigated by Block Public Access.
- Credential leak. Mitigated by short-lived IAM credentials (STS — Security Token Service), MFA, and CloudTrail monitoring.
- Pre-signed URL leak. Mitigated by short expirations.
- Insider deletion. Mitigated by versioning + MFA Delete + Object Lock for compliance data.
- Provider compromise. Mitigated by client-side encryption.
- Ransomware on bucket. Mitigated by versioning (the attacker can't actually delete versions if MFA Delete is on) + cross-account backup (separate "vault" account where the prod account has no access).
§14. Use Case Gallery
Five archetypal applications of object storage across very different domains.
14.1 Static asset hosting (websites, images, JS bundles)
Pattern. Consumer web app — Shopify storefronts, Substack publications, SaaS marketing sites — serves static assets (images, CSS, JS bundles, fonts) globally.
Architecture. Object storage as origin; CDN in front; assets versioned by hash in filename (main-a8f29c.js) so they cache forever. Build uploads new versions; old URLs never invalidate because old hashes never reappear.
Why it fits. Bytes uploaded once, served billions of times. Origin latency doesn't matter because cache hit rate is >99%. Cost is CDN egress, not S3 storage. URL-hash versioning makes cache invalidation a non-problem.
Numbers. Medium site: 500 GB assets, 50M req/mo to CDN, 1M req/mo to origin (2% miss). S3: $11.50 storage + minimal egress. CloudFront: ~$4,250 egress. Without S3+CDN, nginx on EC2 needs multiple servers and breaks under viral traffic.
14.2 Video and media storage (Netflix-class, YouTube-class)
Pattern. Streaming service stores master video files plus transcoded variants (resolutions × codecs × bitrates). Netflix: ~17,000 titles × dozens of encodings each; YouTube: billions of videos.
Architecture. Masters cold-tier; variants hotter, proportional to popularity; HLS (HTTP Live Streaming) or DASH (Dynamic Adaptive Streaming over HTTP) chunks at 2-10 sec / few MB each. Client requests chunks sequentially; CDN edge caches hot chunks.
Why it fits. Big bytes, "fetch whole chunk in one read" access, durability matters (masters expensive to regenerate), tiering essential (a 2010 film sits in Glacier; a new release in Standard).
Numbers. Netflix Open Connect serves ~30% of internet traffic at peak. Backing catalog 10-100 PB. Pareto-distributed popularity means a small fraction of titles dominates; rest sits cold.
14.3 Data lake (Iceberg / Delta / Hudi tables on object storage)
Pattern. Analytics / ML org stores PB of structured data (events, transactions, logs) as Parquet in object storage. Apache Iceberg, Delta Lake, Apache Hudi provide ACID-like semantics over immutable Parquet using manifest files.
Architecture. s3://lake/<db>/<table>/data/<hash>/file-<id>.parquet. Catalog metadata s3://lake/<db>/<table>/metadata/v123.metadata.json points atomically at the active file set. Writes append + rewrite catalog atomically. Compaction merges small files.
Why it fits. Multi-PB volumes; read-mostly with appends; Trino / Spark / Flink / DuckDB / BigQuery / Snowflake all read Parquet-on-S3 natively; cost orders of magnitude lower than warehouse storage. Iceberg catalog + Parquet data = the canonical metadata-in-DB + bytes-in-object-storage pattern.
Numbers. Uber's lake: ~250 PB on HDFS + S3, hundreds of thousands of tables. Netflix: PB-scale Iceberg on S3, all analytics built on it. Apple, Airbnb, Stripe sit on this pattern.
14.4 ML model storage and training data
Pattern. ML teams store training datasets (TB-PB of images, text, audio, video) and model artifacts (checkpoints, final models, fine-tuned variants) in object storage. Training jobs read from S3/GCS, write checkpoints back. Serving loads from S3/GCS.
Architecture. Datasets in prefixes (s3://datasets/laion-5b/v2/shards/). Versioned model registry (s3://models/recommender/v23/model.pt). Training reads in parallel via PyTorch's S3DataLoader or TensorFlow's tf.data with S3 URLs. LAION-5B, Common Crawl: multi-PB entirely in object storage.
Why it fits. Big infrequently-mutated bytes; reads parallelize (1000 files from many GPUs); PB-scale cost discipline; versioning matches experiment cycle.
Numbers. OpenAI GPT training data: multi-PB. Meta LLaMA: multi-PB datasets, multi-TB checkpoints. Hugging Face hub: tens of thousands of multi-GB checkpoints on S3, served to anyone running from transformers import AutoModel.
14.5 Backup and archival
Pattern. Database backups, application logs, system snapshots stream to object storage continuously. Mostly never read; rare DR event reads a tiny slice.
Architecture. Encrypted dumps to s3://backups-prod/postgres/db-name/2026-05-22/. Lifecycle sends old to Glacier after 30 days. Paranoid: cross-account or cross-cloud backup destinations (separate AWS account, or even Backblaze B2 outside AWS, so prod compromise can't reach backups).
Why it fits. Cheap, durable, retain for years, easy to encrypt, lifecycle to ultra-cheap tiers. Restore latency tolerable for DR.
Numbers. 1 TB DB with daily fulls + hourly incrementals, 7-year retention: ~$60-100/mo with lifecycle vs $1500-3000/mo without.
14.6 Application file uploads (Dropbox, Notion, Slack)
Pattern. Users upload files (docs, attachments, photos, videos). App stores bytes in object storage, metadata (owner, sharing, filename, mime) in a database, serves via pre-signed URL.
Architecture. The canonical diagram in §6.
Why it fits. User-controlled, unpredictable sizes; multi-tenant; CDN for shared/public files, pre-signed URLs for private; lifecycle for trash.
Numbers. Notion: hundreds of TB. Slack: PB-scale. Dropbox: EB-scale (Magic Pocket replaced S3 for cost). All on the same metadata-in-DB + bytes-in-object-storage pattern.
§15. Real-World Implementations with Numbers
Named systems shipping object storage at scale.
AWS S3. The reference implementation. Public numbers: 280+ trillion objects stored; 100+ million requests per second in aggregate; exabytes of data. Spans dozens of regions and multiple AZs per region. The benchmark for "what object storage can do."
Netflix Open Connect + S3 origin. Netflix uses S3 as origin for video files; Open Connect (their custom CDN, deploying servers inside ISPs worldwide) serves most of the bytes. Catalog estimated at 10-100 PB across master + encoded variants. Peak network throughput ~30% of internet bandwidth during evening hours globally.
Dropbox Magic Pocket. Dropbox started on AWS S3, hit ~500-600 PB by 2016, then built their own object storage system ("Magic Pocket") specifically to reduce cost. Currently EB-scale, all on commodity hardware (SMR drives, erasure coding, custom controllers). They didn't migrate "back to filesystems"; they built their own S3 from scratch. The public talks on Magic Pocket are an excellent read for systems architecture at this scale.
Pinterest images. ~300 billion pins as of recent reports, each pin has multiple image resolutions stored in S3. PB-scale; high read rate (billions of GETs per day) served largely from CloudFront and an internal cache tier.
GitHub Large File Storage (LFS). Large files in Git repositories (binaries, datasets, models) stored in S3 with the Git history holding only pointers. Tens of PB across hundreds of millions of repositories.
Cloudflare R2. Launched 2022 explicitly to disrupt S3's egress pricing. Reported tens of PB stored within a year; thousands of customers. The technology is similar (erasure coding, S3-compatible API); the business innovation is "we don't charge for egress because our broader CDN business already pays for the bandwidth."
Facebook Haystack and f4. Meta's photo storage. Haystack is the original (~2010-era), a custom blob store for billions of photos optimized for one large file holding many photos. f4 is the cold-storage tier, using extreme erasure coding (e.g. (10, 4) within a rack + (3, 1) across racks for double-fault tolerance) to store warm-but-rarely-read photos at lower cost. ~100 PB scale.
Microsoft Azure Storage. Azure Blob Storage backs everything from Azure VMs (Page blobs as VHDs) to Office 365 (huge document storage) to consumer OneDrive. Many exabytes.
OpenAI / Anthropic / Hugging Face model hosting. Foundation model weights (multi-GB to TB per model) sit in S3 / GCS / Azure Blob for serving and distribution. Hugging Face's CDN-fronted model hub serves billions of model downloads.
Snowflake / Databricks data lakes. Snowflake's storage layer is built on cloud object storage (S3, GCS, Azure Blob). Databricks Lakehouse architecture: Delta Lake on S3. PB-scale data per customer.
The range is staggering. From a side-project blog storing 10 GB in S3 to Meta's f4 holding 100 PB, the same conceptual primitive — key, bytes, durability — scales.
§16. Summary
Object storage is the planet-scale dumping ground of opaque bytes: a flat key-to-blob map with eleven-nines durability, strong read-after-write, effectively infinite capacity, but 50-150ms latency and no semantics richer than
GETandPUT. The variants (S3, GCS, Azure Blob, R2, MinIO, Ceph) agree on the API and disagree on pricing — egress, not storage, is the structural cost. Internally, erasure coding spreads each object across availability zones; prefix sharding caps single-prefix throughput at ~3500 GETs / 5500 PUTs per second; multipart upload exists because single PUT caps at 5 GB and parallelism wins. The canonical architectural pattern is small-metadata-in-a-database pointing at big-bytes-in-object-storage, fronted by a CDN for public reads, gated by IAM and bucket policies and pre-signed URLs for private access, and aged through lifecycle policies into cheaper tiers. Get the boundaries right (what goes here vs in a database vs in block storage) and the rest is plumbing.