Haystack vs GFS
o Targeting video files. Each video file divided into a bunch of “chunks” (64MB each)
o Each chunk is replicated several times (three by default)
o File->Chunk mapping maintained by a central master
o All metadata kept in memory
o Several photos combine into a single large file known as a logical volume (100GB)
o Each logical volume corresponds to several physical volumes (replicas)
o photoID->(logicalVolume, fileOffset, fileSize) mapping maintained by a separate
“Haystack Directory” component
o All metadata kept in memory
Why another Custom Storage System?
• One-photo-per-file is wasteful: a lot of the per-file metadata are not
needed for photos, e.g., file permissions
• Photos are written once and never modified: only need to support
Three core components:
• Haystack Store
• Haystack Directory
• Haystack Cache: functions as
an internal CDN
oBrowser can be directed to
either the CDN or the Cache
oNot confirmed, but this
“Haystack Cache” may be
referring to Facebook’s many
“Edge Cache” located at
Internet Points of Presence
2 3 7 8
• Multiple (millions of) photos stored in a single large file called a physical
• E.g. a 10TB server = 100 physical volumes x 100GB per volume
• Physical volumes from different machines are grouped into logical volumes
o Storing a photo on a logical volume = writing it to all corresponding physical volumes
• Conceptually, a physical volume is simply a very large file (100GB) saved as
‘/hay/haystack_<logical volume id>’.
• A Store machine can retrieve a photo using <LogicalVolumeID, FileOffset,
Size>. The trick is to find out the logical volume ID, the file offset and size,
without disk operations
• Each physical volume starts with
o File Offset 0 is not a valid photo
• Followed by a sequence of
needles, one for each photo.
• Each Store machine maintains
for each physical volume the in-
memory mapping of
o File offset == 0 means deleted
• Cookie is not stored in memory:
it is only checked after reading a
needle from disk
• In theory a machine can reconstruct its in-memory
mappings by scanning all physical volumes. Doing so is
• Store machines maintain an index file for each of their
volumes. The index file is a checkpoint of the in-memory
metadata. The index file shortens restart time.
• An index file’s layout is similar to a volume file’s,
containing a superblock followed by a sequence of index
records, one for each needle in the superblock, in the
same order as the needles appear in the volume file.
• Index files are updated asynchronously => may represent
• Adding a new photo: needle appended to the volume file.
Index record appended asynchronously to the index file
(so may get lost). If index record lost, these orphan
needles always appear the end of the volume files =>
quick to identify after a reboot.
• Deleting a photo: needle flagged in the volume file. Index
record not updated. Upon reading the photo, the
machine inspects the deleted flag and then updates its in-
Write-Enables vs Read-Only
• When new machines are added to the Haystack Store, they are write-
• Only write-enabled machines receive uploads
• Over time the available capacity on these machines decreases. When
a machine exhausts its capacity, it is marked as read-only (by the
• Haystack Directory
oMaintains logical volume=>physical volumes mapping. Web server uses this
mapping when uploading photos and also (FIXME: how?) when constructing
othe logical volume where each photo resides
oLoad balances: which logical volume receives a write request; which physical
volume (machineID) receives a read request;
oDetermines whether a read request should be handled by the CDN or by the
oIdentify logical volumes that are read-only and mark them as such
• The Directory stores its information in a replicated database accessed
via a PHP interface that leverages memcache to reduce latency
• Organized as a distributed hash table (DHT) where photo ID is the key
• It caches a photo only if
oThe request comes directly from a user and not the CDN. Post-CDN caching is
ineffective as it is unlikely that a request that misses in the CDN would hit in
our internal cache
oAnd, the photo is fetched from a write-enabled Store machine. This is in order
to shelter write-enabled Store machines from read requests. Photos are most
heavily accessed soon after they are uploaded. Filesystems perform better
when doing either reads or writes but not both.
2+3. Web Server uses the
Haystack Directory to construct a
URL for each photo.
al volume, PhotoID>
• photo ID includes
o a 64-bit key
o a 32-bit alternate key which
identifies the photo’s type
o and a cookie.
• The Directory determines
whether a read request should
be handled by the CDN or by the
2 3 7 8
6. (Browser may skip this step) CDN looks up
the photo using <Logical volume, PhotoID> in
the URL. If CDN cannot find the photo in its
cache, it strips the CDN part from the URL
and passes to the Haystack Cache.
7. Cache does a similar lookup and, on a
miss, strips the Cache part from the URL and
passes to the Haystack Store
8. The Store machine looks up the photo ID
from the in-memory metadata to get
<FileOffset, NeedleSize, Flags>. If the flag
indicates that the photo has been deleted,
errors are returned. Otherwise the Store
machine reads the corresponding needle.
8. Verifies the cookie and the integrity of the
data. If all checks pass, the photo is returned
to the Haystack Cache.
• Cookie is not stored in memory: it is only checked
after reading a needle from disk. The cookie
stored in the volume file is compared against the
cookie in the URL.
2 3 7 8
Haystack: Write Operation
2. Server requests a write-enabled logical
volume from the directory. A unique 64-bit key
is assigned to the photo (FIXME: by the
2. Directory picks a random cookie value and
stores it with the photo. The cookie effectively
eliminates attacks aimed at guessing valid URLs
4. Web server provides the logical volume ID,
the photo ID, and the photo data to each of the
physical volumes mapped to the assigned
4. Each Store machine synchronously appends
needle to its physical volume file.
Redundant Space in Physical Volume
• To delete a photo = sets the deleted flag in both the Store machine’s in-memory mapping
and synchronously in the physical volume file
• Haystack disallows overwriting needles
o for example, rotating a photo will cause an updated needle to be appended with the same key and
o There may exist multiple needles with the same photo ID. How to identify the newest version?
• If the new needle is in a different logical volume, the Directory updates its
photoID=>logicalVolume mapping so that future requests will never fetch the older
• If the new needle is in the same logical volume, which implies the same Store machines
and same physical volumes, the needle with the highest file offset is the newest version.
o periodically checks the health of each Store machine. E.g., remotely tests the
connection to each machine, checks the availability of each volume file, attempts to
read data from the Store machine.
o If pitchfork identifies a machine to be not-healthy, all logical volumes on that
machine are marked as read-only.
o The underlying cause for the failed checks are addressed manually offline.
• Bulk sync: recover the contents of a failed machine from a replica
o reclaims space used by deleted and duplicated needles
o copies a volume file into a new file, skipping any duplicate or deleted entries
o if a deletion needs to happen in the middle of a compaction, the deletion is marked
in both the source and destination volume files.
• 32-bytes per photo (see
table on right), plus
approximately 2 bytes per
image due to hash-table
overheads => 40 bytes per
• In comparison, an
xfs_inode_t structure in
Linux is 536 bytes.
64-bit photo key
1st scaled image 32-bit offset + 16-bit size
2st scaled image 32-bit offset + 16-bit size
3st scaled image 32-bit offset + 16-bit size
4st scaled image 32-bit offset + 16-bit size
In practice, the flag is not stored in the in-memory metadata. If the in-memory metadata has FileOffset==0, it means the file is deleted. Otherwise, the needle will be read, and the flag stored in the needle (i.e., in the volume file) will be checked again to see whether the file is deleted.