Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
RAD
How We Replicate Terabytes of Data Around the World Every Day
Jason Koppe
System Administrator
Indeed is the #1
external source of hire
64% of US job searchers search
on indeed each month
Unique Visitors (millions)
Mi...
How We Build Systems
fast simple resilient scalable
fast
Fast
Job Search Browser Rendering
median ~0.5 seconds
Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Feb 29 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 M...
simple
2004 launch: a few servers, 1.8m US jobs
2004
Aggregation
MySQL
Job Search
Every job on
the web
relational database,
accessed across the network
NOT fast at full text search
NOT a search engine
2004
Indeed
1999
Lucene
LuceneTM
a high-performance, full featured
text search engine library
LuceneTM
NOT a remote database,
files must be on local disk
MySQL
Database Server Lucene Index Server
Index Builder
/data/jobindex
Index Builder Index Builder Index Builder Index Builder
/data/jobindex /data/jobindex /data/jobindex /data/jobindex
MySQL
MySQL
Database Server Indexer Server
Index Builder
/data/jobindex
Search Engine
/data/jobindex
4 Search Servers
any combination of data, not just lucene
lucene +
model
lucene +
model
bitset
lucene +
model
bitset
lucene +
custom
binary
lucene +
model
bitset
lucene +
custom
binary
json +
csv
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Artifact
is read-optimized data sto...
Producer
creates and updates a data artifact
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Eng...
Consumer
reads a data artifact
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
MySQL
produce once, consume many times
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: minimize database access
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: compute artifact once
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: scale consumers independen...
MySQL
Expensive
Index Builder
Producer
Artifact Artifact
Commodity
Search Engine
Benefit: scale consumers independently
MySQL
Database Server
Index Builder
Producer
Artifact Artifact
Consumers
Search Engine
Benefit: separate code deployables
fast resilient scalable
Producer
artifact
Search Engine
Consumers
artifact
Index Builder
Producer
artifact
Search Engine
Consumers
artifact
Index Builder
rsync
efficient point-to-point file transfer utility
1
consumers should
reload data regularly
1
consumers should
reload data regularly
2
roll back
consumers should
reload data regularly
2
roll back
3
data reload should
not interrupt requests
1
artifact versioning
$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
new directory for new version
$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.latest -> jobindex.3
symlink to know current version
$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.4
jobindex.latest -> jobindex.4
load new data
$ ls -d jobindex.*
jobindex.1
jobindex.2
jobindex.3
jobindex.4
jobindex.latest -> jobindex.3
roll back
each new version takes disk space & time
versions
total bytes on disk
normal disk copy
versions
disk
latency
total bytes on disk
normal disk copy
versions
version
create time
disk
latency
total bytes on disk
normal disk copy
1.8m jobs, change <2% per hour
all jobs
00:00 AM
all jobs
00:00 AM
all jobs
04:00 AM
new jobs
changed jobs
all jobs
00:00 AM
all jobs
04:00 AM
new jobs
changed jobs
unchanged
incremental updates
save disk space & time
share data between versions
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
jobindex.2
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
fi...
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
fi...
5GB
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
1GB
jobindex.2
file1.bin
file2.bi...
file1.bin
file2.bin
file3.bin
file4.bin
jobindex.2
file1.bin
file2.bin
file3.bin
file5.bin
jobindex.3
deleted
1GB 1GB = 5G...
remove referenced file of symlink, data is gone
hardlink
additional name for an existing file
hardlink != symlink
file1.bin
file2.bin
file3.bin
3GB
jobindex.1
file1.bin
file2.bin
file3.bin
file4.bin
1GB
jobindex.2
file1.bin
file2.bin
fi...
file1.bin
file2.bin
file3.bin
file4.bin
4GB
jobindex.2
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
1GB
jobindex.3
= ...
file1.bin
file2.bin
file3.bin
file4.bin
file5.bin
5GB
jobindex.3
= 5GB
remove last hardlink, data is gone
artifact versions: symlinks + hardlinks + rsync
scale: single producer, many consumers
Job Search Browser Rendering
median ~0.5 seconds
Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Feb 29 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 M...
fast simple resilient scalable
How We Build Systems
2004
Indeed
1999
Lucene
2008
6 countries
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2004 2008 200920062005
22.5 M5.2 M 7.1 M4.0 M1.8 M
jobs added or modified each month
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2nd
datacenter
Producer
Consumers
artifacts
DC1
Staging
Consumers
artifacts
DC2
multi-dc rsync
Staging
Consumers
artifacts
DC3
Producer
Consumers
artifacts
DC1
Staging
Consumers
artifacts
DC2
Staging
Consumers
artifacts
DC3
minimize
Internet
bandwid...
2011
52 countries
4 datacenters
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2004 2008 200920062005
22.5 M5.2 M 7.1 M4.0 M1.8 M
jobs added or modified each month
2011
32.5 M
rsync system growing pains
Simple: serially copy one artifact at a time
DC1
Producer Artifacts
DC2
Staging Artifacts
Problem: serially can cause delays
Producer
Staging
New
New
New
Old
DC1
DC2
smalllarge2large1
smalllarge2large1
Workaround: copy separately in “streams”
DC1
DC2
Staging
Producer
Simple: point-to-point datacenter rsync paths
DC4
DC3
DC2
DC1
Problem: Internet, why did you do that?
Down
DC4
DC3
DC2
DC1
Workaround: shift replication path
DC4
DC3
DC2
DC1
Scale: few consumers with rsync
Producer
Artifacts Consumers
Consumers
Producer
Grow: many consumers with rsync
Artifacts
Consumers
Consumers
Producer
Problem: too many consumers with rsync
Artifacts
Consumers
network
100%
used
Workaround: add more network bandwidth
Consumers
Producer
Artifacts
Consumers
Workaround: add staging tiers
Consumers
Producer
Artifacts
Staging
Artifacts Artifacts
Staging
Artifacts
Staging
Artifacts...
rsync growth required sysad intervention
2011
52 countries
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2014
rsync growth
100 artifacts, adding +1 producer each month
producing 1,761 TB per month
over 200 consumers, +2 each month
replicating over 21,931 TB per month
staging tiers or network bandwidth, quarterly
modify replication path, monthly
requiring too much intervention from system
administrators
sysad
dev
sysad
dev
+50%
+100%
2014
January December
2011
52 countries
2004
Indeed
1999
Lucene
2008
6 countries
2009
23 countries
2014
rsync limits
Julie Scully
Software Engineer
Jobsearch backend team produces a lot of data
RAD
“Resilient Artifact Distribution”
Design GoalsDesign Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide ...
Design Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide visibility
3...
Design Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide visibility
1...
Design Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide visibility
1...
Design Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide visibility
1...
Design GoalsDesign Goals
Minimize network bottlenecks
Loose coupling
Automatic recovery
Developer empowerment
System-wide ...
No more point-to-point
Measure time and
network traffic
Bittorrent: Would it work?
Sample replication to
3 consumers
https://github.com/shevek/tt...
Network Test
Total MB received + transmitted for 700MB artifact
Producer 2,240
Consumer 1 746
Consumer 2 747
Consumer 3 74...
Network Test
Total MB received + transmitted for 700MB artifact
Producer 2,240 782
Consumer 1 746 1,226
Consumer 2 747 1,2...
Network Test
Total MB received + transmitted for 700MB artifact
Producer 2,240 782
Consumer 1 746 1,226
Consumer 2 747 1,2...
24 minutes
rsync
5.5 minutes
bittorrent
Timing Test
How does bittorrent work?
Data split into small pieces of equal size
Hash computed for each piece
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1
Piece 1: 75 MB
Piece 2: 75 MB
Piece 3: 75 MB
Piece 4: 75 M...
torrent metadata file
{ files:file1.bin,100MB;
file2.bin,200MB;
file3.bin,50MB }
{ piecelength:75MB }
{
infohash:XSDJSK;JDISJLD;DJKJDB;KDJB
OP;F...
{ files:file1.bin,100MB;
file2.bin,200MB;
file3.bin,50MB }
{ piecelength:75MB }
{
infohash:XSDJSK;JDISJLD;DJKJDB;KDJB
OP;F...
{ files:file1.bin,100MB;
file2.bin,200MB;
file3.bin,50MB }
{ piecelength:75MB }
{
infohash:XSDJSK;JDISJLD;DJKJDB;KDJB
OP;F...
Tracker
Coordinator of the download
Seeder
Any client providing data
Seeder
Data
I have pieces for info hash
Tracker
.torrent
Info Hash
File manifest
Data .torrent
Info Hash
File manifest
Seeder Tracker
Info hash peer
Map
Ok!
I have pieces for info hash
Consumer
Any client downloading data
Peers for infohash
Consumer Tracker
.torrent
Info Hash
File manifest
Tracker URL
Map
Info hash peer
How a consumer gets th...
Peers for infohash
Peerlist
Consumer Tracker
.torrent
Info Hash
File manifest
Tracker URL
Map
Info hash peer
How a consume...
Data .torrent
Info Hash
File manifest
Consumer/
Seeder
I have pieces for infohash
Tracker
Info hash peer
Map
It is also a ...
Consumer 1
Seeding as it downloads
Consumer 2
Seeding as it downloads
Consumer 3
Seeding as it downloads
Seeder
SWARM
Didn’t quite meet our needs
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
Piece 4: HASH4
Piece 5: HASH5
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(2...
jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2....
jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HA...
jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HA...
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1 jobindex.2
File4.bin
(50MB)
File3.bin
(50MB)
File1.bin
(10...
Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH9
Piece 5: HASH10
Piece 1: HASH1
Piece 2: HASH2
Piece 3: HASH3
P...
Control sort order?
jobindex.2
File3.bin
(50MB)
File1.bin
(150MB)
File2.bin
(200MB)
Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH...
File3.bin
(50MB)
File1.bin
(100MB)
File2.bin
(200MB)
jobindex.1
Piece 1: HASH6
Piece 2: HASH7
Piece 3: HASH8
Piece 4: HASH...
hash each file?
Compare files not pieces
{ files:file1.bin,100MB,DATETIME;
file2.bin,200MB,DATETIME;
file3.bin,50MB,DATETIME }
{ piecelength:75MB }
...
.torrent me...
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.1
Piece 1: File 0, File1
Piece 2: File 1
Piece 3: File 1, Fi...
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobindex.1
File1.bin
(100MB)
File2.bin
(200MB)
File3.bin
(50MB)
jobin...
Bittorrent Evaluation Result
substantially faster drastically reduces
network load on the
producer machine
horizontally sc...
Design GoalsDesign Goals
Automatic recovery
Developer empowerment
System-wide visibility
3
4
5
Loose coupling2
Minimize ne...
Service-oriented architecture
Headwater
The beginning of a river
Headwater
Host
Data
Producer Data
Publish
my data
Headwater takes ownership of the data
(hardlink + read-only)
Headwater
Host
Data
Producer Data
Publish
my data
Will do!
Headwater
Host
Data
Producer Data
create the .torrent metadata file
Headwater
The beginning of a river
River
Course the water carves
across the landscape
Rhone
RhoneRhone
Zookeeper
Rhone: multi-master coordinator service
Rhone
Headwater
Host
Data
Producer Data
Rhone
Headwater
Host
Data
Producer Datadata.version
torrent metadata
Rhone
Headwater
Host
Data
Producer Datadata.version
torrent metadata
Rhone
Headwater
Host
Data
Producer Data
Tracker
.torrent metadata
can be retrieved
data.version
torrent metadata
Headwater
The beginning of a river
River
Course the water carves
across the landscape
Delta
The end of the river
Subscribe
to data!
Delta
Host
Data
Consumer
Make all subscribed artifacts available
RhoneDelta
Host
Data
Consumer
Headwater
Host
Data
Producer Data
Delta
Data
Consumer
Rhone
Host
Tracker
Delta
Host
Data
ConsumerData
/rad/data
Delta
Host
Data
ConsumerData
Where’s
the latest
data?
/rad/data
It’s at
/rad/data
Delta
Host
Data
ConsumerData
Where’s the
latest data?
/rad/data
Delta
Host
Data
ConsumerData
/rad/data
Keep all subscribed artifacts current
Delta
Data
Consumer
Rhone
Host
Rhone
Data
Host
Artifact Availability Flow
Delta Headwater
Host
Data
Consumer
Data
Producer Data
Design GoalsDesign Goals
Automatic recovery
Developer empowerment
System-wide visibility
4
5
Minimize network bottlenecks1...
Rhone
Headwater
Host
Data
Producer Data
Crash!
Rhone
Headwater
Data
Producer Datadata.version
torrent metadata
Tracker
Crash!
Host
Development philosophy:
Make recovery the common case
Durable state with atomic filesystem operations
All service calls are idempotent
RAD handles network recovery
DC4
DC3
DC2
DC1
rsync is point-to-point
DC1
DC4
DC3
DC2
bittorrent peer-to-peer
Down
DC1
DC4
DC3
DC2
No problem with bittorrent swarm
RAD treats artifact independently
Design GoalsDesign Goals
Developer empowerment
System-wide visibility5
Minimize network bottlenecks1
Loose coupling2
Autom...
Adding a new artifact in the rsync system
Ask System Administrators
Adding a new artifact in the RAD system
Declare it in the code
REST API is language agnostic
Design GoalsDesign Goals
System-wide visibility
Minimize network bottlenecks1
Loose coupling2
Automatic recovery3
Develope...
Rhone already knows all artifacts
Rhone stores list of versions by artifact.
version 4
version 5
version 6
artifactA
version 221
version 226
version 227
ver...
Heartbeats from Delta and Headwater
Rhone has system-wide view
RADAR: Developers can easily see where their data is
RADAR: Developers can easily see where their data is
RADAR: Developers can easily see where their data is
RADAR: Developers can easily see where their data is
start simple and iterate
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
Lesson learned: prevent people from
using the system incorrectly
We made configuration TOO easy
New Requirement: protect the disks
Delta
Prevent downloading artifacts that will fill the disk (and alarm)
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
criti...
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
criti...
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
migrated to RAD
2015
criti...
100 artifacts in 10 years
2011
52 countries
2004
Indeed
2008
6 countries
2009
23 countries
2014
rsync limits
1st artifact
...
7,666
versions published
Producer
Consumer
56
unique producers
52,357
versions downloaded
670
unique consumers
RAD Stats
M...
Duration of JobIndex replication in RAD v. Rsync
Jan 18 6 AM 12 PM 6 PM Jan 19 6 AM 12 PM 6 PM
1,000
2,000
3,000
RAD rsync...
replicating over 65,193 TB per month
Learn More
Engineering blog & talks http://indeed.tech
Open Source http://opensource.indeedeng.io
Careers http://indeed.jo...
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
Next

Share

@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day

Link to video: https://youtu.be/lDXdf5q8Yw8

At Indeed, we use massive amounts of data to build our products and services. At first, we relied on rsync to distribute these data to our servers. This rsync system lasted for ten years before we started to encounter scaling challenges. So we built a new system on top of BitTorrent to improve latency, reliability, and throughput. Today, terabytes of data flow around the world every day between our servers. In this talk, we will describe what we needed, what we created, and the lessons we learned building a system at this scale.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

@Indeedeng: RAD - How We Replicate Terabytes of Data Around the World Every Day

  1. 1. RAD How We Replicate Terabytes of Data Around the World Every Day
  2. 2. Jason Koppe System Administrator
  3. 3. Indeed is the #1 external source of hire 64% of US job searchers search on indeed each month Unique Visitors (millions) Million unique visitors 2009 2011 2012 2013 2014 2015 0 20 40 60 80 100 120 140 160 180 2010 180M 180 million unique users 80.2M unique US visitors per month 16M jobs 50+ countries 28 languages
  4. 4. How We Build Systems fast simple resilient scalable
  5. 5. fast
  6. 6. Fast
  7. 7. Job Search Browser Rendering median ~0.5 seconds Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Feb 29 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 Mar 6 Mar 7 Mar 8 0 100 200 300 400 500 600 700 800 milliseconds
  8. 8. simple
  9. 9. 2004 launch: a few servers, 1.8m US jobs
  10. 10. 2004 Aggregation MySQL Job Search Every job on the web
  11. 11. relational database, accessed across the network
  12. 12. NOT fast at full text search NOT a search engine
  13. 13. 2004 Indeed 1999 Lucene
  14. 14. LuceneTM a high-performance, full featured text search engine library
  15. 15. LuceneTM NOT a remote database, files must be on local disk
  16. 16. MySQL Database Server Lucene Index Server Index Builder /data/jobindex
  17. 17. Index Builder Index Builder Index Builder Index Builder /data/jobindex /data/jobindex /data/jobindex /data/jobindex MySQL
  18. 18. MySQL Database Server Indexer Server Index Builder /data/jobindex Search Engine /data/jobindex 4 Search Servers
  19. 19. any combination of data, not just lucene
  20. 20. lucene + model
  21. 21. lucene + model bitset
  22. 22. lucene + model bitset lucene + custom binary
  23. 23. lucene + model bitset lucene + custom binary json + csv
  24. 24. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine
  25. 25. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine Artifact is read-optimized data stored in a directory on the file system
  26. 26. Producer creates and updates a data artifact Database Server Index Builder Producer Artifact Artifact Consumers Search Engine MySQL
  27. 27. Consumer reads a data artifact Database Server Index Builder Producer Artifact Artifact Consumers Search Engine MySQL
  28. 28. produce once, consume many times
  29. 29. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine Benefit: minimize database access
  30. 30. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine Benefit: compute artifact once
  31. 31. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine Benefit: scale consumers independently
  32. 32. MySQL Expensive Index Builder Producer Artifact Artifact Commodity Search Engine Benefit: scale consumers independently
  33. 33. MySQL Database Server Index Builder Producer Artifact Artifact Consumers Search Engine Benefit: separate code deployables
  34. 34. fast resilient scalable
  35. 35. Producer artifact Search Engine Consumers artifact Index Builder
  36. 36. Producer artifact Search Engine Consumers artifact Index Builder
  37. 37. rsync efficient point-to-point file transfer utility
  38. 38. 1 consumers should reload data regularly
  39. 39. 1 consumers should reload data regularly 2 roll back
  40. 40. consumers should reload data regularly 2 roll back 3 data reload should not interrupt requests 1
  41. 41. artifact versioning
  42. 42. $ ls -d jobindex.* jobindex.1 jobindex.2 jobindex.3 new directory for new version
  43. 43. $ ls -d jobindex.* jobindex.1 jobindex.2 jobindex.3 jobindex.latest -> jobindex.3 symlink to know current version
  44. 44. $ ls -d jobindex.* jobindex.1 jobindex.2 jobindex.3 jobindex.4 jobindex.latest -> jobindex.4 load new data
  45. 45. $ ls -d jobindex.* jobindex.1 jobindex.2 jobindex.3 jobindex.4 jobindex.latest -> jobindex.3 roll back
  46. 46. each new version takes disk space & time
  47. 47. versions total bytes on disk normal disk copy
  48. 48. versions disk latency total bytes on disk normal disk copy
  49. 49. versions version create time disk latency total bytes on disk normal disk copy
  50. 50. 1.8m jobs, change <2% per hour
  51. 51. all jobs 00:00 AM
  52. 52. all jobs 00:00 AM all jobs 04:00 AM new jobs changed jobs
  53. 53. all jobs 00:00 AM all jobs 04:00 AM new jobs changed jobs unchanged
  54. 54. incremental updates
  55. 55. save disk space & time
  56. 56. share data between versions
  57. 57. file1.bin file2.bin file3.bin 3GB jobindex.1
  58. 58. file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin jobindex.2
  59. 59. file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin file4.bin 4GB jobindex.2
  60. 60. file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin file4.bin 4GB jobindex.2 file1.bin file2.bin file3.bin file4.bin file5.bin 5GB jobindex.3
  61. 61. file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin file4.bin 4GB jobindex.2 file1.bin file2.bin file3.bin file4.bin file5.bin 5GB jobindex.3 = 12GB+ +
  62. 62. 5GB file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin file4.bin 1GB jobindex.2 file1.bin file2.bin file3.bin file4.bin file5.bin 1GB jobindex.3 =+ +
  63. 63. file1.bin file2.bin file3.bin file4.bin jobindex.2 file1.bin file2.bin file3.bin file5.bin jobindex.3 deleted 1GB 1GB = 5GB+ 2GB file4.bin
  64. 64. remove referenced file of symlink, data is gone
  65. 65. hardlink additional name for an existing file
  66. 66. hardlink != symlink
  67. 67. file1.bin file2.bin file3.bin 3GB jobindex.1 file1.bin file2.bin file3.bin file4.bin 1GB jobindex.2 file1.bin file2.bin file3.bin file4.bin file5.bin 1GB jobindex.3 = 5GB+ +
  68. 68. file1.bin file2.bin file3.bin file4.bin 4GB jobindex.2 file1.bin file2.bin file3.bin file4.bin file5.bin 1GB jobindex.3 = 5GB+
  69. 69. file1.bin file2.bin file3.bin file4.bin file5.bin 5GB jobindex.3 = 5GB
  70. 70. remove last hardlink, data is gone
  71. 71. artifact versions: symlinks + hardlinks + rsync
  72. 72. scale: single producer, many consumers
  73. 73. Job Search Browser Rendering median ~0.5 seconds Feb 24 Feb 25 Feb 26 Feb 27 Feb 28 Feb 29 Mar 1 Mar 2 Mar 3 Mar 4 Mar 5 Mar 6 Mar 7 Mar 8 0 100 200 300 400 500 600 700 800 milliseconds
  74. 74. fast simple resilient scalable How We Build Systems
  75. 75. 2004 Indeed 1999 Lucene 2008 6 countries
  76. 76. 2004 Indeed 1999 Lucene 2008 6 countries 2009 23 countries
  77. 77. 2004 2008 200920062005 22.5 M5.2 M 7.1 M4.0 M1.8 M jobs added or modified each month
  78. 78. 2004 Indeed 1999 Lucene 2008 6 countries 2009 23 countries 2nd datacenter
  79. 79. Producer Consumers artifacts DC1 Staging Consumers artifacts DC2 multi-dc rsync Staging Consumers artifacts DC3
  80. 80. Producer Consumers artifacts DC1 Staging Consumers artifacts DC2 Staging Consumers artifacts DC3 minimize Internet bandwidth
  81. 81. 2011 52 countries 4 datacenters 2004 Indeed 1999 Lucene 2008 6 countries 2009 23 countries
  82. 82. 2004 2008 200920062005 22.5 M5.2 M 7.1 M4.0 M1.8 M jobs added or modified each month 2011 32.5 M
  83. 83. rsync system growing pains
  84. 84. Simple: serially copy one artifact at a time DC1 Producer Artifacts DC2 Staging Artifacts
  85. 85. Problem: serially can cause delays Producer Staging New New New Old DC1 DC2
  86. 86. smalllarge2large1 smalllarge2large1 Workaround: copy separately in “streams” DC1 DC2 Staging Producer
  87. 87. Simple: point-to-point datacenter rsync paths DC4 DC3 DC2 DC1
  88. 88. Problem: Internet, why did you do that? Down DC4 DC3 DC2 DC1
  89. 89. Workaround: shift replication path DC4 DC3 DC2 DC1
  90. 90. Scale: few consumers with rsync Producer Artifacts Consumers
  91. 91. Consumers Producer Grow: many consumers with rsync Artifacts Consumers
  92. 92. Consumers Producer Problem: too many consumers with rsync Artifacts Consumers network 100% used
  93. 93. Workaround: add more network bandwidth Consumers Producer Artifacts Consumers
  94. 94. Workaround: add staging tiers Consumers Producer Artifacts Staging Artifacts Artifacts Staging Artifacts Staging Artifacts Consumers Consumers Consumers Consumers Consumers Consumers Consumers Staging
  95. 95. rsync growth required sysad intervention
  96. 96. 2011 52 countries 2004 Indeed 1999 Lucene 2008 6 countries 2009 23 countries 2014 rsync growth
  97. 97. 100 artifacts, adding +1 producer each month
  98. 98. producing 1,761 TB per month
  99. 99. over 200 consumers, +2 each month
  100. 100. replicating over 21,931 TB per month
  101. 101. staging tiers or network bandwidth, quarterly
  102. 102. modify replication path, monthly
  103. 103. requiring too much intervention from system administrators
  104. 104. sysad dev sysad dev +50% +100% 2014 January December
  105. 105. 2011 52 countries 2004 Indeed 1999 Lucene 2008 6 countries 2009 23 countries 2014 rsync limits
  106. 106. Julie Scully Software Engineer
  107. 107. Jobsearch backend team produces a lot of data
  108. 108. RAD “Resilient Artifact Distribution”
  109. 109. Design GoalsDesign Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 1 2 3 4 5
  110. 110. Design Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 3 4 5 1 2
  111. 111. Design Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 1 2 5 4 3
  112. 112. Design Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 1 2 3 5 4
  113. 113. Design Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 1 2 3 4 5
  114. 114. Design GoalsDesign Goals Minimize network bottlenecks Loose coupling Automatic recovery Developer empowerment System-wide visibility 1 2 3 4 5
  115. 115. No more point-to-point
  116. 116. Measure time and network traffic Bittorrent: Would it work? Sample replication to 3 consumers https://github.com/shevek/ttorrent
  117. 117. Network Test Total MB received + transmitted for 700MB artifact Producer 2,240 Consumer 1 746 Consumer 2 747 Consumer 3 747 machine RSYNC
  118. 118. Network Test Total MB received + transmitted for 700MB artifact Producer 2,240 782 Consumer 1 746 1,226 Consumer 2 747 1,225 Consumer 3 747 1,245 machine BITTORRENTRSYNC
  119. 119. Network Test Total MB received + transmitted for 700MB artifact Producer 2,240 782 Consumer 1 746 1,226 Consumer 2 747 1,225 Consumer 3 747 1,245 Total 4,481 4,480 machine BITTORRENTRSYNC
  120. 120. 24 minutes rsync 5.5 minutes bittorrent Timing Test
  121. 121. How does bittorrent work?
  122. 122. Data split into small pieces of equal size
  123. 123. Hash computed for each piece
  124. 124. File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  125. 125. File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1 Piece 1: 75 MB Piece 2: 75 MB Piece 3: 75 MB Piece 4: 75 MB Piece 5: 25 MB
  126. 126. torrent metadata file
  127. 127. { files:file1.bin,100MB; file2.bin,200MB; file3.bin,50MB } { piecelength:75MB } { infohash:XSDJSK;JDISJLD;DJKJDB;KDJB OP;FJEIODK; } .torrent metadata file:
  128. 128. { files:file1.bin,100MB; file2.bin,200MB; file3.bin,50MB } { piecelength:75MB } { infohash:XSDJSK;JDISJLD;DJKJDB;KDJB OP;FJEIODK; } .torrent metadata file:
  129. 129. { files:file1.bin,100MB; file2.bin,200MB; file3.bin,50MB } { piecelength:75MB } { infohash:XSDJSK;JDISJLD;DJKJDB;KDJB OP;FJEIODK; } .torrent metadata file:
  130. 130. Tracker Coordinator of the download
  131. 131. Seeder Any client providing data
  132. 132. Seeder Data I have pieces for info hash Tracker .torrent Info Hash File manifest
  133. 133. Data .torrent Info Hash File manifest Seeder Tracker Info hash peer Map Ok! I have pieces for info hash
  134. 134. Consumer Any client downloading data
  135. 135. Peers for infohash Consumer Tracker .torrent Info Hash File manifest Tracker URL Map Info hash peer How a consumer gets the first piece
  136. 136. Peers for infohash Peerlist Consumer Tracker .torrent Info Hash File manifest Tracker URL Map Info hash peer How a consumer gets the first piece
  137. 137. Data .torrent Info Hash File manifest Consumer/ Seeder I have pieces for infohash Tracker Info hash peer Map It is also a seeder
  138. 138. Consumer 1 Seeding as it downloads Consumer 2 Seeding as it downloads Consumer 3 Seeding as it downloads Seeder SWARM
  139. 139. Didn’t quite meet our needs
  140. 140. Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  141. 141. jobindex.2 File4.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  142. 142. jobindex.2 File4.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH6 Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 Piece 6: HASH7 File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  143. 143. jobindex.2 File4.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH6 Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 Piece 6: HASH7 File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  144. 144. File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1 jobindex.2 File4.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.2 File0.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB)
  145. 145. Piece 1: HASH6 Piece 2: HASH7 Piece 3: HASH8 Piece 4: HASH9 Piece 5: HASH10 Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 Piece 6: HASH11 File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1 jobindex.2 File4.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.2 File0.bin (50MB) File3.bin (50MB) File1.bin (100MB) File2.bin (200MB)
  146. 146. Control sort order?
  147. 147. jobindex.2 File3.bin (50MB) File1.bin (150MB) File2.bin (200MB) Piece 1: HASH6 Piece 2: HASH7 Piece 3: HASH8 Piece 4: HASH9 Piece 5: HASH10 Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 Piece 6: HASH11 File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1
  148. 148. File3.bin (50MB) File1.bin (100MB) File2.bin (200MB) jobindex.1 Piece 1: HASH6 Piece 2: HASH7 Piece 3: HASH8 Piece 4: HASH9 Piece 5: HASH10 Piece 1: HASH1 Piece 2: HASH2 Piece 3: HASH3 Piece 4: HASH4 Piece 5: HASH5 Piece 6: HASH11 File3.bin (50MB) File1.bin (150MB) File2.bin (200MB) jobindex.2
  149. 149. hash each file?
  150. 150. Compare files not pieces
  151. 151. { files:file1.bin,100MB,DATETIME; file2.bin,200MB,DATETIME; file3.bin,50MB,DATETIME } { piecelength:75MB } ... .torrent metadata file contents:
  152. 152. File1.bin (100MB) File2.bin (200MB) File3.bin (50MB) jobindex.1 Piece 1: File 0, File1 Piece 2: File 1 Piece 3: File 1, File 2 Piece 4: File 2 Piece 5: File 2, File 3 Piece 6: File 3 File1.bin (100MB) File2.bin (200MB) File3.bin (50MB) jobindex.2 File0.bin (50MB)
  153. 153. File1.bin (100MB) File2.bin (200MB) File3.bin (50MB) jobindex.1 File1.bin (100MB) File2.bin (200MB) File3.bin (50MB) jobindex.2 File0.bin (50MB) Piece 1: File 0, File1 Piece 2: File 1 Piece 3: File 1, File 2 Piece 4: File 2 Piece 5: File 2, File 3 Piece 6: File 3
  154. 154. Bittorrent Evaluation Result substantially faster drastically reduces network load on the producer machine horizontally scalable
  155. 155. Design GoalsDesign Goals Automatic recovery Developer empowerment System-wide visibility 3 4 5 Loose coupling2 Minimize network bottlenecks1
  156. 156. Service-oriented architecture
  157. 157. Headwater The beginning of a river
  158. 158. Headwater Host Data Producer Data Publish my data
  159. 159. Headwater takes ownership of the data (hardlink + read-only)
  160. 160. Headwater Host Data Producer Data Publish my data Will do!
  161. 161. Headwater Host Data Producer Data
  162. 162. create the .torrent metadata file
  163. 163. Headwater The beginning of a river River Course the water carves across the landscape
  164. 164. Rhone RhoneRhone Zookeeper Rhone: multi-master coordinator service
  165. 165. Rhone Headwater Host Data Producer Data
  166. 166. Rhone Headwater Host Data Producer Datadata.version torrent metadata
  167. 167. Rhone Headwater Host Data Producer Datadata.version torrent metadata
  168. 168. Rhone Headwater Host Data Producer Data Tracker .torrent metadata can be retrieved data.version torrent metadata
  169. 169. Headwater The beginning of a river River Course the water carves across the landscape Delta The end of the river
  170. 170. Subscribe to data! Delta Host Data Consumer
  171. 171. Make all subscribed artifacts available
  172. 172. RhoneDelta Host Data Consumer Headwater Host Data Producer Data
  173. 173. Delta Data Consumer Rhone Host
  174. 174. Tracker Delta Host Data ConsumerData /rad/data
  175. 175. Delta Host Data ConsumerData Where’s the latest data? /rad/data
  176. 176. It’s at /rad/data Delta Host Data ConsumerData Where’s the latest data? /rad/data
  177. 177. Delta Host Data ConsumerData /rad/data
  178. 178. Keep all subscribed artifacts current
  179. 179. Delta Data Consumer Rhone Host
  180. 180. Rhone Data Host Artifact Availability Flow Delta Headwater Host Data Consumer Data Producer Data
  181. 181. Design GoalsDesign Goals Automatic recovery Developer empowerment System-wide visibility 4 5 Minimize network bottlenecks1 Loose coupling2 3
  182. 182. Rhone Headwater Host Data Producer Data Crash!
  183. 183. Rhone Headwater Data Producer Datadata.version torrent metadata Tracker Crash! Host
  184. 184. Development philosophy: Make recovery the common case
  185. 185. Durable state with atomic filesystem operations
  186. 186. All service calls are idempotent
  187. 187. RAD handles network recovery
  188. 188. DC4 DC3 DC2 DC1 rsync is point-to-point
  189. 189. DC1 DC4 DC3 DC2 bittorrent peer-to-peer
  190. 190. Down DC1 DC4 DC3 DC2 No problem with bittorrent swarm
  191. 191. RAD treats artifact independently
  192. 192. Design GoalsDesign Goals Developer empowerment System-wide visibility5 Minimize network bottlenecks1 Loose coupling2 Automatic recovery3 4
  193. 193. Adding a new artifact in the rsync system
  194. 194. Ask System Administrators
  195. 195. Adding a new artifact in the RAD system
  196. 196. Declare it in the code
  197. 197. REST API is language agnostic
  198. 198. Design GoalsDesign Goals System-wide visibility Minimize network bottlenecks1 Loose coupling2 Automatic recovery3 Developer empowerment4 5
  199. 199. Rhone already knows all artifacts
  200. 200. Rhone stores list of versions by artifact. version 4 version 5 version 6 artifactA version 221 version 226 version 227 version 228 artifactB version 1artifactC
  201. 201. Heartbeats from Delta and Headwater
  202. 202. Rhone has system-wide view
  203. 203. RADAR: Developers can easily see where their data is
  204. 204. RADAR: Developers can easily see where their data is
  205. 205. RADAR: Developers can easily see where their data is
  206. 206. RADAR: Developers can easily see where their data is
  207. 207. start simple and iterate
  208. 208. 2011 52 countries 2004 Indeed 2008 6 countries 2009 23 countries 2014 rsync limits 1st artifact migrated to RAD
  209. 209. Lesson learned: prevent people from using the system incorrectly
  210. 210. We made configuration TOO easy
  211. 211. New Requirement: protect the disks
  212. 212. Delta Prevent downloading artifacts that will fill the disk (and alarm)
  213. 213. 2011 52 countries 2004 Indeed 2008 6 countries 2009 23 countries 2014 rsync limits 1st artifact migrated to RAD 2015 critical artifacts migrated
  214. 214. 2011 52 countries 2004 Indeed 2008 6 countries 2009 23 countries 2014 rsync limits 1st artifact migrated to RAD 2015 critical artifacts migrated 2016 80 RAD artifacts
  215. 215. 2011 52 countries 2004 Indeed 2008 6 countries 2009 23 countries 2014 rsync limits 1st artifact migrated to RAD 2015 critical artifacts migrated 2016 80 RAD artifacts 100 artifacts in 10 years
  216. 216. 100 artifacts in 10 years 2011 52 countries 2004 Indeed 2008 6 countries 2009 23 countries 2014 rsync limits 1st artifact migrated to RAD 2015 critical artifacts migrated 2016 80 RAD artifacts 80 new artifacts in 1 year
  217. 217. 7,666 versions published Producer Consumer 56 unique producers 52,357 versions downloaded 670 unique consumers RAD Stats March 23, 2016
  218. 218. Duration of JobIndex replication in RAD v. Rsync Jan 18 6 AM 12 PM 6 PM Jan 19 6 AM 12 PM 6 PM 1,000 2,000 3,000 RAD rsync time
  219. 219. replicating over 65,193 TB per month
  220. 220. Learn More Engineering blog & talks http://indeed.tech Open Source http://opensource.indeedeng.io Careers http://indeed.jobs Twitter @IndeedEng

Link to video: https://youtu.be/lDXdf5q8Yw8 At Indeed, we use massive amounts of data to build our products and services. At first, we relied on rsync to distribute these data to our servers. This rsync system lasted for ten years before we started to encounter scaling challenges. So we built a new system on top of BitTorrent to improve latency, reliability, and throughput. Today, terabytes of data flow around the world every day between our servers. In this talk, we will describe what we needed, what we created, and the lessons we learned building a system at this scale.

Views

Total views

3,632

On Slideshare

0

From embeds

0

Number of embeds

2,845

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×