4. Personal profiles in family trees
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives.
5. Personal profiles in family trees
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives.
6. Personal profiles in family trees –
Sharding MySQL
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives. Many
interconnected MySQL
tables. Millions of daily
updates.
Site A
Site A
Site A
Event Individual
Child
InFamily
Family
Family
Event
Tags
Photos
7. Personal profiles in family trees –
Sharding MySQL
Family trees:
A complex network of
people, each with
personal info, life
events, and connections
to relatives. Many
interconnected MySQL
tables. Millions of daily
updates.
Good response time for
single family site access,
using MySQL Database
Sharding.
Over 650 shards, on >20
physical hosts, growing
Shard 650
Partition 500
Shard 1
Site A
Site A
Site A
Event Individual
Child
InFamily
Family
Family
Event
Tags
Photos
. . .
8. The issue with RDBMS sharding
Problematic when
multiple shards are
needed at once.
For example, to display
search results and profile
matches coming from
many family trees.
Costly to scale for
more readers
Options:
• Build a custom
parallel-fetch
Aggregator service
• NoSQL
9. Cassandra to the rescue
Cassandra recap:
• Key-value store
• Ring-based consistent
hashing cluster
• Support for clusters split
between data centers
• Data redundancy and
consistency at user
controlled level
• Append-only high write
throughput
11. PeopleStore: Overview
• Store 2.6 billion profiles (and growing over a million a day)
• Provide very fast read access
• Shadows the MySQL source of truth (at least for foreseeable future)
• Data consistency is critical
• Store each person as one aggregated record in Cassandra, including
ALL info for typical uses, to minimize nested/follow-up queries: get all
information needed at once
• Decision point: replicate relatives, or point to their record?
13. Web Servers
(PHP)
PeopleStore: Architecture
. . .
MySQL
Highly sharded RDBMS
Web Servers
(PHP)Web Server
PoepleStore
micorservicePoepleStore
micorservicePeopleStore
micorservice
Source of Truth
Synchronous
updates
Multi-item Fetch
Cassandra Cluster
Mass
Loading
Hadoop
cluster
Online Flows Batch first load / reload
14. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
• JSON: Flexibility of structure
(text, not 2.2 JSON support)
• Split fields: Flexibility to
fetch fields needed
• Not using a Collection for
plural fields – due to
Cassandra limitation on
using IN clause on table
with Collection fields (non
issue for us)
• Future: use User Defined
Types
15. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Only minimal relatives info:
ID + name
Requires another fetch for full
relative data
16. PeopleStore: Schema
CREATE TABLE peoplestore.people (
site_id int,
tree_id int,
individual_id int,
adopted_child_in_family_id int,
child_in_family_id int,
foster_child_in_family_id int,
gender text,
is_alive boolean,
privacy_level int,
last_update int,
loading_mode int,
loading_time timestamp,
thumbnail text,
name text,
events text,
photos text,
relatives text,
PRIMARY KEY (site_id, tree_id, individual_id)
) WITH …
compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Started with Size Tiered
Compaction. Generated
thousands of SSTables and
slowed query time.
Moving to Leveled Compaction
solved the issue.
17. PeopleStore: microservice
Clients
• Control exposure, read / write per flow
• Discover services by listing DNS SRV records
• Clients do round-robin on these services
Services
• A Spring Boot Java REST server
• Deployed as a Docker container managed by
Mesos & Marathon
• Mesos manages DNS entries
• Mesos monitors services health
• Metrics sent to JMX
Failure recovery needed despite redundancy
• In write for consistency; in read for availability
PoepleStore
micorservice
(Java)
PoepleStore
micorservice
(Java)
PeopleStore
micorservice
Web Servers
(PHP)
Web Servers
(PHP)Web Server
Write
failure
recovery
Mesos+Marathon
DNS
18. PeopleStore: Mass Loading
To boot the system, and in case of major scheme/logic changes, we
had to load 2.2 billion person profiles at once.
Evaluated:
• Cassandra’s sstableloader tool
• hdfs2cass from Spotify
Cons:
• Uses SSTableSimpleWriter and Cassandra streaming
• Very sensitive to C* version
Selected: Hadoop + online Cassandra updates
19. PeopleStore: Mass Loading with Hadoop
. . .
MySQL
Hadoop
cluster
Extract and Aggregate
MySQL extractor + PIG flow Avro
Load
Crunch + Cassandra Driver
• Tested: logged/unlogged BATCH writes.
Does NOT help performance.
• Had to implement write retries to reach 0 failures
• Collect stats into Hadoop counters
• Load time
2.2 billion items, 6 Hadoop nodes, 6 C* nodes
~30k writes per second
~17 hours load + hours compaction time
Impact on read latency very reasonable
20. PeopleStore: Mass loading with online updates
Mass loading takes time. In the meanwhile, we have online updates.
Batch load must not overwrite newer online updates
Tested: lightweight transactions:
INSERT ... IF NOT EXISTS / UPDATE... IF update_time < <value>
Result: major slowdown, due to massive read-before-write
Solution: updated_people table: small table, indicating only people that changed
online while batch loading is running. Read-before-write viable because table is
small; >99% of queries return empty set. Insignificant slowdown.
Hadoop
cluster
PoepleStore
micorservice
(Java)
PoepleStore
micorservice
(Java)
PeopleStore
micorservice
(Java)
. . .
MySQL
21. PeopleStore: JVM Tuning
Experienced long GC pauses in the Cassandra nodes
Upgrade from Java 1.7 to 1.8.0_65
Switch from CMS to G1 garbage collector
Major improvement.
This is the default in Cassandra 3.0
Tune JVM params
(/etc/cassandra/conf/cassandra-env.sh)
See https://tobert.github.io/pages/als-cassandra-21-tuning-
guide.html
# highlights:
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC”
JVM_OPTS="$JVM_OPTS -Xms16G -Xmx16G”
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500”
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16"
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16”
22. PeopleStore: Other issues
Experienced unexplained missing rows on read (CASSANDRA-10801)
We upgraded the Cassandra nodes from 2.1.11 to 2.1.12 and the java
driver from 2.1.5 to 2.1.9, which solved the issue.
Cassandra Driver: Spring @Query annotations cannot handle “IN”
queries. Instead, we used CassandraTemplate to build a native query.
23. PeopleStore: Results
Reduced latency:
• Matches page: over 50% reduction of load time
• Search results page: 40% reduction of load time
• 90% of microservice
calls < 100ms
Reduced load on
MySQL databases
• From hundreds of
queries per page,
to just a few
25. EVERY page on myheritage.com needs access to
• Summarized user (account) information from multiple sources
For marketing tracking, affiliates programs, retargeting
Includes properties and counters, coming from various sources
• A/B test data
Participation and variant selection
for guests and registered members
Latency: less than 10ms slowdown for any page
Data must be fresh
Storing also for guests, lots of data
Make data available to BI systems
Aggregating the data at runtime is too slow
requires maintaining live aggregated data
high updates rate
AccountStore Needs
Fast user properties and counters
Example
var gtmDataLayer = [{
"site_plan":"premium-plus”,
"data_subscription":
"no-data-subscription",
"active_paying":
"not-actively-paying”
"site_visits":3509,
"last_mobile_sighting":
"2016-02-07 11:10:25”
...
}];
26. Use Cassandra to “store it as you read it” – updated aggregate
information and counters on users and guests.
Event subscribers update the aggregate data online as it changes, in
two tables: data, and counters (C* limitation)
For example, num_individuals_in_trees changed online as family tree is modified,
and subscription_expiration_date is changed as user becomes a paying subscriber.
A separate Cassandra table maps guests to users as they convert and
register.
AccountStore: Overview
27. Same physical datacenter
Requirement: Allow BI systems to collect data. Do not put BI load on
the production cluster.
Solution: create a fictitious data-center in the cluster
AccountStore and A/B test cluster topology
App Cassandra
data center
BI Cassandra
data center
BI systemclients
clients
clients
28. Using secondary indexes for non-
typical flows
Converted guests maintain the UUID,
plus mapping to/from account_id
AccountStore: Schema
CREATE TABLE accounts.account_store_data (
account_uid uuid PRIMARY KEY,
creation_time timestamp,
device_types set,
highest_site_plan int,
last_visit timestamp,
. . .
) WITH ...;
CREATE TABLE accounts.account_id_guest_id (
account_id int,
guest_id ascii,
guest_creation_time timestamp,
updated_at timestamp,
uuid uuid,
PRIMARY KEY ((account_id, guest_id))
) WITH ...;
CREATE INDEX account_id_guest_id_updated_at_idx ON
accounts.account_id_guest_id (updated_at);
CREATE INDEX account_id_guest_id_uuid_idx ON
accounts.account_id_guest_id (uuid);
CREATE TABLE accounts.account_store_counters
(
account_uid uuid PRIMARY KEY,
num_individuals_in_all_trees counter,
num_visits counter,
. . .
) WITH ...;
29. Scale: Millions of active users, hundreds of active experiments: billions
of rows.
Latency: must not slow down the application; many pages have
multiple experiments active on them.
Must allow time-based collection into BI systems.
Classic implementation: Sharded MySQL.
We already have a cluster sharded by Family Site ID.
We do not want another MySQL cluster, sharded by User ID.
Decision: a natural addition to the AccountStore Cassandra cluster.
A/B tests
30. AccountStore: A/B tests schema
CREATE TABLE ab_test.member_to_experiment_ts
(
uuid_bucket int,
day int,
hour int,
experiment_id int,
uuid uuid,
created_at timestamp,
variant_id int,
PRIMARY KEY (
(uuid_bucket, day, hour),
experiment_id,
uuid)
) WITH ...;
CREATE TABLE ab_test.member_to_experiment (
account_uid uuid,
experiment_id int,
created_at timestamp,
created_at_ts bigint,
variant_id int,
PRIMARY KEY (account_uid, experiment_id)
) WITH ...;
CREATE INDEX member_to_experiment_experiment_id_idx
ON ab_test.member_to_experiment (experiment_id);
Simple lookup of
experiment variant
for a user
Secondary lookup by experiment
Preventing hotspots in time-based
index:
Use uuid_bucketing to ensure
partitioning
Reading requires going over all buckets.
Full dump: using
sstable2json