Building a Real Time Analytics API
at Scale
DataXDay, May 17th 2018
Sylvain Friquet
@sylvainfriquet
Software Engineer
Algolia: Search as a Service
As-you-type Speed
Results in milliseconds
at every keystrokes.
Relevance
Finding the best content
for every intent.
User Experience
Delightfully engaging,
impressively intuitive.
@DataXDay
Algolia: Search as a Service
@DataXDay
Algolia by the numbers
16
Regions
55
Data centers
Offices
Regions
40B
Searches /mo
150B
API calls /mo
2012
Founded
200
Employees
4500
Customers
$74M
Funding
Algolia Search Analytics
@DataXDay@DataXDay
Where we started
> 4 years old project
> ElasticSearch
> Self Hosted
> 500M to 40B searches/month
> Upgrading ES Cluster too tedious
@DataXDay
What we wanted
> Sub second API response time
> Low latency
> Large retention
> Billions of events per day
> Scale with us
> Hosted solution
@DataXDay
The big picture
@DataXDay
Datastore we considered
> BigQuery
> RedShift
> Citus
Google BigQuery
@DataXDay
Citus
> Postgres Extension
> Distributed
> Multi Tenant
> Near real time analytics
> Scale Out
@DataXDay
Sub-second analytics query
> Ingesting raw events
> Rolling them up
> Sub-second API queries
@DataXDay
Rollup
> Aggregation
> Coarse grain analysis
> Pre determined queries
@DataXDay
Ingesting raw events
> Simple schema CREATE TABLE queries (
app_id text,
timestamp timestamp,
query text,
user_id text,
created_at timestamp default now()
);
@DataXDay
Ingesting raw events
> Simple schema
> Shard by tenant
CREATE TABLE queries (
app_id text,
timestamp timestamp,
query text,
user_id text,
created_at timestamp default now()
);
SELECT
create_distributed_table('queries',
'app_id');
@DataXDay
Ingesting raw events
> Simple schema
> Shard by tenant
> Batch insert (parallel COPY)
> Up to 7M rows/s
CREATE TABLE queries (
app_id text,
timestamp timestamp,
query text,
user_id text,
created_at timestamp default now()
);
SELECT
create_distributed_table('queries',
'app_id');
@DataXDay
Rollup table
CREATE TABLE rollups_5min (
timestamp timestamp,
app_id text,
query_count bigint,
user_count HLL,
top_queries JSONB
);
> Aggregate metrics per time window
@DataXDay
Rollup table
CREATE TABLE rollups_5min (
timestamp timestamp,
app_id text,
query_count bigint,
user_count HLL,
top_queries JSONB
);
> Aggregate metrics per time window
> TOPN and HLL extension
@DataXDay
Rollup table
CREATE TABLE rollups_5min (
timestamp timestamp,
app_id text,
query_count bigint,
user_count HLL,
top_queries JSONB
);
SELECT create_distributed_table('rollups_5min',
'app_id');
> Aggregate metrics per time window
> TOPN and HLL extension
> Collocated with raw event tables
@DataXDay
Rollup query
> Periodic rollup
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
@DataXDay
Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
> Periodic rollup
> Concurrently executed across workers
@DataXDay
Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
> Periodic rollup
> Concurrently executed across workers
> Out of order events
@DataXDay
Rollup query
INSERT INTO rollups_5min
SELECT
date_trunc('seconds', …) AS minute,
app_id,
count(*) AS query_count,
hll_add_agg(hll_hash_bigint(user_id)) AS user_count,
topn_add_agg(query) AS top_queries
FROM queries
WHERE created_at >= $1 AND created_at <= $2
GROUP BY app_id, minute
ON CONFLICT (app_id, minute)
DO UPDATE SET
query_count = query + EXCLUDED.query_count,
user_count = user_count + EXCLUDED.user_count,
top_queries = top_queries + EXCLUDED.top_queries;
> Periodic rollup
> Concurrently executed across workers
> Out of order events
> Incremental or idempotent
@DataXDay
Rolling up the rollups
> Further aggregate 5min rollups into 1day rollups
> 200x to 50,000x compression ratio in our case
@DataXDay
API queries
Sample queries
> Count
SELECT sum(query_count) FROM … rollups_1day UNION ALL rollups_5min … WHERE …
> Distinct Approx Count
SELECT hll_cardinality(sum(user_count))::bigint FROM ...
> TopN
SELECT (topn(topn_union_agg(top_queries), 10)).* FROM ...
@DataXDay
Some numbers
> every 5min: ~20s to rollup 20M rows
> 64 shards, 48 vCPUs, 366G RAM
> API latency p99 < 800ms, p95 < 500ms
@DataXDay
Conclusion
> Rollup approach working at scale
> Citus becoming the foundation for several new
products (Click Analytics...)
@DataXDay
Thank you
The video of this presentation
will be soon available at dataxday.fr
Thanks to our sponsors
Stay tuned by following @DataXDay

DataXDay - Building a Real Time Analytics API at Scale

  • 1.
    Building a RealTime Analytics API at Scale DataXDay, May 17th 2018 Sylvain Friquet @sylvainfriquet Software Engineer
  • 2.
    Algolia: Search asa Service As-you-type Speed Results in milliseconds at every keystrokes. Relevance Finding the best content for every intent. User Experience Delightfully engaging, impressively intuitive. @DataXDay
  • 3.
    Algolia: Search asa Service @DataXDay
  • 4.
    Algolia by thenumbers 16 Regions 55 Data centers Offices Regions 40B Searches /mo 150B API calls /mo 2012 Founded 200 Employees 4500 Customers $74M Funding
  • 5.
  • 6.
    Where we started >4 years old project > ElasticSearch > Self Hosted > 500M to 40B searches/month > Upgrading ES Cluster too tedious @DataXDay
  • 7.
    What we wanted >Sub second API response time > Low latency > Large retention > Billions of events per day > Scale with us > Hosted solution @DataXDay
  • 8.
  • 9.
    Datastore we considered >BigQuery > RedShift > Citus Google BigQuery @DataXDay
  • 10.
    Citus > Postgres Extension >Distributed > Multi Tenant > Near real time analytics > Scale Out @DataXDay
  • 11.
    Sub-second analytics query >Ingesting raw events > Rolling them up > Sub-second API queries @DataXDay
  • 12.
    Rollup > Aggregation > Coarsegrain analysis > Pre determined queries @DataXDay
  • 13.
    Ingesting raw events >Simple schema CREATE TABLE queries ( app_id text, timestamp timestamp, query text, user_id text, created_at timestamp default now() ); @DataXDay
  • 14.
    Ingesting raw events >Simple schema > Shard by tenant CREATE TABLE queries ( app_id text, timestamp timestamp, query text, user_id text, created_at timestamp default now() ); SELECT create_distributed_table('queries', 'app_id'); @DataXDay
  • 15.
    Ingesting raw events >Simple schema > Shard by tenant > Batch insert (parallel COPY) > Up to 7M rows/s CREATE TABLE queries ( app_id text, timestamp timestamp, query text, user_id text, created_at timestamp default now() ); SELECT create_distributed_table('queries', 'app_id'); @DataXDay
  • 16.
    Rollup table CREATE TABLErollups_5min ( timestamp timestamp, app_id text, query_count bigint, user_count HLL, top_queries JSONB ); > Aggregate metrics per time window @DataXDay
  • 17.
    Rollup table CREATE TABLErollups_5min ( timestamp timestamp, app_id text, query_count bigint, user_count HLL, top_queries JSONB ); > Aggregate metrics per time window > TOPN and HLL extension @DataXDay
  • 18.
    Rollup table CREATE TABLErollups_5min ( timestamp timestamp, app_id text, query_count bigint, user_count HLL, top_queries JSONB ); SELECT create_distributed_table('rollups_5min', 'app_id'); > Aggregate metrics per time window > TOPN and HLL extension > Collocated with raw event tables @DataXDay
  • 19.
    Rollup query > Periodicrollup INSERT INTO rollups_5min SELECT date_trunc('seconds', …) AS minute, app_id, count(*) AS query_count, hll_add_agg(hll_hash_bigint(user_id)) AS user_count, topn_add_agg(query) AS top_queries FROM queries WHERE created_at >= $1 AND created_at <= $2 GROUP BY app_id, minute @DataXDay
  • 20.
    Rollup query INSERT INTOrollups_5min SELECT date_trunc('seconds', …) AS minute, app_id, count(*) AS query_count, hll_add_agg(hll_hash_bigint(user_id)) AS user_count, topn_add_agg(query) AS top_queries FROM queries WHERE created_at >= $1 AND created_at <= $2 GROUP BY app_id, minute > Periodic rollup > Concurrently executed across workers @DataXDay
  • 21.
    Rollup query INSERT INTOrollups_5min SELECT date_trunc('seconds', …) AS minute, app_id, count(*) AS query_count, hll_add_agg(hll_hash_bigint(user_id)) AS user_count, topn_add_agg(query) AS top_queries FROM queries WHERE created_at >= $1 AND created_at <= $2 GROUP BY app_id, minute > Periodic rollup > Concurrently executed across workers > Out of order events @DataXDay
  • 22.
    Rollup query INSERT INTOrollups_5min SELECT date_trunc('seconds', …) AS minute, app_id, count(*) AS query_count, hll_add_agg(hll_hash_bigint(user_id)) AS user_count, topn_add_agg(query) AS top_queries FROM queries WHERE created_at >= $1 AND created_at <= $2 GROUP BY app_id, minute ON CONFLICT (app_id, minute) DO UPDATE SET query_count = query + EXCLUDED.query_count, user_count = user_count + EXCLUDED.user_count, top_queries = top_queries + EXCLUDED.top_queries; > Periodic rollup > Concurrently executed across workers > Out of order events > Incremental or idempotent @DataXDay
  • 23.
    Rolling up therollups > Further aggregate 5min rollups into 1day rollups > 200x to 50,000x compression ratio in our case @DataXDay
  • 24.
    API queries Sample queries >Count SELECT sum(query_count) FROM … rollups_1day UNION ALL rollups_5min … WHERE … > Distinct Approx Count SELECT hll_cardinality(sum(user_count))::bigint FROM ... > TopN SELECT (topn(topn_union_agg(top_queries), 10)).* FROM ... @DataXDay
  • 25.
    Some numbers > every5min: ~20s to rollup 20M rows > 64 shards, 48 vCPUs, 366G RAM > API latency p99 < 800ms, p95 < 500ms @DataXDay
  • 26.
    Conclusion > Rollup approachworking at scale > Citus becoming the foundation for several new products (Click Analytics...) @DataXDay
  • 27.
  • 28.
    The video ofthis presentation will be soon available at dataxday.fr Thanks to our sponsors Stay tuned by following @DataXDay