Building an analytics API
Apidays Paris - Decembre 2021
David Wobrock
1
Who am I?
2
- French/German living in Paris
- Senior Lead API Engineer @ Botify
- French Tech Start-up in Paris
- Worker a bit around APIs
https://twitter.com/davidwobrock
https://github.com/David-Wobrock
https://www.linkedin.com/in/david-wobrock
Plan
1. Let’s define analytics
2. Understanding our use case
3. Architecture
4. Learned lessons from the journey
3
1
Definitions
4
New York
JULY
Australia
SEPTEMBER
Singapore
APRIL
Helsinki & North
MARCH
Paris
DECEMBER
London
OCTOBER
Jakarta
FEBRUARY
Hong Kong
AUGUST
JUNE
India
MAY
Check out our API Conferences here
50+ events since 2012, 14 countries, 2,000+ speakers, 50,000+ attendees,
300k+ online community
Want to talk at one of our conferences?
Apply to speak here
Relational vs. analytics data
Relational data
- Structured
- Normalized
- Constrained
- Linked
Transactional DB
Online transaction processing - OLTP
- Real-time transactions
- CRUD in a transactional model
Ex: efficient to access specific rows
5
Analytical data
- Less structured
- Less constrained
Analytical DB
Online analytical processing - OLAP
- Analyse huge data sets
Ex: efficient aggregation of millions of rows
Rest API
When talking about APIs, we think a lot about REST APIs.
- Identify your resource (often linked to a relational table)
- Hyperlinks to fetch related resources (often following a foreign key)
- Methods to alter resources (often generating an SQL statement for the CRUD operation)
In modern web applications, data often comes from a relational database.
6
An API for analytics
- Data querying can also be done through a REST API
- Insertion, update and deletion will often happen from other sources
How to define an analytics REST API?
You don’t have resources, foreign key constraints and the use cases are numerous
7
2
Our use case
8
Botify is a Search Engine Optimisation (SEO) platform.
In short and technical, we
- ingest SEO data from various sources
- filter, join and aggregate data
- provide insights and automations
9
Context about Botify
Our analytic use case
10
Crawl and analyse
50 million pages
Understanding an example website
Ingest 250 millions
rows of Apache logs
Get 100 million events
from Google Analytics
70 million keywords
on Google
Need to be able to express any business use case:
- What product pages lost most traffic since the website migration?
- What new keywords gained most traffic compared to last week?
- What is the loading time of my pages present in my sitemap and that Google crawled?
=> all required multiple data sources, different aggregations and dimensions
Enable data consumption
Analytical data can be consumed in many ways
- in an Application through predefined or custom charts
- through a flexible data explorer
- in BI and dashboarding tools
- through an interactive API
- through an export batch system
- in any other format automatically or manually
11
We built Botify Query Language (BQL)
12
Define metrics aggregated by
dimensions
What sources of data are needed?
On what timeframe?
Allow filtering and sorting
https://developers.botify.com/
3
Architecture
13
Overview
- Botify Application
- REST API
- BI Tools
(Data Studio, Tableau, Looker…)
- Data exports
14
BQL
- PostgreSQL
- Google BigQuery
- Athena
- S3
- Redis
Business requirement
- Build an interface that allows expressing any use case
- That can query data from any backend
- And respond in any format
15
Business requirement meeting Tech
- Build an interface that allows expressing any use case
=> flexible and understandable input format
- That can query data from any backend
=> multiple connectors
=> fetching data from the most efficient backend
- And respond in any format
=> generic and adaptable exporters
16
The right tool for the right job: backends
- Query data warehouse software like BigQuery, Athena, Snowflake…
But not only!
The flexible input format makes these database not efficient for all cases
- Compute an aggregation on hundreds of GB (column-oriented DB)
- Fetch basic information from rows (key-value DB)
- Fetch large information of one item (object storage)
And the combination of multiple backends
17
BQL internals
18
BQL JSON Schemas
Query Parser
Parsing Infos
Backend(s) Choice
Backend Transformer(s)
Results
Queries
Connection
Internet
Adaptable export capabilities
19
Raw BQL response
Result Transformer
Formatters
JSON
CSV XML
Compressors
ZSTD
GZ ZIP
Backends
GCS
S3 BQ
Split files
One file
Internet
4
Lessons
20
Developer Relations and learning curve
Onboarding is no easy task.
Custom DSL with multiple specific concepts required a high learning curve.
Mitigation:
- internal and customer trainings, also for and by developers
- document with many examples
- standard dialect to query the API (sub-SQL?)
21
Monitoring and tooling are key
A single API endpoint, that changes with respect to its body.
- Who makes what calls?
- Understand patterns of slow and/or expensive access
- Breaking API changes have to be avoided
Fine-tuned and specific monitoring is queried to understand your system.
22
Something similar, but open source
https://cube.dev/
Cube - The Analytics API for Building Data Applications
- Similar Query Language
- Supports many backends
- Defines schemas
Very similar!
But answers a more generic need than BQL
23
Conclusion
How to build an analytics API?
- Identifying and understand your use case
- Building and optimizing around, since it can cost you and performances on such scale can be
hard to achieve
24
Thanks for listening!
25
New York
JULY
Australia
SEPTEMBER
Singapore
APRIL
Helsinki & North
MARCH
Paris
DECEMBER
London
OCTOBER
Jakarta
FEBRUARY
Hong Kong
AUGUST
JUNE
India
MAY
Check out our API Conferences here
50+ events since 2012, 14 countries, 2,000+ speakers, 50,000+ attendees,
300k+ online community
Want to talk at one of our conferences?
Apply to speak here

apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify

  • 1.
    Building an analyticsAPI Apidays Paris - Decembre 2021 David Wobrock 1
  • 2.
    Who am I? 2 -French/German living in Paris - Senior Lead API Engineer @ Botify - French Tech Start-up in Paris - Worker a bit around APIs https://twitter.com/davidwobrock https://github.com/David-Wobrock https://www.linkedin.com/in/david-wobrock
  • 3.
    Plan 1. Let’s defineanalytics 2. Understanding our use case 3. Architecture 4. Learned lessons from the journey 3
  • 4.
  • 5.
    New York JULY Australia SEPTEMBER Singapore APRIL Helsinki &North MARCH Paris DECEMBER London OCTOBER Jakarta FEBRUARY Hong Kong AUGUST JUNE India MAY Check out our API Conferences here 50+ events since 2012, 14 countries, 2,000+ speakers, 50,000+ attendees, 300k+ online community Want to talk at one of our conferences? Apply to speak here
  • 6.
    Relational vs. analyticsdata Relational data - Structured - Normalized - Constrained - Linked Transactional DB Online transaction processing - OLTP - Real-time transactions - CRUD in a transactional model Ex: efficient to access specific rows 5 Analytical data - Less structured - Less constrained Analytical DB Online analytical processing - OLAP - Analyse huge data sets Ex: efficient aggregation of millions of rows
  • 7.
    Rest API When talkingabout APIs, we think a lot about REST APIs. - Identify your resource (often linked to a relational table) - Hyperlinks to fetch related resources (often following a foreign key) - Methods to alter resources (often generating an SQL statement for the CRUD operation) In modern web applications, data often comes from a relational database. 6
  • 8.
    An API foranalytics - Data querying can also be done through a REST API - Insertion, update and deletion will often happen from other sources How to define an analytics REST API? You don’t have resources, foreign key constraints and the use cases are numerous 7
  • 9.
  • 10.
    Botify is aSearch Engine Optimisation (SEO) platform. In short and technical, we - ingest SEO data from various sources - filter, join and aggregate data - provide insights and automations 9 Context about Botify
  • 11.
    Our analytic usecase 10 Crawl and analyse 50 million pages Understanding an example website Ingest 250 millions rows of Apache logs Get 100 million events from Google Analytics 70 million keywords on Google Need to be able to express any business use case: - What product pages lost most traffic since the website migration? - What new keywords gained most traffic compared to last week? - What is the loading time of my pages present in my sitemap and that Google crawled? => all required multiple data sources, different aggregations and dimensions
  • 12.
    Enable data consumption Analyticaldata can be consumed in many ways - in an Application through predefined or custom charts - through a flexible data explorer - in BI and dashboarding tools - through an interactive API - through an export batch system - in any other format automatically or manually 11
  • 13.
    We built BotifyQuery Language (BQL) 12 Define metrics aggregated by dimensions What sources of data are needed? On what timeframe? Allow filtering and sorting https://developers.botify.com/
  • 14.
  • 15.
    Overview - Botify Application -REST API - BI Tools (Data Studio, Tableau, Looker…) - Data exports 14 BQL - PostgreSQL - Google BigQuery - Athena - S3 - Redis
  • 16.
    Business requirement - Buildan interface that allows expressing any use case - That can query data from any backend - And respond in any format 15
  • 17.
    Business requirement meetingTech - Build an interface that allows expressing any use case => flexible and understandable input format - That can query data from any backend => multiple connectors => fetching data from the most efficient backend - And respond in any format => generic and adaptable exporters 16
  • 18.
    The right toolfor the right job: backends - Query data warehouse software like BigQuery, Athena, Snowflake… But not only! The flexible input format makes these database not efficient for all cases - Compute an aggregation on hundreds of GB (column-oriented DB) - Fetch basic information from rows (key-value DB) - Fetch large information of one item (object storage) And the combination of multiple backends 17
  • 19.
    BQL internals 18 BQL JSONSchemas Query Parser Parsing Infos Backend(s) Choice Backend Transformer(s) Results Queries Connection Internet
  • 20.
    Adaptable export capabilities 19 RawBQL response Result Transformer Formatters JSON CSV XML Compressors ZSTD GZ ZIP Backends GCS S3 BQ Split files One file Internet
  • 21.
  • 22.
    Developer Relations andlearning curve Onboarding is no easy task. Custom DSL with multiple specific concepts required a high learning curve. Mitigation: - internal and customer trainings, also for and by developers - document with many examples - standard dialect to query the API (sub-SQL?) 21
  • 23.
    Monitoring and toolingare key A single API endpoint, that changes with respect to its body. - Who makes what calls? - Understand patterns of slow and/or expensive access - Breaking API changes have to be avoided Fine-tuned and specific monitoring is queried to understand your system. 22
  • 24.
    Something similar, butopen source https://cube.dev/ Cube - The Analytics API for Building Data Applications - Similar Query Language - Supports many backends - Defines schemas Very similar! But answers a more generic need than BQL 23
  • 25.
    Conclusion How to buildan analytics API? - Identifying and understand your use case - Building and optimizing around, since it can cost you and performances on such scale can be hard to achieve 24
  • 26.
  • 27.
    New York JULY Australia SEPTEMBER Singapore APRIL Helsinki &North MARCH Paris DECEMBER London OCTOBER Jakarta FEBRUARY Hong Kong AUGUST JUNE India MAY Check out our API Conferences here 50+ events since 2012, 14 countries, 2,000+ speakers, 50,000+ attendees, 300k+ online community Want to talk at one of our conferences? Apply to speak here