Data Infrastructure in Kumparan

Data Infrastructure in
Kumparan

- Overview of the data infrastructure
- ETL Data source → data lake → data warehouse:
- Data lake: GCS
- Data warehouse: BigQuery
- Real time data source
- Batch data source
- Processing the data
- Processing Engine: Bigquery, Python,
Dataproc (spark)
- Scheduler: Airflow, Kubernetes Cronjob
Content

- Serving
- Database
- Elasticsearch
- Bigtable
- Mysql
- Redis
- API → Python server in GKE
- Visual and report → DOMO
- Use Cases examples:
- Visualization and reporting system
- Recommender System
- Spam Detector
- Trend predictor
Content

Kumparan is a startup that focus on media and
technology.
Heavily focus on media during the first year (2017)
and start focusing more on technology in 2018.
What is kumparan?

Transporter: DataSource → GCS
- Real time data:
- Goes through pub/sub
- pub/sub consumer do mini-batch and put into GCS
- Batch data:
- Scheduled hourly or daily using airflow
- Use python script
- The data in GCS is stored with AVRO format
- Publish pub/sub event every time we generate file in GCS

Loader: GCS → BigQuery
- Implemented as pub/sub consumer
- Get an event from transporter and load the file
into BigQuery
- The consumer is auto-scaled based on pub/sub
queue size

Processing Engine -- BigQuery
- Process the data using a query and saved into another table on
BigQuery
- The process of executing the query is managed by a system called
derived-automation (manage schedule and dependency of the
queries)
- Scheduled with airflow with django as dashboard

Processing Engine -- Python
- Process data to build a model to solve problems
- The python script is executed as kubernetes cronjob on GKE (to
have a better resource management compared to airflow)
- Usually the input data to the python script already processed
by BigQuery so that it is not too big and python can handle it
- The input data usually from BigQuery, but it can also come
from GCS, pub/sub, or data source (depend on how fast we
want to update)

Processing Engine -- Dataproc
- Process the data using Spark on Dataproc (using pyspark
mostly)
- Mostly used if the input data is too big for python
- Scheduled using airflow

Serving -- Database
For serving we use various database depend on use case:
- Elasticsearch:
- Most commonly in kumparan used so far
- Able to serve frequent request
- Able to filter or search the data
- Able to do simple aggregation
- For data that does not need complex join operation
- BigTable (hbase like):
- For heavy read/write
- Able to do simple filtering
- Great for key-value access
- Not for full text search or join operation

Serving -- Database
- Redis:
- For caching data that have heavy read request
- Great for key-value access
- Have various data structure like hash, set, sorted-set, etc
- Often to be used together with mysql as cache
- Cannot do join and cannot handle a very big data since it is in memory
- Mysql:
- RDBMS
- Can handle complex query and join
- Used for relatively small data ( < 10GB) and need a relatively complex logic
- To handle heavy read request, we can use it together with redis

Serving -- API
- Serve the API using python (flask, falcon, or sanic)
- Deployed as deployment in kubernetes on GKE
- Act as the interface:
- Get the data from serving database
- Or load model from GCS and compute the result on the fly
from the model

Use Case -- Visualization and
Reporting System
- Automate the reporting and visualize it
- People can monitor the metrics they cared about
- BI team don’t need to do the same task again and again
- Infra:
- Data source → BigQuery → BigQuery → DOMO

Use Case -- Recommender System
- Give people the content they cared about, it can increase
Page/Session, CTR, etc
- Give insight for internal teams (for marketing, editorial, etc)
- Infra:
- Data source: Tracker, Core app data
- Processing: BigQuery, Python, Spark
- Serving DB: Elasticsearch
- Serving API: Python on GKE

Use Case -- Spam Detector
- Detecting spam to make sure the content is credible (maintain the
quality of the content)
- Training flow:
- Create labelled data on google drive and connect it to
BigQuery
- Experimenting using jupyter notebook
- Create classification model using python and store the
model to GCS
- Serving flow:
- The API server load the model from GCS during initiation
- Use the model for predict for every request

Use Case -- Social Media Trend Detector
- Detect the phrase that will be trending on social media
- Process flow (twitter):
- Crawler get the twitter data and put it into GCS then load it into
BigQuery
- Clean the raw tweet and put it into Elasticsearch
- Keyword extractor on python run hourly and extract phrase from
the tweet and put into BigQuery
- The active keywords is monitored and their metrics are aggregated
periodically and the aggregated data stored into BigQuery
- Spike Detector process the aggregated data and detect spike/trend
then store the trending keyword into ES (serving database)

We are hiring!
1. Software Engineer (Frontend, Backend & Mobile
Application)
2. Site Reliability Engineer
3. QA Engineer
4. Data Engineer
Email Us on joindev@kumparan.com

Data Infrastructure in Kumparan

More Related Content

What's hot

Similar to Data Infrastructure in Kumparan

Recently uploaded

Data Infrastructure in Kumparan