Data Infrastructure in
Kumparan
- Overview of the data infrastructure
- ETL Data source → data lake → data warehouse:
- Data lake: GCS
- Data warehouse: BigQuery
- Real time data source
- Batch data source
- Processing the data
- Processing Engine: Bigquery, Python,
Dataproc (spark)
- Scheduler: Airflow, Kubernetes Cronjob
Content
- Serving
- Database
- Elasticsearch
- Bigtable
- Mysql
- Redis
- API → Python server in GKE
- Visual and report → DOMO
- Use Cases examples:
- Visualization and reporting system
- Recommender System
- Spam Detector
- Trend predictor
Content
Kumparan is a startup that focus on media and
technology.
Heavily focus on media during the first year (2017)
and start focusing more on technology in 2018.
What is kumparan?
Data Infrastructure Overview
Transporter: DataSource → GCS
- Real time data:
- Goes through pub/sub
- pub/sub consumer do mini-batch and put into GCS
- Batch data:
- Scheduled hourly or daily using airflow
- Use python script
- The data in GCS is stored with AVRO format
- Publish pub/sub event every time we generate file in GCS
Loader: GCS → BigQuery
- Implemented as pub/sub consumer
- Get an event from transporter and load the file
into BigQuery
- The consumer is auto-scaled based on pub/sub
queue size
Processing Engine -- BigQuery
- Process the data using a query and saved into another table on
BigQuery
- The process of executing the query is managed by a system called
derived-automation (manage schedule and dependency of the
queries)
- Scheduled with airflow with django as dashboard
Processing Engine -- Python
- Process data to build a model to solve problems
- The python script is executed as kubernetes cronjob on GKE (to
have a better resource management compared to airflow)
- Usually the input data to the python script already processed
by BigQuery so that it is not too big and python can handle it
- The input data usually from BigQuery, but it can also come
from GCS, pub/sub, or data source (depend on how fast we
want to update)
Processing Engine -- Dataproc
- Process the data using Spark on Dataproc (using pyspark
mostly)
- Mostly used if the input data is too big for python
- Scheduled using airflow
Serving -- Database
For serving we use various database depend on use case:
- Elasticsearch:
- Most commonly in kumparan used so far
- Able to serve frequent request
- Able to filter or search the data
- Able to do simple aggregation
- For data that does not need complex join operation
- BigTable (hbase like):
- For heavy read/write
- Able to do simple filtering
- Great for key-value access
- Not for full text search or join operation
Serving -- Database
- Redis:
- For caching data that have heavy read request
- Great for key-value access
- Have various data structure like hash, set, sorted-set, etc
- Often to be used together with mysql as cache
- Cannot do join and cannot handle a very big data since it is in memory
- Mysql:
- RDBMS
- Can handle complex query and join
- Used for relatively small data ( < 10GB) and need a relatively complex logic
- To handle heavy read request, we can use it together with redis
Serving -- API
- Serve the API using python (flask, falcon, or sanic)
- Deployed as deployment in kubernetes on GKE
- Act as the interface:
- Get the data from serving database
- Or load model from GCS and compute the result on the fly
from the model
Serving -- Visual and Report
Use Case -- Visualization and
Reporting System
- Automate the reporting and visualize it
- People can monitor the metrics they cared about
- BI team don’t need to do the same task again and again
- Infra:
- Data source → BigQuery → BigQuery → DOMO
Use Case -- Recommender System
- Give people the content they cared about, it can increase
Page/Session, CTR, etc
- Give insight for internal teams (for marketing, editorial, etc)
- Infra:
- Data source: Tracker, Core app data
- Processing: BigQuery, Python, Spark
- Serving DB: Elasticsearch
- Serving API: Python on GKE
Use Case -- Spam Detector
- Detecting spam to make sure the content is credible (maintain the
quality of the content)
- Training flow:
- Create labelled data on google drive and connect it to
BigQuery
- Experimenting using jupyter notebook
- Create classification model using python and store the
model to GCS
- Serving flow:
- The API server load the model from GCS during initiation
- Use the model for predict for every request
Use Case -- Social Media Trend Detector
- Detect the phrase that will be trending on social media
- Process flow (twitter):
- Crawler get the twitter data and put it into GCS then load it into
BigQuery
- Clean the raw tweet and put it into Elasticsearch
- Keyword extractor on python run hourly and extract phrase from
the tweet and put into BigQuery
- The active keywords is monitored and their metrics are aggregated
periodically and the aggregated data stored into BigQuery
- Spike Detector process the aggregated data and detect spike/trend
then store the trending keyword into ES (serving database)
We are hiring!
1. Software Engineer (Frontend, Backend & Mobile
Application)
2. Site Reliability Engineer
3. QA Engineer
4. Data Engineer
Email Us on joindev@kumparan.com
THANK YOU!

Data Infrastructure in Kumparan

  • 1.
  • 2.
    - Overview ofthe data infrastructure - ETL Data source → data lake → data warehouse: - Data lake: GCS - Data warehouse: BigQuery - Real time data source - Batch data source - Processing the data - Processing Engine: Bigquery, Python, Dataproc (spark) - Scheduler: Airflow, Kubernetes Cronjob Content
  • 3.
    - Serving - Database -Elasticsearch - Bigtable - Mysql - Redis - API → Python server in GKE - Visual and report → DOMO - Use Cases examples: - Visualization and reporting system - Recommender System - Spam Detector - Trend predictor Content
  • 4.
    Kumparan is astartup that focus on media and technology. Heavily focus on media during the first year (2017) and start focusing more on technology in 2018. What is kumparan?
  • 5.
  • 6.
    Transporter: DataSource →GCS - Real time data: - Goes through pub/sub - pub/sub consumer do mini-batch and put into GCS - Batch data: - Scheduled hourly or daily using airflow - Use python script - The data in GCS is stored with AVRO format - Publish pub/sub event every time we generate file in GCS
  • 7.
    Loader: GCS →BigQuery - Implemented as pub/sub consumer - Get an event from transporter and load the file into BigQuery - The consumer is auto-scaled based on pub/sub queue size
  • 8.
    Processing Engine --BigQuery - Process the data using a query and saved into another table on BigQuery - The process of executing the query is managed by a system called derived-automation (manage schedule and dependency of the queries) - Scheduled with airflow with django as dashboard
  • 10.
    Processing Engine --Python - Process data to build a model to solve problems - The python script is executed as kubernetes cronjob on GKE (to have a better resource management compared to airflow) - Usually the input data to the python script already processed by BigQuery so that it is not too big and python can handle it - The input data usually from BigQuery, but it can also come from GCS, pub/sub, or data source (depend on how fast we want to update)
  • 11.
    Processing Engine --Dataproc - Process the data using Spark on Dataproc (using pyspark mostly) - Mostly used if the input data is too big for python - Scheduled using airflow
  • 12.
    Serving -- Database Forserving we use various database depend on use case: - Elasticsearch: - Most commonly in kumparan used so far - Able to serve frequent request - Able to filter or search the data - Able to do simple aggregation - For data that does not need complex join operation - BigTable (hbase like): - For heavy read/write - Able to do simple filtering - Great for key-value access - Not for full text search or join operation
  • 13.
    Serving -- Database -Redis: - For caching data that have heavy read request - Great for key-value access - Have various data structure like hash, set, sorted-set, etc - Often to be used together with mysql as cache - Cannot do join and cannot handle a very big data since it is in memory - Mysql: - RDBMS - Can handle complex query and join - Used for relatively small data ( < 10GB) and need a relatively complex logic - To handle heavy read request, we can use it together with redis
  • 14.
    Serving -- API -Serve the API using python (flask, falcon, or sanic) - Deployed as deployment in kubernetes on GKE - Act as the interface: - Get the data from serving database - Or load model from GCS and compute the result on the fly from the model
  • 15.
  • 16.
    Use Case --Visualization and Reporting System - Automate the reporting and visualize it - People can monitor the metrics they cared about - BI team don’t need to do the same task again and again - Infra: - Data source → BigQuery → BigQuery → DOMO
  • 17.
    Use Case --Recommender System - Give people the content they cared about, it can increase Page/Session, CTR, etc - Give insight for internal teams (for marketing, editorial, etc) - Infra: - Data source: Tracker, Core app data - Processing: BigQuery, Python, Spark - Serving DB: Elasticsearch - Serving API: Python on GKE
  • 18.
    Use Case --Spam Detector - Detecting spam to make sure the content is credible (maintain the quality of the content) - Training flow: - Create labelled data on google drive and connect it to BigQuery - Experimenting using jupyter notebook - Create classification model using python and store the model to GCS - Serving flow: - The API server load the model from GCS during initiation - Use the model for predict for every request
  • 19.
    Use Case --Social Media Trend Detector - Detect the phrase that will be trending on social media - Process flow (twitter): - Crawler get the twitter data and put it into GCS then load it into BigQuery - Clean the raw tweet and put it into Elasticsearch - Keyword extractor on python run hourly and extract phrase from the tweet and put into BigQuery - The active keywords is monitored and their metrics are aggregated periodically and the aggregated data stored into BigQuery - Spike Detector process the aggregated data and detect spike/trend then store the trending keyword into ES (serving database)
  • 20.
    We are hiring! 1.Software Engineer (Frontend, Backend & Mobile Application) 2. Site Reliability Engineer 3. QA Engineer 4. Data Engineer Email Us on joindev@kumparan.com
  • 21.