Horizontally Scalable
Compute Infrastructure
Yosua Michael Maranatha
- What?
- Why?
- Problem Definition
- Example Problem: Word Counting
- Using single machines
- Proposed HSCI
- Using proposed HSCI
- Other use cases: Crawler
- Questions?
Content
What is Horizontally Scalable?
- Horizontally Scalable system is a system that
able to have more capacity by adding more
machines
- As examples if one machine can handle a load
of 100 rps, then we can use ten identical
machine to handle 1000 rps
What is Horizontally Scalable?
Reference: image_source
What is Horizontally Scalable
Compute Infrastructure?
- Compute Infrastructure is a infrastructure that
designed for computation or processing
- A Horizontally Scalable Compute Infrastructure
(HSCI) able to compute or process more things
by adding more machines
Why do we need HSCI?
● Scaling up computation vertically easily hit the
ceiling since the CPU computation speed
growth is relatively slow
● Also, scaling up vertically is not flexible and
usually cause down time during the time we
scale up or down
HSCI Design Problem Definitions
● Have a lot of independent tasks
● Have a bunch of machines
● Want to process those tasks with the machines
● Need a way to distribute tasks to the machines
nicely (balanced and robust)
● Able to easily add machine to speed up the
overall process if needed
Example Problem: Word Counting
Suppose we have more than hundred millions of
text file in GCS.
We want to count the term frequency on each word
on all the files.
Using a Single Machine
● Using a single machine we can loop each file in
the GCS
● For each file we can pre-process it by make it all
lower case and split by the word separator
(space, tab, comma, etc)
● Then we store and update the count for each
word using a hash-map or dictionary
● This methods will work, however it will be very
time consuming . . .
Proposed HSCI
Our proposed HSCI is:
- Breakdown the problems into independent
tasks
- Put the tasks into message queues (We use
Google Pub/Sub)
- The processors will get the task from the queue
and process it accordingly
- If the tasks is multi-layered, then the processors
will put the next task into the queue again
Proposed HSCI
Using proposed HSCI
● First we will count number of words on each
file, we put the file path into Pub/Sub
● The processor engine get the file, count the
word frequency, increment the frequency of
each words on memcached
● We have the final results on the memcached :)
● We can add more processing engine as needed
(all of them are stateless and identical to each
other)
Other use cases: Crawler
We are hiring! 1. Data Engineer
2. BI Engineer
3. Data Scientist
4. Software Engineer (Frontend, Backend & Mobile
Application)
Email Us on joindev@kumparan.com
THANK YOU!
QUESTIONS ?

Horizontally Scalable Compute Infrastructure

  • 1.
  • 2.
    - What? - Why? -Problem Definition - Example Problem: Word Counting - Using single machines - Proposed HSCI - Using proposed HSCI - Other use cases: Crawler - Questions? Content
  • 3.
    What is HorizontallyScalable? - Horizontally Scalable system is a system that able to have more capacity by adding more machines - As examples if one machine can handle a load of 100 rps, then we can use ten identical machine to handle 1000 rps
  • 4.
    What is HorizontallyScalable? Reference: image_source
  • 5.
    What is HorizontallyScalable Compute Infrastructure? - Compute Infrastructure is a infrastructure that designed for computation or processing - A Horizontally Scalable Compute Infrastructure (HSCI) able to compute or process more things by adding more machines
  • 6.
    Why do weneed HSCI? ● Scaling up computation vertically easily hit the ceiling since the CPU computation speed growth is relatively slow ● Also, scaling up vertically is not flexible and usually cause down time during the time we scale up or down
  • 7.
    HSCI Design ProblemDefinitions ● Have a lot of independent tasks ● Have a bunch of machines ● Want to process those tasks with the machines ● Need a way to distribute tasks to the machines nicely (balanced and robust) ● Able to easily add machine to speed up the overall process if needed
  • 8.
    Example Problem: WordCounting Suppose we have more than hundred millions of text file in GCS. We want to count the term frequency on each word on all the files.
  • 9.
    Using a SingleMachine ● Using a single machine we can loop each file in the GCS ● For each file we can pre-process it by make it all lower case and split by the word separator (space, tab, comma, etc) ● Then we store and update the count for each word using a hash-map or dictionary ● This methods will work, however it will be very time consuming . . .
  • 10.
    Proposed HSCI Our proposedHSCI is: - Breakdown the problems into independent tasks - Put the tasks into message queues (We use Google Pub/Sub) - The processors will get the task from the queue and process it accordingly - If the tasks is multi-layered, then the processors will put the next task into the queue again
  • 11.
  • 12.
    Using proposed HSCI ●First we will count number of words on each file, we put the file path into Pub/Sub ● The processor engine get the file, count the word frequency, increment the frequency of each words on memcached ● We have the final results on the memcached :) ● We can add more processing engine as needed (all of them are stateless and identical to each other)
  • 13.
  • 14.
    We are hiring!1. Data Engineer 2. BI Engineer 3. Data Scientist 4. Software Engineer (Frontend, Backend & Mobile Application) Email Us on joindev@kumparan.com
  • 15.
  • 16.