Presentation for talk Engineering Scalability in Block71 Jakarta, 20 September 2018. This presentation will give a proposal for a horizontally scalable compute infrastructure.
2. - What?
- Why?
- Problem Definition
- Example Problem: Word Counting
- Using single machines
- Proposed HSCI
- Using proposed HSCI
- Other use cases: Crawler
- Questions?
Content
3. What is Horizontally Scalable?
- Horizontally Scalable system is a system that
able to have more capacity by adding more
machines
- As examples if one machine can handle a load
of 100 rps, then we can use ten identical
machine to handle 1000 rps
5. What is Horizontally Scalable
Compute Infrastructure?
- Compute Infrastructure is a infrastructure that
designed for computation or processing
- A Horizontally Scalable Compute Infrastructure
(HSCI) able to compute or process more things
by adding more machines
6. Why do we need HSCI?
● Scaling up computation vertically easily hit the
ceiling since the CPU computation speed
growth is relatively slow
● Also, scaling up vertically is not flexible and
usually cause down time during the time we
scale up or down
7. HSCI Design Problem Definitions
● Have a lot of independent tasks
● Have a bunch of machines
● Want to process those tasks with the machines
● Need a way to distribute tasks to the machines
nicely (balanced and robust)
● Able to easily add machine to speed up the
overall process if needed
8. Example Problem: Word Counting
Suppose we have more than hundred millions of
text file in GCS.
We want to count the term frequency on each word
on all the files.
9. Using a Single Machine
● Using a single machine we can loop each file in
the GCS
● For each file we can pre-process it by make it all
lower case and split by the word separator
(space, tab, comma, etc)
● Then we store and update the count for each
word using a hash-map or dictionary
● This methods will work, however it will be very
time consuming . . .
10. Proposed HSCI
Our proposed HSCI is:
- Breakdown the problems into independent
tasks
- Put the tasks into message queues (We use
Google Pub/Sub)
- The processors will get the task from the queue
and process it accordingly
- If the tasks is multi-layered, then the processors
will put the next task into the queue again
12. Using proposed HSCI
● First we will count number of words on each
file, we put the file path into Pub/Sub
● The processor engine get the file, count the
word frequency, increment the frequency of
each words on memcached
● We have the final results on the memcached :)
● We can add more processing engine as needed
(all of them are stateless and identical to each
other)
14. We are hiring! 1. Data Engineer
2. BI Engineer
3. Data Scientist
4. Software Engineer (Frontend, Backend & Mobile
Application)
Email Us on joindev@kumparan.com