Horizontally Scalable Compute Infrastructure

•

0 likes•57 views

Presentation for talk Engineering Scalability in Block71 Jakarta, 20 September 2018. This presentation will give a proposal for a horizontally scalable compute infrastructure.

Technology

Horizontally Scalable
Compute Infrastructure
Yosua Michael Maranatha

- What?
- Why?
- Problem Definition
- Example Problem: Word Counting
- Using single machines
- Proposed HSCI
- Using proposed HSCI
- Other use cases: Crawler
- Questions?
Content

What is Horizontally Scalable?
- Horizontally Scalable system is a system that
able to have more capacity by adding more
machines
- As examples if one machine can handle a load
of 100 rps, then we can use ten identical
machine to handle 1000 rps

What is Horizontally Scalable?
Reference: image_source

What is Horizontally Scalable
Compute Infrastructure?
- Compute Infrastructure is a infrastructure that
designed for computation or processing
- A Horizontally Scalable Compute Infrastructure
(HSCI) able to compute or process more things
by adding more machines

Why do we need HSCI?
● Scaling up computation vertically easily hit the
ceiling since the CPU computation speed
growth is relatively slow
● Also, scaling up vertically is not flexible and
usually cause down time during the time we
scale up or down

HSCI Design Problem Definitions
● Have a lot of independent tasks
● Have a bunch of machines
● Want to process those tasks with the machines
● Need a way to distribute tasks to the machines
nicely (balanced and robust)
● Able to easily add machine to speed up the
overall process if needed

Example Problem: Word Counting
Suppose we have more than hundred millions of
text file in GCS.
We want to count the term frequency on each word
on all the files.

Using a Single Machine
● Using a single machine we can loop each file in
the GCS
● For each file we can pre-process it by make it all
lower case and split by the word separator
(space, tab, comma, etc)
● Then we store and update the count for each
word using a hash-map or dictionary
● This methods will work, however it will be very
time consuming . . .

Proposed HSCI
Our proposed HSCI is:
- Breakdown the problems into independent
tasks
- Put the tasks into message queues (We use
Google Pub/Sub)
- The processors will get the task from the queue
and process it accordingly
- If the tasks is multi-layered, then the processors
will put the next task into the queue again

Using proposed HSCI
● First we will count number of words on each
file, we put the file path into Pub/Sub
● The processor engine get the file, count the
word frequency, increment the frequency of
each words on memcached
● We have the final results on the memcached :)
● We can add more processing engine as needed
(all of them are stateless and identical to each
other)

We are hiring! 1. Data Engineer
2. BI Engineer
3. Data Scientist
4. Software Engineer (Frontend, Backend & Mobile
Application)
Email Us on joindev@kumparan.com

Similar to Horizontally Scalable Compute Infrastructure

Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015. The deck served as a backdrop to the interactive session http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/ The scope was to drive an architectural conversation about : o What it actually takes to get the data you need to add that one metric to your report/dashboard? o What's it like to navigate the early conversations of an analytic solution? o How is one technology selected over another and how do those selections impact or define other selections?

Architecting Big Data Ingest & Manipulation

George Long

La big datacamp2014_vikram_dixit

Data Con LA

Hadoop live online training

Harika583

Schedulers optimization to handle multiple jobs in hadoop cluster

Shivraj Raj

Big table

PSIT

Highlights of Features Coming Soon in HPCC Systems 6.0.0! Come learn how the upcoming 6.0 release can help you solve Big Data problems faster and more efficient. Topics include: · How using the new Virtual slave Thor makes using a smart/lookup join faster · How to add and leave tracing in your code without affecting the graph · How the HPCC Systems Visualisations Framework provides easy and fast access to visualisations from data included in a workunit or Roxie query · Plus, hear how our success with GSoC (Google Summer of Code) in 2015 is preparing us for this year

HPCC Systems 6.0.0 Highlights

HPCC Systems

Bt0070

Simpaly Jha

Apache Traffic Server

supertom

operating system

Mandavi Classes

Distributed Computing & MapReduce

coolmirza143

Hadoop introduction

葵慶李

Learn what is Hadoop-and-BigData

Thanusha154

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Arseny Chernov

Scott Callaghan from the Southern California Earthquake Center presented this deck in a recent Blue Waters Webinar. "I will present an overview of scientific workflows. I'll discuss what the community means by "workflows" and what elements make up a workflow. We'll talk about common problems that users might be facing, such as automation, job management, data staging, resource provisioning, and provenance tracking, and explain how workflow tools can help address these challenges. I'll present a brief example from my own work with a series of seismic codes showing how using workflow tools can improve scientific applications. I'll finish with an overview of high-level workflow concepts, with an aim to preparing users to get the most out of discussions of specific workflow tools and identify which tools would be best for them." Watch the video: http://wp.me/p3RLHQ-gtH Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Overview of Scientific Workflows - Why Use Them?

inside-BigData.com

Resource scheduling

Ghazal Tashakor

Resource scheduling

Ghazal Tashakor

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...

Alluxio, Inc.

Introduction to apache horn (incubating)

Edward Yoon

Apache Hive for modern DBAs

Luis Marques

Hadoop bangalore-meetup-dec-2011-yoda

InMobi

Similar to Horizontally Scalable Compute Infrastructure (20)

Architecting Big Data Ingest & Manipulation

La big datacamp2014_vikram_dixit

Hadoop live online training

Schedulers optimization to handle multiple jobs in hadoop cluster

Big table

HPCC Systems 6.0.0 Highlights

Bt0070

Apache Traffic Server

operating system

Distributed Computing & MapReduce

Hadoop introduction

Learn what is Hadoop-and-BigData

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Overview of Scientific Workflows - Why Use Them?

Resource scheduling

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...

Introduction to apache horn (incubating)

Apache Hive for modern DBAs

Hadoop bangalore-meetup-dec-2011-yoda

Recently uploaded

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Manulife - Insurer Transformation Award 2024

The Digital Insurer

Recently uploaded (20)

MS Copilot expands with MS Graph connectors

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

A Beginners Guide to Building a RAG App Using Open Source Milvus

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Axa Assurance Maroc - Insurer Innovation Award 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

presentation ICT roal in 21st century education

Apidays New York 2024 - The value of a flexible API Management solution for O...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Boost Fertility New Invention Ups Success Rates.pdf

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Strategies for Landing an Oracle DBA Job as a Fresher

Manulife - Insurer Transformation Award 2024

Horizontally Scalable Compute Infrastructure

1. Horizontally Scalable Compute Infrastructure Yosua Michael Maranatha

2. - What? - Why? - Problem Definition - Example Problem: Word Counting - Using single machines - Proposed HSCI - Using proposed HSCI - Other use cases: Crawler - Questions? Content

3. What is Horizontally Scalable? - Horizontally Scalable system is a system that able to have more capacity by adding more machines - As examples if one machine can handle a load of 100 rps, then we can use ten identical machine to handle 1000 rps

4. What is Horizontally Scalable? Reference: image_source

5. What is Horizontally Scalable Compute Infrastructure? - Compute Infrastructure is a infrastructure that designed for computation or processing - A Horizontally Scalable Compute Infrastructure (HSCI) able to compute or process more things by adding more machines

6. Why do we need HSCI? ● Scaling up computation vertically easily hit the ceiling since the CPU computation speed growth is relatively slow ● Also, scaling up vertically is not flexible and usually cause down time during the time we scale up or down

7. HSCI Design Problem Definitions ● Have a lot of independent tasks ● Have a bunch of machines ● Want to process those tasks with the machines ● Need a way to distribute tasks to the machines nicely (balanced and robust) ● Able to easily add machine to speed up the overall process if needed

8. Example Problem: Word Counting Suppose we have more than hundred millions of text file in GCS. We want to count the term frequency on each word on all the files.

9. Using a Single Machine ● Using a single machine we can loop each file in the GCS ● For each file we can pre-process it by make it all lower case and split by the word separator (space, tab, comma, etc) ● Then we store and update the count for each word using a hash-map or dictionary ● This methods will work, however it will be very time consuming . . .

10. Proposed HSCI Our proposed HSCI is: - Breakdown the problems into independent tasks - Put the tasks into message queues (We use Google Pub/Sub) - The processors will get the task from the queue and process it accordingly - If the tasks is multi-layered, then the processors will put the next task into the queue again

11. Proposed HSCI

12. Using proposed HSCI ● First we will count number of words on each file, we put the file path into Pub/Sub ● The processor engine get the file, count the word frequency, increment the frequency of each words on memcached ● We have the final results on the memcached :) ● We can add more processing engine as needed (all of them are stateless and identical to each other)

13. Other use cases: Crawler

14. We are hiring! 1. Data Engineer 2. BI Engineer 3. Data Scientist 4. Software Engineer (Frontend, Backend & Mobile Application) Email Us on joindev@kumparan.com

15. THANK YOU!

16. QUESTIONS ?

Horizontally Scalable Compute Infrastructure

Recommended

Recommended

More Related Content

Similar to Horizontally Scalable Compute Infrastructure

Similar to Horizontally Scalable Compute Infrastructure (20)

Recently uploaded

Recently uploaded (20)

Horizontally Scalable Compute Infrastructure