SlideShare a Scribd company logo
Part 0 – Answer important argument
                          Art of Distributed
                       ‘Why Distributed Computing’
                          Haytham ElFadeel - Hfadeel@acm.org




Abstract
Day after day distributed systems tend to appear in mainstream systems. In spite of ma-
turity of distributed frameworks and tools, a lot of distributed technologies still ambi-
guous for a lot of people, especially if we talks about complicated or new stuff.
Lately I have been getting a lot of questions about distributed computing, especially dis-
tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma-
pReduce fit in all situations? How can we compares it with other technologies such as
Grid Computing? And what is the best solution to our situation? So I decided to write
about distributed computing article into two parts. The first, about distributed compu-
ting models and the differences between them. The second part, I will discuss reliability
of the distrusted system and distributed storage systems. So let’s start…

Introduction:
After publish the part 1 of this article that entitled ‘Rethinking about the Distributed
Computing’, I received a few message doubting about Distributed Computing. So I de-
cide to go back and write Part 0 – Answer important argument ‘Why Distributed Com-
puting’.

Why Distributed Computing:
A log of challenges require involving of huge amount of Data, Image if you want to
crawling the whole Web, perform Monte Carlo simulation, or render movie animation,
etc. Such challenges and many others cannot perform in one PC even.
Usually we resort to Distributed Computing when we want deal with huge amount of
data. So the question is what is the big data?




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
The Big Data - Experience with single machine:
Today computers have several Gigabytes or a few Terabytes of storage system capacity.
But what is ‘big data’ anyway? Gigabytes? Terabytes? Petabytes?
A database on the order of 100 GB would not be considered trivially small even today,
although hard drives capable of storing 10 times as much can be had for less than $100
at any computer store. The U.S. Census database included many different datasets of
varying sizes, but let’s simplify a bit: 100 gigabytes is enough to store at least the basic
demographic information: age, sex, income, ethnicity, language, religion, housing status,
and location, packed in a 128-bit record—for every living human being on this planet.
This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be
considered ‘big data’? It depends, on what you’re trying to do with it. More important-
ly, any competent programmer could in a few hours write a simple, unoptimized appli-
cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that
dataset and return answers to simple aggregation queries such as ‘what is the median
age by sex for each country?’ with perfectly reasonable performance.
To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e.
128-bit) records containing uniformly distributed random data. Since a seven-bit age
field allows a maximum of 128 possible values, one bit for sex allows only two, and eight
bits for country allows up to 256 option, we can calculate the median age by using a
counting strategy: simply create 65,536 buckets-one for each combination of age, sex,
and country-and count how many records fall into each. We find the median age by de-
termining, for each sex and country group, the cumulative count over the 128 age buck-
ets: the median is the bucket where the count reaches half of the total. [See figure 1].




 In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver-
age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU
the whole time.



Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB
RAM. Running off in-memory data, my simple median-age-by-sex-and-country program
completed in less than a minute. By such measures, I would hesitate to call this ‘big da-
ta’. This application will need in average between 725-second, and 1000-second to per-
form the above quire. But what if the data bigger than this, you want perform more
complex quires, or what if this data stored into Database system which means every
record will take only 128-bit??? Actually if you add more data or make the query more
complex the time will likely to get exponential or at least multiply!
So the question answer is: It’s depends on what you’re trying to do with the Data.

Hard Limit:
Hardware and software failures are important limit, there is no way to predict or protect
your system from the failures. Even if you use a High-end Server or expensive hardware
the failure is likely to occur. So in single machine you likely are going to lose your whole
work! But if you distribute the work across many machines you will lose one part of the
work and you can resume it into another machine. Actually the modern distributed
strategies show its ability and reliable to handle the hardware and software failures,
Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop
all designed to handle the unexpected failures.
Of course, hardware—chiefly memory and CPU limitations—is often a major factor in
software limits on dataset size. Many applications are designed to read entire datasets
into memory and work with them there; a good example of this is the popular statistical
computing environment R.7 Memory-bound applications naturally exhibit higher per-
formance than disk-bound ones (at least insofar as the data-crunching they carry out
advances beyond single-pass, purely sequential processing), but requiring all data to fit
in memory means that if you have a dataset larger than your installed RAM, you’re out
of luck. So in this case do you will wait until the next generation of the hardware come?!
On most hardware platforms, there’s a much harder limit on memory expansion than
disk expansion: the motherboard has only so many slots to fill.
The problem often goes further than this, however. Like most other aspects of comput-
er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare
configuration for a desktop workstation, and servers are frequently configured with far
more than that. There is no guarantee, however, that a memory-bound application will
be able to use all installed RAM. Even under modern 64-bit operating systems, many




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org
applications today (e.g., R under Windows) have only 32-bit executables and are limited
to 4-GB address spaces—this often translates into a 2- or 3-GB working set limitation.
Finally, even where a 64-bit binary is available—removing the absolute address space
limitation—all too often relics from the age of 32-bit code still pervade software, partic-
ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit
versions of R (available for Linux and Mac) use signed 32-bit integers to represent
lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit
system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such
as the earlier world census example ends up being too big for R to handle.




Art of Distributed - Part 0: Why Distributed Computing.
Haytham ElFadeel - Hfadeel@acm.org

More Related Content

Viewers also liked

P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
transportemultisolucione
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
George Ang
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
George Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
George Ang
 
Resultados de las elecciones en Bolivia
Resultados de las elecciones en BoliviaResultados de las elecciones en Bolivia
Resultados de las elecciones en Bolivia
Jesús Alanoca
 

Viewers also liked (6)

P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
P0 n 02 re 01 recepción de materias primas y productos semielaborados (2) (1)
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Resultados de las elecciones en Bolivia
Resultados de las elecciones en BoliviaResultados de las elecciones en Bolivia
Resultados de las elecciones en Bolivia
 

Similar to Art Of Distributed P0

Seminar
SeminarSeminar
Seminar
hpmizuki
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
coolmirza143
 
Big data business case
Big data   business caseBig data   business case
Big data business case
Karthik Padmanabhan ( MLE℠)
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
GeekNightHyderabad
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
jimliddle
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
Andrey Karpov
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing dis
gopishna09
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
PVS-Studio
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Vectorization whitepaper
Vectorization whitepaperVectorization whitepaper
Vectorization whitepaper
VIVEKSINGH634333
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
benoitg
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
Dr. Vardhan choubey
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
AjayAgarwal107
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTC
Guy Coates
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
xlight
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
VIJAYAPRABAP
 

Similar to Art Of Distributed P0 (20)

Seminar
SeminarSeminar
Seminar
 
Big Data
Big DataBig Data
Big Data
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
Grid computing dis
Grid computing disGrid computing dis
Grid computing dis
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Vectorization whitepaper
Vectorization whitepaperVectorization whitepaper
Vectorization whitepaper
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
bigdata 2.pptx
bigdata 2.pptxbigdata 2.pptx
bigdata 2.pptx
 
Blades for HPTC
Blades for HPTCBlades for HPTC
Blades for HPTC
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 

More from George Ang

腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)George Ang
 
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享George Ang
 
腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向George Ang
 
腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化George Ang
 
腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介George Ang
 
腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程George Ang
 
腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍George Ang
 
腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道George Ang
 

More from George Ang (20)

腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
腾讯大讲堂17 性能优化不是仅局限于后台(qzone)
 
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
 
腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向腾讯大讲堂19 系统优化的方向
腾讯大讲堂19 系统优化的方向
 
腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化腾讯大讲堂13 soso访问速度优化
腾讯大讲堂13 soso访问速度优化
 
腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介腾讯大讲堂21 搜索引擎优化(seo)简介
腾讯大讲堂21 搜索引擎优化(seo)简介
 
腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程腾讯大讲堂24 qq show2.0重构历程
腾讯大讲堂24 qq show2.0重构历程
 
腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍腾讯大讲堂25 企业级搜索托管平台介绍
腾讯大讲堂25 企业级搜索托管平台介绍
 
腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道腾讯大讲堂26 带宽优化之道
腾讯大讲堂26 带宽优化之道
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

Art Of Distributed P0

  • 1. Part 0 – Answer important argument Art of Distributed ‘Why Distributed Computing’ Haytham ElFadeel - Hfadeel@acm.org Abstract Day after day distributed systems tend to appear in mainstream systems. In spite of ma- turity of distributed frameworks and tools, a lot of distributed technologies still ambi- guous for a lot of people, especially if we talks about complicated or new stuff. Lately I have been getting a lot of questions about distributed computing, especially dis- tributed computing models, and MapReduce, such as: What is MapReduce? Can Ma- pReduce fit in all situations? How can we compares it with other technologies such as Grid Computing? And what is the best solution to our situation? So I decided to write about distributed computing article into two parts. The first, about distributed compu- ting models and the differences between them. The second part, I will discuss reliability of the distrusted system and distributed storage systems. So let’s start… Introduction: After publish the part 1 of this article that entitled ‘Rethinking about the Distributed Computing’, I received a few message doubting about Distributed Computing. So I de- cide to go back and write Part 0 – Answer important argument ‘Why Distributed Com- puting’. Why Distributed Computing: A log of challenges require involving of huge amount of Data, Image if you want to crawling the whole Web, perform Monte Carlo simulation, or render movie animation, etc. Such challenges and many others cannot perform in one PC even. Usually we resort to Distributed Computing when we want deal with huge amount of data. So the question is what is the big data? Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 2. The Big Data - Experience with single machine: Today computers have several Gigabytes or a few Terabytes of storage system capacity. But what is ‘big data’ anyway? Gigabytes? Terabytes? Petabytes? A database on the order of 100 GB would not be considered trivially small even today, although hard drives capable of storing 10 times as much can be had for less than $100 at any computer store. The U.S. Census database included many different datasets of varying sizes, but let’s simplify a bit: 100 gigabytes is enough to store at least the basic demographic information: age, sex, income, ethnicity, language, religion, housing status, and location, packed in a 128-bit record—for every living human being on this planet. This would create a table of 6.75 billion rows and maybe 10 columns. Should that still be considered ‘big data’? It depends, on what you’re trying to do with it. More important- ly, any competent programmer could in a few hours write a simple, unoptimized appli- cation on a $500 desktop PC with minimal CPU and RAM that could crunch through that dataset and return answers to simple aggregation queries such as ‘what is the median age by sex for each country?’ with perfectly reasonable performance. To demonstrate this, I tried it with fake data, a file consisting of 6.75 billion 16-byte (i.e. 128-bit) records containing uniformly distributed random data. Since a seven-bit age field allows a maximum of 128 possible values, one bit for sex allows only two, and eight bits for country allows up to 256 option, we can calculate the median age by using a counting strategy: simply create 65,536 buckets-one for each combination of age, sex, and country-and count how many records fall into each. We find the median age by de- termining, for each sex and country group, the cumulative count over the 128 age buck- ets: the median is the bucket where the count reaches half of the total. [See figure 1]. In my tests, this algorithm was limited primarily by the speed of disk fetching: In aver- age 90-megabyte-per-second sustained read speed, shamefully underutilizing the CPU the whole time. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 3. In fact, our table will fit in the memory of a single, $15,000 Dell server with 128-GB RAM. Running off in-memory data, my simple median-age-by-sex-and-country program completed in less than a minute. By such measures, I would hesitate to call this ‘big da- ta’. This application will need in average between 725-second, and 1000-second to per- form the above quire. But what if the data bigger than this, you want perform more complex quires, or what if this data stored into Database system which means every record will take only 128-bit??? Actually if you add more data or make the query more complex the time will likely to get exponential or at least multiply! So the question answer is: It’s depends on what you’re trying to do with the Data. Hard Limit: Hardware and software failures are important limit, there is no way to predict or protect your system from the failures. Even if you use a High-end Server or expensive hardware the failure is likely to occur. So in single machine you likely are going to lose your whole work! But if you distribute the work across many machines you will lose one part of the work and you can resume it into another machine. Actually the modern distributed strategies show its ability and reliable to handle the hardware and software failures, Systems and Computing Model such as: MapReduce, Dryad, Hadoop, Dynamo, Hadoop all designed to handle the unexpected failures. Of course, hardware—chiefly memory and CPU limitations—is often a major factor in software limits on dataset size. Many applications are designed to read entire datasets into memory and work with them there; a good example of this is the popular statistical computing environment R.7 Memory-bound applications naturally exhibit higher per- formance than disk-bound ones (at least insofar as the data-crunching they carry out advances beyond single-pass, purely sequential processing), but requiring all data to fit in memory means that if you have a dataset larger than your installed RAM, you’re out of luck. So in this case do you will wait until the next generation of the hardware come?! On most hardware platforms, there’s a much harder limit on memory expansion than disk expansion: the motherboard has only so many slots to fill. The problem often goes further than this, however. Like most other aspects of comput- er hardware, maximum memory capacities increase with time; 32 GB is no longer a rare configuration for a desktop workstation, and servers are frequently configured with far more than that. There is no guarantee, however, that a memory-bound application will be able to use all installed RAM. Even under modern 64-bit operating systems, many Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org
  • 4. applications today (e.g., R under Windows) have only 32-bit executables and are limited to 4-GB address spaces—this often translates into a 2- or 3-GB working set limitation. Finally, even where a 64-bit binary is available—removing the absolute address space limitation—all too often relics from the age of 32-bit code still pervade software, partic- ularly in the use of 32-bit integers to index array elements. Thus, for example, 64-bit versions of R (available for Linux and Mac) use signed 32-bit integers to represent lengths, limiting data frames to at most 231-1, or about 2 billion rows. Even on a 64-bit system with sufficient RAM to hold the data, therefore, a 6.75-billion-row dataset such as the earlier world census example ends up being too big for R to handle. Art of Distributed - Part 0: Why Distributed Computing. Haytham ElFadeel - Hfadeel@acm.org