SlideShare a Scribd company logo
The Importance of using Small Solutions to solve Big Problems
How to move a mountain (of data)
Christopher Gallo
Technology Evangelist
SoftLayer, an IBM Company
Houston, USA
cgallo@us.ibm.com
Abstract— Abstract- Designing applications that can produce
meaningful results out of large-scale data sets is a challenging
and often problematic undertaking. The difficulties in these
projects are often compounded by designers using the
improper tool, or worse, designing a new tool that is
inadequate for the task. In the current state of cloud
computing, there exists a myriad of services and software to
handle even the most daunting tasks, however discovering
these tools is often a challenge in and of itself. This paper
presents a case study concerning the design of an application
that uses minimal code to solve a large-data problem as an
exercise in choosing the proper tools and creating a quickly
scalable application in a cloud environment. The study will
take every registered Internet Domain Name and determine if
it is hosted by a specific hosting provider (in this case
SoftLayer, an IBM Company). While the case may seem
simple, the technical challenges presented are both interesting
to solve, and general enough to apply to a wide variety of
similar problems. This case study shows the benefits provided
by Infrastructure as a Service (IaaS), queues as a form of task
distribution, configuration management tools for rapid
scalability, and the importance of leveraging threads for
maximum performance.
Keywords-component; Infrastructure as a Service; Cloud
Scaling; Large-Scale Application Design;
I. INTRODUCTION
"The Cloud" is defined by The National Institute of
Standards and Technology as a model for enabling
ubiquitous, convenient, on-demand network access to a
shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that
can be rapidly provisioned and released with minimal
management effort or service provider interaction. [1]
Creating an application that is not only capable, but
optimized, for operating in "The Cloud" is challenging in
part due to the very distributed and dynamic nature of "The
Cloud", and to the rapidly changing array of tools that need
to be employed. This case study will solve the same problem
with two different methods, one a traditional single node
approach, and the other a cloud based approach. While many
of the techniques required can, and will be used for the single
node approach, only when we apply these techniques to "The
Cloud" will we see their optimal value.
The problem starts off fairly simply. We are tasked with
iterating through every registered domain name, and
assessing whether it is hosted in a SoftLayer[2] datacenter or
not. The scale of the problem becomes clear when we
discover how many domains there could be. The only
limitation on a domain name is that each label be less than 63
ASCII characters, usually only A-Z and the "-" character [3].
This give us a grand total of 63^26 possible combinations
per Top Level Domain (TLD), of which there are now over
800 [4]. To make our task somewhat easier, various registrars
allow access to their list of registered domain names, so we
will restrict our search to only domains we know to exist,
and will not attempt to search every possible domain name
combination, as that would take an eternity. The registrars
behind the most popular TLDs, .COM, .NET, and .ORG all
give out access, which comprises about 80% of the total
registered domains, or around 150,000,000 domains total [5].
We will need to be content with that number, as obtaining
access to 100% of domains is cost prohibitive for this case
study.
This paper will present the case study by first elaborating
on some of the background technical challenges presented by
iterating through one hundred and fifty million records and
how we plan to solve them, along with the methodology we
plan to use for the two cases. Then we will discuss the Base
Case, which would be a traditional single node solution to
this problem, and some of the lessons learned. Next we will
study the Cloud Case, and how it compares to the Base Case.
Finally we will close with some thoughts on what could have
been done better along with some other concluding remarks.
II. BACKGROUND
It might seem unusual that a large IaaS provider like
SoftLayer does not have ready access to the information on
which domains are being hosted on their infrastructure, but
while SoftLayer keeps track of how many servers are online
and the number of IP addresses that are being leased out,
SoftLayer does not keep track of anything that runs on the
server once access is handed over to a customer. So this
leaves SoftLayer in a position of having to determine the
number of domains hosted the hard way, by checking each
and every registered domain.
Since there are around 150,000,000 domains to check,
using a monolithic program where each domain is processed
fully before proceeding to the next is simply going to take
too long, each task must be broken down and parallelized as
much as possible. Multi-threaded programming is generally
significantly more challenging than single-treaded
programming, to such a degree that many programmers
avoid it altogether [18]. Yet here multi-threading is going to
be a must in order to get meaningful results in a reasonable
amount of time. While multi-threaded programming has not
gotten easier since the paper by Bridges, Matthew, et al was
published in 2007, there are now many new tools which will
be explored here to help make the task easier.
Even on a single machine, being able to take advantage
of every core is paramount to maximizing performance of an
application [19], and the easiest way for this application will
be to split every task into its own program that can run
simultaneously and independently of each other. The tasks
will be broken down as listed below.
a. Domain Parser
This is the script that is responsible for taking the files
provided by the various registrars and adding them to the
RabbitMQ server. These zone file are downloaded ahead of
time since they can be fairly large and are located on the
system running the Domain Parser. To help minimize queue
transactions, each domain is packaged into groups of 25. The
package is a simple array of objects, encoded as JSON. The
logic for this code is in Fig. 1:
b. Domain Resolver
This script takes a packet of domains from the queue,
attempts to resolve each one in a thread, and then adds an
updated packet of domains to a final queue, adding in some
new information about the domain. This section is where
multi-threading will really shine. The average time to resolve
a domain successfully for this project was 0.306 seconds.
However, even with optimizations to Unbound, the time to
unsuccessfully resolve a domain was 2.051 seconds, which is
a very long time for a CPU to wait for a result. Thankfully
threads allow us the ability to continue to attempt to resolve
domains while we wait on a response from the upstream
DNS server. The logic for this code is contained in Fig. 2.
DNS lookups are going to be the biggest bottleneck for
this study, especially since it is expected that about 25% of
the lookups will result in a failure [6], which will
significantly slow down the rate at which we can query
domains. To mitigate this, a local DNS resolver service
(Unbound DNS [7]) will be required so that control can be
exercised over how long to wait on slow DNS servers, and to
limit caching to save on resource utilization. Each domain
will be only queried once, so there should be no need for
caching at all in this project.
c. Domain Checker
This script takes a packet of domains from the final
queue, and checks against our database of IP addresses to see
if the IP address of the domain is a SoftLayer IP address or
not. Once the check is complete, the domain object is
updated with that information and finally saved to Elastic
Search. The logic is in Fig. 3.
1. Domain Parser Logic
2. Domain Resolver Logic
To control the even distribution of domains to processes
between each program, a message queue will need to be
added. For this project an Advanced Message Queuing
Protocol (AMQP) compatible queue was chosen because it is
an open standard supported by a wide variety of client and
service applications [8]. the AMQP protocol is designed to
be usable from different programming environments,
operating systems, and hardware devices, as well as making
high-performance implementations possible on various
network transports including TCP, SCTP (Stream Control
Transmission Protocol), and InfiniBand [9].
3. Domain Checker Logic
Specifically, RabbitMQ was chosen for this project since
due to its ease of setup and support for the Python
programming language [20], however any AMQP
compatible service would have likely worked just as well.
Although the WHOIS [22] database serves as a great
resource to lookup what organization owns an IP address, it
will not be used here as SoftLayer has provided database
containing all of their IP address information. To make
querying this database as fast as possible, the IP information
will be converted from the common dotted quad format into
its decimal representation using the netaddr python library.
These decimal numbers will be stored in an indexed MySQL
database to facility fast queries [23].
Storing the data is the most important technical challenge
to solve, since up until this point all the work we have done
has been in memory, and would be lost if the services were
shut down. NoSQL is defined as a collection of next
generation databases mostly addressing some of these points:
being non-relational, distributed, open-source and
horizontally scalable [10], which are precisely the problems
that we will likely encounter. There are a wide variety in
NoSQL implementations, and for this project a Document
Store style offering is the best fitted for how the data will be
used after it is stored. In light of the huge variety of NoSQL
applications that could possible work with this project,
ElasticSearch for three main reasons.
• Storing data is fast, and as simple as forming a HTTP
PUT request [21].
• Searching through the data is the main purpose of
ElasticSearch, which will be useful for doing post mortem
data analysis.
• Most importantly, Kibana [11] is a fantastic tool to
visualize data stored in ElasticSearch, and was used to
create many of the graphs in this case study.
Finally, all of this will be run on the Debian “jessie/sid”
operating system, with most of the custom code written in
python 2.7. The operating system and programming
language are just personal preferences however, it should be
expected that similar results would be apparent with different
choices made here.
III. Methodology
The end goal of this project is to determine with some
accuracy the exact number of domains that resolve to a
SoftLayer owned IP address. Yet there three important
milestones that will be observed in trying to reach this goal.
1. The proof of concept. During this section, the core
components of the project are put together, tested,
and checked for consistency. Critically important
for any software project.
2. The Base Case. The first full run through the data
set and will serve as a benchmark for what we could
expect performance to look like given a single
server approach.
3. The Cloud Case. Here we will attempt to leverage
as many resources as possible to answer our
question in the shortest time possible, and will be
compared against the Base Case.
While finding the answer to our question may be
interesting to some, especially SoftLayer, we have setup this
study to help answer some questions that might be more
relevant to the community, specifically those who lack
excessive experience working with cloud technologies and
distributed workloads. We hope to address the following
general concerns with this case study.
Concern 1
What are the difficulties in solving a large-data problem with
a monolithic approach?
Concern 2
How much time and effort can be saved with a cloud based
approach compared to a monolithic approach?
These concerns are important because they mirror many of
the concerns newcomers coming into the cloud computing
space encounter, and addressing them will hopefully
alleviate some of the hesitancy to adopt cloud computing.
IV. Proof of Concept
Creating a proof of concept version is critical to the
success of any application. It is during this phase where we
try to answer the most basic question, "can this plan actually
work?". Even with most of the technology stack already
chosen before attempting the proof of concept, creating a
proof of concept is important to prove that all the technology
works well together before work is wasted on a solution that
is impossible. This stage brought to light a collection of
issues that had previously not been apparent on the surface.
As mentioned earlier, multi-threaded programming is
inherently difficult, and working out these difficulties is
much easier in the proof of concept phase than in a full
production run. This phase also uncovered an interesting
problem in that the domain files were being parsed entirely
too quickly, which had the result of crashing the RabbitMQ
server almost instantly by exhausting the available RAM.
Thankfully this issue was discovered early and with some
fine tuning of the RabbitMQ settings, and some rate limiting
on the parsing program, everything ended up running very
smoothly afterwards.
Aside from those major issues uncovered, this proof of
concept phase helped illuminate which areas of the program
were likely to break, and where best to put in logging
messages to ensure any errors were being properly reported
and handled. The data structure used to pass domain
information between processes was finalized here, along
with the end document that will eventually be stored in
ElasticSearch.
V. Base Case
With the proof of concept finished, it is time to move
onto actually running everything together at full speed. This
involves ordering a new server, installing the required
libraries and packages, configuring everything and the
setting all the programs running.
A. New Problems
Going from a proof of concept to a full run is generally
bound to uncover new problems, and this transition is no
exception. The first unexpected hurdle turned out to be
difficulties in turning a python program into a background
service, which was surprisingly complicated, at least for
someone not intimately familiar with how Debian manages
startup scripts. Secondly, while DNS lookups were expected
to be fairly CPU expensive, they turned out to largely be the
limiting factor in how many processes could be launched at
once. Since none of the DNS lookups being performed
would be in a cache already, the resolver needed to query the
root name servers, then the zone name servers, then finally
the authoritative name servers for each domain.
Passing messages between processes with RabbitMQ was
incredibly easy, but slightly error prone. The biggest issue
was that occasionally the connector would hit a timeout and
that would cause the resolver program to exit. Once some
logic was added to the programs interacting with RabbitMQ
to handle that exception and keep going everything ran
smoothly.
B. The Hardware
The power of a bare metal server has been well
documented [12], so for this single server case a single, bare
metal, server will be used to get the most optimal
4. Base Case Domains Per Hour
performance. This server is something that would be
easily found in any datacenter, or at least something very
similar.
The server will be an Intel Xeon E3-1270, 4 Cores
@3.40GHz, 2 hard drives and 8 GB of RAM, costing
0.368$/hour [13]. This server was chosen because of its fast
clock speed, cheap hourly rate, and enough RAM to hold all
our data.
C. Results
Below is a breakdown of the average amount of CPU
percentage each part of our solution took up. These numbers
are approximate averages to give a good sense of where most
of the time was spent. As noted earlier, Unbound (or DNS
resolver) takes up nearly 50% of the CPU time. RabbitMQ
and ElasticSearch are both fairly low on this chart, which
was a little unexpected, however it goes a long way to show
how powerful and well made these tools are. So it should be
no surprise that the code written specifically for this study
performed worse than tools written by industry experts.
I. CPU USAGE BREAKDOWN
5. RabbitMQ Network Utilization
Process CPU %
Unbound 45%
Domain Resolver x 40 25%
Domain Parser 1%
Domain Checker 1%
RabbitMQ 15%
ElasticSearch 10%
Operating System 3%
Overall, the whole system took about 300 hours to run
for a grand total of $102.672, averaging between 100 and
200 domains a second. A bargain considering the cost for
just the Intel Xeon E3-1270 v3 is $373.11 [24].
Increasing the number of cores will easily help reduce the
runtime, however there are only so many cores you can fit
inside a single machine. The biggest hourly server SoftLayer
provides is the Intel Xeon E5-2690 v3 (12 Cores, 2.60 GHz)
$2.226/hour [14]. Since this server has three times as many
cores as our original, it can be generously assumed this
process would have taken a third of the time (100 hours).
However 100 hours @ $2.226/hour is significantly more
expensive at $222.6.
Overall once all of the programs were set running, the
base case performed admirably without supervision. There
are still some performance improvements that could have
been made to the code and configuration of services, but that
would take a significant amount of intimate knowledge about
each service and some of the inner workings of the python
libraries involved, so to get our runtime and overall cost
down, it is easier to simply spread everything out into a
cloud deployment.
VI. Cloud Case
On of the many benefits of Cloud Computing is a
smoother scalability path. Cloud Computing empowers any
application with architectures that were designed to easily
scale with added hardware and infrastructure resources [15].
This path to smoother scalability is exactly what this case
will study. The simplest way to start scaling is to split off
each service into its own bare metal or virtual server. The
RabbitMQ service will get a virtual server with plenty of
RAM, and the ElasticSearch service, MySQL, and the
domain parser will get a bare metal server with plenty of disk
space and ample disk speed. Unbound and the domain
resolver will be paired together on a series of virtual servers
to maximize cores while minimizing costs. The virtual server
will need at least two cores, one to run Unbound, and the
other to go through all of the domain resolver threads. The
domain checker service will also get a series of virtual
servers as it is also only dependent on CPU time, with very
little disk or ram usage.
A. New Problems
The first major problem is adopting a cloud computing
deployment, is the network. In the Base Case data was
transferred between services via the loopback interface,
which is incredibly fast since the data never has to actually
go over the wire. In the Cloud Case however, it quickly
became apparent that the default 100Mbps data transfer rate
was entirely too slow for our application. Thankfully it is a
simple matter to upgrade to a 1Gbps connection in a cloud
environment, which was plenty of bandwidth, with our
application maxing out at around 250Mbps. Due to the
amounts of data being transferred over the network,
bandwidth costs also become a big concern. Luckily
SoftLayer does not meter traffic over their private network ,
even across regions [25]. Provided all network traffic is kept
to the private network there will be no additional costs for
splitting out the infrastructure.
1. Network traffic handled by rabbitMQ
Configuration management starts to become a real
problem in cloud environments due to the ever increasing
number of nodes requiring configuration. Setting up a single
server is a fairly trivial task for any seasoned administrator,
but managing dozens of nodes that all need to be provisioned
simultaneously becomes a bit of a nightmare. Thankfully
there are a myriad of configuration management tools [16]
that help manage cloud deployments, and for this project Salt
Stack [17] was selected for its ability to easily provision
servers on the SoftLayer platform. Once SaltStack has been
fleshed out with the details of the application and its
deployment structure, creating the thirty six servers required
for the Cloud Case is contained in one simple command, and
takes about fifteen minutes for all nodes to be provisioned,
configured, and running the programs they were told to run.
B. The Hardware
a. Domain Master - Hourly Bare Metal - 4 Cores
3.50GHz 32 GB @ .595$/hour
This server will be responsible for both being the master
for my SaltStack configuration management along with
running the ElasticSearch, Kibana, MySQL, and Domain
Parser services. This is the only bare metal server since this
is the only node where data is actually written to or read
from a disk.
b. Rabbit Node - Virtual Server - 4 cores 48GB RAM -
1Gbps Network @ .606$/hour
Responsible for the RabbitMQ service. 48G of RAM is a
significant increase from the base case, which is due to the
rate at which domains are entering the queue. In the base
case we limited the rate of the Domain Parser to keep in pace
with the Domain Resolver, however in this case that rate
limit has been removed since the Domain Resolver will be
scaled up significantly as the hardware provisioned here can
support holding the entirety of the data that will be worked
with. This now makes the network a limiting factor where it
was not previously, hence the 1Gbps network connection.
c. 25 Resolver Nodes - Virtual Server - 2 cores 1G
RAM @ .060$/hour
Responsible for Unbound and the Domain Resolver
script. Each node can run about 40 Domain Resolver scripts
before maxing out the CPU. Due to the very dependent
nature of Unbound and Domain Resolver, keeping them
together worked out really well.
d. 10 Checking Nodes - Virtual Server - 2 cores 1G
RAM @ .060$/hour
Responsible for running the Domain Checker script.
Each node can run about 80 Domain Checker scripts before
maxing out the CPU. The amount of work required for the
Domain Checker is significantly less than the Domain
Resolver, which is why the same amount of domains could
be processed with ten nodes instead of the twenty five for
Domain Resolver.
Separating out the services in this manner has the very
significant advantage of being able to use more CPUs and
RAM than can fit into a single server. Each server, aside
from the Domain Master, was managed entirely by
SaltStack, from the ordering step all the way to the final
provisioning and running the needed services, without ever
having to login to the server itself.
Overall, the server count here was a bit on the
conservative side, however this setup still completely
exceeded expectations even without hitting any cloud
bottlenecks. With 78 cores working, the Cloud Case
managed to progress through between 6000 and 7000
domains a second, which is a huge increase from the Base
Case.
A. Results
From the point domains started being added to
RabbitMQ, this project took a little under 6 hours to fully
complete the assigned task, and could have been even shorter
had more Resolver Nodes been added. This project was left
to run overnight given the extreme length of time the Base
Case took, which is why no other Resolver Nodes were
added, since the project was already completed before it was
noticed how fast it was going.
Despite the significantly higher CPU and RAM count
used in the Cloud Case, the end cost was only $26.998,
roughly a quarter of the Base Case cost. This cost should
hopefully help make it clear how powerful Cloud
Architectures can be in both time and money savings.
Since everything is also specified in SaltStack,
redeploying this environment again is a trivial process,
which is another huge benefit of using a cloud computing
model for solving problems.
VII. Conclusion
In the face of increasingly vast and complicated work
loads, traditional programming techniques are quickly
becoming inadequate and time consuming. Distributing tasks
across a wide array of discrete nodes is going to be a critical
aspect of any large-data project, and being able to master the
plethora of services that assist programmers in this space is a
must for any developer going into the Cloud Era.
6. Cloud Case Domains Per Hour
Primarily the Message Queue as a tool for task
distribution, and NoSQL data stores are going to play some
of the biggest roles in these architectures. Hopefully this
paper helped shed some light on how all these services can
work together to build a successful application even without
a significant amount of prior knowledge of the products
involved.
Finally, we can address fully our concerns from earlier.
Concern 1
The difficulties with solving large-data problems with a
monolithic approach tend to be the limitations imposed by
physical restrictions. Even though both the Cloud Case and
the Base Case used a similar software architecture, the Base
Case simply couldn't get a server big enough to go through
the data in even a reasonable fraction of the time compared
to the Cloud Case. Even though a myriad of unfamiliar
technology was employed here, generally the only
information required was how to get the service installed,
and how to get data into or out of the service in question.
While the inner workings remain a mystery, the services
themselves perform well with intelligently designed defaults.
Concern 2
With the Cloud Case clocking in at around 6 hours and
$27 it greatly surpassed the Base Case in both time and cost,
as the Base Case too around 300 hours and $103. Although it
is counter intuitive, using more computing power can
actually be cheaper if it can reduce the required computation
time for a program. Getting the Cloud Case setup in
SaltStack was certainly challenging and time consuming,
however now that the work has been done, redeploying the
Cloud Case takes no time at all, where redeploying the Base
Case would still take a few hours of configuration by hand to
get everything working.
In conclusion, it should hopefully be clear that expertise
in cloud computing is not required to be able to take
advantage of the power it offers. Nor should distributed or
parallelized programming techniques be avoided because
they are difficult to understand, the performance
improvement they allow for are too great to ignore. Work is
being done constantly to make these techniques easier to
understand, and already a great many tools and concepts,
such as queues for message transfers between programs, that
allow even an inexperienced developer to make great choices
in how to solve difficult problems.
VIII. Acknowledgments
This work was sponsored by SoftLayer, which is why
they were the IaaS vendor of choice in this paper. While the
pricing and servers are specific to SoftLayer, we expect the
findings in this paper to be replicable in any other IaaS
vendor. The developers at SaltStack were also incredibly
helpful in sorting out issues relating to some of the more
complicated configurations in the deployment.
REFERENCES
1. http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf
2. https://softlayer.com
3. https://tools.ietf.org/html/rfc1035 section 3.1
4. https://ntldstats.com/
5. http://www.registrarstats.com/TLDDomainCounts.aspx
6. Jaeyeon Jung, Emil Sit, Hari Balakrishnan and Robert Morris "DNS
Performance and the Effectiveness of Caching" IEEE/ACM
TRANSACTIONS ON NETWORKING, VOL. 10, NO. 5, OCTOBER
2002
7. https://www.unbound.net/
8. https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol
9. O'Hara, J. (2007). "Toward a commodity enterprise middleware". Acm
Queue 5 (4): 48–55
10. http://nosql-database.org/
11. https://www.elastic.co/products/kibana
12. Ekanayake, Jaliya, and Geoffrey Fox. "High performance parallel
computing with clouds and cloud technologies." Cloud Computing.
Springer Berlin Heidelberg, 2010. 294-308.
13. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/
37276/64
14. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/
165559/103
15. Creeger, Mache. "Cloud Computing: An Overview." ACM Queue 7.5
(2009): 2.
16. https://en.wikipedia.org/wiki/Configuration_management
17. http://saltstack.com/
18. Bridges, Matthew, et al. "Revisiting the sequential programming model
for multi-core." Proceedings of the 40th Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE Computer
Society, 2007.
19. Dean, Jeffrey, and Sanjay Ghemawat. "Distributed programming with
Mapreduce." Beautiful Code. Sebastopol: O’Reilly Media, Inc 384
(2007).
20. https://pika.readthedocs.org/en/0.10.0/
21. https://www.elastic.co/guide/en/elasticsearch/guide/current/create-
doc.html
22. https://whois.icann.org/en/about-whois
23. Schwartz, Baron, Peter Zaitsev, and Vadim Tkachenko. High
performance MySQL: Optimization, backups, and replication. "
O'Reilly Media, Inc.", 2012. 115-130
24. http://amzn.com/B00D697QRM
25. http://blog.softlayer.com/tag/private-network

More Related Content

What's hot

Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
Sebastian Verheughe
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
Len Bass
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
confluent
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computing
ijccsa
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providers
Len Bass
 
How NOSQL Paid off for Telenor
How NOSQL Paid off for TelenorHow NOSQL Paid off for Telenor
How NOSQL Paid off for Telenor
Sebastian Verheughe
 
COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
 COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR... COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
Nexgen Technology
 
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
csandit
 
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis Labs
 
System to generate speech to text in real time
System to generate speech to text in real timeSystem to generate speech to text in real time
System to generate speech to text in real time
Saptarshi Chatterjee
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query compute
Angelo Corsaro
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Microservices, Kafka Streams and KafkaEsque
Microservices, Kafka Streams and KafkaEsqueMicroservices, Kafka Streams and KafkaEsque
Microservices, Kafka Streams and KafkaEsque
confluent
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud
Shyam Hajare
 

What's hot (18)

Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
Using Graph Databases in Real-Time to Solve Resource Authorization at Telenor...
 
Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
HADRFINAL13112016
HADRFINAL13112016HADRFINAL13112016
HADRFINAL13112016
 
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
Bulletproof Kafka with Fault Tree Analysis (Andrey Falko, Lyft) Kafka Summit ...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Scheduling in cloud computing
Scheduling in cloud computingScheduling in cloud computing
Scheduling in cloud computing
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Architecting for the cloud cloud providers
Architecting for the cloud cloud providersArchitecting for the cloud cloud providers
Architecting for the cloud cloud providers
 
How NOSQL Paid off for Telenor
How NOSQL Paid off for TelenorHow NOSQL Paid off for Telenor
How NOSQL Paid off for Telenor
 
COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
 COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR... COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
COST-MINIMIZING DYNAMIC MIGRATION OF CONTENT DISTRIBUTION SERVICES INTO HYBR...
 
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
 
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure
 
System to generate speech to text in real time
System to generate speech to text in real timeSystem to generate speech to text in real time
System to generate speech to text in real time
 
zenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query computezenoh: zero overhead pub/sub store/query compute
zenoh: zero overhead pub/sub store/query compute
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Microservices, Kafka Streams and KafkaEsque
Microservices, Kafka Streams and KafkaEsqueMicroservices, Kafka Streams and KafkaEsque
Microservices, Kafka Streams and KafkaEsque
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud
 

Viewers also liked

Actividad economica 2
Actividad economica   2Actividad economica   2
Actividad economica 2
Jose Quiroz
 
The victorian at victoria park naples florida.text.marked
The victorian at victoria park naples florida.text.markedThe victorian at victoria park naples florida.text.marked
The victorian at victoria park naples florida.text.markedVineyards Naples
 
Ies maria aurèlia capmany anna i júlia
Ies maria aurèlia capmany anna i júliaIes maria aurèlia capmany anna i júlia
Ies maria aurèlia capmany anna i júliajuliapons89
 
Lon chaney
Lon chaneyLon chaney
Lon chaney
bosekto
 
Ativ 3 giselavc
Ativ 3 giselavcAtiv 3 giselavc
Ativ 3 giselavc
giselavieiradacostasilvei
 
Unit a at seawatch ii bayside naples florida.text.marked
Unit a at seawatch ii bayside naples florida.text.markedUnit a at seawatch ii bayside naples florida.text.marked
Unit a at seawatch ii bayside naples florida.text.markedVineyards Naples
 
Trabalho Alexandre5ºC
Trabalho Alexandre5ºCTrabalho Alexandre5ºC
Trabalho Alexandre5ºCtuchav
 
Diogo e avô Vitor
Diogo e avô VitorDiogo e avô Vitor
Diogo e avô Vitortuchav
 
Royal at maxson homes naples florida
Royal at maxson homes naples floridaRoyal at maxson homes naples florida
Royal at maxson homes naples floridaVineyards Naples
 
Linfomas
Linfomas Linfomas
Diferenciacion celular
Diferenciacion celularDiferenciacion celular
Diferenciacion celular
Emerson Fabri
 
LordJeshuaInheritanceNovember2016.7007
LordJeshuaInheritanceNovember2016.7007LordJeshuaInheritanceNovember2016.7007
LordJeshuaInheritanceNovember2016.7007
Lord Jesus Christ
 

Viewers also liked (20)

Jp2001207245
Jp2001207245Jp2001207245
Jp2001207245
 
Diiapo
DiiapoDiiapo
Diiapo
 
Actividad economica 2
Actividad economica   2Actividad economica   2
Actividad economica 2
 
Folleto 06
Folleto 06Folleto 06
Folleto 06
 
The victorian at victoria park naples florida.text.marked
The victorian at victoria park naples florida.text.markedThe victorian at victoria park naples florida.text.marked
The victorian at victoria park naples florida.text.marked
 
Ies maria aurèlia capmany anna i júlia
Ies maria aurèlia capmany anna i júliaIes maria aurèlia capmany anna i júlia
Ies maria aurèlia capmany anna i júlia
 
8
88
8
 
Plantas
PlantasPlantas
Plantas
 
Ll 14 3
Ll 14   3Ll 14   3
Ll 14 3
 
Lon chaney
Lon chaneyLon chaney
Lon chaney
 
3
33
3
 
Ativ 3 giselavc
Ativ 3 giselavcAtiv 3 giselavc
Ativ 3 giselavc
 
Unit a at seawatch ii bayside naples florida.text.marked
Unit a at seawatch ii bayside naples florida.text.markedUnit a at seawatch ii bayside naples florida.text.marked
Unit a at seawatch ii bayside naples florida.text.marked
 
Trabalho Alexandre5ºC
Trabalho Alexandre5ºCTrabalho Alexandre5ºC
Trabalho Alexandre5ºC
 
Hamlin Knight Brochure
Hamlin Knight BrochureHamlin Knight Brochure
Hamlin Knight Brochure
 
Diogo e avô Vitor
Diogo e avô VitorDiogo e avô Vitor
Diogo e avô Vitor
 
Royal at maxson homes naples florida
Royal at maxson homes naples floridaRoyal at maxson homes naples florida
Royal at maxson homes naples florida
 
Linfomas
Linfomas Linfomas
Linfomas
 
Diferenciacion celular
Diferenciacion celularDiferenciacion celular
Diferenciacion celular
 
LordJeshuaInheritanceNovember2016.7007
LordJeshuaInheritanceNovember2016.7007LordJeshuaInheritanceNovember2016.7007
LordJeshuaInheritanceNovember2016.7007
 

Similar to moveMountainIEEE

2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
Vasu S
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
SN Chakraborty
 
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAXFast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
IJERA Editor
 
introduction to distributed computing.pptx
introduction to distributed computing.pptxintroduction to distributed computing.pptx
introduction to distributed computing.pptx
ApthiriSurekha
 
D017212027
D017212027D017212027
D017212027
IOSR Journals
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
IOSR Journals
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
Kashyap Parmar
 
Vps server 19
Vps server 19Vps server 19
Vps server 19
GilberteFarnsworth31
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Matei Zaharia
 
Performance and Cost Analysis of Modern Public Cloud Services
Performance and Cost Analysis of Modern Public Cloud ServicesPerformance and Cost Analysis of Modern Public Cloud Services
Performance and Cost Analysis of Modern Public Cloud ServicesMd.Saiedur Rahaman
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computing
www.pixelsolutionbd.com
 
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
aulasnilda
 
Understanding Cloud Computing by BS Infotech
Understanding Cloud Computing by BS InfotechUnderstanding Cloud Computing by BS Infotech
Understanding Cloud Computing by BS Infotech
ranapoonam1
 
Unit 3
Unit 3Unit 3
H017144148
H017144148H017144148
H017144148
IOSR Journals
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
IOSR Journals
 
Performance comparison on java technologies a practical approach
Performance comparison on java technologies   a practical approachPerformance comparison on java technologies   a practical approach
Performance comparison on java technologies a practical approach
csandit
 
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACHPERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
cscpconf
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
kamaelian
 

Similar to moveMountainIEEE (20)

Tombolo
TomboloTombolo
Tombolo
 
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAXFast Synchronization In IVR Using REST API For HTML5 And AJAX
Fast Synchronization In IVR Using REST API For HTML5 And AJAX
 
introduction to distributed computing.pptx
introduction to distributed computing.pptxintroduction to distributed computing.pptx
introduction to distributed computing.pptx
 
D017212027
D017212027D017212027
D017212027
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Vps server 19
Vps server 19Vps server 19
Vps server 19
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Performance and Cost Analysis of Modern Public Cloud Services
Performance and Cost Analysis of Modern Public Cloud ServicesPerformance and Cost Analysis of Modern Public Cloud Services
Performance and Cost Analysis of Modern Public Cloud Services
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computing
 
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
1Running head WINDOWS SERVER DEPLOYMENT PROPOSAL2WINDOWS SE.docx
 
Understanding Cloud Computing by BS Infotech
Understanding Cloud Computing by BS InfotechUnderstanding Cloud Computing by BS Infotech
Understanding Cloud Computing by BS Infotech
 
Unit 3
Unit 3Unit 3
Unit 3
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Performance comparison on java technologies a practical approach
Performance comparison on java technologies   a practical approachPerformance comparison on java technologies   a practical approach
Performance comparison on java technologies a practical approach
 
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACHPERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 

moveMountainIEEE

  • 1. The Importance of using Small Solutions to solve Big Problems How to move a mountain (of data) Christopher Gallo Technology Evangelist SoftLayer, an IBM Company Houston, USA cgallo@us.ibm.com Abstract— Abstract- Designing applications that can produce meaningful results out of large-scale data sets is a challenging and often problematic undertaking. The difficulties in these projects are often compounded by designers using the improper tool, or worse, designing a new tool that is inadequate for the task. In the current state of cloud computing, there exists a myriad of services and software to handle even the most daunting tasks, however discovering these tools is often a challenge in and of itself. This paper presents a case study concerning the design of an application that uses minimal code to solve a large-data problem as an exercise in choosing the proper tools and creating a quickly scalable application in a cloud environment. The study will take every registered Internet Domain Name and determine if it is hosted by a specific hosting provider (in this case SoftLayer, an IBM Company). While the case may seem simple, the technical challenges presented are both interesting to solve, and general enough to apply to a wide variety of similar problems. This case study shows the benefits provided by Infrastructure as a Service (IaaS), queues as a form of task distribution, configuration management tools for rapid scalability, and the importance of leveraging threads for maximum performance. Keywords-component; Infrastructure as a Service; Cloud Scaling; Large-Scale Application Design; I. INTRODUCTION "The Cloud" is defined by The National Institute of Standards and Technology as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. [1] Creating an application that is not only capable, but optimized, for operating in "The Cloud" is challenging in part due to the very distributed and dynamic nature of "The Cloud", and to the rapidly changing array of tools that need to be employed. This case study will solve the same problem with two different methods, one a traditional single node approach, and the other a cloud based approach. While many of the techniques required can, and will be used for the single node approach, only when we apply these techniques to "The Cloud" will we see their optimal value. The problem starts off fairly simply. We are tasked with iterating through every registered domain name, and assessing whether it is hosted in a SoftLayer[2] datacenter or not. The scale of the problem becomes clear when we discover how many domains there could be. The only limitation on a domain name is that each label be less than 63 ASCII characters, usually only A-Z and the "-" character [3]. This give us a grand total of 63^26 possible combinations per Top Level Domain (TLD), of which there are now over 800 [4]. To make our task somewhat easier, various registrars allow access to their list of registered domain names, so we will restrict our search to only domains we know to exist, and will not attempt to search every possible domain name combination, as that would take an eternity. The registrars behind the most popular TLDs, .COM, .NET, and .ORG all give out access, which comprises about 80% of the total registered domains, or around 150,000,000 domains total [5]. We will need to be content with that number, as obtaining access to 100% of domains is cost prohibitive for this case study. This paper will present the case study by first elaborating on some of the background technical challenges presented by iterating through one hundred and fifty million records and how we plan to solve them, along with the methodology we plan to use for the two cases. Then we will discuss the Base Case, which would be a traditional single node solution to this problem, and some of the lessons learned. Next we will study the Cloud Case, and how it compares to the Base Case. Finally we will close with some thoughts on what could have been done better along with some other concluding remarks. II. BACKGROUND It might seem unusual that a large IaaS provider like SoftLayer does not have ready access to the information on which domains are being hosted on their infrastructure, but while SoftLayer keeps track of how many servers are online and the number of IP addresses that are being leased out, SoftLayer does not keep track of anything that runs on the server once access is handed over to a customer. So this leaves SoftLayer in a position of having to determine the number of domains hosted the hard way, by checking each and every registered domain.
  • 2. Since there are around 150,000,000 domains to check, using a monolithic program where each domain is processed fully before proceeding to the next is simply going to take too long, each task must be broken down and parallelized as much as possible. Multi-threaded programming is generally significantly more challenging than single-treaded programming, to such a degree that many programmers avoid it altogether [18]. Yet here multi-threading is going to be a must in order to get meaningful results in a reasonable amount of time. While multi-threaded programming has not gotten easier since the paper by Bridges, Matthew, et al was published in 2007, there are now many new tools which will be explored here to help make the task easier. Even on a single machine, being able to take advantage of every core is paramount to maximizing performance of an application [19], and the easiest way for this application will be to split every task into its own program that can run simultaneously and independently of each other. The tasks will be broken down as listed below. a. Domain Parser This is the script that is responsible for taking the files provided by the various registrars and adding them to the RabbitMQ server. These zone file are downloaded ahead of time since they can be fairly large and are located on the system running the Domain Parser. To help minimize queue transactions, each domain is packaged into groups of 25. The package is a simple array of objects, encoded as JSON. The logic for this code is in Fig. 1: b. Domain Resolver This script takes a packet of domains from the queue, attempts to resolve each one in a thread, and then adds an updated packet of domains to a final queue, adding in some new information about the domain. This section is where multi-threading will really shine. The average time to resolve a domain successfully for this project was 0.306 seconds. However, even with optimizations to Unbound, the time to unsuccessfully resolve a domain was 2.051 seconds, which is a very long time for a CPU to wait for a result. Thankfully threads allow us the ability to continue to attempt to resolve domains while we wait on a response from the upstream DNS server. The logic for this code is contained in Fig. 2. DNS lookups are going to be the biggest bottleneck for this study, especially since it is expected that about 25% of the lookups will result in a failure [6], which will significantly slow down the rate at which we can query domains. To mitigate this, a local DNS resolver service (Unbound DNS [7]) will be required so that control can be exercised over how long to wait on slow DNS servers, and to limit caching to save on resource utilization. Each domain will be only queried once, so there should be no need for caching at all in this project. c. Domain Checker This script takes a packet of domains from the final queue, and checks against our database of IP addresses to see if the IP address of the domain is a SoftLayer IP address or not. Once the check is complete, the domain object is updated with that information and finally saved to Elastic Search. The logic is in Fig. 3. 1. Domain Parser Logic 2. Domain Resolver Logic To control the even distribution of domains to processes between each program, a message queue will need to be added. For this project an Advanced Message Queuing Protocol (AMQP) compatible queue was chosen because it is an open standard supported by a wide variety of client and service applications [8]. the AMQP protocol is designed to be usable from different programming environments, operating systems, and hardware devices, as well as making high-performance implementations possible on various network transports including TCP, SCTP (Stream Control Transmission Protocol), and InfiniBand [9].
  • 3. 3. Domain Checker Logic Specifically, RabbitMQ was chosen for this project since due to its ease of setup and support for the Python programming language [20], however any AMQP compatible service would have likely worked just as well. Although the WHOIS [22] database serves as a great resource to lookup what organization owns an IP address, it will not be used here as SoftLayer has provided database containing all of their IP address information. To make querying this database as fast as possible, the IP information will be converted from the common dotted quad format into its decimal representation using the netaddr python library. These decimal numbers will be stored in an indexed MySQL database to facility fast queries [23]. Storing the data is the most important technical challenge to solve, since up until this point all the work we have done has been in memory, and would be lost if the services were shut down. NoSQL is defined as a collection of next generation databases mostly addressing some of these points: being non-relational, distributed, open-source and horizontally scalable [10], which are precisely the problems that we will likely encounter. There are a wide variety in NoSQL implementations, and for this project a Document Store style offering is the best fitted for how the data will be used after it is stored. In light of the huge variety of NoSQL applications that could possible work with this project, ElasticSearch for three main reasons. • Storing data is fast, and as simple as forming a HTTP PUT request [21]. • Searching through the data is the main purpose of ElasticSearch, which will be useful for doing post mortem data analysis. • Most importantly, Kibana [11] is a fantastic tool to visualize data stored in ElasticSearch, and was used to create many of the graphs in this case study. Finally, all of this will be run on the Debian “jessie/sid” operating system, with most of the custom code written in python 2.7. The operating system and programming language are just personal preferences however, it should be expected that similar results would be apparent with different choices made here. III. Methodology The end goal of this project is to determine with some accuracy the exact number of domains that resolve to a SoftLayer owned IP address. Yet there three important milestones that will be observed in trying to reach this goal. 1. The proof of concept. During this section, the core components of the project are put together, tested, and checked for consistency. Critically important for any software project. 2. The Base Case. The first full run through the data set and will serve as a benchmark for what we could expect performance to look like given a single server approach. 3. The Cloud Case. Here we will attempt to leverage as many resources as possible to answer our question in the shortest time possible, and will be compared against the Base Case. While finding the answer to our question may be interesting to some, especially SoftLayer, we have setup this study to help answer some questions that might be more relevant to the community, specifically those who lack excessive experience working with cloud technologies and distributed workloads. We hope to address the following general concerns with this case study. Concern 1 What are the difficulties in solving a large-data problem with a monolithic approach? Concern 2 How much time and effort can be saved with a cloud based approach compared to a monolithic approach? These concerns are important because they mirror many of the concerns newcomers coming into the cloud computing space encounter, and addressing them will hopefully alleviate some of the hesitancy to adopt cloud computing. IV. Proof of Concept Creating a proof of concept version is critical to the success of any application. It is during this phase where we try to answer the most basic question, "can this plan actually work?". Even with most of the technology stack already chosen before attempting the proof of concept, creating a proof of concept is important to prove that all the technology works well together before work is wasted on a solution that is impossible. This stage brought to light a collection of issues that had previously not been apparent on the surface. As mentioned earlier, multi-threaded programming is inherently difficult, and working out these difficulties is much easier in the proof of concept phase than in a full production run. This phase also uncovered an interesting problem in that the domain files were being parsed entirely too quickly, which had the result of crashing the RabbitMQ server almost instantly by exhausting the available RAM. Thankfully this issue was discovered early and with some fine tuning of the RabbitMQ settings, and some rate limiting
  • 4. on the parsing program, everything ended up running very smoothly afterwards. Aside from those major issues uncovered, this proof of concept phase helped illuminate which areas of the program were likely to break, and where best to put in logging messages to ensure any errors were being properly reported and handled. The data structure used to pass domain information between processes was finalized here, along with the end document that will eventually be stored in ElasticSearch. V. Base Case With the proof of concept finished, it is time to move onto actually running everything together at full speed. This involves ordering a new server, installing the required libraries and packages, configuring everything and the setting all the programs running. A. New Problems Going from a proof of concept to a full run is generally bound to uncover new problems, and this transition is no exception. The first unexpected hurdle turned out to be difficulties in turning a python program into a background service, which was surprisingly complicated, at least for someone not intimately familiar with how Debian manages startup scripts. Secondly, while DNS lookups were expected to be fairly CPU expensive, they turned out to largely be the limiting factor in how many processes could be launched at once. Since none of the DNS lookups being performed would be in a cache already, the resolver needed to query the root name servers, then the zone name servers, then finally the authoritative name servers for each domain. Passing messages between processes with RabbitMQ was incredibly easy, but slightly error prone. The biggest issue was that occasionally the connector would hit a timeout and that would cause the resolver program to exit. Once some logic was added to the programs interacting with RabbitMQ to handle that exception and keep going everything ran smoothly. B. The Hardware The power of a bare metal server has been well documented [12], so for this single server case a single, bare metal, server will be used to get the most optimal 4. Base Case Domains Per Hour performance. This server is something that would be easily found in any datacenter, or at least something very similar. The server will be an Intel Xeon E3-1270, 4 Cores @3.40GHz, 2 hard drives and 8 GB of RAM, costing 0.368$/hour [13]. This server was chosen because of its fast clock speed, cheap hourly rate, and enough RAM to hold all our data. C. Results Below is a breakdown of the average amount of CPU percentage each part of our solution took up. These numbers are approximate averages to give a good sense of where most of the time was spent. As noted earlier, Unbound (or DNS resolver) takes up nearly 50% of the CPU time. RabbitMQ and ElasticSearch are both fairly low on this chart, which was a little unexpected, however it goes a long way to show how powerful and well made these tools are. So it should be no surprise that the code written specifically for this study performed worse than tools written by industry experts. I. CPU USAGE BREAKDOWN 5. RabbitMQ Network Utilization Process CPU % Unbound 45% Domain Resolver x 40 25% Domain Parser 1% Domain Checker 1% RabbitMQ 15% ElasticSearch 10% Operating System 3%
  • 5. Overall, the whole system took about 300 hours to run for a grand total of $102.672, averaging between 100 and 200 domains a second. A bargain considering the cost for just the Intel Xeon E3-1270 v3 is $373.11 [24]. Increasing the number of cores will easily help reduce the runtime, however there are only so many cores you can fit inside a single machine. The biggest hourly server SoftLayer provides is the Intel Xeon E5-2690 v3 (12 Cores, 2.60 GHz) $2.226/hour [14]. Since this server has three times as many cores as our original, it can be generously assumed this process would have taken a third of the time (100 hours). However 100 hours @ $2.226/hour is significantly more expensive at $222.6. Overall once all of the programs were set running, the base case performed admirably without supervision. There are still some performance improvements that could have been made to the code and configuration of services, but that would take a significant amount of intimate knowledge about each service and some of the inner workings of the python libraries involved, so to get our runtime and overall cost down, it is easier to simply spread everything out into a cloud deployment. VI. Cloud Case On of the many benefits of Cloud Computing is a smoother scalability path. Cloud Computing empowers any application with architectures that were designed to easily scale with added hardware and infrastructure resources [15]. This path to smoother scalability is exactly what this case will study. The simplest way to start scaling is to split off each service into its own bare metal or virtual server. The RabbitMQ service will get a virtual server with plenty of RAM, and the ElasticSearch service, MySQL, and the domain parser will get a bare metal server with plenty of disk space and ample disk speed. Unbound and the domain resolver will be paired together on a series of virtual servers to maximize cores while minimizing costs. The virtual server will need at least two cores, one to run Unbound, and the other to go through all of the domain resolver threads. The domain checker service will also get a series of virtual servers as it is also only dependent on CPU time, with very little disk or ram usage. A. New Problems The first major problem is adopting a cloud computing deployment, is the network. In the Base Case data was transferred between services via the loopback interface, which is incredibly fast since the data never has to actually go over the wire. In the Cloud Case however, it quickly became apparent that the default 100Mbps data transfer rate was entirely too slow for our application. Thankfully it is a simple matter to upgrade to a 1Gbps connection in a cloud environment, which was plenty of bandwidth, with our application maxing out at around 250Mbps. Due to the amounts of data being transferred over the network, bandwidth costs also become a big concern. Luckily SoftLayer does not meter traffic over their private network , even across regions [25]. Provided all network traffic is kept to the private network there will be no additional costs for splitting out the infrastructure. 1. Network traffic handled by rabbitMQ Configuration management starts to become a real problem in cloud environments due to the ever increasing number of nodes requiring configuration. Setting up a single server is a fairly trivial task for any seasoned administrator, but managing dozens of nodes that all need to be provisioned simultaneously becomes a bit of a nightmare. Thankfully there are a myriad of configuration management tools [16] that help manage cloud deployments, and for this project Salt Stack [17] was selected for its ability to easily provision servers on the SoftLayer platform. Once SaltStack has been fleshed out with the details of the application and its deployment structure, creating the thirty six servers required for the Cloud Case is contained in one simple command, and takes about fifteen minutes for all nodes to be provisioned, configured, and running the programs they were told to run. B. The Hardware a. Domain Master - Hourly Bare Metal - 4 Cores 3.50GHz 32 GB @ .595$/hour This server will be responsible for both being the master for my SaltStack configuration management along with running the ElasticSearch, Kibana, MySQL, and Domain Parser services. This is the only bare metal server since this is the only node where data is actually written to or read from a disk. b. Rabbit Node - Virtual Server - 4 cores 48GB RAM - 1Gbps Network @ .606$/hour Responsible for the RabbitMQ service. 48G of RAM is a significant increase from the base case, which is due to the rate at which domains are entering the queue. In the base case we limited the rate of the Domain Parser to keep in pace with the Domain Resolver, however in this case that rate limit has been removed since the Domain Resolver will be scaled up significantly as the hardware provisioned here can support holding the entirety of the data that will be worked with. This now makes the network a limiting factor where it was not previously, hence the 1Gbps network connection. c. 25 Resolver Nodes - Virtual Server - 2 cores 1G RAM @ .060$/hour Responsible for Unbound and the Domain Resolver script. Each node can run about 40 Domain Resolver scripts before maxing out the CPU. Due to the very dependent nature of Unbound and Domain Resolver, keeping them together worked out really well. d. 10 Checking Nodes - Virtual Server - 2 cores 1G RAM @ .060$/hour Responsible for running the Domain Checker script. Each node can run about 80 Domain Checker scripts before maxing out the CPU. The amount of work required for the Domain Checker is significantly less than the Domain Resolver, which is why the same amount of domains could
  • 6. be processed with ten nodes instead of the twenty five for Domain Resolver. Separating out the services in this manner has the very significant advantage of being able to use more CPUs and RAM than can fit into a single server. Each server, aside from the Domain Master, was managed entirely by SaltStack, from the ordering step all the way to the final provisioning and running the needed services, without ever having to login to the server itself. Overall, the server count here was a bit on the conservative side, however this setup still completely exceeded expectations even without hitting any cloud bottlenecks. With 78 cores working, the Cloud Case managed to progress through between 6000 and 7000 domains a second, which is a huge increase from the Base Case. A. Results From the point domains started being added to RabbitMQ, this project took a little under 6 hours to fully complete the assigned task, and could have been even shorter had more Resolver Nodes been added. This project was left to run overnight given the extreme length of time the Base Case took, which is why no other Resolver Nodes were added, since the project was already completed before it was noticed how fast it was going. Despite the significantly higher CPU and RAM count used in the Cloud Case, the end cost was only $26.998, roughly a quarter of the Base Case cost. This cost should hopefully help make it clear how powerful Cloud Architectures can be in both time and money savings. Since everything is also specified in SaltStack, redeploying this environment again is a trivial process, which is another huge benefit of using a cloud computing model for solving problems. VII. Conclusion In the face of increasingly vast and complicated work loads, traditional programming techniques are quickly becoming inadequate and time consuming. Distributing tasks across a wide array of discrete nodes is going to be a critical aspect of any large-data project, and being able to master the plethora of services that assist programmers in this space is a must for any developer going into the Cloud Era. 6. Cloud Case Domains Per Hour Primarily the Message Queue as a tool for task distribution, and NoSQL data stores are going to play some of the biggest roles in these architectures. Hopefully this paper helped shed some light on how all these services can work together to build a successful application even without a significant amount of prior knowledge of the products involved. Finally, we can address fully our concerns from earlier. Concern 1 The difficulties with solving large-data problems with a monolithic approach tend to be the limitations imposed by physical restrictions. Even though both the Cloud Case and the Base Case used a similar software architecture, the Base Case simply couldn't get a server big enough to go through the data in even a reasonable fraction of the time compared to the Cloud Case. Even though a myriad of unfamiliar technology was employed here, generally the only information required was how to get the service installed, and how to get data into or out of the service in question. While the inner workings remain a mystery, the services themselves perform well with intelligently designed defaults. Concern 2 With the Cloud Case clocking in at around 6 hours and $27 it greatly surpassed the Base Case in both time and cost, as the Base Case too around 300 hours and $103. Although it is counter intuitive, using more computing power can actually be cheaper if it can reduce the required computation time for a program. Getting the Cloud Case setup in SaltStack was certainly challenging and time consuming, however now that the work has been done, redeploying the Cloud Case takes no time at all, where redeploying the Base Case would still take a few hours of configuration by hand to get everything working. In conclusion, it should hopefully be clear that expertise in cloud computing is not required to be able to take advantage of the power it offers. Nor should distributed or parallelized programming techniques be avoided because they are difficult to understand, the performance improvement they allow for are too great to ignore. Work is being done constantly to make these techniques easier to understand, and already a great many tools and concepts, such as queues for message transfers between programs, that allow even an inexperienced developer to make great choices in how to solve difficult problems. VIII. Acknowledgments This work was sponsored by SoftLayer, which is why they were the IaaS vendor of choice in this paper. While the pricing and servers are specific to SoftLayer, we expect the findings in this paper to be replicable in any other IaaS vendor. The developers at SaltStack were also incredibly helpful in sorting out issues relating to some of the more complicated configurations in the deployment.
  • 7. REFERENCES 1. http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf 2. https://softlayer.com 3. https://tools.ietf.org/html/rfc1035 section 3.1 4. https://ntldstats.com/ 5. http://www.registrarstats.com/TLDDomainCounts.aspx 6. Jaeyeon Jung, Emil Sit, Hari Balakrishnan and Robert Morris "DNS Performance and the Effectiveness of Caching" IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 5, OCTOBER 2002 7. https://www.unbound.net/ 8. https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol 9. O'Hara, J. (2007). "Toward a commodity enterprise middleware". Acm Queue 5 (4): 48–55 10. http://nosql-database.org/ 11. https://www.elastic.co/products/kibana 12. Ekanayake, Jaliya, and Geoffrey Fox. "High performance parallel computing with clouds and cloud technologies." Cloud Computing. Springer Berlin Heidelberg, 2010. 294-308. 13. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/ 37276/64 14. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/ 165559/103 15. Creeger, Mache. "Cloud Computing: An Overview." ACM Queue 7.5 (2009): 2. 16. https://en.wikipedia.org/wiki/Configuration_management 17. http://saltstack.com/ 18. Bridges, Matthew, et al. "Revisiting the sequential programming model for multi-core." Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007. 19. Dean, Jeffrey, and Sanjay Ghemawat. "Distributed programming with Mapreduce." Beautiful Code. Sebastopol: O’Reilly Media, Inc 384 (2007). 20. https://pika.readthedocs.org/en/0.10.0/ 21. https://www.elastic.co/guide/en/elasticsearch/guide/current/create- doc.html 22. https://whois.icann.org/en/about-whois 23. Schwartz, Baron, Peter Zaitsev, and Vadim Tkachenko. High performance MySQL: Optimization, backups, and replication. " O'Reilly Media, Inc.", 2012. 115-130 24. http://amzn.com/B00D697QRM 25. http://blog.softlayer.com/tag/private-network