moveMountainIEEE

The Importance of using Small Solutions to solve Big Problems
How to move a mountain (of data)
Christopher Gallo
Technology Evangelist
SoftLayer, an IBM Company
Houston, USA
cgallo@us.ibm.com
Abstract— Abstract- Designing applications that can produce
meaningful results out of large-scale data sets is a challenging
and often problematic undertaking. The difficulties in these
projects are often compounded by designers using the
improper tool, or worse, designing a new tool that is
inadequate for the task. In the current state of cloud
computing, there exists a myriad of services and software to
handle even the most daunting tasks, however discovering
these tools is often a challenge in and of itself. This paper
presents a case study concerning the design of an application
that uses minimal code to solve a large-data problem as an
exercise in choosing the proper tools and creating a quickly
scalable application in a cloud environment. The study will
take every registered Internet Domain Name and determine if
it is hosted by a specific hosting provider (in this case
SoftLayer, an IBM Company). While the case may seem
simple, the technical challenges presented are both interesting
to solve, and general enough to apply to a wide variety of
similar problems. This case study shows the benefits provided
by Infrastructure as a Service (IaaS), queues as a form of task
distribution, configuration management tools for rapid
scalability, and the importance of leveraging threads for
maximum performance.
Keywords-component; Infrastructure as a Service; Cloud
Scaling; Large-Scale Application Design;
I. INTRODUCTION
"The Cloud" is defined by The National Institute of
Standards and Technology as a model for enabling
ubiquitous, convenient, on-demand network access to a
shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that
can be rapidly provisioned and released with minimal
management effort or service provider interaction. [1]
Creating an application that is not only capable, but
optimized, for operating in "The Cloud" is challenging in
part due to the very distributed and dynamic nature of "The
Cloud", and to the rapidly changing array of tools that need
to be employed. This case study will solve the same problem
with two different methods, one a traditional single node
approach, and the other a cloud based approach. While many
of the techniques required can, and will be used for the single
node approach, only when we apply these techniques to "The
Cloud" will we see their optimal value.
The problem starts off fairly simply. We are tasked with
iterating through every registered domain name, and
assessing whether it is hosted in a SoftLayer[2] datacenter or
not. The scale of the problem becomes clear when we
discover how many domains there could be. The only
limitation on a domain name is that each label be less than 63
ASCII characters, usually only A-Z and the "-" character [3].
This give us a grand total of 63^26 possible combinations
per Top Level Domain (TLD), of which there are now over
800 [4]. To make our task somewhat easier, various registrars
allow access to their list of registered domain names, so we
will restrict our search to only domains we know to exist,
and will not attempt to search every possible domain name
combination, as that would take an eternity. The registrars
behind the most popular TLDs, .COM, .NET, and .ORG all
give out access, which comprises about 80% of the total
registered domains, or around 150,000,000 domains total [5].
We will need to be content with that number, as obtaining
access to 100% of domains is cost prohibitive for this case
study.
This paper will present the case study by first elaborating
on some of the background technical challenges presented by
iterating through one hundred and fifty million records and
how we plan to solve them, along with the methodology we
plan to use for the two cases. Then we will discuss the Base
Case, which would be a traditional single node solution to
this problem, and some of the lessons learned. Next we will
study the Cloud Case, and how it compares to the Base Case.
Finally we will close with some thoughts on what could have
been done better along with some other concluding remarks.
II. BACKGROUND
It might seem unusual that a large IaaS provider like
SoftLayer does not have ready access to the information on
which domains are being hosted on their infrastructure, but
while SoftLayer keeps track of how many servers are online
and the number of IP addresses that are being leased out,
SoftLayer does not keep track of anything that runs on the
server once access is handed over to a customer. So this
leaves SoftLayer in a position of having to determine the
number of domains hosted the hard way, by checking each
and every registered domain.

Since there are around 150,000,000 domains to check,
using a monolithic program where each domain is processed
fully before proceeding to the next is simply going to take
too long, each task must be broken down and parallelized as
much as possible. Multi-threaded programming is generally
significantly more challenging than single-treaded
programming, to such a degree that many programmers
avoid it altogether [18]. Yet here multi-threading is going to
be a must in order to get meaningful results in a reasonable
amount of time. While multi-threaded programming has not
gotten easier since the paper by Bridges, Matthew, et al was
published in 2007, there are now many new tools which will
be explored here to help make the task easier.
Even on a single machine, being able to take advantage
of every core is paramount to maximizing performance of an
application [19], and the easiest way for this application will
be to split every task into its own program that can run
simultaneously and independently of each other. The tasks
will be broken down as listed below.
a. Domain Parser
This is the script that is responsible for taking the files
provided by the various registrars and adding them to the
RabbitMQ server. These zone file are downloaded ahead of
time since they can be fairly large and are located on the
system running the Domain Parser. To help minimize queue
transactions, each domain is packaged into groups of 25. The
package is a simple array of objects, encoded as JSON. The
logic for this code is in Fig. 1:
b. Domain Resolver
This script takes a packet of domains from the queue,
attempts to resolve each one in a thread, and then adds an
updated packet of domains to a final queue, adding in some
new information about the domain. This section is where
multi-threading will really shine. The average time to resolve
a domain successfully for this project was 0.306 seconds.
However, even with optimizations to Unbound, the time to
unsuccessfully resolve a domain was 2.051 seconds, which is
a very long time for a CPU to wait for a result. Thankfully
threads allow us the ability to continue to attempt to resolve
domains while we wait on a response from the upstream
DNS server. The logic for this code is contained in Fig. 2.
DNS lookups are going to be the biggest bottleneck for
this study, especially since it is expected that about 25% of
the lookups will result in a failure [6], which will
significantly slow down the rate at which we can query
domains. To mitigate this, a local DNS resolver service
(Unbound DNS [7]) will be required so that control can be
exercised over how long to wait on slow DNS servers, and to
limit caching to save on resource utilization. Each domain
will be only queried once, so there should be no need for
caching at all in this project.
c. Domain Checker
This script takes a packet of domains from the final
queue, and checks against our database of IP addresses to see
if the IP address of the domain is a SoftLayer IP address or
not. Once the check is complete, the domain object is
updated with that information and finally saved to Elastic
Search. The logic is in Fig. 3.
1. Domain Parser Logic
2. Domain Resolver Logic
To control the even distribution of domains to processes
between each program, a message queue will need to be
added. For this project an Advanced Message Queuing
Protocol (AMQP) compatible queue was chosen because it is
an open standard supported by a wide variety of client and
service applications [8]. the AMQP protocol is designed to
be usable from different programming environments,
operating systems, and hardware devices, as well as making
high-performance implementations possible on various
network transports including TCP, SCTP (Stream Control
Transmission Protocol), and InfiniBand [9].

3. Domain Checker Logic
Specifically, RabbitMQ was chosen for this project since
due to its ease of setup and support for the Python
programming language [20], however any AMQP
compatible service would have likely worked just as well.
Although the WHOIS [22] database serves as a great
resource to lookup what organization owns an IP address, it
will not be used here as SoftLayer has provided database
containing all of their IP address information. To make
querying this database as fast as possible, the IP information
will be converted from the common dotted quad format into
its decimal representation using the netaddr python library.
These decimal numbers will be stored in an indexed MySQL
database to facility fast queries [23].
Storing the data is the most important technical challenge
to solve, since up until this point all the work we have done
has been in memory, and would be lost if the services were
shut down. NoSQL is defined as a collection of next
generation databases mostly addressing some of these points:
being non-relational, distributed, open-source and
horizontally scalable [10], which are precisely the problems
that we will likely encounter. There are a wide variety in
NoSQL implementations, and for this project a Document
Store style offering is the best fitted for how the data will be
used after it is stored. In light of the huge variety of NoSQL
applications that could possible work with this project,
ElasticSearch for three main reasons.
• Storing data is fast, and as simple as forming a HTTP
PUT request [21].
• Searching through the data is the main purpose of
ElasticSearch, which will be useful for doing post mortem
data analysis.
• Most importantly, Kibana [11] is a fantastic tool to
visualize data stored in ElasticSearch, and was used to
create many of the graphs in this case study.
Finally, all of this will be run on the Debian “jessie/sid”
operating system, with most of the custom code written in
python 2.7. The operating system and programming
language are just personal preferences however, it should be
expected that similar results would be apparent with different
choices made here.
III. Methodology
The end goal of this project is to determine with some
accuracy the exact number of domains that resolve to a
SoftLayer owned IP address. Yet there three important
milestones that will be observed in trying to reach this goal.
1. The proof of concept. During this section, the core
components of the project are put together, tested,
and checked for consistency. Critically important
for any software project.
2. The Base Case. The first full run through the data
set and will serve as a benchmark for what we could
expect performance to look like given a single
server approach.
3. The Cloud Case. Here we will attempt to leverage
as many resources as possible to answer our
question in the shortest time possible, and will be
compared against the Base Case.
While finding the answer to our question may be
interesting to some, especially SoftLayer, we have setup this
study to help answer some questions that might be more
relevant to the community, specifically those who lack
excessive experience working with cloud technologies and
distributed workloads. We hope to address the following
general concerns with this case study.
Concern 1
What are the difficulties in solving a large-data problem with
a monolithic approach?
Concern 2
How much time and effort can be saved with a cloud based
approach compared to a monolithic approach?
These concerns are important because they mirror many of
the concerns newcomers coming into the cloud computing
space encounter, and addressing them will hopefully
alleviate some of the hesitancy to adopt cloud computing.
IV. Proof of Concept
Creating a proof of concept version is critical to the
success of any application. It is during this phase where we
try to answer the most basic question, "can this plan actually
work?". Even with most of the technology stack already
chosen before attempting the proof of concept, creating a
proof of concept is important to prove that all the technology
works well together before work is wasted on a solution that
is impossible. This stage brought to light a collection of
issues that had previously not been apparent on the surface.
As mentioned earlier, multi-threaded programming is
inherently difficult, and working out these difficulties is
much easier in the proof of concept phase than in a full
production run. This phase also uncovered an interesting
problem in that the domain files were being parsed entirely
too quickly, which had the result of crashing the RabbitMQ
server almost instantly by exhausting the available RAM.
Thankfully this issue was discovered early and with some
fine tuning of the RabbitMQ settings, and some rate limiting

on the parsing program, everything ended up running very
smoothly afterwards.
Aside from those major issues uncovered, this proof of
concept phase helped illuminate which areas of the program
were likely to break, and where best to put in logging
messages to ensure any errors were being properly reported
and handled. The data structure used to pass domain
information between processes was finalized here, along
with the end document that will eventually be stored in
ElasticSearch.
V. Base Case
With the proof of concept finished, it is time to move
onto actually running everything together at full speed. This
involves ordering a new server, installing the required
libraries and packages, configuring everything and the
setting all the programs running.
A. New Problems
Going from a proof of concept to a full run is generally
bound to uncover new problems, and this transition is no
exception. The first unexpected hurdle turned out to be
difficulties in turning a python program into a background
service, which was surprisingly complicated, at least for
someone not intimately familiar with how Debian manages
startup scripts. Secondly, while DNS lookups were expected
to be fairly CPU expensive, they turned out to largely be the
limiting factor in how many processes could be launched at
once. Since none of the DNS lookups being performed
would be in a cache already, the resolver needed to query the
root name servers, then the zone name servers, then finally
the authoritative name servers for each domain.
Passing messages between processes with RabbitMQ was
incredibly easy, but slightly error prone. The biggest issue
was that occasionally the connector would hit a timeout and
that would cause the resolver program to exit. Once some
logic was added to the programs interacting with RabbitMQ
to handle that exception and keep going everything ran
smoothly.
B. The Hardware
The power of a bare metal server has been well
documented [12], so for this single server case a single, bare
metal, server will be used to get the most optimal
4. Base Case Domains Per Hour
performance. This server is something that would be
easily found in any datacenter, or at least something very
similar.
The server will be an Intel Xeon E3-1270, 4 Cores
@3.40GHz, 2 hard drives and 8 GB of RAM, costing
0.368$/hour [13]. This server was chosen because of its fast
clock speed, cheap hourly rate, and enough RAM to hold all
our data.
C. Results
Below is a breakdown of the average amount of CPU
percentage each part of our solution took up. These numbers
are approximate averages to give a good sense of where most
of the time was spent. As noted earlier, Unbound (or DNS
resolver) takes up nearly 50% of the CPU time. RabbitMQ
and ElasticSearch are both fairly low on this chart, which
was a little unexpected, however it goes a long way to show
how powerful and well made these tools are. So it should be
no surprise that the code written specifically for this study
performed worse than tools written by industry experts.
I. CPU USAGE BREAKDOWN
5. RabbitMQ Network Utilization
Process CPU %
Unbound 45%
Domain Resolver x 40 25%
Domain Parser 1%
Domain Checker 1%
RabbitMQ 15%
ElasticSearch 10%
Operating System 3%

Overall, the whole system took about 300 hours to run
for a grand total of $102.672, averaging between 100 and
200 domains a second. A bargain considering the cost for
just the Intel Xeon E3-1270 v3 is $373.11 [24].
Increasing the number of cores will easily help reduce the
runtime, however there are only so many cores you can fit
inside a single machine. The biggest hourly server SoftLayer
provides is the Intel Xeon E5-2690 v3 (12 Cores, 2.60 GHz)
$2.226/hour [14]. Since this server has three times as many
cores as our original, it can be generously assumed this
process would have taken a third of the time (100 hours).
However 100 hours @ $2.226/hour is significantly more
expensive at $222.6.
Overall once all of the programs were set running, the
base case performed admirably without supervision. There
are still some performance improvements that could have
been made to the code and configuration of services, but that
would take a significant amount of intimate knowledge about
each service and some of the inner workings of the python
libraries involved, so to get our runtime and overall cost
down, it is easier to simply spread everything out into a
cloud deployment.
VI. Cloud Case
On of the many benefits of Cloud Computing is a
smoother scalability path. Cloud Computing empowers any
application with architectures that were designed to easily
scale with added hardware and infrastructure resources [15].
This path to smoother scalability is exactly what this case
will study. The simplest way to start scaling is to split off
each service into its own bare metal or virtual server. The
RabbitMQ service will get a virtual server with plenty of
RAM, and the ElasticSearch service, MySQL, and the
domain parser will get a bare metal server with plenty of disk
space and ample disk speed. Unbound and the domain
resolver will be paired together on a series of virtual servers
to maximize cores while minimizing costs. The virtual server
will need at least two cores, one to run Unbound, and the
other to go through all of the domain resolver threads. The
domain checker service will also get a series of virtual
servers as it is also only dependent on CPU time, with very
little disk or ram usage.
A. New Problems
The first major problem is adopting a cloud computing
deployment, is the network. In the Base Case data was
transferred between services via the loopback interface,
which is incredibly fast since the data never has to actually
go over the wire. In the Cloud Case however, it quickly
became apparent that the default 100Mbps data transfer rate
was entirely too slow for our application. Thankfully it is a
simple matter to upgrade to a 1Gbps connection in a cloud
environment, which was plenty of bandwidth, with our
application maxing out at around 250Mbps. Due to the
amounts of data being transferred over the network,
bandwidth costs also become a big concern. Luckily
SoftLayer does not meter traffic over their private network ,
even across regions [25]. Provided all network traffic is kept
to the private network there will be no additional costs for
splitting out the infrastructure.
1. Network traffic handled by rabbitMQ
Configuration management starts to become a real
problem in cloud environments due to the ever increasing
number of nodes requiring configuration. Setting up a single
server is a fairly trivial task for any seasoned administrator,
but managing dozens of nodes that all need to be provisioned
simultaneously becomes a bit of a nightmare. Thankfully
there are a myriad of configuration management tools [16]
that help manage cloud deployments, and for this project Salt
Stack [17] was selected for its ability to easily provision
servers on the SoftLayer platform. Once SaltStack has been
fleshed out with the details of the application and its
deployment structure, creating the thirty six servers required
for the Cloud Case is contained in one simple command, and
takes about fifteen minutes for all nodes to be provisioned,
configured, and running the programs they were told to run.
B. The Hardware
a. Domain Master - Hourly Bare Metal - 4 Cores
3.50GHz 32 GB @ .595$/hour
This server will be responsible for both being the master
for my SaltStack configuration management along with
running the ElasticSearch, Kibana, MySQL, and Domain
Parser services. This is the only bare metal server since this
is the only node where data is actually written to or read
from a disk.
b. Rabbit Node - Virtual Server - 4 cores 48GB RAM -
1Gbps Network @ .606$/hour
Responsible for the RabbitMQ service. 48G of RAM is a
significant increase from the base case, which is due to the
rate at which domains are entering the queue. In the base
case we limited the rate of the Domain Parser to keep in pace
with the Domain Resolver, however in this case that rate
limit has been removed since the Domain Resolver will be
scaled up significantly as the hardware provisioned here can
support holding the entirety of the data that will be worked
with. This now makes the network a limiting factor where it
was not previously, hence the 1Gbps network connection.
c. 25 Resolver Nodes - Virtual Server - 2 cores 1G
RAM @ .060$/hour
Responsible for Unbound and the Domain Resolver
script. Each node can run about 40 Domain Resolver scripts
before maxing out the CPU. Due to the very dependent
nature of Unbound and Domain Resolver, keeping them
together worked out really well.
d. 10 Checking Nodes - Virtual Server - 2 cores 1G
RAM @ .060$/hour
Responsible for running the Domain Checker script.
Each node can run about 80 Domain Checker scripts before
maxing out the CPU. The amount of work required for the
Domain Checker is significantly less than the Domain
Resolver, which is why the same amount of domains could

be processed with ten nodes instead of the twenty five for
Domain Resolver.
Separating out the services in this manner has the very
significant advantage of being able to use more CPUs and
RAM than can fit into a single server. Each server, aside
from the Domain Master, was managed entirely by
SaltStack, from the ordering step all the way to the final
provisioning and running the needed services, without ever
having to login to the server itself.
Overall, the server count here was a bit on the
conservative side, however this setup still completely
exceeded expectations even without hitting any cloud
bottlenecks. With 78 cores working, the Cloud Case
managed to progress through between 6000 and 7000
domains a second, which is a huge increase from the Base
Case.
A. Results
From the point domains started being added to
RabbitMQ, this project took a little under 6 hours to fully
complete the assigned task, and could have been even shorter
had more Resolver Nodes been added. This project was left
to run overnight given the extreme length of time the Base
Case took, which is why no other Resolver Nodes were
added, since the project was already completed before it was
noticed how fast it was going.
Despite the significantly higher CPU and RAM count
used in the Cloud Case, the end cost was only $26.998,
roughly a quarter of the Base Case cost. This cost should
hopefully help make it clear how powerful Cloud
Architectures can be in both time and money savings.
Since everything is also specified in SaltStack,
redeploying this environment again is a trivial process,
which is another huge benefit of using a cloud computing
model for solving problems.
VII. Conclusion
In the face of increasingly vast and complicated work
loads, traditional programming techniques are quickly
becoming inadequate and time consuming. Distributing tasks
across a wide array of discrete nodes is going to be a critical
aspect of any large-data project, and being able to master the
plethora of services that assist programmers in this space is a
must for any developer going into the Cloud Era.
6. Cloud Case Domains Per Hour
Primarily the Message Queue as a tool for task
distribution, and NoSQL data stores are going to play some
of the biggest roles in these architectures. Hopefully this
paper helped shed some light on how all these services can
work together to build a successful application even without
a significant amount of prior knowledge of the products
involved.
Finally, we can address fully our concerns from earlier.
Concern 1
The difficulties with solving large-data problems with a
monolithic approach tend to be the limitations imposed by
physical restrictions. Even though both the Cloud Case and
the Base Case used a similar software architecture, the Base
Case simply couldn't get a server big enough to go through
the data in even a reasonable fraction of the time compared
to the Cloud Case. Even though a myriad of unfamiliar
technology was employed here, generally the only
information required was how to get the service installed,
and how to get data into or out of the service in question.
While the inner workings remain a mystery, the services
themselves perform well with intelligently designed defaults.
Concern 2
With the Cloud Case clocking in at around 6 hours and
$27 it greatly surpassed the Base Case in both time and cost,
as the Base Case too around 300 hours and $103. Although it
is counter intuitive, using more computing power can
actually be cheaper if it can reduce the required computation
time for a program. Getting the Cloud Case setup in
SaltStack was certainly challenging and time consuming,
however now that the work has been done, redeploying the
Cloud Case takes no time at all, where redeploying the Base
Case would still take a few hours of configuration by hand to
get everything working.
In conclusion, it should hopefully be clear that expertise
in cloud computing is not required to be able to take
advantage of the power it offers. Nor should distributed or
parallelized programming techniques be avoided because
they are difficult to understand, the performance
improvement they allow for are too great to ignore. Work is
being done constantly to make these techniques easier to
understand, and already a great many tools and concepts,
such as queues for message transfers between programs, that
allow even an inexperienced developer to make great choices
in how to solve difficult problems.
VIII. Acknowledgments
This work was sponsored by SoftLayer, which is why
they were the IaaS vendor of choice in this paper. While the
pricing and servers are specific to SoftLayer, we expect the
findings in this paper to be replicable in any other IaaS
vendor. The developers at SaltStack were also incredibly
helpful in sorting out issues relating to some of the more
complicated configurations in the deployment.

REFERENCES
1. http://faculty.winthrop.edu/domanm/csci411/Handouts/NIST.pdf
2. https://softlayer.com
3. https://tools.ietf.org/html/rfc1035 section 3.1
4. https://ntldstats.com/
5. http://www.registrarstats.com/TLDDomainCounts.aspx
6. Jaeyeon Jung, Emil Sit, Hari Balakrishnan and Robert Morris "DNS
Performance and the Effectiveness of Caching" IEEE/ACM
TRANSACTIONS ON NETWORKING, VOL. 10, NO. 5, OCTOBER
2002
7. https://www.unbound.net/
8. https://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol
9. O'Hara, J. (2007). "Toward a commodity enterprise middleware". Acm
Queue 5 (4): 48–55
10. http://nosql-database.org/
11. https://www.elastic.co/products/kibana
12. Ekanayake, Jaliya, and Geoffrey Fox. "High performance parallel
computing with clouds and cloud technologies." Cloud Computing.
Springer Berlin Heidelberg, 2010. 294-308.
13. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/
37276/64
14. https://www.softlayer.com/Store/orderHourlyBareMetalInstance/
165559/103
15. Creeger, Mache. "Cloud Computing: An Overview." ACM Queue 7.5
(2009): 2.
16. https://en.wikipedia.org/wiki/Configuration_management
17. http://saltstack.com/
18. Bridges, Matthew, et al. "Revisiting the sequential programming model
for multi-core." Proceedings of the 40th Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE Computer
Society, 2007.
19. Dean, Jeffrey, and Sanjay Ghemawat. "Distributed programming with
Mapreduce." Beautiful Code. Sebastopol: O’Reilly Media, Inc 384
(2007).
20. https://pika.readthedocs.org/en/0.10.0/
21. https://www.elastic.co/guide/en/elasticsearch/guide/current/create-
doc.html
22. https://whois.icann.org/en/about-whois
23. Schwartz, Baron, Peter Zaitsev, and Vadim Tkachenko. High
performance MySQL: Optimization, backups, and replication. "
O'Reilly Media, Inc.", 2012. 115-130
24. http://amzn.com/B00D697QRM
25. http://blog.softlayer.com/tag/private-network

moveMountainIEEE

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to moveMountainIEEE

Similar to moveMountainIEEE (20)

moveMountainIEEE