Experimenting With Big Data

Experimenting with Big Data
Thomas Vanhove, Gregory Van Seghbroeck, Tim Wauters, Bruno Volckaert, Filip De Turck
Big Data and its Problems
For us the big data domain can be divided into three main categories:
 big data analysis,
 big data management, and
 querying big data.
This distinction is merely a research thing. Big data applications always require a combination of these
three topics.
Big data analysis can in its turn be subdivided into two tracks: analysis on big data sets and analysis
on large streams of incoming data. The latter is often called stream processing or complex event
processing. In stream processing we often have a real-time requirement, i.e. we need the result of the
processing of the incoming data immediately. Whereas in analysis on big data sets, we need a more
thorough analysis.
The main research track in big data management is about data distribution or data partitioning. Think
of it as you have a set of servers available, where will you place the data. Data distribution is in most
cases handled by the data store or the data storage system you have selected. It has an impact on
many features, e.g. robustness of the system, data redundancy, performance of reads and writes, data
consistency, availability etc. There already exist a wide variety of data stores or data storage systems
that excel in certain features, however it is not possible to support all features simultaneously.
Deciding in the huge jungle of possible data storage systems is not a trivial job. But help is on its way!
Querying as research topic is also handled by the data stores itself and it has lot of common ground
with data management. We can see at least three major ways of querying data in the big data context.
First of all we have simple reads. This has everything to do with how and where the big data is stored.
It has also a lot of similarities to data management, especially with indexing, data partitioning and data
redundancy. Another querying method involves range queries. This is a form of queries where a set
(sometimes in a specific order) of results is returned. How this set is constructed is the topic of this
research. The third way of querying is full text search. In full text search you want to search large
chunks of plain text or structured text for occurrences of specific words or concepts.
For companies it is difficult to choose a specific strategy, especially in this relatively new and volatile
domain. Would it not be interesting to be able to experiment with all these different technologies?
What about your applications, are they big data proof? Do they scale? Can they handle increasing
loads? How many and which resources are needed? What if I told you there is already a platform were
you can do your big data experiments. Ready to use, without the hassle of integration and
configuration.
Tengu: Big Data Applications Platform
The Tengu platform allows customers to experiment with a lot of aspects of big data. If you want to
simply try the new big data stores (e.g. Cassandra or Elastic Search), you can easily set up a Tengu
environment with these components already configured. You can also try different types of big data
analysis methodologies with Tengu. For example a clean Tengu instance comes with three different

types of big data analysis: stream processing, batch analysis and the Lambda Architecture. If you want
to experiment with your existing application and see for instance how well it performs in a big data
context, Tengu can also be used for this purpose.
In what follows we will go deeper into all the technologies and software components that currently
make up Tengu. With specific attention to these components’ function in Tengu and how you, as
experimenter, can use them. The last subsection provides a small tutorial on how you can use Tengu
to set up a big data environment.
Technologies
Big data analysis
Batch analysis – Apache Hadoop MapReduce
For the batch analysis Tengu relies on Apache’s Hadoop MapReduce. MapReduce is a parallel
programming concept first coined by Google. Due to the way MapReduce works it is targeting a
specific type of big data batch analysis jobs. MapReduce, as the name suggests, exists out of two
phases: a map phase and a reduce phase. What typically happens is that an extremely large data set
is chopped into smaller manageable parts (i.e. parts that can be handled by simple PCs). On these
smaller parts a map function is executed; this is where the actual analysis happens. This map phase is
ideally executed on as many nodes as there are smaller parts. The result of the map function should
always be some kind of key-value set. These different key-value sets are then aggregated into one
large key-value set in the reduce phase. It is important to point out a very significant shortcoming of
the MapReduce programming concept, the analysis cannot have dependencies on the entire or parts
of the data set. Does this mean that you cannot have any dependencies between your data or the
analysis of this data? Not at all, since it is possible to chain several MapReduce jobs to perform very
complex analysis jobs.
Stream processing – Apache Storm
Tengu uses the Apache Storm project for its stream processing. This open source project, initially
developed by Nathan Marz for Twitter, is capable of processing large streams of data. They say it will
do the processing in real-time, but this of course depends on the type of processing you want to do.
The idea behind Storm is very similar to MapReduce, with the difference that in Storm we do not chop
up the data, but we chop up the analysis job. By dividing a complex analysis job into different small,
and reusable, parts, the processing can be heavily parallelized and distributed over the available
worker nodes1
. In this way it is possible to achieve a very high throughput (of course depending on
the number of worker nodes). In the Apache Storm lingo such a small processing part is called a bolt.
These bolts are chained together in what they call a topology, which in its entirety will perform the
complex analysis job.
Lambda Architecture
The Lambda architecture, also coined by Nathan Marz, combines the two previous approaches. He
created this specific analysis approach because he saw the need for real-time analysis (similar to
stream processing) that also included information on historical data, which could grow potentially to
extreme sizes. Without going into too much detail, this is what happens in the Lambda architecture:
a batch analysis job is running that processes the historical data. In the case of Tengu, this Batch is
handled by Apache Hadoop MapReduce. The result of this processing is stored in what Nathan Marz
calls the Batch View. During this batch analysis job a sub-optimal processing is done. Sub-optimal,
because we need this to be handled in real-time. Tengu uses Apache Storm for this, so a stream
1
A worker node is simply a server designated for a computational task.

processing analysis system. When the client queries this system for information the Batch View and
the results from the stream processing data analysis framework are aggregated and thus return
information that contains both recent info and information derived from the historical data. When an
iteration of the batch analysis job is finished, the recently received data (so the data that is processed
by Apache Storm) is combined with the historical data. When this transition is finished, the batch
analysis job starts again, but now over more data. During this new iteration, the system then receives
new incoming data, which is in turn handled by the stream processor. Until the batch job is finished
again, which will initiate the data move of all the newly received data to the batch analysis job’s source
data … and so on. This is a continuous process, always providing a view on the historical data and a
view on the real-time data. The Lambda architecture thus is a specific hybrid approach for big data
analysis leveraging the computing power of batch processing in a batch layer with the responsiveness
of real-time computing system in the stream processor (which is called architecture the speed layer in
the Lambda architecture).
Tengu provides all the necessary building blocks to set up such a Lambda architecture. Control of the
batch analysis job and of the movement of the recently arrived data is handled by the Enterprise
Service Bus controlling and managing Tengu.
Data Stores
Tengu already supports several data stores out of the box, including the relational data base MySQL.
The other data stores are three NoSQL data stores with very distinct usage and features. Now, what
does Tengu provide to you as an experimenter? It does all the deployment and configuration for you,
so the data stores are ready to be used by your applications.
Cassandra
Cassandra is what they call a key-value based column store. This means that every value is uniquely
identified by its key and that the value is a combination of different columns. Cassandra is a good
starting point for people coming from the RDBMS world and wanting to taste what NoSQL is about,
because it still has a concept of tables. Next to this Cassandra has a query language (called CQL) that
is very similar to the well-known SQL language. Cassandra of course has many NoSQL features, e.g.
decentralization (i.e. no single point of failure), data replication and fault-tolerance (both to network
errors and server down-time).
A big difference between Cassandra and regular RDBMS systems is, that it does not have the concept
of joins. So it is not possible to join multiple keys into new sets of columns, like you do with RDBMS
tables. Joining or other types of cross-references have to be handled by the application. This is a
consequence of the data model chosen for Cassandra. However, this data model (the key-value based
column store) has many advantages, e.g. large capacity and extremely fast writes and reads.
MongoDB
MongoDB is a key-value based document store. You will typically store entire documents in a
MongoDB data stores. A document in MongoDB is a set of properties, look at this as a big JSON file. A
nice feature of MongoDB is that it does not require to know a data schema in advance. It will learn the
data format on the fly, as you insert the new documents. MongoDB also indexes your data as you
insert it, making for lightning fast reads. Creating new documents and updating existing documents in
MongoDB is a bit more tedious (still very fast and highly scalable), because of the automatic data
format recognition and the automatic indexing mechanism.
Although it is possible to query parts of document, MongoDB is typically used to retrieve the entire
document at once. A very important feature of MongoDB, and definitely one of its strengths, is its

range query capabilities. A range query for example is when you need all documents created between
two specific dates (ranges do not always have to be over dates, any property can be used from the
document).
ElasticSearch
ElasticSearch can also be considered as a document store, however it is much more. ElasticSearch’s
main advantage is situated around its full-text search capabilities, for which it heavily relies on Apache
Lucene. ElasticSearch is actually a feature-rich service layer on top of Apache’s incredible indexing
system, Lucene. It provides an easy to use search API and filtering API, with a lot of customization
possibilities.
Other data storage systems
Thanks to some central components that are an integral part of Tengu, we have some extra data
storage systems that you can experiment with. A very important one is Apache Hadoop Distributed
File System (HDFS), which is the distributed file system that comes with Apache MapReduce. HDFS can
be used as a regular file system, but with all the features of the Hadoop system: high availability,
scalability, redundancy, fault-tolerance, network-partition-proof, etc.
Another component that can be used to store data is Apache Kafka. Apache Kafka is actually a scalable
distributed message broker, but it is also capable of persisting large amounts of data. In Tengu we use
Apache Kafka as message store for our Tengu Lambda architecture implementation. It tightly
integrates with Apache Storm and Apache HDFS. You have generally two types of message brokers:
queue based or topic based systems. Apache Kafka adopts some form of the topic based message
broker.
Resource management and configuration
Fed4FIRE
With Tengu it is possible to setup big data experiment environments with an easy RESTful API. With a
simple POST request you can create a new environment. What happens in the background, is that this
POST request is translated into specific calls to one of the Fed4FIRE2
testbeds (with the API you can
actually decide yourself on which testbed you want the Tengu experimentation environment to be
deployed). Fed4FIRE is a large European Integration Project under the 7th
Framework Program
enabling experiments that combine and federate facilities from the different FIRE research
communities. IBCN is one of the main driving forces in Fed4FIRE, not only with opening up IBCN’s
Virtual Wall to other Fed4FIRE partners, but also as main developer of several federation and client
tools used in Fed4FIRE. One of these tools is JFed3
, which allows an experimenter to setup any type of
server topology on the Fed4FIRE testbeds. It is actually this tool that is being used by Tengu to allocate
and deploy the necessary resources.
Configuration – Chef
The JFed client tool only provides the servers to be used in the Tengu experiment environment. The
necessary software components and the configuration of these components is handled by Chef4
, a
configuration management framework. Currently we provide a particular set of predefined cookbooks
and recipes that are being used in Tengu. Cookbooks and recipes contain the necessary information
and dependencies to deploy, configure and integrate specific pieces of software. However we open
up Chef to our Tengu experimenters, allowing them to deploy their own specific set of tools and
2
http://www.fed4fire.eu/
3
http://jfed.iminds.be/
4
https://www.chef.io/chef/

software components as well. In the Chef Supermarket5
you can find lots of cookbooks for most of the
commonly used tools. If you need tools that are not available in the Supermarket, you always can
create the cookbooks and recipes yourself or if this is too much hassle, you still can use the tools
available in Ubuntu to deploy and install your software components manually. We, however, strongly
advice to use Chef as it is very straightforward to move from the experimenting environment to your
company’s production environment.
Cloud virtualization – OpenStack
Tengu will set up a big data experimentation environment for you with a fixed set of servers. The size
of some of the clusters (such as the Apache Hadoop cluster and the Storm cluster) can be defined by
the experimenter, but the other servers are fixed. It is possible, especially to experiment with existing
applications, that this fixed set of servers is not sufficient. Tengu answers to this need by also
configuring an OpenStack private cloud. The size of this cloud can also be defined by the experimenter.
In this OpenStack private cloud an experimenter can create many virtual machines that can be used
to deploy software components necessary for the experimenter’s application. Deployment and
configuration of these components can – and this is actually again the preferred way – be handled by
Chef.
Platform Usage
In what follows we will give a small introduction on how you can use Tengu to set up your own big
data experimentation environment. This is achieved by calling Tengu’s RESTful API. For more advanced
usage of the RESTful API and more in-depth information about Tengu, we refer to the documentation
on the Tengu website.
Step 1: Prerequisites
The only thing you currently need to get started with Tengu, is a valid Fed4FIRE account.
Documentation concerning Fed4FIRE and especially how to obtain such an account, can be found on
the Fed4FIRE documentation site6
.
The RESTful API is a combination of GET and POST HTTP requests. For the GET requests it suffices to
have a standard browser, but for the POST requests this is not enough. Some browsers, however, have
extensions to do RESTful requests. We advise to use a tool such as cURL7
for the RESTful API requests.
All examples provided here are shown using cURL.
Step 2: Deploy your first Tengu core setup
Setting up a Tengu big data environment is as easy as doing the following HTTP POST.
This will setup a Tengu big data environment with an Apache Hadoop cluster of size 3 and an Apache
Storm cluster of size 2. Let us break down the URL.
5
https://supermarket.chef.io/cookbooks-directory
6
http://doc.fed4fire.eu/getanaccount.html
7
http://curl.haxx.se/
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/core?hnodes=3&snod
es=2&testbed=urn:publicid:IDN+wall2.ilabt.iminds.be+authority
+cm" -X POST

The Tengu RESTful API can be reached using the ipv6 address 2001:6a8:1d80:23::141 and port
8280. By using the path /tengu/core we tell the API to setup a Tengu core platform. The Tengu
core platform is configured using several parameters, provided as key-value pairs in the Query-string:
 The testbed we want this Tengu core setup to be deployed to, is set by the testbed
parameter. Its value is defined using the Fed4FIRE uuid of the testbed (here
urn:publicid:IDN+wall2.ilabt.iminds.be+authority+cm).
 The size of the Apache Hadoop cluster is set via the hnodes parameter.
 Similar for the size of the Apache Storm cluster, snodes.
A response similar to this one will be responded. This response includes the uuid (here
urn:uuid:74584f5b-cc26-46cd-8ab9-5b42f53e95c0) that can be used to get
information about the Tengu core setup.
Step 3: Get information your deployed setup
Retrieving information (e.g. state information) about your deployed big data environment, is via the
following HTTP GET.
Depending on the current state of the deployment, the response will also include links to the
interfaces of important components. In the case of a Tengu core setup, this will be the Hadoop
Administration front end, the webUI of the HDFS Namenode, the webUI of the Storm cluster, the
OpenStack Horizon UI and the webUI of the WSO2 ESB.
The format of the response will be as follows.
HTTP/1.1 202 OK
Date: Wed, 11 Mar 2015 07:54:55 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>urn:uuid:74584f5b-cc26-46cd-8ab9-
5b42f53e95c0</ten:id>
<lnk:link method="get” href="/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0” />
</ten:platform>
</ten:tengu>
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0"

What’s next for Tengu?
The Tengu platform currently has a lot of building blocks that are already integrated and configured
automatically. What Tengu does not yet (automatically) do, is deploying an experimenter’s
application. Although it is possible to do everything by hand, we would like to help experimenters
even more by making the process of deploying an application as generic, configurable and automated
as possible. This requires an abstract view on what an application is, especially focusing on big data
applications and cloud applications with all its different layers.
Somewhat connected to this – easing the cloud and big data adoption by the current applications. At
the moment, if someone wants to experiment how an existing application would react when using for
example a NoSQL data store the application will have to be changed drastically. We want to make
these changes to the application obsolete by automatically transforming the requests coming from
the application into requests that can be interpreted by the new data store, but without changing the
semantical meaning of the original request. This last part is extremely important. Current solutions
with middleware abstraction layers are already capable of letting the application talk with different
data stores in a generic abstracted way, but it is not guaranteed that the behavior of the original
requests are maintained.
Another research and development topic we are currently investigating is monitoring in a big data
environment. This research is very challenging, not only because of the highly distributed nature of
these big data environments, but also because of the heterogeneity of the different components
involved in these environments. We believe that if we want to make from Tengu the go-to platform
for experimentation in big data and cloud contexts, we have to offer a deeply integrated monitoring
framework as well. This monitoring framework should not only look at the environment’s resource
usage, but should also monitor the behavior of the application.
Projects Tengu
DMS² – http://www.iminds.be/en/projects/2014/03/06/dms2
The DMS² (Decentralized Data Management and Migration for SaaS) project aims for the creation of
a strategic and practical framework to deal with data management challenges for (potentially new and
currently active) SaaS providers in the cloud.
HTTP/1.1 200 OK
Date: Wed, 11 Mar 2015 08:22:03 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>{uuid}</ten:id>
<ten:status>{UNKNOWN|READY|FAILED}</ten:status>
<lnk:link method="...” rel="..." href="...” /> *
</ten:platform>
</ten:tengu>

The outcome includes:
 A reference model for requirements engineering and architectural trade-off analysis, specific
for data management and migration in SaaS solutions. This is an essential element for
customer acquisition projects.
 Middleware for data management, supporting interoperability, data protection measures,
tactics and solutions for federated data storage and data processing.
The project outcome is driven and validated by four industry case studies, from UP-nxt, Verizon
Terremark, Agfa and Luciad, and results in a demonstrator.
AMiCA – http://www.amicaproject.be
The AMiCA (“Automatic Monitoring for Cyberspace Applications”) project aims to mine relevant social
media (blogs, chat rooms, and social networking sites) and collect, analyse, and integrate large
amounts of information using text and image analysis. The ultimate goal is to trace harmful content,
contact, or conduct in an automatic way. Essentially, we take a cross-media mining approach that
allows us to detect risks “on-the-fly”. When critical situations are detected (e.g. a very violent
communication), alerts can be issued to moderators of the social networking sites. When used on
aggregated data, the same technology can be used for incident collection and monitoring at the scale
of individual social networking sites. In addition, the technology can provide accurate quantitative
data to support providers, science, and government in decision-making processes with respect to child
safety online.
 Period: 01/01/2013 - 31/12/2016
 Sponsor: IWT - Agentschap voor Innovatie door Wetenschap en Technologie (Agency for
Innovation by Science and Technology)
SEQUOIA – http://www.iminds.be/en/projects/2015/03/11/sequoia
The SEQUOIA (Safe Query Applications for Cloud-Based SaaS Applications) project aims to create a
security framework for advanced queries and reporting in SaaS environments. These solutions will be
combined with intricate security rules at the application level. As a result, SaaS providers will be able
to further optimize their offerings and strengthen confidence in SaaS services.
iFest – http://www.iminds.be/en/projects/2015/03/11/ifest
The iFest project aims to develop a new generation of festival wristbands that ensure a richer festival
experience based on built-in communication and sensor functions. iFest is also focusing on a software
platform that allows organizers to manage the wristbands and analyze the data obtained.
PROVIDENCE – http://www.iminds.be/en/projects/2014/06/28/providence
The PROVIDENCE (Predicting the Online Virality of Entertainment and Newscontent) research project
aims to optimize online news publication strategies by anticipating the predicted viral nature of news
on social media.
Social networks are increasingly popular for distributing news content. It’s a medium where the users
themselves decide which topics become ‘viral’. The main goal of the PROVIDENCE project is to
optimize online news publication strategies by pro- actively making use of the predicted virality of
news on social media platforms. Providence will tackle the technological and research challenges that
will build up a virality-driven production flow into a commercial online news environment. These
research challenges encompass the large-scale monitoring, analysis and prediction of news
consumption and news sharing behavior by users and specific user segments.

Fed4FIRE – http://www.fed4fire.eu
Fed4FIRE is an Integrating Project under the European Union’s Seventh Framework Program (FP7)
addressing the work program topic Future Internet Research and Experimentation. It started in
October 2012 and will run for 48 months, until the end of September 2016.
Experimentally driven research is considered to be a key factor for growing the European Internet
industry. In order to enable this type of RTD activities, a number of projects for building a European
facility for Future Internet Research and Experimentation (FIRE) have been launched, each project
targeting a specific community within the Future Internet ecosystem. Through the federation of these
infrastructures, innovative experiments become possible that break the boundaries of these domains.
Besides, infrastructure developers can utilize common tools of the federation, allowing them to focus
on their core testbed activities.
Recent projects have already successfully demonstrated the advantages of federation within a
community. The Fed4FIRE project intends to implement the next step in these activities by successfully
federating across the community borders and offering openness for future extensions.
IBCN’s Tengu Team
Thomas Vanhove
Thomas obtained his master’s degree in Computer Science from Ghent University, Belgium in July
2012. In August 2012, he started his PhD at the IBCN (Intec Broadband Communication Networks)
research group, researching data management solutions in cloud environment. Tengu originated in
the first years of his research in this domain and has since become the main focus of his PhD.
Dr. Gregory Van Seghbroeck
Gregory Van Seghbroeck graduated at Ghent University in 2005. After a brief stop as an IT consultant,
he joined the Department of Information Technology (INTEC) at Ghent University. On the 1st of
January, 2007, he received a PhD grant from IWT, Institute for the Support of Innovation through
Science and Technology, to work on theoretical aspects of advanced validation mechanism for
distributed interaction protocols and service choreographies. In 2011 he received his Ph.D. in
Computer Science Engineering. Since July 2012, he has been active as a post-doctoral researcher at
Ghent University, where he has been involved in several national and European projects, including the
FP7 project BonFIRE and the award-winning ITEA2 project SODA. His main research interests include
complex distributed processes, cloud computing, service engineering, and service oriented
architectures. As an author or co-author his work has been published in international journals and
conference proceedings.
Dr. ir. Tim Wauters
Tim received his M.Sc. degree in electro-technical engineering in June 2001 from Ghent University,
Belgium. In January 2007, he obtained the Ph.D. degree in electro-technical engineering at the same
university. Since September 2001, he has been working in the Department of Information Technology
(INTEC) at Ghent University, and is now active as a post-doctoral fellow of the F.W.O.-V. His main
research interests focus on network and service architectures and management solutions for scalable
multimedia delivery services. His work has been published in about 50 scientific publications in
international journals and in the proceedings of international conferences.
Dr. Bruno Volckaert
Bruno Volckaert graduated in 2001 from the Ghent University and obtained his PhD entitled
“Architectures and Algorithms for network and service aware Grid resource management” in 2006.

Since then he has been responsible for over 20 research projects (ICON, EU FP6, ITEA, SBO). He was
research lead of the TRACK and RAILS projects, both dealing with advances in distributed software for
railway transportation and is currently research lead of the Elastic Media Distribution project dealing
with Cloud provisioning for professional media cooperation platforms. His main focus is on distributed
systems, more specifically dealing with intelligent cloud resource provisioning methods and
transportation.
Prof. Dr. ir. Filip De Turck
Filip received his M.Sc. degree in Electronic Engineering from the Ghent University, Belgium, in June
1997. In May 2002, he obtained the Ph.D. degree in Electronic Engineering from the same university.
During his Ph.D. research he was funded by the F.W.O.-V., the Fund for Scientific Research Flanders.
From October 2002 until September 2008, he was a post-doctoral fellow of the F.W.O.-V. and part
time professor, affiliated with the Department of Information Technology of the Ghent University. At
the moment, he is a full-time professor affiliated with the Department of Information Technology of
the Ghent University and the IBBT (Interdisciplinary Institute of Broadband Technology Flanders) in
the area of telecommunication and software engineering. Filip De Turck is author or co-author of
approximately 250 papers published in international journals or in the proceedings of international
conferences. His main research interests include scalable software architectures for
telecommunication network and service management, performance evaluation and design of new
telecommunication and eHealth services.

Experimenting With Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Experimenting With Big Data

Similar to Experimenting With Big Data (20)

Recently uploaded

Recently uploaded (20)

Experimenting With Big Data