SlideShare a Scribd company logo
1 of 10
Download to read offline
Experimenting with Big Data
Thomas Vanhove, Gregory Van Seghbroeck, Tim Wauters, Bruno Volckaert, Filip De Turck
Big Data and its Problems
For us the big data domain can be divided into three main categories:
 big data analysis,
 big data management, and
 querying big data.
This distinction is merely a research thing. Big data applications always require a combination of these
three topics.
Big data analysis can in its turn be subdivided into two tracks: analysis on big data sets and analysis
on large streams of incoming data. The latter is often called stream processing or complex event
processing. In stream processing we often have a real-time requirement, i.e. we need the result of the
processing of the incoming data immediately. Whereas in analysis on big data sets, we need a more
thorough analysis.
The main research track in big data management is about data distribution or data partitioning. Think
of it as you have a set of servers available, where will you place the data. Data distribution is in most
cases handled by the data store or the data storage system you have selected. It has an impact on
many features, e.g. robustness of the system, data redundancy, performance of reads and writes, data
consistency, availability etc. There already exist a wide variety of data stores or data storage systems
that excel in certain features, however it is not possible to support all features simultaneously.
Deciding in the huge jungle of possible data storage systems is not a trivial job. But help is on its way!
Querying as research topic is also handled by the data stores itself and it has lot of common ground
with data management. We can see at least three major ways of querying data in the big data context.
First of all we have simple reads. This has everything to do with how and where the big data is stored.
It has also a lot of similarities to data management, especially with indexing, data partitioning and data
redundancy. Another querying method involves range queries. This is a form of queries where a set
(sometimes in a specific order) of results is returned. How this set is constructed is the topic of this
research. The third way of querying is full text search. In full text search you want to search large
chunks of plain text or structured text for occurrences of specific words or concepts.
For companies it is difficult to choose a specific strategy, especially in this relatively new and volatile
domain. Would it not be interesting to be able to experiment with all these different technologies?
What about your applications, are they big data proof? Do they scale? Can they handle increasing
loads? How many and which resources are needed? What if I told you there is already a platform were
you can do your big data experiments. Ready to use, without the hassle of integration and
configuration.
Tengu: Big Data Applications Platform
The Tengu platform allows customers to experiment with a lot of aspects of big data. If you want to
simply try the new big data stores (e.g. Cassandra or Elastic Search), you can easily set up a Tengu
environment with these components already configured. You can also try different types of big data
analysis methodologies with Tengu. For example a clean Tengu instance comes with three different
types of big data analysis: stream processing, batch analysis and the Lambda Architecture. If you want
to experiment with your existing application and see for instance how well it performs in a big data
context, Tengu can also be used for this purpose.
In what follows we will go deeper into all the technologies and software components that currently
make up Tengu. With specific attention to these components’ function in Tengu and how you, as
experimenter, can use them. The last subsection provides a small tutorial on how you can use Tengu
to set up a big data environment.
Technologies
Big data analysis
Batch analysis – Apache Hadoop MapReduce
For the batch analysis Tengu relies on Apache’s Hadoop MapReduce. MapReduce is a parallel
programming concept first coined by Google. Due to the way MapReduce works it is targeting a
specific type of big data batch analysis jobs. MapReduce, as the name suggests, exists out of two
phases: a map phase and a reduce phase. What typically happens is that an extremely large data set
is chopped into smaller manageable parts (i.e. parts that can be handled by simple PCs). On these
smaller parts a map function is executed; this is where the actual analysis happens. This map phase is
ideally executed on as many nodes as there are smaller parts. The result of the map function should
always be some kind of key-value set. These different key-value sets are then aggregated into one
large key-value set in the reduce phase. It is important to point out a very significant shortcoming of
the MapReduce programming concept, the analysis cannot have dependencies on the entire or parts
of the data set. Does this mean that you cannot have any dependencies between your data or the
analysis of this data? Not at all, since it is possible to chain several MapReduce jobs to perform very
complex analysis jobs.
Stream processing – Apache Storm
Tengu uses the Apache Storm project for its stream processing. This open source project, initially
developed by Nathan Marz for Twitter, is capable of processing large streams of data. They say it will
do the processing in real-time, but this of course depends on the type of processing you want to do.
The idea behind Storm is very similar to MapReduce, with the difference that in Storm we do not chop
up the data, but we chop up the analysis job. By dividing a complex analysis job into different small,
and reusable, parts, the processing can be heavily parallelized and distributed over the available
worker nodes1
. In this way it is possible to achieve a very high throughput (of course depending on
the number of worker nodes). In the Apache Storm lingo such a small processing part is called a bolt.
These bolts are chained together in what they call a topology, which in its entirety will perform the
complex analysis job.
Lambda Architecture
The Lambda architecture, also coined by Nathan Marz, combines the two previous approaches. He
created this specific analysis approach because he saw the need for real-time analysis (similar to
stream processing) that also included information on historical data, which could grow potentially to
extreme sizes. Without going into too much detail, this is what happens in the Lambda architecture:
a batch analysis job is running that processes the historical data. In the case of Tengu, this Batch is
handled by Apache Hadoop MapReduce. The result of this processing is stored in what Nathan Marz
calls the Batch View. During this batch analysis job a sub-optimal processing is done. Sub-optimal,
because we need this to be handled in real-time. Tengu uses Apache Storm for this, so a stream
1
A worker node is simply a server designated for a computational task.
processing analysis system. When the client queries this system for information the Batch View and
the results from the stream processing data analysis framework are aggregated and thus return
information that contains both recent info and information derived from the historical data. When an
iteration of the batch analysis job is finished, the recently received data (so the data that is processed
by Apache Storm) is combined with the historical data. When this transition is finished, the batch
analysis job starts again, but now over more data. During this new iteration, the system then receives
new incoming data, which is in turn handled by the stream processor. Until the batch job is finished
again, which will initiate the data move of all the newly received data to the batch analysis job’s source
data … and so on. This is a continuous process, always providing a view on the historical data and a
view on the real-time data. The Lambda architecture thus is a specific hybrid approach for big data
analysis leveraging the computing power of batch processing in a batch layer with the responsiveness
of real-time computing system in the stream processor (which is called architecture the speed layer in
the Lambda architecture).
Tengu provides all the necessary building blocks to set up such a Lambda architecture. Control of the
batch analysis job and of the movement of the recently arrived data is handled by the Enterprise
Service Bus controlling and managing Tengu.
Data Stores
Tengu already supports several data stores out of the box, including the relational data base MySQL.
The other data stores are three NoSQL data stores with very distinct usage and features. Now, what
does Tengu provide to you as an experimenter? It does all the deployment and configuration for you,
so the data stores are ready to be used by your applications.
Cassandra
Cassandra is what they call a key-value based column store. This means that every value is uniquely
identified by its key and that the value is a combination of different columns. Cassandra is a good
starting point for people coming from the RDBMS world and wanting to taste what NoSQL is about,
because it still has a concept of tables. Next to this Cassandra has a query language (called CQL) that
is very similar to the well-known SQL language. Cassandra of course has many NoSQL features, e.g.
decentralization (i.e. no single point of failure), data replication and fault-tolerance (both to network
errors and server down-time).
A big difference between Cassandra and regular RDBMS systems is, that it does not have the concept
of joins. So it is not possible to join multiple keys into new sets of columns, like you do with RDBMS
tables. Joining or other types of cross-references have to be handled by the application. This is a
consequence of the data model chosen for Cassandra. However, this data model (the key-value based
column store) has many advantages, e.g. large capacity and extremely fast writes and reads.
MongoDB
MongoDB is a key-value based document store. You will typically store entire documents in a
MongoDB data stores. A document in MongoDB is a set of properties, look at this as a big JSON file. A
nice feature of MongoDB is that it does not require to know a data schema in advance. It will learn the
data format on the fly, as you insert the new documents. MongoDB also indexes your data as you
insert it, making for lightning fast reads. Creating new documents and updating existing documents in
MongoDB is a bit more tedious (still very fast and highly scalable), because of the automatic data
format recognition and the automatic indexing mechanism.
Although it is possible to query parts of document, MongoDB is typically used to retrieve the entire
document at once. A very important feature of MongoDB, and definitely one of its strengths, is its
range query capabilities. A range query for example is when you need all documents created between
two specific dates (ranges do not always have to be over dates, any property can be used from the
document).
ElasticSearch
ElasticSearch can also be considered as a document store, however it is much more. ElasticSearch’s
main advantage is situated around its full-text search capabilities, for which it heavily relies on Apache
Lucene. ElasticSearch is actually a feature-rich service layer on top of Apache’s incredible indexing
system, Lucene. It provides an easy to use search API and filtering API, with a lot of customization
possibilities.
Other data storage systems
Thanks to some central components that are an integral part of Tengu, we have some extra data
storage systems that you can experiment with. A very important one is Apache Hadoop Distributed
File System (HDFS), which is the distributed file system that comes with Apache MapReduce. HDFS can
be used as a regular file system, but with all the features of the Hadoop system: high availability,
scalability, redundancy, fault-tolerance, network-partition-proof, etc.
Another component that can be used to store data is Apache Kafka. Apache Kafka is actually a scalable
distributed message broker, but it is also capable of persisting large amounts of data. In Tengu we use
Apache Kafka as message store for our Tengu Lambda architecture implementation. It tightly
integrates with Apache Storm and Apache HDFS. You have generally two types of message brokers:
queue based or topic based systems. Apache Kafka adopts some form of the topic based message
broker.
Resource management and configuration
Fed4FIRE
With Tengu it is possible to setup big data experiment environments with an easy RESTful API. With a
simple POST request you can create a new environment. What happens in the background, is that this
POST request is translated into specific calls to one of the Fed4FIRE2
testbeds (with the API you can
actually decide yourself on which testbed you want the Tengu experimentation environment to be
deployed). Fed4FIRE is a large European Integration Project under the 7th
Framework Program
enabling experiments that combine and federate facilities from the different FIRE research
communities. IBCN is one of the main driving forces in Fed4FIRE, not only with opening up IBCN’s
Virtual Wall to other Fed4FIRE partners, but also as main developer of several federation and client
tools used in Fed4FIRE. One of these tools is JFed3
, which allows an experimenter to setup any type of
server topology on the Fed4FIRE testbeds. It is actually this tool that is being used by Tengu to allocate
and deploy the necessary resources.
Configuration – Chef
The JFed client tool only provides the servers to be used in the Tengu experiment environment. The
necessary software components and the configuration of these components is handled by Chef4
, a
configuration management framework. Currently we provide a particular set of predefined cookbooks
and recipes that are being used in Tengu. Cookbooks and recipes contain the necessary information
and dependencies to deploy, configure and integrate specific pieces of software. However we open
up Chef to our Tengu experimenters, allowing them to deploy their own specific set of tools and
2
http://www.fed4fire.eu/
3
http://jfed.iminds.be/
4
https://www.chef.io/chef/
software components as well. In the Chef Supermarket5
you can find lots of cookbooks for most of the
commonly used tools. If you need tools that are not available in the Supermarket, you always can
create the cookbooks and recipes yourself or if this is too much hassle, you still can use the tools
available in Ubuntu to deploy and install your software components manually. We, however, strongly
advice to use Chef as it is very straightforward to move from the experimenting environment to your
company’s production environment.
Cloud virtualization – OpenStack
Tengu will set up a big data experimentation environment for you with a fixed set of servers. The size
of some of the clusters (such as the Apache Hadoop cluster and the Storm cluster) can be defined by
the experimenter, but the other servers are fixed. It is possible, especially to experiment with existing
applications, that this fixed set of servers is not sufficient. Tengu answers to this need by also
configuring an OpenStack private cloud. The size of this cloud can also be defined by the experimenter.
In this OpenStack private cloud an experimenter can create many virtual machines that can be used
to deploy software components necessary for the experimenter’s application. Deployment and
configuration of these components can – and this is actually again the preferred way – be handled by
Chef.
Platform Usage
In what follows we will give a small introduction on how you can use Tengu to set up your own big
data experimentation environment. This is achieved by calling Tengu’s RESTful API. For more advanced
usage of the RESTful API and more in-depth information about Tengu, we refer to the documentation
on the Tengu website.
Step 1: Prerequisites
The only thing you currently need to get started with Tengu, is a valid Fed4FIRE account.
Documentation concerning Fed4FIRE and especially how to obtain such an account, can be found on
the Fed4FIRE documentation site6
.
The RESTful API is a combination of GET and POST HTTP requests. For the GET requests it suffices to
have a standard browser, but for the POST requests this is not enough. Some browsers, however, have
extensions to do RESTful requests. We advise to use a tool such as cURL7
for the RESTful API requests.
All examples provided here are shown using cURL.
Step 2: Deploy your first Tengu core setup
Setting up a Tengu big data environment is as easy as doing the following HTTP POST.
This will setup a Tengu big data environment with an Apache Hadoop cluster of size 3 and an Apache
Storm cluster of size 2. Let us break down the URL.
5
https://supermarket.chef.io/cookbooks-directory
6
http://doc.fed4fire.eu/getanaccount.html
7
http://curl.haxx.se/
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/core?hnodes=3&snod
es=2&testbed=urn:publicid:IDN+wall2.ilabt.iminds.be+authority
+cm" -X POST
The Tengu RESTful API can be reached using the ipv6 address 2001:6a8:1d80:23::141 and port
8280. By using the path /tengu/core we tell the API to setup a Tengu core platform. The Tengu
core platform is configured using several parameters, provided as key-value pairs in the Query-string:
 The testbed we want this Tengu core setup to be deployed to, is set by the testbed
parameter. Its value is defined using the Fed4FIRE uuid of the testbed (here
urn:publicid:IDN+wall2.ilabt.iminds.be+authority+cm).
 The size of the Apache Hadoop cluster is set via the hnodes parameter.
 Similar for the size of the Apache Storm cluster, snodes.
A response similar to this one will be responded. This response includes the uuid (here
urn:uuid:74584f5b-cc26-46cd-8ab9-5b42f53e95c0) that can be used to get
information about the Tengu core setup.
Step 3: Get information your deployed setup
Retrieving information (e.g. state information) about your deployed big data environment, is via the
following HTTP GET.
Depending on the current state of the deployment, the response will also include links to the
interfaces of important components. In the case of a Tengu core setup, this will be the Hadoop
Administration front end, the webUI of the HDFS Namenode, the webUI of the Storm cluster, the
OpenStack Horizon UI and the webUI of the WSO2 ESB.
The format of the response will be as follows.
HTTP/1.1 202 OK
Date: Wed, 11 Mar 2015 07:54:55 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>urn:uuid:74584f5b-cc26-46cd-8ab9-
5b42f53e95c0</ten:id>
<lnk:link method="get” href="/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0” />
</ten:platform>
</ten:tengu>
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0"
What’s next for Tengu?
The Tengu platform currently has a lot of building blocks that are already integrated and configured
automatically. What Tengu does not yet (automatically) do, is deploying an experimenter’s
application. Although it is possible to do everything by hand, we would like to help experimenters
even more by making the process of deploying an application as generic, configurable and automated
as possible. This requires an abstract view on what an application is, especially focusing on big data
applications and cloud applications with all its different layers.
Somewhat connected to this – easing the cloud and big data adoption by the current applications. At
the moment, if someone wants to experiment how an existing application would react when using for
example a NoSQL data store the application will have to be changed drastically. We want to make
these changes to the application obsolete by automatically transforming the requests coming from
the application into requests that can be interpreted by the new data store, but without changing the
semantical meaning of the original request. This last part is extremely important. Current solutions
with middleware abstraction layers are already capable of letting the application talk with different
data stores in a generic abstracted way, but it is not guaranteed that the behavior of the original
requests are maintained.
Another research and development topic we are currently investigating is monitoring in a big data
environment. This research is very challenging, not only because of the highly distributed nature of
these big data environments, but also because of the heterogeneity of the different components
involved in these environments. We believe that if we want to make from Tengu the go-to platform
for experimentation in big data and cloud contexts, we have to offer a deeply integrated monitoring
framework as well. This monitoring framework should not only look at the environment’s resource
usage, but should also monitor the behavior of the application.
Projects Tengu
DMS² – http://www.iminds.be/en/projects/2014/03/06/dms2
The DMS² (Decentralized Data Management and Migration for SaaS) project aims for the creation of
a strategic and practical framework to deal with data management challenges for (potentially new and
currently active) SaaS providers in the cloud.
HTTP/1.1 200 OK
Date: Wed, 11 Mar 2015 08:22:03 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>{uuid}</ten:id>
<ten:status>{UNKNOWN|READY|FAILED}</ten:status>
<lnk:link method="...” rel="..." href="...” /> *
</ten:platform>
</ten:tengu>
The outcome includes:
 A reference model for requirements engineering and architectural trade-off analysis, specific
for data management and migration in SaaS solutions. This is an essential element for
customer acquisition projects.
 Middleware for data management, supporting interoperability, data protection measures,
tactics and solutions for federated data storage and data processing.
The project outcome is driven and validated by four industry case studies, from UP-nxt, Verizon
Terremark, Agfa and Luciad, and results in a demonstrator.
AMiCA – http://www.amicaproject.be
The AMiCA (“Automatic Monitoring for Cyberspace Applications”) project aims to mine relevant social
media (blogs, chat rooms, and social networking sites) and collect, analyse, and integrate large
amounts of information using text and image analysis. The ultimate goal is to trace harmful content,
contact, or conduct in an automatic way. Essentially, we take a cross-media mining approach that
allows us to detect risks “on-the-fly”. When critical situations are detected (e.g. a very violent
communication), alerts can be issued to moderators of the social networking sites. When used on
aggregated data, the same technology can be used for incident collection and monitoring at the scale
of individual social networking sites. In addition, the technology can provide accurate quantitative
data to support providers, science, and government in decision-making processes with respect to child
safety online.
 Period: 01/01/2013 - 31/12/2016
 Sponsor: IWT - Agentschap voor Innovatie door Wetenschap en Technologie (Agency for
Innovation by Science and Technology)
SEQUOIA – http://www.iminds.be/en/projects/2015/03/11/sequoia
The SEQUOIA (Safe Query Applications for Cloud-Based SaaS Applications) project aims to create a
security framework for advanced queries and reporting in SaaS environments. These solutions will be
combined with intricate security rules at the application level. As a result, SaaS providers will be able
to further optimize their offerings and strengthen confidence in SaaS services.
iFest – http://www.iminds.be/en/projects/2015/03/11/ifest
The iFest project aims to develop a new generation of festival wristbands that ensure a richer festival
experience based on built-in communication and sensor functions. iFest is also focusing on a software
platform that allows organizers to manage the wristbands and analyze the data obtained.
PROVIDENCE – http://www.iminds.be/en/projects/2014/06/28/providence
The PROVIDENCE (Predicting the Online Virality of Entertainment and Newscontent) research project
aims to optimize online news publication strategies by anticipating the predicted viral nature of news
on social media.
Social networks are increasingly popular for distributing news content. It’s a medium where the users
themselves decide which topics become ‘viral’. The main goal of the PROVIDENCE project is to
optimize online news publication strategies by pro- actively making use of the predicted virality of
news on social media platforms. Providence will tackle the technological and research challenges that
will build up a virality-driven production flow into a commercial online news environment. These
research challenges encompass the large-scale monitoring, analysis and prediction of news
consumption and news sharing behavior by users and specific user segments.
Fed4FIRE – http://www.fed4fire.eu
Fed4FIRE is an Integrating Project under the European Union’s Seventh Framework Program (FP7)
addressing the work program topic Future Internet Research and Experimentation. It started in
October 2012 and will run for 48 months, until the end of September 2016.
Experimentally driven research is considered to be a key factor for growing the European Internet
industry. In order to enable this type of RTD activities, a number of projects for building a European
facility for Future Internet Research and Experimentation (FIRE) have been launched, each project
targeting a specific community within the Future Internet ecosystem. Through the federation of these
infrastructures, innovative experiments become possible that break the boundaries of these domains.
Besides, infrastructure developers can utilize common tools of the federation, allowing them to focus
on their core testbed activities.
Recent projects have already successfully demonstrated the advantages of federation within a
community. The Fed4FIRE project intends to implement the next step in these activities by successfully
federating across the community borders and offering openness for future extensions.
IBCN’s Tengu Team
Thomas Vanhove
Thomas obtained his master’s degree in Computer Science from Ghent University, Belgium in July
2012. In August 2012, he started his PhD at the IBCN (Intec Broadband Communication Networks)
research group, researching data management solutions in cloud environment. Tengu originated in
the first years of his research in this domain and has since become the main focus of his PhD.
Dr. Gregory Van Seghbroeck
Gregory Van Seghbroeck graduated at Ghent University in 2005. After a brief stop as an IT consultant,
he joined the Department of Information Technology (INTEC) at Ghent University. On the 1st of
January, 2007, he received a PhD grant from IWT, Institute for the Support of Innovation through
Science and Technology, to work on theoretical aspects of advanced validation mechanism for
distributed interaction protocols and service choreographies. In 2011 he received his Ph.D. in
Computer Science Engineering. Since July 2012, he has been active as a post-doctoral researcher at
Ghent University, where he has been involved in several national and European projects, including the
FP7 project BonFIRE and the award-winning ITEA2 project SODA. His main research interests include
complex distributed processes, cloud computing, service engineering, and service oriented
architectures. As an author or co-author his work has been published in international journals and
conference proceedings.
Dr. ir. Tim Wauters
Tim received his M.Sc. degree in electro-technical engineering in June 2001 from Ghent University,
Belgium. In January 2007, he obtained the Ph.D. degree in electro-technical engineering at the same
university. Since September 2001, he has been working in the Department of Information Technology
(INTEC) at Ghent University, and is now active as a post-doctoral fellow of the F.W.O.-V. His main
research interests focus on network and service architectures and management solutions for scalable
multimedia delivery services. His work has been published in about 50 scientific publications in
international journals and in the proceedings of international conferences.
Dr. Bruno Volckaert
Bruno Volckaert graduated in 2001 from the Ghent University and obtained his PhD entitled
“Architectures and Algorithms for network and service aware Grid resource management” in 2006.
Since then he has been responsible for over 20 research projects (ICON, EU FP6, ITEA, SBO). He was
research lead of the TRACK and RAILS projects, both dealing with advances in distributed software for
railway transportation and is currently research lead of the Elastic Media Distribution project dealing
with Cloud provisioning for professional media cooperation platforms. His main focus is on distributed
systems, more specifically dealing with intelligent cloud resource provisioning methods and
transportation.
Prof. Dr. ir. Filip De Turck
Filip received his M.Sc. degree in Electronic Engineering from the Ghent University, Belgium, in June
1997. In May 2002, he obtained the Ph.D. degree in Electronic Engineering from the same university.
During his Ph.D. research he was funded by the F.W.O.-V., the Fund for Scientific Research Flanders.
From October 2002 until September 2008, he was a post-doctoral fellow of the F.W.O.-V. and part
time professor, affiliated with the Department of Information Technology of the Ghent University. At
the moment, he is a full-time professor affiliated with the Department of Information Technology of
the Ghent University and the IBBT (Interdisciplinary Institute of Broadband Technology Flanders) in
the area of telecommunication and software engineering. Filip De Turck is author or co-author of
approximately 250 papers published in international journals or in the proceedings of international
conferences. His main research interests include scalable software architectures for
telecommunication network and service management, performance evaluation and design of new
telecommunication and eHealth services.

More Related Content

What's hot

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdata
Pranab Ghosh
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
Nushrat
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
IJRAT
 

What's hot (20)

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdata
 
Temporal data mining
Temporal data miningTemporal data mining
Temporal data mining
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configuration
 
Paper ijert
Paper ijertPaper ijert
Paper ijert
 
Data stream mining
Data stream miningData stream mining
Data stream mining
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 
Thilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation pptThilaganga mphil cs viva presentation ppt
Thilaganga mphil cs viva presentation ppt
 
lec6_ref.pdf
lec6_ref.pdflec6_ref.pdf
lec6_ref.pdf
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
 
Big Data DC - Analytics at Clearspring
Big Data DC - Analytics at ClearspringBig Data DC - Analytics at Clearspring
Big Data DC - Analytics at Clearspring
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 

Similar to Experimenting With Big Data

TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
Aditya Srinivasan
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Mumbai Academisc
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
Editor Jacotech
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
Kiran Naiga
 

Similar to Experimenting With Big Data (20)

Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
No Sql On Social And Sematic Web
No Sql On Social And Sematic WebNo Sql On Social And Sematic Web
No Sql On Social And Sematic Web
 
NoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebNoSQL On Social And Sematic Web
NoSQL On Social And Sematic Web
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
hive lab
hive labhive lab
hive lab
 
Amplitude wave architecture - Test
Amplitude wave architecture - TestAmplitude wave architecture - Test
Amplitude wave architecture - Test
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 

Recently uploaded (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 

Experimenting With Big Data

  • 1. Experimenting with Big Data Thomas Vanhove, Gregory Van Seghbroeck, Tim Wauters, Bruno Volckaert, Filip De Turck Big Data and its Problems For us the big data domain can be divided into three main categories:  big data analysis,  big data management, and  querying big data. This distinction is merely a research thing. Big data applications always require a combination of these three topics. Big data analysis can in its turn be subdivided into two tracks: analysis on big data sets and analysis on large streams of incoming data. The latter is often called stream processing or complex event processing. In stream processing we often have a real-time requirement, i.e. we need the result of the processing of the incoming data immediately. Whereas in analysis on big data sets, we need a more thorough analysis. The main research track in big data management is about data distribution or data partitioning. Think of it as you have a set of servers available, where will you place the data. Data distribution is in most cases handled by the data store or the data storage system you have selected. It has an impact on many features, e.g. robustness of the system, data redundancy, performance of reads and writes, data consistency, availability etc. There already exist a wide variety of data stores or data storage systems that excel in certain features, however it is not possible to support all features simultaneously. Deciding in the huge jungle of possible data storage systems is not a trivial job. But help is on its way! Querying as research topic is also handled by the data stores itself and it has lot of common ground with data management. We can see at least three major ways of querying data in the big data context. First of all we have simple reads. This has everything to do with how and where the big data is stored. It has also a lot of similarities to data management, especially with indexing, data partitioning and data redundancy. Another querying method involves range queries. This is a form of queries where a set (sometimes in a specific order) of results is returned. How this set is constructed is the topic of this research. The third way of querying is full text search. In full text search you want to search large chunks of plain text or structured text for occurrences of specific words or concepts. For companies it is difficult to choose a specific strategy, especially in this relatively new and volatile domain. Would it not be interesting to be able to experiment with all these different technologies? What about your applications, are they big data proof? Do they scale? Can they handle increasing loads? How many and which resources are needed? What if I told you there is already a platform were you can do your big data experiments. Ready to use, without the hassle of integration and configuration. Tengu: Big Data Applications Platform The Tengu platform allows customers to experiment with a lot of aspects of big data. If you want to simply try the new big data stores (e.g. Cassandra or Elastic Search), you can easily set up a Tengu environment with these components already configured. You can also try different types of big data analysis methodologies with Tengu. For example a clean Tengu instance comes with three different
  • 2. types of big data analysis: stream processing, batch analysis and the Lambda Architecture. If you want to experiment with your existing application and see for instance how well it performs in a big data context, Tengu can also be used for this purpose. In what follows we will go deeper into all the technologies and software components that currently make up Tengu. With specific attention to these components’ function in Tengu and how you, as experimenter, can use them. The last subsection provides a small tutorial on how you can use Tengu to set up a big data environment. Technologies Big data analysis Batch analysis – Apache Hadoop MapReduce For the batch analysis Tengu relies on Apache’s Hadoop MapReduce. MapReduce is a parallel programming concept first coined by Google. Due to the way MapReduce works it is targeting a specific type of big data batch analysis jobs. MapReduce, as the name suggests, exists out of two phases: a map phase and a reduce phase. What typically happens is that an extremely large data set is chopped into smaller manageable parts (i.e. parts that can be handled by simple PCs). On these smaller parts a map function is executed; this is where the actual analysis happens. This map phase is ideally executed on as many nodes as there are smaller parts. The result of the map function should always be some kind of key-value set. These different key-value sets are then aggregated into one large key-value set in the reduce phase. It is important to point out a very significant shortcoming of the MapReduce programming concept, the analysis cannot have dependencies on the entire or parts of the data set. Does this mean that you cannot have any dependencies between your data or the analysis of this data? Not at all, since it is possible to chain several MapReduce jobs to perform very complex analysis jobs. Stream processing – Apache Storm Tengu uses the Apache Storm project for its stream processing. This open source project, initially developed by Nathan Marz for Twitter, is capable of processing large streams of data. They say it will do the processing in real-time, but this of course depends on the type of processing you want to do. The idea behind Storm is very similar to MapReduce, with the difference that in Storm we do not chop up the data, but we chop up the analysis job. By dividing a complex analysis job into different small, and reusable, parts, the processing can be heavily parallelized and distributed over the available worker nodes1 . In this way it is possible to achieve a very high throughput (of course depending on the number of worker nodes). In the Apache Storm lingo such a small processing part is called a bolt. These bolts are chained together in what they call a topology, which in its entirety will perform the complex analysis job. Lambda Architecture The Lambda architecture, also coined by Nathan Marz, combines the two previous approaches. He created this specific analysis approach because he saw the need for real-time analysis (similar to stream processing) that also included information on historical data, which could grow potentially to extreme sizes. Without going into too much detail, this is what happens in the Lambda architecture: a batch analysis job is running that processes the historical data. In the case of Tengu, this Batch is handled by Apache Hadoop MapReduce. The result of this processing is stored in what Nathan Marz calls the Batch View. During this batch analysis job a sub-optimal processing is done. Sub-optimal, because we need this to be handled in real-time. Tengu uses Apache Storm for this, so a stream 1 A worker node is simply a server designated for a computational task.
  • 3. processing analysis system. When the client queries this system for information the Batch View and the results from the stream processing data analysis framework are aggregated and thus return information that contains both recent info and information derived from the historical data. When an iteration of the batch analysis job is finished, the recently received data (so the data that is processed by Apache Storm) is combined with the historical data. When this transition is finished, the batch analysis job starts again, but now over more data. During this new iteration, the system then receives new incoming data, which is in turn handled by the stream processor. Until the batch job is finished again, which will initiate the data move of all the newly received data to the batch analysis job’s source data … and so on. This is a continuous process, always providing a view on the historical data and a view on the real-time data. The Lambda architecture thus is a specific hybrid approach for big data analysis leveraging the computing power of batch processing in a batch layer with the responsiveness of real-time computing system in the stream processor (which is called architecture the speed layer in the Lambda architecture). Tengu provides all the necessary building blocks to set up such a Lambda architecture. Control of the batch analysis job and of the movement of the recently arrived data is handled by the Enterprise Service Bus controlling and managing Tengu. Data Stores Tengu already supports several data stores out of the box, including the relational data base MySQL. The other data stores are three NoSQL data stores with very distinct usage and features. Now, what does Tengu provide to you as an experimenter? It does all the deployment and configuration for you, so the data stores are ready to be used by your applications. Cassandra Cassandra is what they call a key-value based column store. This means that every value is uniquely identified by its key and that the value is a combination of different columns. Cassandra is a good starting point for people coming from the RDBMS world and wanting to taste what NoSQL is about, because it still has a concept of tables. Next to this Cassandra has a query language (called CQL) that is very similar to the well-known SQL language. Cassandra of course has many NoSQL features, e.g. decentralization (i.e. no single point of failure), data replication and fault-tolerance (both to network errors and server down-time). A big difference between Cassandra and regular RDBMS systems is, that it does not have the concept of joins. So it is not possible to join multiple keys into new sets of columns, like you do with RDBMS tables. Joining or other types of cross-references have to be handled by the application. This is a consequence of the data model chosen for Cassandra. However, this data model (the key-value based column store) has many advantages, e.g. large capacity and extremely fast writes and reads. MongoDB MongoDB is a key-value based document store. You will typically store entire documents in a MongoDB data stores. A document in MongoDB is a set of properties, look at this as a big JSON file. A nice feature of MongoDB is that it does not require to know a data schema in advance. It will learn the data format on the fly, as you insert the new documents. MongoDB also indexes your data as you insert it, making for lightning fast reads. Creating new documents and updating existing documents in MongoDB is a bit more tedious (still very fast and highly scalable), because of the automatic data format recognition and the automatic indexing mechanism. Although it is possible to query parts of document, MongoDB is typically used to retrieve the entire document at once. A very important feature of MongoDB, and definitely one of its strengths, is its
  • 4. range query capabilities. A range query for example is when you need all documents created between two specific dates (ranges do not always have to be over dates, any property can be used from the document). ElasticSearch ElasticSearch can also be considered as a document store, however it is much more. ElasticSearch’s main advantage is situated around its full-text search capabilities, for which it heavily relies on Apache Lucene. ElasticSearch is actually a feature-rich service layer on top of Apache’s incredible indexing system, Lucene. It provides an easy to use search API and filtering API, with a lot of customization possibilities. Other data storage systems Thanks to some central components that are an integral part of Tengu, we have some extra data storage systems that you can experiment with. A very important one is Apache Hadoop Distributed File System (HDFS), which is the distributed file system that comes with Apache MapReduce. HDFS can be used as a regular file system, but with all the features of the Hadoop system: high availability, scalability, redundancy, fault-tolerance, network-partition-proof, etc. Another component that can be used to store data is Apache Kafka. Apache Kafka is actually a scalable distributed message broker, but it is also capable of persisting large amounts of data. In Tengu we use Apache Kafka as message store for our Tengu Lambda architecture implementation. It tightly integrates with Apache Storm and Apache HDFS. You have generally two types of message brokers: queue based or topic based systems. Apache Kafka adopts some form of the topic based message broker. Resource management and configuration Fed4FIRE With Tengu it is possible to setup big data experiment environments with an easy RESTful API. With a simple POST request you can create a new environment. What happens in the background, is that this POST request is translated into specific calls to one of the Fed4FIRE2 testbeds (with the API you can actually decide yourself on which testbed you want the Tengu experimentation environment to be deployed). Fed4FIRE is a large European Integration Project under the 7th Framework Program enabling experiments that combine and federate facilities from the different FIRE research communities. IBCN is one of the main driving forces in Fed4FIRE, not only with opening up IBCN’s Virtual Wall to other Fed4FIRE partners, but also as main developer of several federation and client tools used in Fed4FIRE. One of these tools is JFed3 , which allows an experimenter to setup any type of server topology on the Fed4FIRE testbeds. It is actually this tool that is being used by Tengu to allocate and deploy the necessary resources. Configuration – Chef The JFed client tool only provides the servers to be used in the Tengu experiment environment. The necessary software components and the configuration of these components is handled by Chef4 , a configuration management framework. Currently we provide a particular set of predefined cookbooks and recipes that are being used in Tengu. Cookbooks and recipes contain the necessary information and dependencies to deploy, configure and integrate specific pieces of software. However we open up Chef to our Tengu experimenters, allowing them to deploy their own specific set of tools and 2 http://www.fed4fire.eu/ 3 http://jfed.iminds.be/ 4 https://www.chef.io/chef/
  • 5. software components as well. In the Chef Supermarket5 you can find lots of cookbooks for most of the commonly used tools. If you need tools that are not available in the Supermarket, you always can create the cookbooks and recipes yourself or if this is too much hassle, you still can use the tools available in Ubuntu to deploy and install your software components manually. We, however, strongly advice to use Chef as it is very straightforward to move from the experimenting environment to your company’s production environment. Cloud virtualization – OpenStack Tengu will set up a big data experimentation environment for you with a fixed set of servers. The size of some of the clusters (such as the Apache Hadoop cluster and the Storm cluster) can be defined by the experimenter, but the other servers are fixed. It is possible, especially to experiment with existing applications, that this fixed set of servers is not sufficient. Tengu answers to this need by also configuring an OpenStack private cloud. The size of this cloud can also be defined by the experimenter. In this OpenStack private cloud an experimenter can create many virtual machines that can be used to deploy software components necessary for the experimenter’s application. Deployment and configuration of these components can – and this is actually again the preferred way – be handled by Chef. Platform Usage In what follows we will give a small introduction on how you can use Tengu to set up your own big data experimentation environment. This is achieved by calling Tengu’s RESTful API. For more advanced usage of the RESTful API and more in-depth information about Tengu, we refer to the documentation on the Tengu website. Step 1: Prerequisites The only thing you currently need to get started with Tengu, is a valid Fed4FIRE account. Documentation concerning Fed4FIRE and especially how to obtain such an account, can be found on the Fed4FIRE documentation site6 . The RESTful API is a combination of GET and POST HTTP requests. For the GET requests it suffices to have a standard browser, but for the POST requests this is not enough. Some browsers, however, have extensions to do RESTful requests. We advise to use a tool such as cURL7 for the RESTful API requests. All examples provided here are shown using cURL. Step 2: Deploy your first Tengu core setup Setting up a Tengu big data environment is as easy as doing the following HTTP POST. This will setup a Tengu big data environment with an Apache Hadoop cluster of size 3 and an Apache Storm cluster of size 2. Let us break down the URL. 5 https://supermarket.chef.io/cookbooks-directory 6 http://doc.fed4fire.eu/getanaccount.html 7 http://curl.haxx.se/ $ curl -k -i "http://[2001:6a8:1d80:23::141]:8280/tengu/core?hnodes=3&snod es=2&testbed=urn:publicid:IDN+wall2.ilabt.iminds.be+authority +cm" -X POST
  • 6. The Tengu RESTful API can be reached using the ipv6 address 2001:6a8:1d80:23::141 and port 8280. By using the path /tengu/core we tell the API to setup a Tengu core platform. The Tengu core platform is configured using several parameters, provided as key-value pairs in the Query-string:  The testbed we want this Tengu core setup to be deployed to, is set by the testbed parameter. Its value is defined using the Fed4FIRE uuid of the testbed (here urn:publicid:IDN+wall2.ilabt.iminds.be+authority+cm).  The size of the Apache Hadoop cluster is set via the hnodes parameter.  Similar for the size of the Apache Storm cluster, snodes. A response similar to this one will be responded. This response includes the uuid (here urn:uuid:74584f5b-cc26-46cd-8ab9-5b42f53e95c0) that can be used to get information about the Tengu core setup. Step 3: Get information your deployed setup Retrieving information (e.g. state information) about your deployed big data environment, is via the following HTTP GET. Depending on the current state of the deployment, the response will also include links to the interfaces of important components. In the case of a Tengu core setup, this will be the Hadoop Administration front end, the webUI of the HDFS Namenode, the webUI of the Storm cluster, the OpenStack Horizon UI and the webUI of the WSO2 ESB. The format of the response will be as follows. HTTP/1.1 202 OK Date: Wed, 11 Mar 2015 07:54:55 GMT Content-Type: application/xml; charset=utf-8 Connection: close Transfer-Encoding: chunked <?xml version="1.0" encoding="UTF-8"?> <ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core xmlns:lnk="http://www.w3.org/1999/xhtml"> <ten:platform> <ten:id>urn:uuid:74584f5b-cc26-46cd-8ab9- 5b42f53e95c0</ten:id> <lnk:link method="get” href="/tengu/urn:uuid:74584f5b- cc26-46cd-8ab9-5b42f53e95c0” /> </ten:platform> </ten:tengu> $ curl -k -i "http://[2001:6a8:1d80:23::141]:8280/tengu/urn:uuid:74584f5b- cc26-46cd-8ab9-5b42f53e95c0"
  • 7. What’s next for Tengu? The Tengu platform currently has a lot of building blocks that are already integrated and configured automatically. What Tengu does not yet (automatically) do, is deploying an experimenter’s application. Although it is possible to do everything by hand, we would like to help experimenters even more by making the process of deploying an application as generic, configurable and automated as possible. This requires an abstract view on what an application is, especially focusing on big data applications and cloud applications with all its different layers. Somewhat connected to this – easing the cloud and big data adoption by the current applications. At the moment, if someone wants to experiment how an existing application would react when using for example a NoSQL data store the application will have to be changed drastically. We want to make these changes to the application obsolete by automatically transforming the requests coming from the application into requests that can be interpreted by the new data store, but without changing the semantical meaning of the original request. This last part is extremely important. Current solutions with middleware abstraction layers are already capable of letting the application talk with different data stores in a generic abstracted way, but it is not guaranteed that the behavior of the original requests are maintained. Another research and development topic we are currently investigating is monitoring in a big data environment. This research is very challenging, not only because of the highly distributed nature of these big data environments, but also because of the heterogeneity of the different components involved in these environments. We believe that if we want to make from Tengu the go-to platform for experimentation in big data and cloud contexts, we have to offer a deeply integrated monitoring framework as well. This monitoring framework should not only look at the environment’s resource usage, but should also monitor the behavior of the application. Projects Tengu DMS² – http://www.iminds.be/en/projects/2014/03/06/dms2 The DMS² (Decentralized Data Management and Migration for SaaS) project aims for the creation of a strategic and practical framework to deal with data management challenges for (potentially new and currently active) SaaS providers in the cloud. HTTP/1.1 200 OK Date: Wed, 11 Mar 2015 08:22:03 GMT Content-Type: application/xml; charset=utf-8 Connection: close Transfer-Encoding: chunked <?xml version="1.0" encoding="UTF-8"?> <ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core xmlns:lnk="http://www.w3.org/1999/xhtml"> <ten:platform> <ten:id>{uuid}</ten:id> <ten:status>{UNKNOWN|READY|FAILED}</ten:status> <lnk:link method="...” rel="..." href="...” /> * </ten:platform> </ten:tengu>
  • 8. The outcome includes:  A reference model for requirements engineering and architectural trade-off analysis, specific for data management and migration in SaaS solutions. This is an essential element for customer acquisition projects.  Middleware for data management, supporting interoperability, data protection measures, tactics and solutions for federated data storage and data processing. The project outcome is driven and validated by four industry case studies, from UP-nxt, Verizon Terremark, Agfa and Luciad, and results in a demonstrator. AMiCA – http://www.amicaproject.be The AMiCA (“Automatic Monitoring for Cyberspace Applications”) project aims to mine relevant social media (blogs, chat rooms, and social networking sites) and collect, analyse, and integrate large amounts of information using text and image analysis. The ultimate goal is to trace harmful content, contact, or conduct in an automatic way. Essentially, we take a cross-media mining approach that allows us to detect risks “on-the-fly”. When critical situations are detected (e.g. a very violent communication), alerts can be issued to moderators of the social networking sites. When used on aggregated data, the same technology can be used for incident collection and monitoring at the scale of individual social networking sites. In addition, the technology can provide accurate quantitative data to support providers, science, and government in decision-making processes with respect to child safety online.  Period: 01/01/2013 - 31/12/2016  Sponsor: IWT - Agentschap voor Innovatie door Wetenschap en Technologie (Agency for Innovation by Science and Technology) SEQUOIA – http://www.iminds.be/en/projects/2015/03/11/sequoia The SEQUOIA (Safe Query Applications for Cloud-Based SaaS Applications) project aims to create a security framework for advanced queries and reporting in SaaS environments. These solutions will be combined with intricate security rules at the application level. As a result, SaaS providers will be able to further optimize their offerings and strengthen confidence in SaaS services. iFest – http://www.iminds.be/en/projects/2015/03/11/ifest The iFest project aims to develop a new generation of festival wristbands that ensure a richer festival experience based on built-in communication and sensor functions. iFest is also focusing on a software platform that allows organizers to manage the wristbands and analyze the data obtained. PROVIDENCE – http://www.iminds.be/en/projects/2014/06/28/providence The PROVIDENCE (Predicting the Online Virality of Entertainment and Newscontent) research project aims to optimize online news publication strategies by anticipating the predicted viral nature of news on social media. Social networks are increasingly popular for distributing news content. It’s a medium where the users themselves decide which topics become ‘viral’. The main goal of the PROVIDENCE project is to optimize online news publication strategies by pro- actively making use of the predicted virality of news on social media platforms. Providence will tackle the technological and research challenges that will build up a virality-driven production flow into a commercial online news environment. These research challenges encompass the large-scale monitoring, analysis and prediction of news consumption and news sharing behavior by users and specific user segments.
  • 9. Fed4FIRE – http://www.fed4fire.eu Fed4FIRE is an Integrating Project under the European Union’s Seventh Framework Program (FP7) addressing the work program topic Future Internet Research and Experimentation. It started in October 2012 and will run for 48 months, until the end of September 2016. Experimentally driven research is considered to be a key factor for growing the European Internet industry. In order to enable this type of RTD activities, a number of projects for building a European facility for Future Internet Research and Experimentation (FIRE) have been launched, each project targeting a specific community within the Future Internet ecosystem. Through the federation of these infrastructures, innovative experiments become possible that break the boundaries of these domains. Besides, infrastructure developers can utilize common tools of the federation, allowing them to focus on their core testbed activities. Recent projects have already successfully demonstrated the advantages of federation within a community. The Fed4FIRE project intends to implement the next step in these activities by successfully federating across the community borders and offering openness for future extensions. IBCN’s Tengu Team Thomas Vanhove Thomas obtained his master’s degree in Computer Science from Ghent University, Belgium in July 2012. In August 2012, he started his PhD at the IBCN (Intec Broadband Communication Networks) research group, researching data management solutions in cloud environment. Tengu originated in the first years of his research in this domain and has since become the main focus of his PhD. Dr. Gregory Van Seghbroeck Gregory Van Seghbroeck graduated at Ghent University in 2005. After a brief stop as an IT consultant, he joined the Department of Information Technology (INTEC) at Ghent University. On the 1st of January, 2007, he received a PhD grant from IWT, Institute for the Support of Innovation through Science and Technology, to work on theoretical aspects of advanced validation mechanism for distributed interaction protocols and service choreographies. In 2011 he received his Ph.D. in Computer Science Engineering. Since July 2012, he has been active as a post-doctoral researcher at Ghent University, where he has been involved in several national and European projects, including the FP7 project BonFIRE and the award-winning ITEA2 project SODA. His main research interests include complex distributed processes, cloud computing, service engineering, and service oriented architectures. As an author or co-author his work has been published in international journals and conference proceedings. Dr. ir. Tim Wauters Tim received his M.Sc. degree in electro-technical engineering in June 2001 from Ghent University, Belgium. In January 2007, he obtained the Ph.D. degree in electro-technical engineering at the same university. Since September 2001, he has been working in the Department of Information Technology (INTEC) at Ghent University, and is now active as a post-doctoral fellow of the F.W.O.-V. His main research interests focus on network and service architectures and management solutions for scalable multimedia delivery services. His work has been published in about 50 scientific publications in international journals and in the proceedings of international conferences. Dr. Bruno Volckaert Bruno Volckaert graduated in 2001 from the Ghent University and obtained his PhD entitled “Architectures and Algorithms for network and service aware Grid resource management” in 2006.
  • 10. Since then he has been responsible for over 20 research projects (ICON, EU FP6, ITEA, SBO). He was research lead of the TRACK and RAILS projects, both dealing with advances in distributed software for railway transportation and is currently research lead of the Elastic Media Distribution project dealing with Cloud provisioning for professional media cooperation platforms. His main focus is on distributed systems, more specifically dealing with intelligent cloud resource provisioning methods and transportation. Prof. Dr. ir. Filip De Turck Filip received his M.Sc. degree in Electronic Engineering from the Ghent University, Belgium, in June 1997. In May 2002, he obtained the Ph.D. degree in Electronic Engineering from the same university. During his Ph.D. research he was funded by the F.W.O.-V., the Fund for Scientific Research Flanders. From October 2002 until September 2008, he was a post-doctoral fellow of the F.W.O.-V. and part time professor, affiliated with the Department of Information Technology of the Ghent University. At the moment, he is a full-time professor affiliated with the Department of Information Technology of the Ghent University and the IBBT (Interdisciplinary Institute of Broadband Technology Flanders) in the area of telecommunication and software engineering. Filip De Turck is author or co-author of approximately 250 papers published in international journals or in the proceedings of international conferences. His main research interests include scalable software architectures for telecommunication network and service management, performance evaluation and design of new telecommunication and eHealth services.