Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Design of a lightweight set of data pipelines to scrub PII information.
Scrubbing PII information from data brings ease of sharing data.
It also helps organisations to confidently push data outside organisation for large scale analytics on the cloud.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Design of a lightweight set of data pipelines to scrub PII information.
Scrubbing PII information from data brings ease of sharing data.
It also helps organisations to confidently push data outside organisation for large scale analytics on the cloud.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Data mining model for the data retrieval from central server configurationijcsit
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
Abstract— In data mining, association rule mining is one of the major techniques for discovering meaningful patterns from large collection of data. Discovering frequent item sets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns surrounded by complex data. Frequent Item set Mining is one of the classical data mining tribulations in most of the data mining applications. Apache Hadoop is a major innovation in the IT market place last decade. From modest beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature analysis on different techniques for mining frequent item sets and frequent item sets on Hadoop.
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey IJECEIAES
In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
The design and implementation of modern column oriented databasesTilak Patidar
An attempt to break down the paper on the design of column-oriented databases into simpler terms.
https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
https://blog.acolyer.org/2018/09/26/the-design-and-implementation-of-modern-column-oriented-database-systems/
Data mining model for the data retrieval from central server configurationijcsit
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
Abstract— In data mining, association rule mining is one of the major techniques for discovering meaningful patterns from large collection of data. Discovering frequent item sets play an important role in mining association rules, sequence rules, web log mining and many other interesting patterns surrounded by complex data. Frequent Item set Mining is one of the classical data mining tribulations in most of the data mining applications. Apache Hadoop is a major innovation in the IT market place last decade. From modest beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature analysis on different techniques for mining frequent item sets and frequent item sets on Hadoop.
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey IJECEIAES
In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
In this paper we describe NoSQL, a series of non-relational database
technologies and products developed to address the current problems the
RDMS system are facing: lack of true scalability, poor performance on high
data volumes and low availability. Some of these products have already been
involved in production and they perform very well: Amazon’s Dynamo,
Google’s Bigtable, Cassandra, etc. Also we provide a view on how these
systems influence the applications development in the social and semantic Web
sphere.
In this paper we describe NoSQL, a series of non-relational database technologies and products developed to address the current problems the RDMS system are facing: lack of true scalability, poor performance on high data volumes and low availability. Some of these products have already been involved in production and they perform very well: Amazon’s Dynamo, Google’s Bigtable, Cassandra, etc. Also we provide a view on how these systems influence the applications development in the social and semantic Web sphere.
Key aspects of big data storage and its architectureRahul Chaturvedi
This paper helps understand the tools and technologies related to a classic BigData setting. Someone who reads this paper, especially Enterprise Architects, will find it helpful in choosing several BigData database technologies in a Hadoop architecture.
One Size Doesn't Fit All: The New Database Revolutionmark madsen
Slides from a webcast for the database revolution research report (report will be available at http://www.databaserevolution.com)
Choosing the right database has never been more challenging, or potentially rewarding. The options available now span a wide spectrum of architectures, each of which caters to a particular workload. The range of pricing is also vast, with a variety of free and low-cost solutions now challenging the long-standing titans of the industry. How can you determine the optimal solution for your particular workload and budget? Register for this Webcast to find out!
Robin Bloor, Ph.D. Chief Analyst of the Bloor Group, and Mark Madsen of Third Nature, Inc. will present the findings of their three-month research project focused on the evolution of database technology. They will offer practical advice for the best way to approach the evaluation, procurement and use of today’s database management systems. Bloor and Madsen will clarify market terminology and provide a buyer-focused, usage-oriented model of available technologies.
Webcast video and audio will be available on the report download site as well.
For the past several decades the rising tide of technology -- especially the increasing speed of single processors -- has allowed the same data analysis code to run faster and on bigger data sets. That happy era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM. To deal with this, we need software that can use multiple cores, multiple hard drives, and multiple computers.
That is, we need scalable data analysis software. It needs to scale from small data sets to huge ones, from using one core and one hard drive on one computer to using many cores and many hard drives on many computers, and from using local hardware to using remote clouds.
R is the ideal platform for scalable data analysis software. It is easy to add new functionality in the R environment, and easy to integrate it into existing functionality. R is also powerful, flexible and forgiving.
I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR. A key part of this approach is to efficiently operate on "chunks" of data -- sets of rows of data for selected columns. I will discuss this approach from the point of view of:
- Storing data on disk
- Importing data from other sources
- Reading and writing of chunks of data
- Handling data in memory
- Using multiple cores on single computers
- Using multiple computers
- Automatically parallelizing "external memory" algorithms
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Experimenting With Big Data
1. Experimenting with Big Data
Thomas Vanhove, Gregory Van Seghbroeck, Tim Wauters, Bruno Volckaert, Filip De Turck
Big Data and its Problems
For us the big data domain can be divided into three main categories:
big data analysis,
big data management, and
querying big data.
This distinction is merely a research thing. Big data applications always require a combination of these
three topics.
Big data analysis can in its turn be subdivided into two tracks: analysis on big data sets and analysis
on large streams of incoming data. The latter is often called stream processing or complex event
processing. In stream processing we often have a real-time requirement, i.e. we need the result of the
processing of the incoming data immediately. Whereas in analysis on big data sets, we need a more
thorough analysis.
The main research track in big data management is about data distribution or data partitioning. Think
of it as you have a set of servers available, where will you place the data. Data distribution is in most
cases handled by the data store or the data storage system you have selected. It has an impact on
many features, e.g. robustness of the system, data redundancy, performance of reads and writes, data
consistency, availability etc. There already exist a wide variety of data stores or data storage systems
that excel in certain features, however it is not possible to support all features simultaneously.
Deciding in the huge jungle of possible data storage systems is not a trivial job. But help is on its way!
Querying as research topic is also handled by the data stores itself and it has lot of common ground
with data management. We can see at least three major ways of querying data in the big data context.
First of all we have simple reads. This has everything to do with how and where the big data is stored.
It has also a lot of similarities to data management, especially with indexing, data partitioning and data
redundancy. Another querying method involves range queries. This is a form of queries where a set
(sometimes in a specific order) of results is returned. How this set is constructed is the topic of this
research. The third way of querying is full text search. In full text search you want to search large
chunks of plain text or structured text for occurrences of specific words or concepts.
For companies it is difficult to choose a specific strategy, especially in this relatively new and volatile
domain. Would it not be interesting to be able to experiment with all these different technologies?
What about your applications, are they big data proof? Do they scale? Can they handle increasing
loads? How many and which resources are needed? What if I told you there is already a platform were
you can do your big data experiments. Ready to use, without the hassle of integration and
configuration.
Tengu: Big Data Applications Platform
The Tengu platform allows customers to experiment with a lot of aspects of big data. If you want to
simply try the new big data stores (e.g. Cassandra or Elastic Search), you can easily set up a Tengu
environment with these components already configured. You can also try different types of big data
analysis methodologies with Tengu. For example a clean Tengu instance comes with three different
2. types of big data analysis: stream processing, batch analysis and the Lambda Architecture. If you want
to experiment with your existing application and see for instance how well it performs in a big data
context, Tengu can also be used for this purpose.
In what follows we will go deeper into all the technologies and software components that currently
make up Tengu. With specific attention to these components’ function in Tengu and how you, as
experimenter, can use them. The last subsection provides a small tutorial on how you can use Tengu
to set up a big data environment.
Technologies
Big data analysis
Batch analysis – Apache Hadoop MapReduce
For the batch analysis Tengu relies on Apache’s Hadoop MapReduce. MapReduce is a parallel
programming concept first coined by Google. Due to the way MapReduce works it is targeting a
specific type of big data batch analysis jobs. MapReduce, as the name suggests, exists out of two
phases: a map phase and a reduce phase. What typically happens is that an extremely large data set
is chopped into smaller manageable parts (i.e. parts that can be handled by simple PCs). On these
smaller parts a map function is executed; this is where the actual analysis happens. This map phase is
ideally executed on as many nodes as there are smaller parts. The result of the map function should
always be some kind of key-value set. These different key-value sets are then aggregated into one
large key-value set in the reduce phase. It is important to point out a very significant shortcoming of
the MapReduce programming concept, the analysis cannot have dependencies on the entire or parts
of the data set. Does this mean that you cannot have any dependencies between your data or the
analysis of this data? Not at all, since it is possible to chain several MapReduce jobs to perform very
complex analysis jobs.
Stream processing – Apache Storm
Tengu uses the Apache Storm project for its stream processing. This open source project, initially
developed by Nathan Marz for Twitter, is capable of processing large streams of data. They say it will
do the processing in real-time, but this of course depends on the type of processing you want to do.
The idea behind Storm is very similar to MapReduce, with the difference that in Storm we do not chop
up the data, but we chop up the analysis job. By dividing a complex analysis job into different small,
and reusable, parts, the processing can be heavily parallelized and distributed over the available
worker nodes1
. In this way it is possible to achieve a very high throughput (of course depending on
the number of worker nodes). In the Apache Storm lingo such a small processing part is called a bolt.
These bolts are chained together in what they call a topology, which in its entirety will perform the
complex analysis job.
Lambda Architecture
The Lambda architecture, also coined by Nathan Marz, combines the two previous approaches. He
created this specific analysis approach because he saw the need for real-time analysis (similar to
stream processing) that also included information on historical data, which could grow potentially to
extreme sizes. Without going into too much detail, this is what happens in the Lambda architecture:
a batch analysis job is running that processes the historical data. In the case of Tengu, this Batch is
handled by Apache Hadoop MapReduce. The result of this processing is stored in what Nathan Marz
calls the Batch View. During this batch analysis job a sub-optimal processing is done. Sub-optimal,
because we need this to be handled in real-time. Tengu uses Apache Storm for this, so a stream
1
A worker node is simply a server designated for a computational task.
3. processing analysis system. When the client queries this system for information the Batch View and
the results from the stream processing data analysis framework are aggregated and thus return
information that contains both recent info and information derived from the historical data. When an
iteration of the batch analysis job is finished, the recently received data (so the data that is processed
by Apache Storm) is combined with the historical data. When this transition is finished, the batch
analysis job starts again, but now over more data. During this new iteration, the system then receives
new incoming data, which is in turn handled by the stream processor. Until the batch job is finished
again, which will initiate the data move of all the newly received data to the batch analysis job’s source
data … and so on. This is a continuous process, always providing a view on the historical data and a
view on the real-time data. The Lambda architecture thus is a specific hybrid approach for big data
analysis leveraging the computing power of batch processing in a batch layer with the responsiveness
of real-time computing system in the stream processor (which is called architecture the speed layer in
the Lambda architecture).
Tengu provides all the necessary building blocks to set up such a Lambda architecture. Control of the
batch analysis job and of the movement of the recently arrived data is handled by the Enterprise
Service Bus controlling and managing Tengu.
Data Stores
Tengu already supports several data stores out of the box, including the relational data base MySQL.
The other data stores are three NoSQL data stores with very distinct usage and features. Now, what
does Tengu provide to you as an experimenter? It does all the deployment and configuration for you,
so the data stores are ready to be used by your applications.
Cassandra
Cassandra is what they call a key-value based column store. This means that every value is uniquely
identified by its key and that the value is a combination of different columns. Cassandra is a good
starting point for people coming from the RDBMS world and wanting to taste what NoSQL is about,
because it still has a concept of tables. Next to this Cassandra has a query language (called CQL) that
is very similar to the well-known SQL language. Cassandra of course has many NoSQL features, e.g.
decentralization (i.e. no single point of failure), data replication and fault-tolerance (both to network
errors and server down-time).
A big difference between Cassandra and regular RDBMS systems is, that it does not have the concept
of joins. So it is not possible to join multiple keys into new sets of columns, like you do with RDBMS
tables. Joining or other types of cross-references have to be handled by the application. This is a
consequence of the data model chosen for Cassandra. However, this data model (the key-value based
column store) has many advantages, e.g. large capacity and extremely fast writes and reads.
MongoDB
MongoDB is a key-value based document store. You will typically store entire documents in a
MongoDB data stores. A document in MongoDB is a set of properties, look at this as a big JSON file. A
nice feature of MongoDB is that it does not require to know a data schema in advance. It will learn the
data format on the fly, as you insert the new documents. MongoDB also indexes your data as you
insert it, making for lightning fast reads. Creating new documents and updating existing documents in
MongoDB is a bit more tedious (still very fast and highly scalable), because of the automatic data
format recognition and the automatic indexing mechanism.
Although it is possible to query parts of document, MongoDB is typically used to retrieve the entire
document at once. A very important feature of MongoDB, and definitely one of its strengths, is its
4. range query capabilities. A range query for example is when you need all documents created between
two specific dates (ranges do not always have to be over dates, any property can be used from the
document).
ElasticSearch
ElasticSearch can also be considered as a document store, however it is much more. ElasticSearch’s
main advantage is situated around its full-text search capabilities, for which it heavily relies on Apache
Lucene. ElasticSearch is actually a feature-rich service layer on top of Apache’s incredible indexing
system, Lucene. It provides an easy to use search API and filtering API, with a lot of customization
possibilities.
Other data storage systems
Thanks to some central components that are an integral part of Tengu, we have some extra data
storage systems that you can experiment with. A very important one is Apache Hadoop Distributed
File System (HDFS), which is the distributed file system that comes with Apache MapReduce. HDFS can
be used as a regular file system, but with all the features of the Hadoop system: high availability,
scalability, redundancy, fault-tolerance, network-partition-proof, etc.
Another component that can be used to store data is Apache Kafka. Apache Kafka is actually a scalable
distributed message broker, but it is also capable of persisting large amounts of data. In Tengu we use
Apache Kafka as message store for our Tengu Lambda architecture implementation. It tightly
integrates with Apache Storm and Apache HDFS. You have generally two types of message brokers:
queue based or topic based systems. Apache Kafka adopts some form of the topic based message
broker.
Resource management and configuration
Fed4FIRE
With Tengu it is possible to setup big data experiment environments with an easy RESTful API. With a
simple POST request you can create a new environment. What happens in the background, is that this
POST request is translated into specific calls to one of the Fed4FIRE2
testbeds (with the API you can
actually decide yourself on which testbed you want the Tengu experimentation environment to be
deployed). Fed4FIRE is a large European Integration Project under the 7th
Framework Program
enabling experiments that combine and federate facilities from the different FIRE research
communities. IBCN is one of the main driving forces in Fed4FIRE, not only with opening up IBCN’s
Virtual Wall to other Fed4FIRE partners, but also as main developer of several federation and client
tools used in Fed4FIRE. One of these tools is JFed3
, which allows an experimenter to setup any type of
server topology on the Fed4FIRE testbeds. It is actually this tool that is being used by Tengu to allocate
and deploy the necessary resources.
Configuration – Chef
The JFed client tool only provides the servers to be used in the Tengu experiment environment. The
necessary software components and the configuration of these components is handled by Chef4
, a
configuration management framework. Currently we provide a particular set of predefined cookbooks
and recipes that are being used in Tengu. Cookbooks and recipes contain the necessary information
and dependencies to deploy, configure and integrate specific pieces of software. However we open
up Chef to our Tengu experimenters, allowing them to deploy their own specific set of tools and
2
http://www.fed4fire.eu/
3
http://jfed.iminds.be/
4
https://www.chef.io/chef/
5. software components as well. In the Chef Supermarket5
you can find lots of cookbooks for most of the
commonly used tools. If you need tools that are not available in the Supermarket, you always can
create the cookbooks and recipes yourself or if this is too much hassle, you still can use the tools
available in Ubuntu to deploy and install your software components manually. We, however, strongly
advice to use Chef as it is very straightforward to move from the experimenting environment to your
company’s production environment.
Cloud virtualization – OpenStack
Tengu will set up a big data experimentation environment for you with a fixed set of servers. The size
of some of the clusters (such as the Apache Hadoop cluster and the Storm cluster) can be defined by
the experimenter, but the other servers are fixed. It is possible, especially to experiment with existing
applications, that this fixed set of servers is not sufficient. Tengu answers to this need by also
configuring an OpenStack private cloud. The size of this cloud can also be defined by the experimenter.
In this OpenStack private cloud an experimenter can create many virtual machines that can be used
to deploy software components necessary for the experimenter’s application. Deployment and
configuration of these components can – and this is actually again the preferred way – be handled by
Chef.
Platform Usage
In what follows we will give a small introduction on how you can use Tengu to set up your own big
data experimentation environment. This is achieved by calling Tengu’s RESTful API. For more advanced
usage of the RESTful API and more in-depth information about Tengu, we refer to the documentation
on the Tengu website.
Step 1: Prerequisites
The only thing you currently need to get started with Tengu, is a valid Fed4FIRE account.
Documentation concerning Fed4FIRE and especially how to obtain such an account, can be found on
the Fed4FIRE documentation site6
.
The RESTful API is a combination of GET and POST HTTP requests. For the GET requests it suffices to
have a standard browser, but for the POST requests this is not enough. Some browsers, however, have
extensions to do RESTful requests. We advise to use a tool such as cURL7
for the RESTful API requests.
All examples provided here are shown using cURL.
Step 2: Deploy your first Tengu core setup
Setting up a Tengu big data environment is as easy as doing the following HTTP POST.
This will setup a Tengu big data environment with an Apache Hadoop cluster of size 3 and an Apache
Storm cluster of size 2. Let us break down the URL.
5
https://supermarket.chef.io/cookbooks-directory
6
http://doc.fed4fire.eu/getanaccount.html
7
http://curl.haxx.se/
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/core?hnodes=3&snod
es=2&testbed=urn:publicid:IDN+wall2.ilabt.iminds.be+authority
+cm" -X POST
6. The Tengu RESTful API can be reached using the ipv6 address 2001:6a8:1d80:23::141 and port
8280. By using the path /tengu/core we tell the API to setup a Tengu core platform. The Tengu
core platform is configured using several parameters, provided as key-value pairs in the Query-string:
The testbed we want this Tengu core setup to be deployed to, is set by the testbed
parameter. Its value is defined using the Fed4FIRE uuid of the testbed (here
urn:publicid:IDN+wall2.ilabt.iminds.be+authority+cm).
The size of the Apache Hadoop cluster is set via the hnodes parameter.
Similar for the size of the Apache Storm cluster, snodes.
A response similar to this one will be responded. This response includes the uuid (here
urn:uuid:74584f5b-cc26-46cd-8ab9-5b42f53e95c0) that can be used to get
information about the Tengu core setup.
Step 3: Get information your deployed setup
Retrieving information (e.g. state information) about your deployed big data environment, is via the
following HTTP GET.
Depending on the current state of the deployment, the response will also include links to the
interfaces of important components. In the case of a Tengu core setup, this will be the Hadoop
Administration front end, the webUI of the HDFS Namenode, the webUI of the Storm cluster, the
OpenStack Horizon UI and the webUI of the WSO2 ESB.
The format of the response will be as follows.
HTTP/1.1 202 OK
Date: Wed, 11 Mar 2015 07:54:55 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>urn:uuid:74584f5b-cc26-46cd-8ab9-
5b42f53e95c0</ten:id>
<lnk:link method="get” href="/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0” />
</ten:platform>
</ten:tengu>
$ curl -k -i
"http://[2001:6a8:1d80:23::141]:8280/tengu/urn:uuid:74584f5b-
cc26-46cd-8ab9-5b42f53e95c0"
7. What’s next for Tengu?
The Tengu platform currently has a lot of building blocks that are already integrated and configured
automatically. What Tengu does not yet (automatically) do, is deploying an experimenter’s
application. Although it is possible to do everything by hand, we would like to help experimenters
even more by making the process of deploying an application as generic, configurable and automated
as possible. This requires an abstract view on what an application is, especially focusing on big data
applications and cloud applications with all its different layers.
Somewhat connected to this – easing the cloud and big data adoption by the current applications. At
the moment, if someone wants to experiment how an existing application would react when using for
example a NoSQL data store the application will have to be changed drastically. We want to make
these changes to the application obsolete by automatically transforming the requests coming from
the application into requests that can be interpreted by the new data store, but without changing the
semantical meaning of the original request. This last part is extremely important. Current solutions
with middleware abstraction layers are already capable of letting the application talk with different
data stores in a generic abstracted way, but it is not guaranteed that the behavior of the original
requests are maintained.
Another research and development topic we are currently investigating is monitoring in a big data
environment. This research is very challenging, not only because of the highly distributed nature of
these big data environments, but also because of the heterogeneity of the different components
involved in these environments. We believe that if we want to make from Tengu the go-to platform
for experimentation in big data and cloud contexts, we have to offer a deeply integrated monitoring
framework as well. This monitoring framework should not only look at the environment’s resource
usage, but should also monitor the behavior of the application.
Projects Tengu
DMS² – http://www.iminds.be/en/projects/2014/03/06/dms2
The DMS² (Decentralized Data Management and Migration for SaaS) project aims for the creation of
a strategic and practical framework to deal with data management challenges for (potentially new and
currently active) SaaS providers in the cloud.
HTTP/1.1 200 OK
Date: Wed, 11 Mar 2015 08:22:03 GMT
Content-Type: application/xml; charset=utf-8
Connection: close
Transfer-Encoding: chunked
<?xml version="1.0" encoding="UTF-8"?>
<ten:tengu xmlns:ten=http://tengu.ibcn.ugent.be/0/1/core
xmlns:lnk="http://www.w3.org/1999/xhtml">
<ten:platform>
<ten:id>{uuid}</ten:id>
<ten:status>{UNKNOWN|READY|FAILED}</ten:status>
<lnk:link method="...” rel="..." href="...” /> *
</ten:platform>
</ten:tengu>
8. The outcome includes:
A reference model for requirements engineering and architectural trade-off analysis, specific
for data management and migration in SaaS solutions. This is an essential element for
customer acquisition projects.
Middleware for data management, supporting interoperability, data protection measures,
tactics and solutions for federated data storage and data processing.
The project outcome is driven and validated by four industry case studies, from UP-nxt, Verizon
Terremark, Agfa and Luciad, and results in a demonstrator.
AMiCA – http://www.amicaproject.be
The AMiCA (“Automatic Monitoring for Cyberspace Applications”) project aims to mine relevant social
media (blogs, chat rooms, and social networking sites) and collect, analyse, and integrate large
amounts of information using text and image analysis. The ultimate goal is to trace harmful content,
contact, or conduct in an automatic way. Essentially, we take a cross-media mining approach that
allows us to detect risks “on-the-fly”. When critical situations are detected (e.g. a very violent
communication), alerts can be issued to moderators of the social networking sites. When used on
aggregated data, the same technology can be used for incident collection and monitoring at the scale
of individual social networking sites. In addition, the technology can provide accurate quantitative
data to support providers, science, and government in decision-making processes with respect to child
safety online.
Period: 01/01/2013 - 31/12/2016
Sponsor: IWT - Agentschap voor Innovatie door Wetenschap en Technologie (Agency for
Innovation by Science and Technology)
SEQUOIA – http://www.iminds.be/en/projects/2015/03/11/sequoia
The SEQUOIA (Safe Query Applications for Cloud-Based SaaS Applications) project aims to create a
security framework for advanced queries and reporting in SaaS environments. These solutions will be
combined with intricate security rules at the application level. As a result, SaaS providers will be able
to further optimize their offerings and strengthen confidence in SaaS services.
iFest – http://www.iminds.be/en/projects/2015/03/11/ifest
The iFest project aims to develop a new generation of festival wristbands that ensure a richer festival
experience based on built-in communication and sensor functions. iFest is also focusing on a software
platform that allows organizers to manage the wristbands and analyze the data obtained.
PROVIDENCE – http://www.iminds.be/en/projects/2014/06/28/providence
The PROVIDENCE (Predicting the Online Virality of Entertainment and Newscontent) research project
aims to optimize online news publication strategies by anticipating the predicted viral nature of news
on social media.
Social networks are increasingly popular for distributing news content. It’s a medium where the users
themselves decide which topics become ‘viral’. The main goal of the PROVIDENCE project is to
optimize online news publication strategies by pro- actively making use of the predicted virality of
news on social media platforms. Providence will tackle the technological and research challenges that
will build up a virality-driven production flow into a commercial online news environment. These
research challenges encompass the large-scale monitoring, analysis and prediction of news
consumption and news sharing behavior by users and specific user segments.
9. Fed4FIRE – http://www.fed4fire.eu
Fed4FIRE is an Integrating Project under the European Union’s Seventh Framework Program (FP7)
addressing the work program topic Future Internet Research and Experimentation. It started in
October 2012 and will run for 48 months, until the end of September 2016.
Experimentally driven research is considered to be a key factor for growing the European Internet
industry. In order to enable this type of RTD activities, a number of projects for building a European
facility for Future Internet Research and Experimentation (FIRE) have been launched, each project
targeting a specific community within the Future Internet ecosystem. Through the federation of these
infrastructures, innovative experiments become possible that break the boundaries of these domains.
Besides, infrastructure developers can utilize common tools of the federation, allowing them to focus
on their core testbed activities.
Recent projects have already successfully demonstrated the advantages of federation within a
community. The Fed4FIRE project intends to implement the next step in these activities by successfully
federating across the community borders and offering openness for future extensions.
IBCN’s Tengu Team
Thomas Vanhove
Thomas obtained his master’s degree in Computer Science from Ghent University, Belgium in July
2012. In August 2012, he started his PhD at the IBCN (Intec Broadband Communication Networks)
research group, researching data management solutions in cloud environment. Tengu originated in
the first years of his research in this domain and has since become the main focus of his PhD.
Dr. Gregory Van Seghbroeck
Gregory Van Seghbroeck graduated at Ghent University in 2005. After a brief stop as an IT consultant,
he joined the Department of Information Technology (INTEC) at Ghent University. On the 1st of
January, 2007, he received a PhD grant from IWT, Institute for the Support of Innovation through
Science and Technology, to work on theoretical aspects of advanced validation mechanism for
distributed interaction protocols and service choreographies. In 2011 he received his Ph.D. in
Computer Science Engineering. Since July 2012, he has been active as a post-doctoral researcher at
Ghent University, where he has been involved in several national and European projects, including the
FP7 project BonFIRE and the award-winning ITEA2 project SODA. His main research interests include
complex distributed processes, cloud computing, service engineering, and service oriented
architectures. As an author or co-author his work has been published in international journals and
conference proceedings.
Dr. ir. Tim Wauters
Tim received his M.Sc. degree in electro-technical engineering in June 2001 from Ghent University,
Belgium. In January 2007, he obtained the Ph.D. degree in electro-technical engineering at the same
university. Since September 2001, he has been working in the Department of Information Technology
(INTEC) at Ghent University, and is now active as a post-doctoral fellow of the F.W.O.-V. His main
research interests focus on network and service architectures and management solutions for scalable
multimedia delivery services. His work has been published in about 50 scientific publications in
international journals and in the proceedings of international conferences.
Dr. Bruno Volckaert
Bruno Volckaert graduated in 2001 from the Ghent University and obtained his PhD entitled
“Architectures and Algorithms for network and service aware Grid resource management” in 2006.
10. Since then he has been responsible for over 20 research projects (ICON, EU FP6, ITEA, SBO). He was
research lead of the TRACK and RAILS projects, both dealing with advances in distributed software for
railway transportation and is currently research lead of the Elastic Media Distribution project dealing
with Cloud provisioning for professional media cooperation platforms. His main focus is on distributed
systems, more specifically dealing with intelligent cloud resource provisioning methods and
transportation.
Prof. Dr. ir. Filip De Turck
Filip received his M.Sc. degree in Electronic Engineering from the Ghent University, Belgium, in June
1997. In May 2002, he obtained the Ph.D. degree in Electronic Engineering from the same university.
During his Ph.D. research he was funded by the F.W.O.-V., the Fund for Scientific Research Flanders.
From October 2002 until September 2008, he was a post-doctoral fellow of the F.W.O.-V. and part
time professor, affiliated with the Department of Information Technology of the Ghent University. At
the moment, he is a full-time professor affiliated with the Department of Information Technology of
the Ghent University and the IBBT (Interdisciplinary Institute of Broadband Technology Flanders) in
the area of telecommunication and software engineering. Filip De Turck is author or co-author of
approximately 250 papers published in international journals or in the proceedings of international
conferences. His main research interests include scalable software architectures for
telecommunication network and service management, performance evaluation and design of new
telecommunication and eHealth services.