ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43Preferred Networks
Preferred Networksでは新物質開発や材料探索を加速する汎用原子レベルシミュレータを利用できるクラウドサービスを開発しています。 顧客毎に独立した環境にユーザがJupyter Notebookを立ち上げ、自社PyPIパッケージによりAPI経由で弊社独自技術を簡単に利用できます。Kubernetesの機能を駆使してマルチテナント環境を構築しており、各顧客に独立したAPIサーバを提供し、その負荷状況によりAPIサーバをスケーリングさせたり、顧客毎にNotebookに対する通信制限や配置Nodeの制御などを実現しています。
本発表ではKubernetesによるマルチテナントJupyter as a Serviceの実現方法を紹介します。
ゼロから作るKubernetesによるJupyter as a Service ー Kubernetes Meetup Tokyo #43Preferred Networks
Preferred Networksでは新物質開発や材料探索を加速する汎用原子レベルシミュレータを利用できるクラウドサービスを開発しています。 顧客毎に独立した環境にユーザがJupyter Notebookを立ち上げ、自社PyPIパッケージによりAPI経由で弊社独自技術を簡単に利用できます。Kubernetesの機能を駆使してマルチテナント環境を構築しており、各顧客に独立したAPIサーバを提供し、その負荷状況によりAPIサーバをスケーリングさせたり、顧客毎にNotebookに対する通信制限や配置Nodeの制御などを実現しています。
本発表ではKubernetesによるマルチテナントJupyter as a Serviceの実現方法を紹介します。
ROS Japan UG #34 LT大会 で飛び込みLTした資料です.
https://rosjp.connpass.com/event/161041/
ROS 2のDashing/Eloquentで QoS (Quality of Service) 周りのAPIがそれぞれ破壊的に更新されててツラかったので,そのTIPS・知見を共有させていただきました.
How YugaByte DB Implements Distributed PostgreSQLYugabyte
Building applications on PostgreSQL that require automatic data sharding and replication, fault tolerance, distributed transactions and geographic data distribution has been hard. In this 3 hour workshop, we will look at how to do this using a real-world example running on top of YugaByte DB, a distributed database that is fully wire-compatible with PostgreSQL and NoSQL APIs (Apache Cassandra and Redis). We will look at the design and architecture of YugaByte DB and how it reuses the PostgreSQL codebase to achieve full API compatibility. YugaByte DB support for PostgreSQL includes most data types, queries, stored procedures, etc. We will also take a look at how to build applications that are planet scale (requiring geographic distribution of data) and how to run them in cloud-native environments (for example, Kubernetes, hybrid or multi-cloud deployments).
In the first half, we give an introduction to modern serialization systems, Protocol Buffers, Apache Thrift and Apache Avro. Which one does meet your needs?
In the second half, we show an example of data ingestion system architecture using Apache Avro.
Neo4j Spatial - Backing a GIS with a true graph databaseCraig Taverner
Geographic data is naturally structured like a graph, and topological analyses view GIS data as graphs, but until now no-one has tried to make use of a real graph database as the backing store for a GIS. The developers of Neo4j have added features to the popular open source graph database to provide for support for spatial indexing, storage and topology. In addition to these core components, there are a number of useful utilities for importing and exporting data from other popular data sources, and enabling the use of this database in well known libraries and applications in the open source GIS environment.
We will discuss the advantages of using a graph database for geographic data, the performance and scalability implications, and the opportunities enabled by this approach. In today's highly connected social web, there is an increasing need for graph-based data management. At the same time applications are becoming more and more location aware. The time is right for the first geographic graph database.
ROS Japan UG #34 LT大会 で飛び込みLTした資料です.
https://rosjp.connpass.com/event/161041/
ROS 2のDashing/Eloquentで QoS (Quality of Service) 周りのAPIがそれぞれ破壊的に更新されててツラかったので,そのTIPS・知見を共有させていただきました.
How YugaByte DB Implements Distributed PostgreSQLYugabyte
Building applications on PostgreSQL that require automatic data sharding and replication, fault tolerance, distributed transactions and geographic data distribution has been hard. In this 3 hour workshop, we will look at how to do this using a real-world example running on top of YugaByte DB, a distributed database that is fully wire-compatible with PostgreSQL and NoSQL APIs (Apache Cassandra and Redis). We will look at the design and architecture of YugaByte DB and how it reuses the PostgreSQL codebase to achieve full API compatibility. YugaByte DB support for PostgreSQL includes most data types, queries, stored procedures, etc. We will also take a look at how to build applications that are planet scale (requiring geographic distribution of data) and how to run them in cloud-native environments (for example, Kubernetes, hybrid or multi-cloud deployments).
In the first half, we give an introduction to modern serialization systems, Protocol Buffers, Apache Thrift and Apache Avro. Which one does meet your needs?
In the second half, we show an example of data ingestion system architecture using Apache Avro.
Neo4j Spatial - Backing a GIS with a true graph databaseCraig Taverner
Geographic data is naturally structured like a graph, and topological analyses view GIS data as graphs, but until now no-one has tried to make use of a real graph database as the backing store for a GIS. The developers of Neo4j have added features to the popular open source graph database to provide for support for spatial indexing, storage and topology. In addition to these core components, there are a number of useful utilities for importing and exporting data from other popular data sources, and enabling the use of this database in well known libraries and applications in the open source GIS environment.
We will discuss the advantages of using a graph database for geographic data, the performance and scalability implications, and the opportunities enabled by this approach. In today's highly connected social web, there is an increasing need for graph-based data management. At the same time applications are becoming more and more location aware. The time is right for the first geographic graph database.
Setiap orang pasti memiliki sekolah impian masing-masing. segala hal yang berbeda-beda setiap individu. Namun perlu kita ingat bahwa sekolah adalah tempat dimana seseorang dibentuk, Maka jadikanlah sekolah sebagai tempat untuk mendidik generasi pelurus bangsa, bukansekedar penerus bangsa. Semoga sekolah impian ini bisa terwujud. Aamiin.
PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.
Modul Free One Day Workshop Implementing Cisco IP Routing and Switched NetworksI Putu Hariyadi
Modul free one day workshop "Implementing Cisco IP Routing and Switched Networks" bagi guru SMK TKJ se-Nusa Tenggara Barat (NTB) yang diselenggarakan di STMIK Bumigora Mataram
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
Abstract: Hadoop is a open source software framework for storage and processing large scale of datasets on clusters of commodity hardware. Hadoop provides a reliable shared storage and analysis system, here storage provided by HDFS and analysis provided by MapReduce. MapReduce frameworks are foraying into the domain of high performance of computing with stringent non-functional requirements namely execution times and throughputs. MapReduce provides simple programming interfaces with two functions: map and reduce. The functions can be automatically executed in parallel on a cluster without requiring any intervention from the programmer. Moreover, MapReduce offers other benefits, including load balancing, high scalability, and fault tolerance. The challenge is that when we consider the data is dynamically and continuously produced, from different geographical locations. For dynamically generated data, an efficient algorithm is desired, for timely guiding the transfer of data into the cloud over time for geo-dispersed data sets, there is need to select the best data center to aggregate all data onto given that a MapReduce like framework is most efficient when data to be processed are all in one place, and not across data centers due to the enormous overhead of inter-data center data moving in the stage of shuffle and reduce. Recently, many researchers tend to implement and deploy data-intensive and/or computation-intensive algorithms on MapReduce parallel computing framework for high processing efficiency.
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Large amount of data are produced daily from various fields such as science, economics,
engineering and health. The main challenge of pervasive computing is to store and analyze large amount of
data.This has led to the need for usable and scalable data applications and storage clusters. In this article, we
examine the hadoop architecture developed to deal with these problems. The Hadoop architecture consists of
the Hadoop Distributed File System (HDFS) and Mapreduce programming model, which enables storage and
computation on a set of commodity computers. In this study, a Hadoop cluster consisting of four nodes was
created.Regarding the data size and cluster size, Pi and Grep MapReduce applications, which show the effect of
different data sizes and number of nodes in the cluster, have been made and their results examined.
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent processing models with a more efficient and intuitive solution to rapidly process large amount of data in parallel is called MapReduce. It is a framework defining a template approach of programming to perform large-scale data computation on clusters of machines in a cloud computing environment. MapReduce provides automatic parallelization and distribution of computation based on several processors. It hides the complexity of writing parallel and distributed programming code. This paper provides a comprehensive systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data processing. We also proposed the MapReduce framework and one implementation of this framework on Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
P2P media streaming and file downloading is most popular applications over the Internet.
These systems reduce the server load and provide a scalable content distribution. P2P
networking is a new paradigm to build distributed applications. It describes the design
requirements for P2P media streaming, live and Video on demand system comparison based on their system architecture. In this paper we described and studied the traditional approaches for P2P streaming systems, design issues, challenges, and current approaches for providing P2P VoD services.
Survey of Parallel Data Processing in Context with MapReduce cscpconf
MapReduce is a parallel programming model and an associated implementation introduced by
Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart,Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...AM Publications
Cloud computing is the concept of distributing a work and also processing the same work over the internet. Cloud
computing is called as service on demand. It is always available on the internet in Pay and Use mode. Processing of the Big
Data takes more time to compute MRI and DICOM data. The processing of hard tasks like this can be solved by using the
concept of MapReduce. MapReduce function is a concept of Map and Reduce functions. Map is the process of splitting or
dividing data. Reduce function is the process of integrating the output of the Map’s input to produce the result. The Map
function does two various image processing techniques to process the input data. Java Advanced Imaging (JAI) is introduced
in the map function in this proposed work. The processed intermediate data of the Map function is sent to the Reduce function
for the further process. The Dynamic Handover Reduce Function (DHRF) algorithm is introduced in the reduce function in
this work. This algorithm is implemented in the Reduce function to reduce the waiting time while processing the intermediate
data. The DHRF algorithm gives the final output by processing the Reduce function. The enhanced MapReduce concept and
proposed optimized algorithm is made to work on Euca2ool (a Cloud tool) to produce an effective and better output when
compared with the previous work in the field of Cloud Computing and Big Data.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Customer Feedback Analytics for Starbucks Nishant Gandhi
Northeastern University class 7250 Big Data Architecture and Governance Assignment work.
Big Data Project proposal by taking the case study of Starbucks
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Indian Institute of Technology, Patna
Large Scale Graph Processing: Neo4j Vs Apache
Giraph Vs Hadoop-MapReduce
(Survey Report)
Nishant M Gandhi
M.Tech. CSE
IIT Patna
2. Contents
1. Introduction ..........................................................................................................................................3
2. Graph Processing Platforms..................................................................................................................3
a. Hadoop-MapReduce .............................................................................................................................3
b. Giraph....................................................................................................................................................4
c. Neo4j .....................................................................................................................................................4
3. Analysis of Platforms.............................................................................................................................5
a. Hadoop-MapReduce.........................................................................................................................5
b. Giraph................................................................................................................................................5
c. Neo4j.................................................................................................................................................5
4. Conclusion.............................................................................................................................................6
5. References ............................................................................................................................................7
3. 1. Introduction
Today we are living in era of big data. From social media to scientific experiments, from
computer to mobile devices, generate huge amount of data every day. Storing and
processing this data is also the challenge now a day. There are so many real life
problems, which can be solved with use of this generated big data. Many of these
problems related with big data can be mapped to graph problems.
Many solutions have been created to process large scale data. One of the most popular
is [4] Hadoop with its [2] MapReduce programming platform. The lack of a programming
model dedicated for graph was addressed by Google with [3] Pregel. The Pregel uses Bulk
Synchronization Parallel model for graph processing. The open source version of Pregel
is [1] Giraph. Another platform is [6] Neo4j which is graph database processing platform.
In this document, we will try to understand these three platforms and their pros & cons.
2. Graph Processing Platforms
There are many platforms available for large scale graph processing. However we are
considering only these three platforms because all three contain very different
programming models to process graph.
[6]Neo4J: desktop platform, NoSQL, graph database, version
[2]Hadoop-MapReduce: cluster platform, generic large-scale data processing
platform
[1]Giraph: cluster platform, large-scale graph processing specialized platform
a. Hadoop-MapReduce
[2]Hadoop is an open-source platform for storing & computing huge amount of data.
Hadoop has been used widely in many data analytics applications. It uses MapReduce
programming model. Hadoop’s MapReduce programming model is inspired by
functional programming’s Map & Reduce functions. The MapReduce programming
model process input data and divides it based on key/value pairs.
Data used by [4] Hadoop is stored in the Hadoop Distributed File System (HDFS). HDFS is
not a part of Hadoop, although it is being used by it and the platform will not work
without HDFS. Datasets which are stored in the HDFS are divided into N blocks of similar
size. Each of these blocks is used as an input for Mapper.
4. [8]Hadoop’s programming model has low performance and high resource consumption
for iterative graph algorithms, because of programming model which require multiple
map-reduce cycle. For example, for iterative graph traversing algorithms Hadoop would
often need to store and load the entire graph structure during each iteration, to transfer
data between the map and reduce processes through the disk-intensive HDFS, and to
run an convergence-checking iterations as an additional job.
b. Giraph
[1]Giraph is an open-source, graph specific distributed system platform. Giraph uses the
Pregel programming model, which is a vertex-centric programming abstraction that
adapts the Bulk Synchronous Parallel (BSP) model. A BSP computation proceeds in a
series of global Supersteps. Within each Superstep, active vertices execute the same
user defined compute function and create & deliver inter-vertex messages. Barriers
ensure synchronization between vertex computations. Once there are no messages to
process and all vertices vote to halt.
[8]Giraph utilizes the design of Hadoop, from which it leverages only Map phase. The
single biggest difference between Hadoop & Giraph is the fact that Giraph is in-memory
which speedup job execution. For fault-tolerance, Giraph uses periodic checkpoints. To
co-ordinate Superstep execution, it uses [5]ZooKeeper.
c. Neo4j
Neo4j is one of the popular open-source NoSQL graph database implemented in java.
Neo4j stores data in graphs rather than in tables. Every stored graph in Neo4j consists of
relationships and vertices annotated with properties. Neo4j can execute graph-
processing algorithms efficiently on just one machine, because of its optimization
techniques that favor response time. [8]Neo4j uses a two-level, main-memory caching
mechanism to improve its performance. The file buffer caches the storage file data in
the same format as it is stored on the durable storage media. The object buffer caches
vertices and relationships in a format that is optimized for high traversal speeds and
transactional writes.
Neo4j processes graphs by traversing all vertices, with the use of either the BFS or DFS
traversal algorithm. To start graph traversal a program has to define a special reference
vertex. This vertex is not a part of the original graph, but an additional artificial vertex
which is add to the graph structure and act as a starting point of the graph traversal. All
graph operations are performed as ACID transactions.
5. 3. Analysis of Platforms
The performance analyses of these platforms have been done several times but here I
am using two materials and their results to write this section of report. The one is M.S.
theses report of Marcin Biczak batch of 2013 from Delft University of Technology.
Another one is report titled [8] “How well do Graph-processing platform performs?”
From these two materials, some important finding comes out which are as listed below.
[8]However, the performance of all platforms is stable and largest variance around 10%.
a. Hadoop-MapReduce
i. [7]Hadoop-MapReduce performs worst in any graph algorithm then other
platforms.
ii. [7]Multi-iteration algorithms suffer from additional performance
penalties.
b. Giraph
i. [8]Giraph process graph in-memory and realize dynamic computation
mechanism by which only selected vertices will be processed in all
iterations of algorithms. That reduces computation time.
ii. [7]For large amounts of messages or big datasets, Giraph can lead to
crashes due to lack of memory.
c. Neo4j
i. [8]Limited by the resource of single machine, the performance of Neo4j
becomes significantly worst when the graph exceeds the memory
capacity.
ii. [7]Neo4j was designed as a single machine dataset. To achieve multi scale,
users of Neo4j have to implement communication between these
machines as well as manage partitioning, consistency etc. It require
significant amount of additional work beside the application
implementations.
iii. [7]Two-level cache allows Neo4j to achieve excellent hot-cache execution
times, especially when graph data accessed by the algorithm fits in cache.
iv. [8]The data ingestion time of Neo4j matches closely the characteristics of
the graph. Overall, data ingestion takes much longer for Neo4j than
HDFS.
6. 4. Conclusion
Based on survey, we can reach to following conclusion.
Modern computers can handle most of smaller or sparser graph databases. However,
once the dataset size significantly increases or if the graph is dense, the execution time
increases significantly. For this reason, single machine based graph processing platforms
cannot compete with distributed system.
We have considered two graph-processing frameworks (Giraph, Neo4j) and a generic
data-processing platform (Hadoop). The platforms which focus on processing graph
dataset achieve significant performance advantages over generic platforms in most of
cases. A Hadoop does not maintain the relations between data and treats every vertex
as a disjoint, which other platforms have to and pay a performance penalty for it. Thus
for certain datasets Hadoop can achieve better performance than the graph-processing
platforms.
There are two significant factors for the large-scale graph-processing platforms: the
programming model and platform design. The Pregel performs much better than
MapReduce for iterative algorithms. Giraph has limitation that it performs in memory
computation, which limitation is not for Hadoop-mapreduce. [7]The Neo4j has achieved
good performance for smaller or sparser dataset on single system. It has very good
documentation and hence easy to learn. [7]The Hadoop-mapreduce considered slowest
platform of all evaluated platforms but Neo4j’s performance for the large or dense
dataset is lower than that of Hadoop-mapreduce. [7]The Giraph platform, which
represents distributed large-scale graph processing platforms, was the fastest platform
in all the test experiment made by Marcin Biczak.
7. 5. References
1. Apache Software Foundation, “Giraph.” http://giraph.apache.org
2. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,”
Comm. ACM, vol. 51, no. 1,2008, pp. 107–112.
3. Pregel: a system for large-scale graph processing - "abstract". G. Malewicz, M. H.
Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. In Proceedings of
the 28th ACM symposium on Principles of distributed computing, PODC '09, pages 6-
6,New York, NY, USA, 2009. ACM.
4. Apache Software Foundation, “Hadoop” Website, 2011.http://hadoop.apache.org
5. Apache Software Foundation, “Zookeeper”.Website,2010. http://zookeeper.apache.org
6. Neo Technology, http://www.neo4j.org
7. LudoGraph: a Sampling Capable Cloud-Based System for Large-Scale Graph Processing
Based on the Pregel programming model, Marcin Biczak, Masters of Science Thesis,Delft
University of Technology Year 2013
8. Y. Guo, M. Biczak, A. Varbanescu, A. Isoup, C. Martella, and T. Willke, “How well do
graph-processing platforms perform? an empirical performance evaluation and analysis:
Extended report,” tech. rep., Delft University of Technology, 2013.