This document discusses challenges and opportunities in parallel graph processing for big data. It describes how graphs are ubiquitous but processing large graphs at scale is difficult due to their huge size, complex correlations between data entities, and skewed distributions. Current computation models have problems with ghost vertices, too much interaction between partitions, and lack of support for iterative graph algorithms. New frameworks are needed to handle these graphs in a scalable way with low memory usage and balanced computation and communication.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of
conventional software and hardware. Hadoop framework distributes large datasets over multiple
commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop
framework and propose methods for enhancing I/O performance. A proven approach is to cache data to
maximize memory-locality of all map tasks. We introduce an approach to optimize I/O, the in-node
combining design which extends the traditional combiner to a node level. The in-node combiner reduces
the total number of intermediate results and curtail network traffic between mappers and reducers.
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...paperpublications3
Abstract: Maximum-flow problem are used to find Google spam sites, discover Face book communities, etc., on graphs from the Internet. Such graphs are now so large that they have outgrown conventional memory-resident algorithms. In this paper, we show how to effectively parallelize a maximum flow problem based on the Edmonds-Karp Algorithm (EKA) method on a cluster using the MapReduce framework. Our algorithm exploits the property that such graphs are small-world networks with low diameter and employs optimizations to improve the effectiveness of MapReduce and increase parallelism. We are able to compute maximum flow on a subset of the a large network graph with approximately more number of vertices and more number of edges using a cluster of 4 or 5 machines in reasonable time.Keywords: Algorithm, MapReduce, Hadoop.
Title: Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flow in Large Network Graph
Author: Dhananjaya Kumar K, Mr. Manjunatha A.S
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...paperpublications3
Abstract: Maximum-flow problem are used to find Google spam sites, discover Face book communities, etc., on graphs from the Internet. Such graphs are now so large that they have outgrown conventional memory-resident algorithms. In this paper, we show how to effectively parallelize a maximum flow problem based on the Edmonds-Karp Algorithm (EKA) method on a cluster using the MapReduce framework. Our algorithm exploits the property that such graphs are small-world networks with low diameter and employs optimizations to improve the effectiveness of MapReduce and increase parallelism. We are able to compute maximum flow on a subset of the a large network graph with approximately more number of vertices and more number of edges using a cluster of 4 or 5 machines in reasonable time.Keywords: Algorithm, MapReduce, Hadoop.
Title: Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flow in Large Network Graph
Author: Dhananjaya Kumar K, Mr. Manjunatha A.S
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
GraphChi (Michael Leznik, Head of BI - London, King)
GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer.
In this deck, Torsten Hoefler from ETH Zurich presents: Data-Centric Parallel Programming.
"The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating code definition from its optimization. We show how to tune several applications in this model and IR. Furthermore, we show a global, datacentric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code."
Watch the video: https://wp.mep3RLHQ-kup
Learn more: http://htor.inf.ethz.ch
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
The Swift scripting language was created to provide a simple, compact way to write parallel scripts that run many copies of ordinary programs concurrently in various workflow patterns, reducing the need for complex parallel programming or arcane scripting to achieve this common high-level task. The result was a highly portable programming model based on implicitly parallel functional dataflow. The same Swift script runs on multi-core computers, clusters, grids, clouds, and supercomputers, and is thus a useful tool for moving workflow computations from laptop to distributed and/or high performance systems.
Swift has proven to be very general, and is in use in domains ranging from earth systems to bioinformatics to molecular modeling. It’s more recently been adapted to serve as a programming model for much finer-grain in-memory workflow on extreme scale systems, where it can perform task rates in the millions to billion-per-second.
In this talk, we describe the state of Swift’s implementation, present several Swift applications, and discuss ideas for of the future evolution of the programming model on which it’s based.
ParaForming - Patterns and Refactoring for Parallel Programmingkhstandrews
Despite Moore's "law", uniprocessor clock speeds have now stalled. Rather than single processors running at ever higher clock speeds, it is
common to find dual-, quad- or even hexa-core processors, even in consumer laptops and desktops.
Future hardware will not be slightly parallel, however, as in today's multicore systems, but will be
massively parallel, with manycore and perhaps even megacore systems
becoming mainstream.
This means that programmers need to start thinking parallel. To achieve this they must move away
from traditional programming models where parallelism is a
bolted-on afterthought. Rather, programmers must use languages where parallelism is deeply embedded into the programming model
from the outset.
By providing a high level model of computation, without explicit ordering of computations,
declarative languages in general, and functional languages in particular, offer many advantages for parallel
programming.
One of the most fundamental advantages of the functional paradigm is purity.
In a purely functional language, as exemplified by Haskell, there are simply no side effects: it is therefore impossible for parallel computations to conflict with each
other in ways that are not well understood.
ParaForming aims to radically improve the process
of parallelising purely functional programs through a comprehensive set of high-level parallel refactoring patterns for Parallel Haskell,
supported by advanced refactoring tools.
By matching parallel design patterns with appropriate algorithmic skeletons
using advanced software refactoring techniques and novel cost information, we will bridge the gap between fully automatic
and fully explicit approaches to parallelisation, helping programmers "think parallel" in a systematic,
guided way. This talk introduces the ParaForming approach, gives some examples and shows how
effective parallel programs can be developed using advanced refactoring technology.
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM’s efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
Existing many Web service QoS prediction
approaches are very accurate in Internet environments,
however they cannot provide accurate prediction values in
Mobile Internet environments since QoS values of Web
services have great volatility. In this paper, we propose an
accurate Web service QoS prediction approach by weakening
the volatility of QoS data from Web services in Mobile Internet
environments. This approach contains three process, i.e., QoS
preprocessing, user similarity computing, and QoS predicting.
We have implemented our proposed approach with experiment
based on real world and synthetic datasets. The results show
that our approach outperforms other approaches in Mobile
Internet environments.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Ling
Liu
School
of
Computer
Science
College
of
Compu2ng
Part
II:
Distributed
Graph
Processing
2. 2
2
Big
Data
Trends
Big Data
Volume
Velocity
Variety
1 zettabyte = a trillion gigabytes (1021 bytes)
CISCO, 2012
500 million
Tweets per day
100 hours of video
are uploaded
every minute
3. 3
3
Why
Graphs?
Graphs
are
everywhere
!
Social
Network
Graphs
Road
Networks
National
Security
Business
Analytics
Biological
Graphs
Friendship Graph
Facebook Engineering, 2010
Brain Network
The Journal of Neuroscience 2011
US Road Network
www.pavementinteractive.org
Web Security
Graph
McAfee, 2013
Intelligence Data Model
NSA, 2013
4. 4
4
How
Big?
Social
Scale
! 1
billion
ver2ces,
100
billion
edges
! 111
PB
adjacency
matrix
! 2.92
TB
adjacency
list
1 billion vertices, 100 billion edges
111 PB adjacency matrix
2.92 TB adjacency list
2.92 TB edge list
Twitter graph from Gephi dataset
(http://www.gephi.org)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
Web
Scale
! 50
billion
ver2ces,
1
trillion
edges
! 271
EB
adjacency
matrix
! 29.5
TB
adjacency
list
Brain
Scale
! 100
billion
ver2ces,
100
trillion
edges
! 1.1
ZB
adjacency
matrix
! 2.83
PB
adjacency
list
NSA-RD-2013-056001v1
Web scale. . .
50 billion vertices, 1 trillion edges
271 EB adjacency matrix
29.5 TB adjacency list
29.1 TB edge list
Internet graph from the Opte Project
(http://www.opte.org/maps)
Web graph from the SNAP database
(http://snap.stanford.edu/data)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-
Brain scale. . .
100 billion vertices, 100 trillion edges
2.08 mNA · bytes2 (molar bytes) adjacency matrix
2.84 PB adjacency list
2.84 PB edge list
Human connectome.
Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
5. 5
5
Big
Graph
Data
Technical
Challenges
Huge
and
growing
size
-‐ Requires
massive
storage
capaci2es
-‐ Graph
analy2cs
usually
requires
much
bigger
compu2ng
and
storage
resources
Complicated
correlaEons
among
data
enEEes
(verEces)
-‐ Make
it
hard
to
parallelize
graph
processing
(hard
to
par22on)
-‐ Most
exis2ng
big
data
systems
are
not
designed
to
handle
such
complexity
Skewed
distribuEon
(i.e.,
high-‐degree
verEces)
-‐ Makes
it
hard
to
ensure
load
balancing
6. 6
6
Parallel
Graph
Processing:
Challenges
• Structure
driven
computa2on
– Storage
and
Data
Transfer
Issues
• Irregular
Graph
Structure
and
Computa2on
Model
– Storage
and
Data/Computa2on
Par22oning
Issues
– Par22oning
v.s.
Load/Resource
Balancing
6
8. 8
8
Build
New
Graph
Frameworks:
Key
Requirements/Challenges
• Less
pre-‐processing
• Low
and
load-‐balanced
computaEon
• Low
and
load-‐balanced
communicaEon
• Low
memory
footprint
• Scalable
wrt
cluster
size
and
graph
size
• General
graph
processing
framework
for
large
collecEons
of
graph
computaEon
algorithms
and
applicaEons
9. 9
9
Graph
OperaEons:
Two
DisEnct
Classes
IteraEve
Graph
Algorithms
! Each
execu2on
consists
of
a
set
of
itera2ons
! In
each
itera2on,
vertex
(or
edge)
values
are
updated
! All
(or
most)
ver2ces
par2cipate
in
the
execu2on
! Examples:
PageRank,
shortest
paths
(SSSP),
connected
components
! Systems:
Pregel,
GraphLab,
GraphChi,
X-‐Stream,
GraphX,
Pregelix
Graph
PaXern
Queries
! Subgraph
matching
problem
! Requires
fast
query
response
2me
! Explores
a
small
frac2on
of
the
en2re
graph
! Examples:
friends-‐of-‐friends,
triangle
paberns
! Systems:
RDF-‐3X,
TripleBit,
SHAPE
VLDB 2014
IEEE SC 2015
11. 11
11
What
Are
IteraEve
Graph
Algorithms?
IteraEve
Graph
Algorithms
! Each
execu2on
consists
of
a
set
of
itera2ons
! In
each
itera2on,
vertex
(or
edge)
values
are
updated
! All
(or
most)
ver2ces
par2cipate
in
the
opera2ons
! Examples:
PageRank,
shortest
paths
(SSSP),
connected
components
! Systems:
Google’s
Pregel,
GraphLab,
GraphChi,
X-‐Stream,
GraphX,
Pregelix
SSSP
Connected Components Source: amplab
12. 12
12
Why
Is
IteraEve
Graph
Processing
So
Difficult?
Huge
and
growing
size
of
graph
data
-‐ Makes
it
hard
to
store
and
handle
the
data
on
a
single
machine
Poor
locality
(many
random
accesses)
-‐ Each
vertex
depends
on
its
neighboring
ver2ces,
recursively
Huge
size
of
intermediate
data
for
each
iteraEon
-‐ Requires
addi2onal
compu2ng
and
storage
resources
Heterogeneous
graph
algorithms
-‐ Different
algorithms
have
different
computa2on
and
access
paberns
High-‐degree
verEces
-‐ Make
it
hard
to
ensure
load
balancing
14. 14
The
problems
of
current
computaEon
models
• Ghost
ver2ces
maintain
adjacency
structure
and
replicate
remote
data.
• Too
much
interac2ons
among
par22ons
14
“ghost” vertices
16. 16
16
Why
Don’t
We
Use
MapReduce?
Of
course,
we
can
use
MapReduce!
The
first
iteraEon
of
Connected
Components
for
this
graph
would
be
…
Map
K
V
2
1
K
V
1
2
3
2
4
2
K
V
2
4
3
4
K
V
2
3
4
3
Reduce
K
Values
1
2
Min
1
2
1,3,4
3
2,4
4
2,3
1
2
2
17. 17
17
Why
We
Shouldn’t
Use
MapReduce
But
…
In
a
typical
MapReduce
job,
disk
IOs
are
performed
in
four
places
So…
10
iteraEons
mean…
Figure source: http://arasan-blog.blogspot.com/
Disk
IOs
in
40
places
18. 18
18
Related
Work
Distributed
Memory-‐Based
Systems
! Messaging-‐based:
Google
Pregel,
Apache
Giraph,
Apache
Hama
! Vertex
mirroring:
GraphLab,
PowerGraph,
GraphX
! Dynamic
load
balancing:
Mizan,
GPS
! Graph-‐centric
view:
Giraph++
Disk-‐Based
Systems
using
single
machine
! Vertex-‐centric
model:
GraphChi
! Edge-‐centric
model:
X-‐Stream
! Vertex-‐Edge
Centric:
GraphLego
With
External
Memory
! Out-‐of-‐core
capabili2es
(Apache
Giraph,
Apache
Hama,
GraphX)
! Not
op2mized
for
graph
computa2ons
! Users
need
to
configure
several
parameters
19. 19
19
Two
Research
DirecEons
IteraEve
Graph
Processing
Systems
Disk-based systems
on a single machine
! Load
a
part
of
the
input
graph
in
memory
! Include
a
set
of
data
structures
and
techniques
to
efficiently
load
graph
data
from
disk
! GraphChi,
X-‐Stream,
…
! Disadv.:
1)
relaEvely
slow,
2)
resource
limitaEons
of
a
single
machine
Distributed memory-based
systems on a cluster
! Load
the
whole
input
graph
in
memory
! Load
all
intermediate
results
and
messages
in
memory
! Pregel,
Giraph,
Hama,
GraphLab,
GraphX,
…
! Disadv.:
1)
very
high
memory
requirement,
2)
coordinaEon
of
distributed
machines
20. 20
20
Main
Features
Develop
GraphMap
! Distributed
itera2ve
graph
computa2on
framework
that
effec2vely
u2lizes
secondary
storage
! To
reduce
the
memory
requirement
of
itera2ve
graph
computa2ons
while
ensuring
compe22ve
(or
beber)
performance
Main
ContribuEons
! Clear
separaEon
between
mutable
and
read-‐only
data
! Two-‐level
parEEoning
technique
for
locality-‐op2mized
data
placement
! Dynamic
access
methods
based
on
the
workloads
of
the
current
itera2on
21. 21
21
Clear
Data
SeparaEon
Graph
Data
Vertices and their data
(mutable)
Edges and their data
(read-only)
Read edge data for each iteration!
22. 22
22
Locality-‐Based
Data
Placement
on
Disk
Edge
Access
Locality
! All
edges
(out-‐edges,
in-‐edges
or
bi-‐edges)
of
a
vertex
are
accessed
together
to
update
its
vertex
value
è
We
place
all
connected
edges
of
a
vertex
together
on
disk
Vertex
Access
Locality
! All
ver2ces
in
a
par22on
are
accessed
by
the
same
worker
(processor)
in
every
itera2on
è
We
store
all
ver2ces,
in
a
par22on,
and
their
edges
into
con2guous
disk
blocks
to
u2lize
sequenEal
disk
accesses
How
can
you
access
disk
efficiently
for
each
iteraEon?
23. 23
23
Dynamic
Access
Methods
Various
Workloads
If
the
current
workload
is
larger
than
the
threshold?
YES NO
Sequential
disk accesses
Random
disk accesses
The threshold is dynamically
configured based on actual access
times for each iteration and for
each worker
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9
#activevertices(x1000)
Iteration
PageRank CC SSSP
24. 24
24
Experiments
First
Prototype
of
GraphMap
! BSP
engine
&
messaging
engine:
U2lize
Apache
Hama
! Disk
storage:
U2lize
Apache
HBase
! Two-‐dimensional
key-‐value
store
Sefngs
! Cluster
of
21
machines
on
Emulab
! 12GB
RAM,
Xeon
E5530,
500GB
and
250GB
SATA
disks
! Connected
via
a
1
GigE
network
! HBase
(ver.
0.96)
on
HDFS
of
Hadoop
(ver.
1.0.4)
! Hama
(ver.
0.6.3)
IteraEve
Graph
Algorithms
! 1)
PageRank
(10
iter.),
2)
SSSP,
3)
CC
25. 25
25
ExecuEon
Time
Analysis
! Hama
fails
for
large
graphs
with
more
than
900M
edges
while
GraphMap
s2ll
works
! Note
that,
in
all
the
cases,
GraphMap
is
faster
(up
to
6
2mes)
than
Hama,
which
is
the
in-‐memory
system
26. 26
26
Breakdown
of
GraphMap
ExecuEon
Time
PageRank on uk-2005 SSSP on uk-2005
CC on uk-2005
Analysis
! For
PageRank,
all
itera2ons
have
similar
results
except
the
first
and
last
! For
SSSP,
itera2on
5
–
15
u2lize
sequen2al
disk
accesses
based
on
our
dynamic
selec2on
! For
CC,
random
disk
accesses
are
selected
from
itera2on
24
27. 27
27
Effects
of
Dynamic
Access
Methods
Analysis
! GraphMap
chooses
the
op2mal
access
method
in
most
of
the
itera2ons
! Possible
further
improvement
through
fine-‐tuning
in
itera2ons
5
and
15
! For
cit-‐Patents,
GraphMap
always
chooses
random
accesses
because
only
3.3%
ver2ces
are
reachable
from
the
start
vertex
and
thus
the
number
of
ac2ve
ver2ces
is
always
small
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40
ComputationTime(sec)
Iteration
Sequential
Random
Dynamic
28. 28
28
• Exis2ng
distributed
graph
systems
are
all
in-‐memory
systems.
In
addi2on
to
Hama,
we
give
a
rela2ve
comparison
with
a
few
other
representa2ve
systems:
Comparing
GraphMap
with
other
systems
GraphMap: 12GB DRAM per node of a cluster of size 21 nodes and 252GB
distributed shared memory
5x DRAM per node
29. 29
29
Social
Life
Journal
(LJ)
Graph
Dataset
Vertices: Members
Edges: Friendship
Graph dataset (stored in HDFS)
cit-Patents (raw size: 268MB): 3.8M vertices, 16.5M edges
soc-LiveJournal1 (raw size: 1.1GB): 4.8M vertices, 69M edges
30. 30
30
• Cluster
sepng
– 6
machines
(1
master
&
5
slaves)
• Spark
sepng
– Spark
shell
(i.e.,
did
NOT
implement
any
Spark
applica2on
yet)
• Built-‐in
PageRank
func2on
of
GraphX
– All
40
cores
(=
8
cores
x
5
slaves)
– Por2on
of
memory
for
RDD
storage:
0.52
(by
default)
• If
we
assign
512MB
for
each
executor,
about
265MB
is
dedicated
for
RDD
storage
Our
iniEal
experience
with
SPARK
/
GraphX
30
32. 32
32
• Our
ini2al
experience
with
SPARK
– Spark
performs
well
with
large
per-‐node
with
>=
68GB
DRAM,
as
reported
in
the
SPARK/GraphX
paper.
– Do
not
perform
well
for
cluster
with
nodes
of
smaller
DRAM
• Messaging
Overheads
– Distributed
graph
processing
systems
do
not
scale
as
the
#
nodes
increases
due
to
the
amount
of
messaging
cost
among
compute
nodes
in
the
cluster
to
synchronize
the
computa2on
in
each
itera2on
round
Spark/GraphX
experience
and
Messaging
Cost
33. 33
33
Summary
GraphMap
! Distributed
itera2ve
graph
computa2on
framework
that
effec2vely
u2lizes
secondary
storage
! Clear
separa2on
between
mutable
and
read-‐only
data
! Locality-‐based
data
placement
on
disk
! Dynamic
access
methods
based
on
the
workloads
of
the
current
itera2on
Ongoing
Research
! Disk
and
worker
coloca2on
to
improve
the
disk
access
performance
! Efficient
and
lightweight
par22oning
techniques,
incorpora2ng
our
work
on
GraphLego
for
single
PC
graph
processing
[ACM
HPDC
2015]
! Comparing
with
SPARK/GraphX
on
larger
DRAM
cluster
34. 34
34
General
Purpose
Distributed
Graph
System
ExisEng
State
of
Art
! Separate
efforts
for
the
two
representa2ve
graph
opera2ons
! Separate
efforts
for
the
scale-‐up
and
scale-‐out
systems
Challenges
for
Developing
a
General
Purpose
Graph
Processing
System
! Different
data
access
paberns
/
graph
computa2on
models
! Different
inter-‐node
communica2on
effects
Possible
DirecEons
! Graph
summariza2on
techniques
! Lightweight
graph
par22oning
techniques
! Op2mized
data
storage
systems
and
access
methods