Home
Assefaw Gebremedhin, Papers on Parallel Computing
Title :
Algorithms for Balanced Graph Colorings with Applications in
Parallel Computing
Authors:
H.Lu, M. Halappanavar, D. Chavarri-a-Miranda, A.H. Gebremedhin, A. Panyala and
A. Kalyanaraman
Status: IEEE Transactions on Parallel and Distributed Systems 28(5), 1240--1256, 2017.
Abstract
Graph coloring--in a generic sense--is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to here as balanced coloring. In this paper, we consider balanced coloring models in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring for two variants of coloring problem, distance-1 coloring (the standard coloring problem) and partial distance-2 coloring (defined on a bipartite graph).
We present parallelization approaches for multi-core and manycore architectures and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application.
In addition, we propose several extensions to our basic balancing schemes and evaluate their balancing efficacy and performance characteristics.
The thorough treatment of balanced coloring presented in this paper from algorithms to application
is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring.
Download paper in PDF
Title :
Parallelization of Bin Packing on Multicore Systems
Authors:
S. Ghosh and A.H. Gebremedhin
Status: 2016 IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'16).
Abstract
We study effective parallelization of approximation algorithms for the one-dimensional
bin packing problem on a multicore platform.
Bin packing is a classic combinatorial optimization problem that aims to pack a given sequence of items
into a minimum number of equal-sized bins.
The problem potentially serves as a model for a wide variety of applications.
Examples include:
packing data into chunks in a memory hierarchy in a given system to increase application performance,
loading vehicles subject to weight limitations, and
packing TV commercials into station breaks.
Bin packing has long served as a proving ground for
the analysis of approximation algorithms and played a crucial role in the development of much
of the theory of approximation algorithms.
Its parallelization, however, has received comparatively much less attention.
In this work, we develop multipleparallel versions of an effective approximation algorithm
(First Fit Decreasing) for the problem and investigate the trade-off between solution quality
and execution time. We use OpenMP and Cilk Plus as mechanisms for achieving the
parallelization. The new parallel algorithms obtain a
speedup of more than 10x (on 32 cores) for moderate to large input sequences
without sacrificing much on the quality of solution produced by the sequential algorithm --- in particular,
we see only about 3 to 30% increase in the number of bins compared to
the sequential version. In turn, the solution obtained by the sequential First Fit Decreasing algorithm
is provably almost optimal (the approximation ratio is less than 1.3).
Download paper in PDF
Title :
One-sided Interface for Matrix Operations using MPI-3 RMA: A Case Study with Elemental
Authors:
S. Ghosh, J.R. Hammond, A.J. Pena, P. Balaji, A.H. Gebremedhin and B. Chapman
Status: Proceedings of ICPP 2016.
Abstract
A one-sided programming model separates communication from synchronization,
and is the driving principle behind partitioned global address space (PGAS)
libraries such as Global Arrays (GA) and SHMEM.
PGAS models expose a rich set of functionality that a developer
needs in order to implement mathematical algorithms that require frequent
multidimensional array accesses.
However, use of existing PGAS libraries in application codes
often requires significant development effort in
order to fully exploit
these programming models. On the other hand, a vast majority of scientific codes
use MPI either directly or indirectly via third-party scientific computation libraries,
and need features to support application-specific communication requirements
(e.g., asynchronous update of distributed sparse
matrices, commonly arising in machine learning workloads).
For such codes
it is often impractical to completely shift programming models in favor
of special one-sided communication middleware. Instead, an elegant
and productive solution is to exploit the one-sided functionality already offered
by MPI-3 RMA (Remote Memory Access).
We designed a general one-sided interface using the MPI-3 passive RMA model for
remote matrix operations in the linear algebra library Elemental; we call
the interface we designed RMAInterface.
Elemental is an open source library for distributed-memory dense and sparse
linear algebra and optimization.
Download paper in PDF
Title :
Balanced Coloring for Parallel Computing Applications
Authors:
H. Lu, M. Halappanavar, D. Charri-a-Miranda, A.H. Gebremedhin and
A. Kalyanaraman
Status: Proceedings of IPDPS 2015.
Abstract
Graph coloring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to as balanced coloring. In this paper, we revisit the problem of balanced coloring in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring, present parallelization approaches for multi-core and manycore architectures, and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application.
The thorough treatment of balanced coloring presented in this paper from algorithms to application
is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring.
Download paper in PDF
Title:
Parallel Maximum Clique Algorithms with Applications to Network Analysis
Authors:
R.A. Rossi, D.F. Gleich, A.H. Gebremedhin
Status: SIAM Journal on Scientific Computing, Vol 37, Issue 5, pages C589-C618, 2015.
Abstract
We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques.
The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique. This occasionally results in a super-linear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compress-friendly ordering of nodes of massive networks.
Download paper in PDF
Title:
Parallel Maximum Clique Algorithms with Applications to
Network Analysis and Storage
Authors:
R.A. Rossi, D.F. Gleich, A.H. Gebremedhin, M.M.A. Patwary
Status: Proceedings of WWW2014.
Abstract
We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. Despite clique's status as an NP-hard problem with poor approximation guarantees, our method exhibits nearly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Key to the efficiency of our algorithm are an initial heuristic procedure that finds a large clique quickly and a parallelized branch and bound strategy with aggressive pruning tnd ordering echniques.
We use the algorithm to compute the largest temporal strong components of temporal contact networks.
Download paper in PDF
Title:
Graph Coloring Algorithms for Multi-core and Massively Multithreaded Architectures
Authors:
U. Catalyurek, J. Feo, A.H. Gebremedhin, M. Halappanavar, A. Pothen
Status:
Parallel Computing 38 (2012), 576-594.
Abstract
We explore the interplay between architectures and
algorithm design in the context of shared-memory platforms
and a specific graph problem of
central importance in scientific and high-performance computing,
distance-1 graph coloring.
We introduce two different kinds of multithreaded heuristic algorithms
for the stated, NP-hard, problem.
The first algorithm relies on speculation and iteration,
and is suitable for any shared-memory system.
The second algorithm uses dataflow principles, and is targeted at
the non-conventional, massively multithreaded Cray XMT system.
We study the performance of the algorithms on three different
platforms---Cray XMT, Sun Niagara 2, and Intel Nehalem---representing
varying degrees of multithreading capabilities.
As testbed, we use synthetically generated massive graphs carefully
designed to cover a wide spectrum of input types.
The results show that the algorithms have scalable runtime performance
and use nearly the same number of colors as the underlying
serial algorithm, which in turn is effective in practice.
The study provides insight into the design of high performance algorithms
for irregular problems on many-core architectures.
Download paper in PDF
Title:
New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures
Authors:
M.M.A. Patwary, A.H. Gebremedhin and A. Pothen
Status:
In E. Jeannot, R. Namyst and J. Roman, editors, Euro-Par 2011, volume 6853 of Lecture Notes in Computer Science,
pages 250 -- 262. Springer, 2011.
Abstract
We present new multithreaded vertex ordering and distance-k graph coloring
algorithms that are well-suited for the emerging and rapidly growing multicore platforms.
The vertex ordering techniques rely on various notions of ``degree", are known to be effective
in reducing the number of colors used by a greedy coloring algorithm, and are generic enough
to be applicable to contexts other than coloring.
We employ approximate degree computation in the ordering algorithms and
speculation and iteration in the coloring algorithms as our primary remedies
for breaking sequentiality and achieving effective parallelization.
The algorithms have been implemented using OpenMP, and experiments run on
Intel Nehalem and other multi-core machines using a set of carefully designed synthetic graphs
and real-world graphs attest that the algorithms provide
scalable runtime performance. The number of colors the algorithms use is nearly the same as
in the serial case, which in turn is often very close to optimal.
Download paper in PDF
Title:
Distributed-memory Parallel Algorithms for Matching and Coloring
Authors:
U. Catalyurek, F. Dobrian, A.H. Gebremedhin, M. Halappanavar,
A. Pothen
Status:
Proceedings of IEEE International Parallel and Distributed
Processing Symposium, Workshops and PhD Forums (IPDPSW),
Workshop on Parallel Computing and Optimization (PCO'11), pages 1966--1975, 2011
Abstract
We discuss the design and
implementation of new highly-scalable distributed-memory parallel algorithms
for two prototypical graph problems,
edge-weighted matching and distance-1 vertex coloring.
Graph algorithms in general have low concurrency, poor data locality, and high ratio of
data access to computation costs,
making it challenging to achieve scalability on massively parallel machines.
We overcome this challenge by employing a variety of techniques, including
speculation and iteration, optimized communication, and randomization.
We present preliminary results on weak and strong scalability studies
conducted on an IBM Blue Gene/P machine employing up to tens of
thousands of processors.
The results show that the algorithms hold strong potential for
computing at petascale.
Download paper in PDF
Title:
Distributed-memory Parallel Algorithms for Distance-2 Coloring and
Related Problems in Derivative Computation
Authors:
D. Bozdag, U. Catalyurek, A. Gebremedhin, F. Manne, E. Boman and F. Ozgunner
Status:
SIAM Journal on Scientific Computing, Vol 32, Issue 4, pp 2418--2446, 2010.
Abstract
The distance-2 graph coloring problem aims at partitioning the vertex
set of a graph into the fewest sets consisting of vertices pairwise at
distance greater than two from each other. Its applications include
derivative computation in numerical optimization and channel
assignment in radio networks. We present efficient,
distributed-memory, parallel heuristic algorithms for this NP-hard
problem as well as for two related problems used in the computation of
Jacobians and Hessians. Parallel speedup is achieved through graph
partitioning, speculative (iterative) coloring, and a BSP-like
organization of parallel computation. Results from experiments
conducted on a PC cluster employing up to 96 processors and using
large-size real-world as well as synthetically generated test graphs
show that the algorithms are scalable. In terms of quality of
solution, the algorithms perform remarkably well---the number of
colors used by the parallel algorithms was observed to be very close
to the number used by the sequential counterparts, which in
turn are quite often near optimal. Moreover, the experimental results
show that the parallel distance-2 coloring algorithm compares
favorably with the alternative approach of solving the distance-2
coloring problem on a graph $G$ by first constructing the square
graph $G^2$ and then applying a parallel distance-1 coloring
algorithm on $G^2$. Implementations of the algorithms are made
available via the Zoltan load-balancing library.
Download paper in PDF
Title:
A framework
for Scalable Greedy Coloring on Distributed Memory Parallel Computers
Authors:
D. Bozdag, A. Gebremedhin, F. Manne, E. Boman and U. Catalyurek
Status:
Journal of Parallel and Distributed Computing Vol 68, No 4, pp 515--535, 2008.
Abstract
We present a scalable framework for parallelizing greedy graph
coloring algorithms on distributed-memory computers. The framework
unifies several existing algorithms and blends a variety of techniques
for creating or facilitating concurrency. The latter techniques
include exploiting features of the initial data distribution,
the use of speculative coloring and randomization, and a BSP-style
organization of computation and communication. We experimentally
study the performance of several specialized algorithms designed using
the framework and implemented using MPI. The experiments are
conducted on two different platforms and the test cases include
large-size synthetic graphs as well as real graphs drawn from various
application areas. Computational results show that implementations
that yield good speedup while at the same time using about the same
number of colors as a sequential greedy algorithm can be achieved by
setting parameters of the framework in accordance with the size and
structure of the graph being colored. Our implementation is freely
available as part of the Zoltan parallel data management and
load-balancing library.
Download paper in PDF
Title:
A Parallel Distance-2 Graph Coloring Algorithm for Distributed
Memory Computers
Authors:
D. Bozdag, U. Catalyurek, A.H. Gebremedhin, F. Manne, E. G. Boman and F.
Ozguner
Status:
Lecture Notes in Computer Science, vol 3726,
2005, pages 796 - 806,Springer. Proc. of HPCC 2005, Sept 21 - 25, 2005,
Sorrento, Italy.
Abstract
The distance-2 graph coloring problem aims at partitioning the vertex
set of a graph into the fewest sets consisting of vertices pairwise at
distance greater than two from each other. Application examples
include numerical optimization and channel assignment. We present the
first distributed-memory heuristic algorithm for this NP-hard
problem. Parallel speedup is achieved through graph partitioning,
speculative (iterative) coloring, and a BSP-like organization of
computation. Experimental results show that the algorithm is scalable,
and compares favorably with an alternative approach---solving the
problem on a graph G by first constructing the square graph
G^2 and then applying a parallel distance-1 coloring
algorithm on G^2 .
Download paper in PDF
Title:
A Scalable Parallel Graph Coloring Algorithm for Distributed Memory
Computers
Authors:
E.G. Boman, D. Bozdag, U. Catalyurek, A.H. Gebremedhin and F. Manne
Status:
Lecture Notes in Computer Science, vol 3648 , 2005, pages 241 - 251,Springer.
Proc. of EuroPar 2005, 30 Aug - 2 Sept, 2005, Lisboa, Portugal.
Abstract
In large-scale parallel applications a
graph coloring is often carried out to schedule computational
tasks. In this paper, we describe a new distributed-memory algorithm
for doing the coloring itself in parallel. The algorithm operates in
an iterati ve fashion; in each round vertices are speculatively
colored based on limited information, and then a set of incorrectly
colored vertices, to be recolored in the next round, is identified.
Parallel speedup is achieved in part by reducing the frequency of
communication among processors. Experimental results on a PC cluster
using up to 16 processors show that the algorithm is scalable.
Download paper in PDF
Title:
Speeding up Parallel Graph Coloring
Authors:
A.H. Gebremedhin, F.Manne and T. Woods
Status:
Lecture Notes in Computer Science, vol 3732, pp 1079-1088, 2005,
Springer. Proc. of Para 2004, June 20 - 23, 2004, Lyngby, Denmark.
Abstract
This paper presents new efficient parallel
algorithms for finding approximate solutions to graph coloring
problems. We consider an existing shared memory parallel graph
coloring algorithm and suggest several enhancements both in terms of
ordering the vertices so as to minimize cache misses, and performing
vertex-to- processor assignments based on graph partitioning instead
of random allocation.
We report experimental results that demonstrate the performance of our
algorithms on an IBM Regatta supercomputer when up to 12 processors
are used. Our implementations use OpenMP for parallelization and
Metis for graph partitioning. The experiments show that we get up
to a 70 % reduction in runtime compared to the previous algorithm.
Download paper in PDF
Title
Graph Coloring on Coarse Grained Multicomputers
Authors:
A. Gebremedhin, I.Guerrin-Lassous, J. Gustedt and J.A. Telle
Status:
Discrete Applied Mathematics, Vol 131, No 1, pp 179--198, 2003
Abstract
We present an efficient and scalable Coarse Grained Multicomputer (CGM)
coloring algorithm that colors a graph G with at most
D + 1 colors where D is the maximum degree in G .
This algorithm is given in two variants: a randomized and
a deterministic .
We show that on a p-processor CGM model
the proposed algorithms require
a parallel time of
O(|G|/p) and a total work and overall
communication cost of O(|G|) .
These bounds correspond to the average
case for the randomized version and to the worst case for the
deterministic variant.
Download paper in PDF
Title:
Parallel Distance-k Coloring
Algorithms for Numerical Optimization
Authors:
A.H. Gebremedhin, F. Manne and A. Pothen
Status:
In B. Monien and R. Feldmann
(Eds.): EuroPar 2002, Lecture Notes in Computer Science 2400, pp. 912-921,
Springer-Verlag 2002.
Abstract
Matrix partitioning problems that arise in the efficient estimation of
sparse Jacobians and Hessians can be modeled using variants of graph
coloring problems. In a previous work, we argue that
distance-2 coloring and distance-3/2 coloring [we now call this
star coloring ]
are robust and flexible formulations of the respective matrix
estimation problems. The problem size in large-scale optimization
contexts makes the matrix estimation phase an expensive part of the
entire computation both in terms of execution time and memory
space. Hence, there is a need for both shared- and distributed-memory
parallel algorithms for the stated graph coloring problems. In the
current work, we present the first practical shared address space
parallel algorithms for these problems. The main idea in our
algorithms is to randomly partition the vertex set equally
among the available processors, let each processor
speculatively color its vertices using information about
already colored vertices, detect eventual conflicts in parallel, and
finally re-color conflicting vertices sequentially.
Randomization is also used in the coloring phases
to further reduce conflicts. Our PRAM-analysis shows that the
algorithms should give almost linear speedup for sparse graphs that
are large relative to the number of processors. Experimental results
from our OpenMP implementations on a Cray Origin2000 using various
large graphs show that the algorithms indeed yield reasonable speedup
for modest numbers of processors.
Download paper in PDF
Title:
Scalable Parallel Graph Coloring Algorithms
Authors:
A. Gebremedhin and F. Manne
Status:
Concurrency: Practice and Expereince,
Vol 12, pp1131--1146, 2000.
Abstract
Finding a good graph coloring quickly is
often a crucial phase in the development of efficient, parallel
algorithms for many scientific and engineering applications. In this
paper we consider the problem of solving the graph coloring problem
itself in parallel. We present a simple and fast parallel graph
coloring heuristic that is well suited for shared memory programming
and yields an almost linear speedup on the PRAM model. We also
present a second heuristic that improves on the number of colors used.
The heuristics have been implemented using OpenMP. Experiments
conducted on an SGI Cray Origin 2000 super computer using very large
graphs from finite element methods and eigenvalue computations
validate the theoretical run-time analysis.
Download paper in PDF
Title:
Graph Coloring on a Coarse Grained Multiprocessor
Authors:
A.H. Gebremedhin, I.G. Lassous, J. Gustedt and J.A. Telle
Status:
In Brandes, Ulrik, Wagner
and Dorothea (Eds.): WG 2000, Lecture Notes in Computer Science 1928, pp.
184-195, 2000, Springer-Verlag.
Abstract
We present the first efficient parallel coloring algorithm for
the Coarse Grain Multicomputer model. The algorithm uses at
most D+1 colors, where D is the maximum degree
in the graph.
Download paper in PDF
Title:
Parallel Graph Coloring Algorithms using
OpenMP
Authors:
A.H. Gebremedhin and F. Manne
Status:
In Proc. of EWOMP'99, First European Workshop on OpenMP,
Sept.30 - Oct. 1, 1999, Lund, Sweden.
Download Extended Abstract in PDF
Go to
Research Areas
Coloring and Automatic Differentiation
Parallel Algorithms
Parallel Computation Models