Home

Assefaw Gebremedhin, Papers on Parallel Computing



Title : Algorithms for Balanced Graph Colorings with Applications in Parallel Computing
Authors: H.Lu, M. Halappanavar, D. Chavarri-a-Miranda, A.H. Gebremedhin, A. Panyala and A. Kalyanaraman
Status: IEEE Transactions on Parallel and Distributed Systems 28(5), 1240--1256, 2017.

Abstract

Graph coloring--in a generic sense--is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to here as balanced coloring. In this paper, we consider balanced coloring models in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring for two variants of coloring problem, distance-1 coloring (the standard coloring problem) and partial distance-2 coloring (defined on a bipartite graph). We present parallelization approaches for multi-core and manycore architectures and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application. In addition, we propose several extensions to our basic balancing schemes and evaluate their balancing efficacy and performance characteristics. The thorough treatment of balanced coloring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring.

Download paper in PDF

Title : Parallelization of Bin Packing on Multicore Systems
Authors: S. Ghosh and A.H. Gebremedhin
Status: 2016 IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'16).

Abstract

We study effective parallelization of approximation algorithms for the one-dimensional bin packing problem on a multicore platform. Bin packing is a classic combinatorial optimization problem that aims to pack a given sequence of items into a minimum number of equal-sized bins. The problem potentially serves as a model for a wide variety of applications. Examples include: packing data into chunks in a memory hierarchy in a given system to increase application performance, loading vehicles subject to weight limitations, and packing TV commercials into station breaks. Bin packing has long served as a proving ground for the analysis of approximation algorithms and played a crucial role in the development of much of the theory of approximation algorithms. Its parallelization, however, has received comparatively much less attention. In this work, we develop multipleparallel versions of an effective approximation algorithm (First Fit Decreasing) for the problem and investigate the trade-off between solution quality and execution time. We use OpenMP and Cilk Plus as mechanisms for achieving the parallelization. The new parallel algorithms obtain a speedup of more than 10x (on 32 cores) for moderate to large input sequences without sacrificing much on the quality of solution produced by the sequential algorithm --- in particular, we see only about 3 to 30% increase in the number of bins compared to the sequential version. In turn, the solution obtained by the sequential First Fit Decreasing algorithm is provably almost optimal (the approximation ratio is less than 1.3).

Download paper in PDF

Title : One-sided Interface for Matrix Operations using MPI-3 RMA: A Case Study with Elemental
Authors: S. Ghosh, J.R. Hammond, A.J. Pena, P. Balaji, A.H. Gebremedhin and B. Chapman
Status: Proceedings of ICPP 2016.

Abstract

A one-sided programming model separates communication from synchronization, and is the driving principle behind partitioned global address space (PGAS) libraries such as Global Arrays (GA) and SHMEM. PGAS models expose a rich set of functionality that a developer needs in order to implement mathematical algorithms that require frequent multidimensional array accesses. However, use of existing PGAS libraries in application codes often requires significant development effort in order to fully exploit these programming models. On the other hand, a vast majority of scientific codes use MPI either directly or indirectly via third-party scientific computation libraries, and need features to support application-specific communication requirements (e.g., asynchronous update of distributed sparse matrices, commonly arising in machine learning workloads). For such codes it is often impractical to completely shift programming models in favor of special one-sided communication middleware. Instead, an elegant and productive solution is to exploit the one-sided functionality already offered by MPI-3 RMA (Remote Memory Access). We designed a general one-sided interface using the MPI-3 passive RMA model for remote matrix operations in the linear algebra library Elemental; we call the interface we designed RMAInterface. Elemental is an open source library for distributed-memory dense and sparse linear algebra and optimization.

Download paper in PDF

Title : Balanced Coloring for Parallel Computing Applications
Authors: H. Lu, M. Halappanavar, D. Charri-a-Miranda, A.H. Gebremedhin and A. Kalyanaraman
Status: Proceedings of IPDPS 2015.

Abstract

Graph coloring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to as balanced coloring. In this paper, we revisit the problem of balanced coloring in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring, present parallelization approaches for multi-core and manycore architectures, and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application. The thorough treatment of balanced coloring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring.

Download paper in PDF

Title: Parallel Maximum Clique Algorithms with Applications to Network Analysis
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin
Status: SIAM Journal on Scientific Computing, Vol 37, Issue 5, pages C589-C618, 2015.

Abstract

We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques. The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique. This occasionally results in a super-linear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compress-friendly ordering of nodes of massive networks.

Download paper in PDF

Title: Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin, M.M.A. Patwary
Status: Proceedings of WWW2014.

Abstract

We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. Despite clique's status as an NP-hard problem with poor approximation guarantees, our method exhibits nearly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Key to the efficiency of our algorithm are an initial heuristic procedure that finds a large clique quickly and a parallelized branch and bound strategy with aggressive pruning tnd ordering echniques. We use the algorithm to compute the largest temporal strong components of temporal contact networks.

Download paper in PDF

Title: Graph Coloring Algorithms for Multi-core and Massively Multithreaded Architectures
Authors: U. Catalyurek, J. Feo, A.H. Gebremedhin, M. Halappanavar, A. Pothen
Status: Parallel Computing 38 (2012), 576-594.

Abstract

We explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring. We introduce two different kinds of multithreaded heuristic algorithms for the stated, NP-hard, problem. The first algorithm relies on speculation and iteration, and is suitable for any shared-memory system. The second algorithm uses dataflow principles, and is targeted at the non-conventional, massively multithreaded Cray XMT system. We study the performance of the algorithms on three different platforms---Cray XMT, Sun Niagara 2, and Intel Nehalem---representing varying degrees of multithreading capabilities. As testbed, we use synthetically generated massive graphs carefully designed to cover a wide spectrum of input types. The results show that the algorithms have scalable runtime performance and use nearly the same number of colors as the underlying serial algorithm, which in turn is effective in practice. The study provides insight into the design of high performance algorithms for irregular problems on many-core architectures.

Download paper in PDF

Title: New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures
Authors: M.M.A. Patwary, A.H. Gebremedhin and A. Pothen
Status: In E. Jeannot, R. Namyst and J. Roman, editors, Euro-Par 2011, volume 6853 of Lecture Notes in Computer Science, pages 250 -- 262. Springer, 2011.

Abstract

We present new multithreaded vertex ordering and distance-k graph coloring algorithms that are well-suited for the emerging and rapidly growing multicore platforms. The vertex ordering techniques rely on various notions of ``degree", are known to be effective in reducing the number of colors used by a greedy coloring algorithm, and are generic enough to be applicable to contexts other than coloring. We employ approximate degree computation in the ordering algorithms and speculation and iteration in the coloring algorithms as our primary remedies for breaking sequentiality and achieving effective parallelization. The algorithms have been implemented using OpenMP, and experiments run on Intel Nehalem and other multi-core machines using a set of carefully designed synthetic graphs and real-world graphs attest that the algorithms provide scalable runtime performance. The number of colors the algorithms use is nearly the same as in the serial case, which in turn is often very close to optimal.

Download paper in PDF

Title: Distributed-memory Parallel Algorithms for Matching and Coloring
Authors: U. Catalyurek, F. Dobrian, A.H. Gebremedhin, M. Halappanavar, A. Pothen
Status: Proceedings of IEEE International Parallel and Distributed Processing Symposium, Workshops and PhD Forums (IPDPSW), Workshop on Parallel Computing and Optimization (PCO'11), pages 1966--1975, 2011

Abstract

We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.

Download paper in PDF

Title: Distributed-memory Parallel Algorithms for Distance-2 Coloring and Related Problems in Derivative Computation
Authors: D. Bozdag, U. Catalyurek, A. Gebremedhin, F. Manne, E. Boman and F. Ozgunner
Status: SIAM Journal on Scientific Computing, Vol 32, Issue 4, pp 2418--2446, 2010.

Abstract

The distance-2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Its applications include derivative computation in numerical optimization and channel assignment in radio networks. We present efficient, distributed-memory, parallel heuristic algorithms for this NP-hard problem as well as for two related problems used in the computation of Jacobians and Hessians. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSP-like organization of parallel computation. Results from experiments conducted on a PC cluster employing up to 96 processors and using large-size real-world as well as synthetically generated test graphs show that the algorithms are scalable. In terms of quality of solution, the algorithms perform remarkably well---the number of colors used by the parallel algorithms was observed to be very close to the number used by the sequential counterparts, which in turn are quite often near optimal. Moreover, the experimental results show that the parallel distance-2 coloring algorithm compares favorably with the alternative approach of solving the distance-2 coloring problem on a graph $G$ by first constructing the square graph $G^2$ and then applying a parallel distance-1 coloring algorithm on $G^2$. Implementations of the algorithms are made available via the Zoltan load-balancing library.

Download paper in PDF

Title: A framework for Scalable Greedy Coloring on Distributed Memory Parallel Computers
Authors: D. Bozdag, A. Gebremedhin, F. Manne, E. Boman and U. Catalyurek
Status: Journal of Parallel and Distributed Computing Vol 68, No 4, pp 515--535, 2008.

Abstract

We present a scalable framework for parallelizing greedy graph coloring algorithms on distributed-memory computers. The framework unifies several existing algorithms and blends a variety of techniques for creating or facilitating concurrency. The latter techniques include exploiting features of the initial data distribution, the use of speculative coloring and randomization, and a BSP-style organization of computation and communication. We experimentally study the performance of several specialized algorithms designed using the framework and implemented using MPI. The experiments are conducted on two different platforms and the test cases include large-size synthetic graphs as well as real graphs drawn from various application areas. Computational results show that implementations that yield good speedup while at the same time using about the same number of colors as a sequential greedy algorithm can be achieved by setting parameters of the framework in accordance with the size and structure of the graph being colored. Our implementation is freely available as part of the Zoltan parallel data management and load-balancing library.

Download paper in PDF

Title: A Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers
Authors: D. Bozdag, U. Catalyurek, A.H. Gebremedhin, F. Manne, E. G. Boman and F. Ozguner
Status: Lecture Notes in Computer Science, vol 3726, 2005, pages 796 - 806,Springer. Proc. of HPCC 2005, Sept 21 - 25, 2005, Sorrento, Italy.

Abstract

The distance-2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Application examples include numerical optimization and channel assignment. We present the first distributed-memory heuristic algorithm for this NP-hard problem. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSP-like organization of computation. Experimental results show that the algorithm is scalable, and compares favorably with an alternative approach---solving the problem on a graph G by first constructing the square graph G^2 and then applying a parallel distance-1 coloring algorithm on G^2 .

Download paper in PDF

Title: A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers
Authors: E.G. Boman, D. Bozdag, U. Catalyurek, A.H. Gebremedhin and F. Manne
Status: Lecture Notes in Computer Science, vol 3648 , 2005, pages 241 - 251,Springer. Proc. of EuroPar 2005, 30 Aug - 2 Sept, 2005, Lisboa, Portugal.

Abstract

In large-scale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributed-memory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterati ve fashion; in each round vertices are speculatively colored based on limited information, and then a set of incorrectly colored vertices, to be recolored in the next round, is identified. Parallel speedup is achieved in part by reducing the frequency of communication among processors. Experimental results on a PC cluster using up to 16 processors show that the algorithm is scalable.

Download paper in PDF

Title: Speeding up Parallel Graph Coloring
Authors: A.H. Gebremedhin, F.Manne and T. Woods
Status: Lecture Notes in Computer Science, vol 3732, pp 1079-1088, 2005, Springer. Proc. of Para 2004, June 20 - 23, 2004, Lyngby, Denmark.

Abstract

This paper presents new efficient parallel algorithms for finding approximate solutions to graph coloring problems. We consider an existing shared memory parallel graph coloring algorithm and suggest several enhancements both in terms of ordering the vertices so as to minimize cache misses, and performing vertex-to- processor assignments based on graph partitioning instead of random allocation. We report experimental results that demonstrate the performance of our algorithms on an IBM Regatta supercomputer when up to 12 processors are used. Our implementations use OpenMP for parallelization and Metis for graph partitioning. The experiments show that we get up to a 70 % reduction in runtime compared to the previous algorithm.

Download paper in PDF

Title Graph Coloring on Coarse Grained Multicomputers
Authors: A. Gebremedhin, I.Guerrin-Lassous, J. Gustedt and J.A. Telle
Status: Discrete Applied Mathematics, Vol 131, No 1, pp 179--198, 2003

Abstract

We present an efficient and scalable Coarse Grained Multicomputer (CGM) coloring algorithm that colors a graph G with at most D + 1 colors where D is the maximum degree in G . This algorithm is given in two variants: a randomized and a deterministic . We show that on a p-processor CGM model the proposed algorithms require a parallel time of O(|G|/p) and a total work and overall communication cost of O(|G|) . These bounds correspond to the average case for the randomized version and to the worst case for the deterministic variant.

Download paper in PDF

Title: Parallel Distance-k Coloring Algorithms for Numerical Optimization
Authors: A.H. Gebremedhin, F. Manne and A. Pothen
Status: In B. Monien and R. Feldmann (Eds.): EuroPar 2002, Lecture Notes in Computer Science 2400, pp. 912-921, Springer-Verlag 2002.

Abstract

Matrix partitioning problems that arise in the efficient estimation of sparse Jacobians and Hessians can be modeled using variants of graph coloring problems. In a previous work, we argue that distance-2 coloring and distance-3/2 coloring [we now call this star coloring ] are robust and flexible formulations of the respective matrix estimation problems. The problem size in large-scale optimization contexts makes the matrix estimation phase an expensive part of the entire computation both in terms of execution time and memory space. Hence, there is a need for both shared- and distributed-memory parallel algorithms for the stated graph coloring problems. In the current work, we present the first practical shared address space parallel algorithms for these problems. The main idea in our algorithms is to randomly partition the vertex set equally among the available processors, let each processor speculatively color its vertices using information about already colored vertices, detect eventual conflicts in parallel, and finally re-color conflicting vertices sequentially. Randomization is also used in the coloring phases to further reduce conflicts. Our PRAM-analysis shows that the algorithms should give almost linear speedup for sparse graphs that are large relative to the number of processors. Experimental results from our OpenMP implementations on a Cray Origin2000 using various large graphs show that the algorithms indeed yield reasonable speedup for modest numbers of processors.

Download paper in PDF

Title: Scalable Parallel Graph Coloring Algorithms
Authors: A. Gebremedhin and F. Manne
Status: Concurrency: Practice and Expereince, Vol 12, pp1131--1146, 2000.

Abstract

Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving the graph coloring problem itself in parallel. We present a simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model. We also present a second heuristic that improves on the number of colors used. The heuristics have been implemented using OpenMP. Experiments conducted on an SGI Cray Origin 2000 super computer using very large graphs from finite element methods and eigenvalue computations validate the theoretical run-time analysis.

Download paper in PDF

Title: Graph Coloring on a Coarse Grained Multiprocessor
Authors: A.H. Gebremedhin, I.G. Lassous, J. Gustedt and J.A. Telle
Status: In Brandes, Ulrik, Wagner and Dorothea (Eds.): WG 2000, Lecture Notes in Computer Science 1928, pp. 184-195, 2000, Springer-Verlag.

Abstract

We present the first efficient parallel coloring algorithm for the Coarse Grain Multicomputer model. The algorithm uses at most D+1 colors, where D is the maximum degree in the graph.

Download paper in PDF

Title: Parallel Graph Coloring Algorithms using OpenMP
Authors: A.H. Gebremedhin and F. Manne
Status: In Proc. of EWOMP'99, First European Workshop on OpenMP, Sept.30 - Oct. 1, 1999, Lund, Sweden.

Download Extended Abstract in PDF
Go to
  • Research Areas
  • Coloring and Automatic Differentiation
  • Parallel Algorithms
  • Parallel Computation Models