Lorena A. Barba group

Pi-Yueh Chuang at GTC 2016

piyueh-with-jensen

Pi-Yueh Chuang gets a photo opportunity with NVIDIA CEO Jen-Hsun Huang.

At this year's NVIDIA GPU Technology Conference, PhD student Pi-Yueh Chuang presented the work "Using AmgX to Accelerate PETSc Codes." AmgX is an NVIDIA library that provides sparse linear solvers, smoothers and preconditioners on GPU devices. Pi-Yueh presented his work to couple AmgX with our fluid-flow solver, PetIBM. This code solves the Navier-Stokes equations on Cartesian grids with the immersed boundary method, and it uses the PETSc library for all the distributed vectors and matrices, and also to solve the sparse linear systems arising from the implicit CFD formulation.

We asked ourselves if we could accelerate PetIBM by offloading the solution of the pressure Poisson equation to AmgX, while leaving the rest of the calculation as a PETSc application. AmgX and PETSc each have their own vector and matrix objects, with unique APIs. To tackle this situation, Pi-Yueh wrote a wrapper code that makes it easy to call AmgX from a PETSc application. The wrapper code efficiently manages the situation when we want to launch the PETSc-based program with more MPI processes than the number of available GPUs. We can thus utilize as many CPU cores as available to run the parts that still remain on the CPU-side of the application code, and use as many GPUs as available to accelerate the linear solvers on AmgX. It is open source and free for unrestricted use under the MIT License.

The GTC session summary reads:

Learn to accelerate existing PETSc applications using AmgX-NVIDIA's library of multi-GPU linear solvers and multigrid preconditioners. We developed wrapper code to couple AmgX and PETSc, allowing programmers to use it with fewer than 10 additional lines of code. Using PetIBM, our PETSc-based, immersed-boundary CFD solver, we show how AmgX can speed up an application with little programming effort. AmgX can thus bring multi-GPU capability to large-scale 3D CFD simulations, reducing execution time and lowering hardware costs. As example, we estimate the potential cost savings using Amazon elastic compute cloud (EC2). We also present performance benchmarks of AmgX, and tips for optimizing GPU multigrid preconditioners for CFD. This presentation is co-authored with Professor Lorena A. Barba.

Performance results

On our benchmarks solving standard Poisson problems with 25M unknowns, we obtained a 5x speed-up on four K20s, compared to one 16-core CPU node, in the 2D case. In the 3D case, we obtained a 13.2x speedup with eight K20s using AmgX, compared to one 16-core CPU node. The minimum number of GPUs, in this case, is dictated by our memory needs.

On larger Poisson benchmarks, with 100M unknowns in 2D and 50M unknowns in 3D, we prefer to look at the equivalent size of a CPU cluster that would give the same speedup as a GPU cluster. We get that about 400 CPU cores would be needed to give similar speed-ups to a 32-GPU cluster (with speedups of 17.6x and 20.8x, respectively). On a real application (flying snake simulations), with almost 3M mesh points, we obtained a 21x speedup on one K20 GPU, compared to one 12-core CPU node.

Warning!

We talk about a "speed-up" obtained when using GPU computing. Please note, this refers to application speed-up from using the solution methods of the AmgX library on GPUs, compared to obtaining the same solution with the PETSc library on CPUs.

The present work is not a code port and therefore you should not be looking at those speed-ups and wondering how they are compatible with the peak-to-peak bandwidth specifications of the hardware.

To be clear: we are not comparing the same code on CPU and GPU. We make no claims of an "apples to apples" comparison. The point of this work is not to show our coding prowess in moving an algorithm to GPU hardware. The point is to assess the reductions in runtime that a user may experience in using the NVIDIA AmgX GPU library from an application code written with PETSc.

Bear in mind that PETSc is a two-decades old, actively developed library, and it is used in hundreds of research codes around the world.

Links to presentation and supplementary materials

On Twitter