# Multi-GPU parallelization for 3D tomographic reconstruction

Algorithm Architecture Adequacy for solving inverse problems

Nicolas GAC (Ass. Prof. Université Paris-sud) Groupe Problèmes Inverses (GPI) L2S (CentraleSupélec/CNRS/Univ Paris Sud) actuellement en délégation CNRS (6 mois) au laboratoire Lagrange

Travaux sur le projet SKA avec André Ferrari

Séminaire du laboratoire Lagrange, site Valrose, 28 novembre 2017







### The beauty, reconstuction algorithms for SKA



### ...and the beast, multi GPU servers



3/78

GPU (Graphic Processing Units) : hardware and software Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board

GPU (Graphic Processing Units) : hardware and software

- GPU (re)designed as a many core architecure
- Programming in CUDA
- A toy example : acceleration of matrix multiplication



Solving (ill-posed) inverse Problems with big dataset

- Iterative (bavesian) algorithm
- Applications



[Tomo3D] Parallelization on the many cores of each GPU board

- Hardware acceleration of Hf and H<sup>t</sup> operators
- Projection on GPU
- Backprojection on GPU

[Tomo3D] Parallelization on the GPU boards of the server

- multi-GPU Parallelization
- CUDA Streams
- CUDA Half float
- Distribution/Centralization of Data

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

**1** GPU (Graphic Processing Units) : hardware and software

- GPU (re)designed as a many core architecure
- Programming in CUDA
- A toy example : acceleration of matrix multiplication
- 2 Solving (ill-posed) inverse Problems with big dataset
- [] [Tomo3D] Parallelization on the many cores of each GPU board
- [4] [Tomo3D] Parallelization on the GPU boards of the server

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Calcul haute performance

#### High Performance Computing (HPC)

- Parallélisation sur machines multi-processeurs
   C Efficace sur machine à mémoire distribuée
- Noeuds de calculs performants
  - ➔ processeurs multi-core, many-core ou FPGA/ASIC



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### GPU : Graphic Processing Unit

#### Evolution vers une architecture many core

- A l'origine, architecture dédiée pour le rendu de volume
   D'ipeline graphique (prog. en OpenGL/Cg)
- Depuis 2006, architecture adaptée à la parallélisation de divers calculs scientifiques

⊃ CUDA : Common Unified Device Architecure (prog. en C)





GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Avant CUDA : pipeline graphique





Calcul sur les Pixels

Shader

8/78

#### GPU (Graphic Processing Units) : hardware and software

Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board [Tomo3D] Parallelization on the GPU boards of the server GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Puissance de calcul



#### GPU (Graphic Processing Units) : hardware and software

Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board [Tomo3D] Parallelization on the GPU boards of the server GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Débit mémoire



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Découpage en threads

|                 | Matériel                         | Logiciel                                      | Exécution                   |     |
|-----------------|----------------------------------|-----------------------------------------------|-----------------------------|-----|
|                 | un Streaming Processor (SP)      | un thread                                     | séquentielle                | (a) |
|                 | un Streaming MultiProcessor (SM) | <b>un bloc de thread</b><br>(plusieurs warps) | s parallèle (SIMT)          | (b) |
| Mémoire globale | une carte GPU (device)           | <b>une grille de thre</b><br>(kernel)         | <b>ads</b> parallèle (MIMT) | (c) |

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Un id par thread et un id par bloc de threads



#### GPU (Graphic Processing Units) : hardware and software

Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board [Tomo3D] Parallelization on the GPU boards of the server GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Hiérarchie mémoire



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### PC hote et carte graphique



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Supercalculateur personnel



15/78

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Découpage en threads et en grilles (kernels)



GPU (re)designed as a many core architecure **Programming in CUDA** A toy example : acceleration of matrix multiplication

### Flot de développement logiciel



### Programmation GPU

#### 1 Parallélisation de l'algorithme

D nourrir en threads (plus ou moins indépendants) le GPU

| n coeurs (1 Ghz)     |         |  |  |
|----------------------|---------|--|--|
| VS                   |         |  |  |
| 1 coeur (3 Ghz)      |         |  |  |
| taux de accélération |         |  |  |
| parallélisation      | GTX 200 |  |  |
| (240 coeurs)         |         |  |  |
| 100 %                | 80      |  |  |
| 99 %                 | 24      |  |  |
| 95 %                 | 6       |  |  |
| 90 %                 | 3       |  |  |
|                      |         |  |  |



18/78

GPU (re)designed as a many core architecure **Programming in CUDA** A toy example : acceleration of matrix multiplication

### **Programmation GPU**

GPU (re)designed as a many core architecure **Programming in CUDA** A toy example : acceleration of matrix multiplication

#### 1 Parallélisation de l'algorithme

⊃ nourrir en threads (plus ou moins indépendants) le GPU

#### **2** Implémentation GPU

Selon l'intensité arithmétique du code (puissance de calcul exploitée / débit des données), l'execution sera soit *memory bound* soit *computation bound* (ex : calcul  $X^k$  [?])

⊃ optimisation du code portera alors soit sur les **accès mémoire** ou soit sur la **complexité arithmétique** 

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Parallélisation du calcul matriciel



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Découpage en blocs de threads



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### kernel = code des threads executés sur le GPU

```
__global__ void matrixMul_kernel( float* C, float* A, float* B,int matrix_size) {
```

```
float C_{sum};

int i_{first}, j_{first};

int i,j;

i_{first}=blockldx.x*BLOCK\_SIZE;

j_{first}=blockldx.y*BLOCK\_SIZE;

i=i_{first}+threadldx.x;

j=j_{first}+threadldx.y;

for (k = 0; k < matrix_size; k++)

C_{sum} += A[i][k] * B[k][j];

C[i][j] = C_{sum};

}
```

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Lancement du kernel depuis le PC hôte

```
#define BLOCK_SIZE 16
```

```
void matrixMul_host(int N) {
```

```
//setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(N /BLOCK_SIZE , N /BLOCK_SIZE );
```

```
//execute the kernel
matrixMul_kernel<<< grid, threads >>>(C_device, A_device, B_device, N);
```

```
•••
```

}

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Gestion de la mémoire GPU via le PC hôte

#define BLOCK\_SIZE 16

void matrixMul\_host(int N) {

// allocate host memory int mem\_size= $N^2$ \*sizeof(float); float\* A\_host = (float\*) malloc(mem\_size); float\* B\_host = (float\*) malloc(mem\_size); float\* C\_host = (float\*) malloc(mem\_size);

// allocate device memory
float\* A\_device,B\_device,C\_device;
cudaMalloc((void\*\*) &A\_device, mem\_size);
cudaMalloc((void\*\*) &B\_device, mem\_size);
cudaMalloc((void\*\*) &C\_device, mem\_size);

// copy host memory to device cudaMemcpy(A\_device, A\_host, mem\_size,cudaMemcpyHostToDevice); cudaMemcpy(B\_device, B\_host, mem\_size,cudaMemcpyHostToDevice);

//setup execution parameters
dim3 threads(BLOCK\_SIZE, BLOCK\_SIZE);
dim3 grid(N /BLOCK\_SIZE , N /BLOCK\_SIZE );

```
// execute the kernel matrixMul_kernel <<< grid, threads >>>(C_device, A_device, B_device, N);
```

// copy result from device to host
cudaMemcpy(C\_host, C\_device, mem\_size,cudaMemcpyDeviceToHost);

```
}
```

#### GPU (Graphic Processing Units) : hardware and software

Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board [Tomo3D] Parallelization on the GPU boards of the server

### Temps GPU

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Matrices de taille 1024.1024

|              | Processeur      | Temps d'exécution    | Transfert mémoire |
|--------------|-----------------|----------------------|-------------------|
| C            | Xeon Quad core  | 9.35 s               |                   |
| non optimisé | 2.7 Ghz         |                      |                   |
| Cuda         | Testla C1060    | 1.35 s <b>(*6,9)</b> | < 1%              |
|              | 240 PE @1,3 Ghz |                      |                   |
| -            |                 |                      |                   |

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Accès séquentiels à la mémoire globale



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Temps GPU avec accès mémoire sequentiels

#### Matrices de taille 1024.1024

|              | Processeur      | Temps d'exécution      | Transfert mémoire |
|--------------|-----------------|------------------------|-------------------|
| C            | Xeon Quad core  | 9.35 s                 |                   |
| non optimisé | 2.7 Ghz         |                        |                   |
| Cuda         | Testla C1060    | 1.35 s <b>(*6,9)</b>   | < 1%              |
|              | 240 PE @1,3 Ghz |                        |                   |
| Cuda         | Testla C1060    | 124 m s <b>(*10,9)</b> | 5%                |
| acces seq.   | 240 PE @1,3 Ghz |                        |                   |
| -            |                 |                        |                   |

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Optimisation des accès mémoire



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Optimisation des accès mémoire



GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

### Variable type qualifiers

#### \_\_device\_\_

- en mémoire globale
- durée de vie de l'application
- accessible par tous les threads de la grille et par le hôte via la librairie runtime

#### \_\_constant\_\_

- en mémoire globale (accès via cache constante)
- durée de vie de l'application
- accessible par tous les threads de la grille et par le hôte via la librairie runtime

#### \_\_shared\_\_

- en mémoire shared (locale à un coeur SIMT)
- durée de vie du bloc de threads
- seulement accessible par les threads d'un même bloc

### Temps GPU optimisé

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Matrices de taille 1024.1024

|              | Processeur      | Temps d'exécution      | Transfert mémoire |
|--------------|-----------------|------------------------|-------------------|
| C            | Xeon Quad core  | 9.35 s                 |                   |
| non optimisé | 2.7 Ghz         |                        |                   |
| Cuda         | Testla C1060    | 1.35 s <b>(*6,9)</b>   | < 1%              |
|              | 240 PE @1,3 Ghz |                        |                   |
| Cuda         | Testla C1060    | 124 m s <b>(*10,9)</b> | 5%                |
| acces seq.   | 240 PE @1,3 Ghz |                        |                   |
| Cuda         | Testla C1060    | 17,5 m s <b>(*7,1)</b> | 34%               |
| shared mem.  | 240 PE @1,3 Ghz |                        |                   |
|              |                 |                        |                   |

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

## Librairie CUBLAS : CUda Basic Linear Algebra Subprograms

 $\begin{array}{l} \# include < \!\! cublas.h \!\! > \\ \# include < \!\! cutil.h \!\! > \end{array}$ 

 $\begin{array}{l} \mbox{int main(void)} \mbox{ } \{ \\ \mbox{float alpha} = 1.0 \mbox{f, beta} = 0.0 \mbox{f;} \\ \mbox{int N} = 1024 \mbox{;} \\ \mbox{int mem_size} = 1024 \mbox{*}1024 \mbox{*}sizeof(\mbox{float}) \mbox{;} \\ \end{array}$ 

// Allocate host memory
float\* A\_host = (float\*) malloc(mem\_size);
float\* B\_host = (float\*) malloc(mem\_size);
float\* C\_host = (float\*) malloc(mem\_size);

cublasInit();

//Allocate device memory
float\* A\_device,B\_device,C\_device;
cublasAlloc(N\*N, sizeof(float), (void \*\*)&A\_device);
cublasAlloc(N\*N, sizeof(float), (void \*\*)&B\_device);
cublasAlloc(N\*N, sizeof(float), (void \*\*)&C\_device);

// copy host memory to device cublasSetMatrix(N,N, sizeof(float), A\_host, N, A\_device, N); cublasSetMatrix(N,N, sizeof(float), B\_host, N, B\_device, N);

//Calcul matriciel sur le GPU
cublasSgemm('n', 'n', N, N, N, alpha, A\_device, N,B\_device, N, beta, C\_device, N);

 $\label{eq:linear} $$ //Récupération du résultat sur le PC hôte cublasGetMatrix(N,N, sizeof(float), C_device,N, C_host, N); $$ (a) $$ (A) $$ (b) $$$ 

#### GPU (Graphic Processing Units) : hardware and software Solving (ill-posed) inverse Problems with big dataset

Solving (ill-posed) inverse Problems with big dataset [Tomo3D] Parallelization on the many cores of each GPU board [Tomo3D] Parallelization on the GPU boards of the server

### Temps CUBLAS

GPU (re)designed as a many core architecure Programming in CUDA A toy example : acceleration of matrix multiplication

#### Matrices de taille 1024.1024

|              | Processeur      | Temps d'exécution      | Transfert mémoire |
|--------------|-----------------|------------------------|-------------------|
| C            | Xeon Quad core  | 9.35 s                 |                   |
| non optimisé | 2.7 Ghz         |                        |                   |
| Cuda         | Testla C1060    | 1.35 s <b>(*6,9)</b>   | < 1%              |
|              | 240 PE @1,3 Ghz |                        |                   |
| Cuda         | Testla C1060    | 124 m s <b>(*10,9)</b> | 5%                |
| acces seq.   | 240 PE @1,3 Ghz |                        |                   |
| Cuda         | Testla C1060    | 17,5 m s <b>(*7,1)</b> | 34%               |
| shared mem.  | 240 PE @1,3 Ghz |                        |                   |
| CUBLAS       | Testla C1060    | 12,8 m s <b>(*1,4)</b> | 43%               |
|              | 240 PE @1,3 Ghz |                        |                   |
|              |                 |                        |                   |

lterative algorithm : Mean square + quadratic reg Applications

#### GPU (Graphic Processing Units) : hardware and software

- Solving (ill-posed) inverse Problems with big dataset
  - Iterative (bayesian) algorithm
  - Applications
- 3 [Tomo3D] Parallelization on the many cores of each GPU board
- [] [Tomo3D] Parallelization on the GPU boards of the server

Iterative algorithm : Mean square + quadratic reg Applications

### Without bayesian regularisation



#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \epsilon$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

#### Criterion : Mean Square

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg Applications

### Without bayesian regularisation



 $f^n: {\rm Estim{\acute e}}$  du volume

#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \boldsymbol{\epsilon}$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

#### Criterion : Mean Square

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg  $\ensuremath{\mathsf{Applications}}$ 

### Without bayesian regularisation



 $\hat{g}$  : Estimée des données

#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \boldsymbol{\epsilon}$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg Applications

### Without bayesian regularisation



 $\delta g$  : Correction des données

#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \epsilon$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg Applications

### Without bayesian regularisation



 $\delta f$  : Correction du volume

#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \epsilon$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg Applications

### Without baysian regularisation



 $f^{n+1}: {\it Nouvelle}$  estimée du volume

#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \epsilon$

- f : volume
- g : tomograph data
- H : acquisition model
- $\epsilon$  : noise

$$J(f) = ||g - Hf||^2$$
  

$$f^{n+1} = f^n - \alpha \cdot \nabla J(f^n)$$
  

$$\nabla J(f) = -2 \cdot H^t(g - Hf)$$

Iterative algorithm : Mean square + quadratic reg Applications

### With bayesian regularisation



#### $\mathbf{g} = \mathbf{H}\mathbf{f} + \boldsymbol{\epsilon}$

- f : volume
- g : tomograph data
- H: acquisition model
- $\epsilon$  : noise

Criterion : Mean Square + Quadratic Regularisation (MSQR)

$$\begin{split} J(f) &= J_1(f) + J_2(f) \\ J_1(f) &= ||g - Hf||^2 \\ J_2(f) &= \lambda ||Df||^2 \\ f^{n+1} &= f^n - \alpha \cdot (\nabla J_1(f^n) + \nabla J_2(f^n)) \end{split}$$

Iterative algorithm : Mean square + quadratic reg Applications

### [Planéto] Correction de vibrations mécaniques

### Collaboration avec l'IDES de l'Univ. Paris-Sud (F. Schmidt)





Instrument PFS (Planetary Fourier Spectrum) de la mission MARS EXPRESS



42/78

Iterative algorithm : Mean square + quadratic reg Applications

# [Planéto] Correction de vibrations mécaniques

#### Instrument modélisé par une convolution 1D



#### Taille gigantesque des données

Des années d'enregistrements de la mission MARS EXPRESS (2003) donc potentiellement 1 milliard de spectres (de 8192 échantillons) !

43/78

Iterative algorithm : Mean square + quadratic reg Applications

### [Astro] Méthode de reconstruction en astronomie



Iterative algorithm : Mean square + quadratic reg Applications

# [Tomo3D] Algorithmes de reconstruction tomographique



# $1 {\cal K}^3$ volume from 1K projections with $1 {\cal K}^2$ pixels (SAFRAN data set)

Work done in collaboration with SAFRAN (Post-doc Thomas Boulay)



Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

I GPU (Graphic Processing Units) : hardware and software

- 2 Solving (ill-posed) inverse Problems with big dataset
- [Tomo3D] Parallelization on the many cores of each GPU board
  - Hardware acceleration of *Hf* and *H<sup>t</sup>* operators
  - Projection on GPU
  - Backprojection on GPU

[] [Tomo3D] Parallelization on the GPU boards of the server

Hardware acceleration of Hf and  $H^t$  operators Projection on GPU Backprojection on GPU

# *Hf* and $H^t \delta g$ computation

### Matrix multiplication

⊃ reading  $h_{ij}$  coefficients in SDRAM memory △volume 2048<sup>3</sup> -> matrix H = 1 Exa Bytes !

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

# *Hf* and $H^t \delta g$ computation

### Matrix multiplication

⊃ reading  $h_{ij}$  coefficients in SDRAM memory  $\triangle$ volume 2048<sup>3</sup> − > matrix H = 1 Exa Bytes !

#### **2** Geometric operators

 $\supset$  on line computation of  $h_{ij}$  coefficients

Paire de projection/rétroprojection en tomographie à émission (géométrie parallèle)



Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

Thèse soutenue en 2008 : "Adéquation Algorithme Architecture pour la reconstruction 3D en imagerie médicale TEP" (Gipsa-lab, Grenoble-INP sous la direction de M. Desvignes et S. Mancini)





GPU (carte graphique)



SoPC (prototypage)

Exploration architecturale 49/78

Hardware acceleration of Hf and  $H^t$  operators Projection on GPU Backprojection on GPU

50/78

### Thesis conclusions



### CPU/GPU/FPGA comparaison

|            | CPU                       | GPU                        | FPGA                     |
|------------|---------------------------|----------------------------|--------------------------|
| Time       | 3 <sup>eme</sup> (*4 P4)  | 1 <sup>er</sup> (*50 P4)   | 2 <sup>eme</sup> (*5 P4) |
| Efficacity | 2 <sup>eme</sup> (7 C/op) | 1 <sup>eme</sup> (14 C/Op) | 1 <sup>er</sup> (2 C/Op) |

- GPU is the hardware accelerator the most performant
- FPGA is the hardawre accelerator the most effcient in term of cycles/op (thanks to our cache 3D)

Hardware acceleration of Hf and  $H^t$  operators Projection on GPU Backprojection on GPU

# GPU quickly adopted by the tomography community

#### Publications in Fully 3D

- 2007 : 1st Workshop HPIR (High Performance Image Reconstruction)
- 2011 : Keyword Multi GPU first appeared

|                         | 2003 | 2005 | 2007 | 2009 | 2011 | 2013 | 2015 | 2017 |
|-------------------------|------|------|------|------|------|------|------|------|
| Cluster (MPI/Open MP)   | 2    | 3    | 5    | 6    | 3    | 2    | 1    | 2    |
| GPU (NVIDIA)            |      |      | 10   | 14   | 17   | 7    | 10   | 20   |
| GPU (AMD)               |      |      |      | 1    | 1    |      | 1    |      |
| Cell (IBM)              |      |      | 3    |      |      |      |      |      |
| FPGA                    |      |      | 4    |      | 1    |      | 1    |      |
| DSP                     |      |      | 2    | 1    |      |      |      |      |
| Intel(Larabee,Xeon phi) |      |      |      | 2    |      | 2    | 2    |      |

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators **Projection on GPU** Backprojection on GPU

# 2D projector "regular sampling"



for (un, phi) in Projection do
 for xn = 0 to xn<sub>max</sub> - 1 do
 // coordinates computation
 yn(xn, un, phi) = ...
 // bi-linear interpolation
 f<sub>interp</sub> = ...
 // accumulation
 g\*(un, phi)+ = f<sub>interp</sub>
 end for
end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

## 2D backprojection : algorithm

#### CALCUL DES COORDONNEES



for (xn, yn) in Volume do
 for phi = 0 to phi<sub>max</sub> - 1 do
 // coordinates computation
 u(phi, xn, yn) = ...
 // accumulation
 f\*(xn, yn)+ = g(u, φ)
 end for
end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

### 2D backprojection : linear interpolation

#### CALCUL DES COORDONNEES



for (xn, yn) in Volume do for phi = 0 to  $phi_{max} - 1$  do // coordinates computation  $u(phi, xn, yn) = \dots$ // linear interpolation  $g_{interp} = (1 - \epsilon_{\mu}) \cdot g(\text{phi}, u_e) +$  $\epsilon_{\mu} \cdot g(\text{phi}, u_e + 1)$ // accumulation  $f^*(xn, yn) + = g_{intern}$ end for end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

### 2D backprojection : scattered data access

#### CALCUL DES COORDONNEES







for (xn, yn) in Volume do for phi = 0 to  $phi_{max} - 1$  do // coordinates computation  $u(phi, xn, yn) = \dots$ // linear Interpolation  $g_{interp} = (1 - \epsilon_u) \cdot g(\text{phi}, u_e) +$  $\epsilon_{\mu} \cdot g(\text{phi}, u_e + 1)$ // accumulation  $f^*(xn, yn) + = g_{interp}$ end for end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

### 2D backprojection : scattered data access

#### CALCUL DES COORDONNEES



ACCES AUX DONNEES  $g(\phi, u)$ 



for (xn, yn) in Volume do for phi = 0 to  $phi_{max} - 1$  do // coordinates computation  $u(phi, xn, yn) = \dots$ // linear interpolation  $g_{interp} = (1 - \epsilon_{\mu}) \cdot g(\text{phi}, u_e) +$  $\epsilon_{\mu} \cdot g(\text{phi}, u_e + 1)$ // accumulation  $f^*(xn, yn) + = g_{interp}$ end for end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

### 2D backprojection : scattered data access

#### CALCUL DES COORDONNEES



ACCES AUX DONNEES  $g(\phi, u)$ 



for (xn, yn) in Volume do for phi = 0 to  $phi_{max} - 1$  do // coordinates computation  $u(phi, xn, yn) = \dots$ // linear interpolation  $g_{interp} = (1 - \epsilon_{\mu}) \cdot g(\text{phi}, u_e) +$  $\epsilon_{\mu} \cdot g(\text{phi}, u_e + 1)$ // accumulation  $f^*(xn, yn) + = g_{interp}$ end for end for

Hardware acceleration of Hf and  $H^t$  operators Projection on GPU Backprojection on GPU

## 2D backprojection by blocks : localized data access

#### CALCUL DES COORDONNEES



ACCES AUX DONNEES  $g(\phi, u)$ 



for (Bx, By) in Volume do for phi = 0 to  $phi_{max} - 1$  do for (xn, yn) in Bloc do // coordinates computation  $u(phi, xn, yn) = \dots$ // linear interpolation  $g_{interp} = (1 - \epsilon_u) \cdot$  $g(\text{phi}, u_e) + \epsilon_u \cdot g(\text{phi}, u_e + 1)$ // accumulation  $f^*(xn, yn) + = g_{interp}$ end for end for end for

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

# 3D backprojection parallelization



(a) Sequential computation on processor elementLoop on z

• Loop on  $\phi$ 

(b) Parallel computation on a block of processors (SIMT)

• Loop on (x,y)

(c) Parallel computation on one card

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

# 3D backprojection parallelization



(a) Sequential computation on processor elementLoop on z

• Loop on z

• Loop on  $\phi$ 

(b) Parallel computation on a block of processors (SIMT)

• Loop on (x,y)

(c) Parallel computation on one card

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

# 3D backprojection parallelization



(a) Sequential computation on processor element

Loop on z

• Loop on  $\phi$ 

(b) Parallel computation on a block of processors (SIMT)

• Loop on (x,y)

(c) Parallel computation on one card

Hardware acceleration of *Hf* and *H<sup>t</sup>* operators Projection on GPU Backprojection on GPU

# 3D backprojection parallelization



(a) Sequential computation on processor element

Loop on z

• Loop on  $\phi$ 

(b) Parallel computation on a block of processors (SIMT)

• Loop on (x,y)

(c) Parallel computation on one card

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### Multi GPUs server (Carri Systems)



multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### 3D backprojection multi GPU parallelization



multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

## 3D projection multi-GPU parallelization



multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### Multi-GPU reconstruction time

# Volume $1K^3$ (float) with 1024 projections on 1 to 8 Titans X (3072 cores at 1,075 Ghz)

|           | 1 GPU | 2 GPUs                  | 4 GPUs                  | 8 GPUs                 |
|-----------|-------|-------------------------|-------------------------|------------------------|
| Proj (ms) | 14416 | 8183 1,76               | 4610 <b>3,13</b>        | 2659 <mark>5,42</mark> |
| Back (ms) | 7604  | 5181 <mark>1,47</mark>  | 3027 <mark>2,51</mark>  | 1929 <mark>3,94</mark> |
| Conv (ms) | 3062  | 2987 1, <mark>02</mark> | 2438 1, <mark>26</mark> | 1668 <mark>1,84</mark> |

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# Goal of streams : hide PC/GPU memory transfer



#### 64/78

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# CUDA streams for mono GPU backprojection (1024 angles 1024<sup>2</sup> plan



multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# CUDA streams for mono GPU backprojection (1024 angles 1024<sup>2</sup> plan



multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

## single GPU time with streams

#### 1K<sup>3</sup> Volume (float) with 1024 projections on 1 Titan X (3072 cores at 1,075 Ghz)

|            | compute | upload | download | w/o stream | w/ streams | Acc. |
|------------|---------|--------|----------|------------|------------|------|
| Proj (ms)  | 88 %    | 6 %    | 6 %      | 14416      | 11551      | 1,25 |
| Rétro (ms) | 71,1 %  | 16,9 % | 12,1 %   | 7604       | 5358       | 1,42 |
| Conv (ms)  | 5 %     | 28,1 % | 66,9 %   | 3062       | 3072       | 0,99 |

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### multi-GPU time with streams

#### $1K^3$ volume (float) with 1024 projections on 1 to 8 Titans X (3072 cores at 1,075 Ghz)

|                       | 1 GPU | 2 GPUs                  | 4 GPUs                 | 8 GPUs    |
|-----------------------|-------|-------------------------|------------------------|-----------|
| Proj (ms) w/o streams | 14416 | 8183 1,76               | 4610 <mark>3,13</mark> | 2659 5,42 |
| Proj (ms) w/ streams  | 11551 | 5783 2,0                | 3142 3,68              | 1756 6,58 |
| Back (ms) w/o streams | 7604  | 5181 1,47               | 3027 2,51              | 1929 3,94 |
| Back (ms) w/ streams  | 5358  | 2609 2,0                | 1672 3,20              | 1731 3,10 |
| Conv (ms) w/o streams | 3062  | 2987 1, <mark>02</mark> | 2438 1,26              | 1668 1,84 |
| Conv (ms) w/ streams  | 3072  | 2482 1,24               | 2340 1,31              | 1674 1,83 |

#### Limitations due to PCI express gen2 bandwith (2 to 4 GB/s)

### Half float data storage

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

#### CUDA 7.5 allows half float storage of data on GPU memory

- 16 bits format : sign (1bit), exponant (5bits), mantissa (10bits)
- assembler instructions allow the conversion half/float and float/half in CUDA kernels
- Advantage (i) : reduction of data volume to store on the GPU board
- Advantage (ii) : reduction of memory transfer
- Advantage (iii) : reduction of SDRAM GPU memory access by the GPU cores

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### single GPU time with streams and half-float storage

#### $1K^3$ volume (float) with 1024 projections on 1 Titan X (3072 cores at 1,075 Ghz)

|           | float | half float | Acc. |
|-----------|-------|------------|------|
| Proj (ms) | 11551 | 8970       | 1,29 |
| Back (ms) | 5358  | 4252       | 1,26 |
| Conv (ms) | 3072  | 1608       | 1,91 |

Additional acceleration with half float storage for projection and backprojection

-> Reduction of SDRAM GPU memory access time by the GPU cores

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

## multi-GPU Time with streams and half-float storage

| $1K^3$ volume (float) with 1024 projections on 1 to 8 Titans X (3072) |
|-----------------------------------------------------------------------|
| cores at 1,075 Ghz)                                                   |

|              | 1 GPU | 2 GPUs                | 4 GPUs                   | 8 GPUs                 |
|--------------|-------|-----------------------|--------------------------|------------------------|
| Proj (ms) f  | 11551 | 5783 <mark>2,0</mark> | 3142 <mark>3,68</mark>   | 1756 <mark>6,58</mark> |
| Proj (ms) hf | 8970  | 4620 1,94             | 2357 <mark>3,80</mark>   | 1265 7,09              |
| Back (ms) f  | 5358  | 2609 <mark>2,0</mark> | 1672 <mark>3,20</mark>   | 1731 <mark>3,10</mark> |
| Back (ms) hf | 4252  | 2164 1,96             | 1229 3,46                | 876 <mark>4,83</mark>  |
| Conv (ms) f  | 3072  | 2482 1,24             | 2340 1, <mark>3</mark> 1 | 1674 <mark>1,83</mark> |
| Conv (ms) hf | 1608  | 1267 1,27             | 1171 1,37                | 843 1,91               |

Limitations due to PCI express gen2 bandwith (2 to 4 GB/s)

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# Data storage during the iterative loop

#### CPU centralisation

All the data ( $f^n$  and  $f^{n+1}$  volume, real g and estimated  $\hat{g}$  sinograms...) could not stay on the GPU board (true from  $1K^3$ ) volumes)

Because of the cone beam geometry, data could not easily cut in independant block of data

- > Data need to be backed up on the CPU at least one time after each iteration

#### (single)GPU centralization

All the data ( $f^n$  and  $f^{n+1}$  volume, real g and estimated  $\hat{g}$  sinograms...) could stay on the GPU board (true up to 512<sup>3</sup> volumes)

- > All the iterative loop could be done on the GPU

#### (multi)GPU centralisation

All the data (n and n+1 volume, real and estimate sinograms...) could be distributed on the diffrenets GPU boards (true up to  $2K^3$  volumes)

-> All the iterative loop could be done without data storage on the CPU

### **CPU** centralization

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

#### Current strategy : result of each operator (proj, back, conv) is backed up on the CPU

- Advantage : operators (proj, back, conv) are independants (usefull for utilization with Matlab and mex function)
- Disadvantage : several synchronizations CPU/GPU and memory transfer time cost

# Solutions to avoid these multiples synchronizations and its impact on reconstruction time

- Use of only one synchronization per iteration by merging operators working on subblock of data (need of a reduction step)
- Hide memory transfer time thanks to streams and half float data storage.

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# Reconstruction time (per iteration with computation of the optimized gradient step) with CPU centralization

#### $1K^3$ volume (float) with 1024 projections on Titans X (3072 cores at 1,075 Ghz)

|        | proj (*2) | retro  | conv(*3) | autres | total  | Acc. |
|--------|-----------|--------|----------|--------|--------|------|
| 1 GPU  | 49,6 %    | 20,2 % | 30,1 %   | 28,1%  | 47,1 s |      |
| 2 GPUs | 36,6 %    | 7,5%   | 14,9 %   | 40,9%  | 32,4 s | 1,45 |
| 4 GPUs | 23,6 %    | 7,5 %  | 21,6 %   | 47,2 % | 27,9 s | 1,69 |
| 8 GPUs | 15,9 %    | 6,6%   | 21,6%    | 55,9 % | 23,6 s | 1,99 |

#### $2K^3$ volume (float) with 2048 projections on Titans X (3072 cores at 1,075 Ghz)

|   |        | proj (*2) | retro   | conv(*3) | autres  | total  | Acc. |
|---|--------|-----------|---------|----------|---------|--------|------|
| Γ | 4 GPUs | 36,27 %   | 20,65%  | 10,38 %  | 32,69%  | 5,4 mn |      |
| Γ | 8 GPUs | 26,38 %   | 13,31 % | 15,22 %  | 45,09 % | 3,8 mn | 1,4  |

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# Reconstruction time (per iteration with computation of the optimized gradient step) with CPU centralization

#### Limitations of this CPU centralization

-> The "little" operations (norm L2, substraction...) are becoming preponderants.... Solutions :

- Parallelization on the CPU cores (the minimum to do....)
- Merge the operators (break the frontier between each operators)
- Use of half float storage to get a GPU centralization (code 100% GPU)

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

# Reconstruction time (per iteration with computation of the optimized gradient step) with **GPU centralization**

| $1 {\cal K}^3$ volume (float) with 1024 projections on one Titan X (3072 cores at 1,075 GH |                                                      |               |             |          |        |        |      |  |
|--------------------------------------------------------------------------------------------|------------------------------------------------------|---------------|-------------|----------|--------|--------|------|--|
|                                                                                            |                                                      | proj (*2)     | back        | conv(*3) | others | total  | Acc. |  |
|                                                                                            | CPU cer                                              | tralization   |             |          |        |        |      |  |
|                                                                                            | 1 GPU                                                | 49,7 %        | 9,8 %       | 12,7 %   | 27,0 % | 43,9 s |      |  |
|                                                                                            | GPU cer                                              | tralization a | nd half flo | at       |        |        |      |  |
|                                                                                            | 1 GPU   78,3 %   18,3 %   2,2 %   1,2 %   21,9 s   2 |               |             |          |        |        |      |  |
|                                                                                            |                                                      |               |             |          |        |        |      |  |
|                                                                                            |                                                      |               |             |          |        |        |      |  |

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### Conclusions

#### Towards an efficient computation on GPU for each operator

- Local and spatial memory locality
- Threads/Blocks "optimal" definition (thread parallelism)
- Unrolling loop (instruction parallelism)
- Incremetal computation

#### Use of streams to hide CPU/GPU memory transfer time

#### Half-float data storage on GPU

- Reduction of CPU/GPU memory transfer
- Reduction of SDRAM GPU/coeurs GPU memory transfer
- Reduction of storage on SDRAM GPU

 $->{\rm A}$  significant accelaration factor (1.2/1.3) on a single GPU and a more efficient multi-GPU parallelization

-> A 100 % GPU code for  $1K^3$  volume is becoming possible

iterative reconstruction of  $2K^3$  volume

multi-GPU Parallelization CUDA Streams CUDA Half float Distribution/Centralization of Data

### Perspective for SKA project

#### Short term perspectives

- Acceleration of the convolution (H and Ht)
- Multi-GPU parallelization with CPU Centralisation of data

#### Median/long term perspectives

- Multi-GPU parallelization with multi-GPU distribution of data
- Use of FPGA Architecture with HLS (High Level Synthesis) tools