Hybrid CPU + Xeon Phi implementation of the Particle-in-Cell method for plasma simulation

This paper presents experimental results of Particle-in-Cell plasma simulation on a hybrid system with CPUs and Intel Xeon Phi coprocessors. We consider simulation of two relevant laser-driven particle acceleration regimes using the Particle-in-Cell code PICADOR. On a node of a cluster with 2 CPUs and 2 Xeon Phi coprocessors the hybrid CPU + Xeon Phi conﬁguration allows to fully utilize the computational resources of the node. It outperforms both CPU-only and Xeon Phi-only conﬁgurations with the speedups between 1.36 x and 1.68 x.


Introduction
Numerical simulation of plasmas is an important area of computational physics with numerous applications.The Particle-in-Cell method [1][2][3] is widely used for plasma simulation in ultrahigh fields.Large-scale 3D simulation requires the use of supercomputers.Currently, there are a number of Particle-in-Cell implementations capable of that, including OSIRIS [4], PIConGPU [5], VPIC [6], VLPL [7], WARP [8].
During the recent years there has been a continuous progress in utilization of accelerators, including GPUs [5,9,10] and Intel Xeon Phi coprocessors [11,12].Our code PICADOR has been previously ported and optimized for Xeon Phi coprocessors that led to 1.6-1.8x speedup compared to an 8-core CPU on a benchmark problem [13].Naturally, it is more challenging to efficiently utilize Xeon Phi on real applications that may have large output, significant imbalance, costly diagnostics, and other performance hindering factors.However, even obtaining the same performance on Xeon Phi as on a multicore CPU is beneficial in terms of hybrid CPU + Xeon Phi computing.It allows to fully utilize computational resources of heterogeneous machines with several CPUs and Xeon Phi coprocessors per node and theoretically may significantly outperform CPU-only and Xeon Phi-only configurations.
This paper presents our experience of solving two problems by Particle-in-Cell simulation using PICADOR on a hybrid CPU + Xeon Phi machine.These two problems have been selected as they are relevant for laser-plasma physics and are part of our current research involving PICADOR as a tool for numerical simulation.Thus, the problems considered present a realistic workload and, secondly, any speedup obtained due to Xeon Phi or hybrid CPU + Xeon Phi computing is beneficial for the current research done using the code.The paper is organized as follows.We briefly describe the method and our code PICADOR in section 1.Sections 2 and 3 are devoted to the physical problems we consider.Summary and conclusions are given in section 3.

Particle-in-Cell code PICADOR 1.Particle-in-Cell method
In this subsection we briefly describe the Particle-in-Cell method, a detailed description is given in [3].
The Particle-in-Cell method operates on field data and particle data.Values of electric and magnetic fields are defined on a spatial grid.A plasma is represented as an ensemble of particles, each with a charge, mass, position and momentum.Each particle used in simulation is in fact a macroparticle that represents a cloud of real particles.A particle form factor defines particle distribution inside the cloud.In this paper we use the cloud-in-cell form factor that corresponds to the uniform distribution in the cloud that has the same size as a cell of the spatial grid.A notable feature of the method is that particles do not interact with each other directly, instead each particle interacts with a set of nearby grid values, depending on the form factor.
The basic computational loop of the Particle-in-Cell method consists of the following stages.For each particle the Lorenz force is computed using interpolated values of the electromagnetic field and the particle momentum and position are updated.Grid values of the current created by particle movement are computed.Field interpolation and current deposition depend on the particle form factor being used.Finally, field values are updated by solving Maxwell's equations.

PICADOR code
PICADOR [13,14] is a tool for 3D plasma simulation based on the Particle-in-Cell method.The code supports a rather standard set of numerical schemes for laser-plasma simulation: several widely used particle form factors, Boris particle pusher [15], Yee grid [16] and finitedifference time-domain field solver [17], charge-conserving current deposition [18,19].
Parallel computing on cluster systems is organized using MPI.On the internode level we use a 3D rectilinear domain decomposition, with each MPI process handling a part of the simulation area.Each MPI process stores particles and grid values in the corresponding sub-area with guard cells.Data exchanges occurs only between neighbour sub-areas.On shared memory OpenMP is used to parallelize loops over particles and cells.Processing of particles in a cell is partially vectorized using a combination of compiler auto-vectorization and manually coded intrinsic functions, details are given in [13].The basic computational scheme is given in fig. 1. PICADOR has been previously ported and optimized for Intel Xeon Phi coprocessors [13].Applying standard optimization techniques such as improving memory locality, enhancing scaling efficiency and vectorization resulted in 1.6-1.8x speedup on Xeon Phi 5110P relative to an 8-core Intel Xeon E5-2660 CPU on a benchmark problem with a uniform distribution of particles between threads and no MPI communication.

Hybrid simulation of wakefield laser pulse self-compression
In this section we consider simulation of a laser pulse self-compression due to wakefield excitation in the relativistic regime.A gas jet is irradiated by a femtosecond high intensity laser pulse.In the result of the interaction a generated wakefield may lead to laser pulse self-compression.This is a promising mechanism for generating ultra-short high-intensity laser pulses [20].The simulation was performed using the following parameters: 256 × 128 × 128 grid, 78.5 million particles, 7 674 time steps, cloud-in-cell particle form factor, charge-conserving Villasenor-Buneman current deposition scheme [18], double precision floating point arithmetic.
Computational experiments were performed on a node of the MVS-10P supercomputer at the Joint Supercomputer Center of RAS with 2 x Intel Xeon E5-2690 CPU (8 cores, 2.9 GHz, Hyperthreading enabled) and 2 x Intel Xeon Phi 7110 (61 cores, 1.053 GHz), Intel C++ Compiler 14.0.1,Intel MPI 4.1.We used a single MPI process per CPU and Xeon Phi, 2 OpenMP threads per core on CPU and 4 threads per core on Xeon Phi.Our previous results [13] show that this configuration of processes and threads is the best for PICADOR.This is probably due to the fact that during the most computationally intensive stages of the Particle-in-Cell method there are lots of independent subproblems that can be solved in parallel, even with 4 threads per core on Xeon Phi.Meanwhile, running several MPI processes on shared memory requires some additional communication, which could hinder overall performance.Run time on a node of MVS-10P in three configurations: CPU-only, Xeon Phi-only, and hybrid CPU + Xeon Phi is given in tab. 1. Run time of the computational core on Xeon Phi is close to that of the CPU.There is some advantage in the particle push stage which is partially vectorized on Xeon Phi, but it is compensated by the non-vectorized current deposition stage.MPI communication on Xeon Phi is slower because some related operations are sequential and single-core performance of the coprosessor is inferior to the CPU.Overall, Xeon Phi is 1.16 x slower compared to the CPU.Nevertheless, utilizing the coprocessors allows to accelerate the simulation compared to the CPU by means of the hybrid CPU + Xeon Phi configuration.It yields 1.44 x and 1.68 x speedup compared to the CPU-only and Xeon Phi-only configurations, respectively.

Hybrid simulation of target normal sheath acceleration
In this section we consider a laser ion acceleration in the target normal sheath acceleration regime, which is a widely used approach to laser-driven particle acceleration [21].A high-intensity laser pulse heats electrons at the front surface of a target forming a sheath at its rear side.Ions of the target are accelerated from the rear side by the electron sheath.Surface grating is used on the irradiated side of the target to increase efficiency [22].The hardware and numerical schemes were the same as in the previous section.Run time on a node of MVS-10P in CPU-only, Xeon Phi-only and CPU + Xeon Phi configurations is given in tab. 2. The simulation area was divided between MPI processes along the direction of the laser pulse to reduce the number of particles going from one process to another.Nevertheless, due to the complex dynamics occurring throughout the simulation, there is a massive migration of particles between processes.Therefore, a large amount of time is spent on MPI communication: 8.8% of the total run time on the CPU and 43.4% on the coprocessor.Due to this fact Xeon Phi is 1.14 x slower compared to the CPU, although the computational core is 1.41 x faster.The issue of the slow particle migration is partially alleviated in the hybrid configuration because of the lower amount of particles per process.The CPU + Xeon Phi combination outperforms the CPU-only and Xeon Phi-only runs by 1.36 x and 1.55 x, respectively.

Conclusions
This paper presents results of simulation of two problems on a node of the MVS-10P cluster using three configurations: 2 x CPUs, 2 x Xeon Phi coprocessors, 2 x CPUs + 2 x Xeon Phi.Both problems considered are not ideally suited for Xeon Phi, with the CPU slightly outperforming Xeon Phi by factors of 1.16 x and 1.14 x because of MPI communication time.Even in this case, it is possible to use the Xeon Phi coprocessors to accelerate simulation by fully utilizing the computational resources of a node in the hybrid CPU + Xeon Phi mode.The speedup of the hybrid configuration compared to the CPU-only and Xeon Phi-only modes is between 1.36 x and 1.68 x.Further progress can be made by reducing MPI communication time, particularly on Xeon Phi.A possible approach is to first process near-boundary particles and then overlap asynchronous particle transfers and processing the remaining particles.Our very first implementation of this idea shows some speedup of data transfers on Xeon Phi for the problem considered in section 2: 1.46 x on 2 x Xeon Phi and 1.35 x in the hybrid mode.However, a tradeoff is some increase in the particle push time, so that the overall speedup is 1.05 x.Further improvement of data transfers is a direction of future work.
The problems considered in this paper have a sufficiently uniform distribution of particles in the simulation area.Thus, they are rather suitable for the manycore architecture of Xeon Phi with the standard OpenMP thread scheduling policies.However, there are important regimes involving electron-positron cascades [23,24] with a small number of cells having more particles than the rest of the simulation area.Our first experiments show that in this case the utilization of standard approach results in significant thread workload imbalance on Xeon Phi that causes underwhelming performance.A parallelization scheme for such problems is under development.
The recently introduced Intel Xeon Phi products of the KNL generation look very promising due to a significant increase in the overall performance as well as in the single-core performance.Optimization of the code for KNL is a direction of future work.
This study was supported by the RFBR, project No. 15-37-21015.This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.

Figure 1 .
Figure 1.Computational scheme of the Particle-in-Cell method

Table 1 .
Run time of simulation of wakefield laser pulse self-compression on CPU, Xeon Phi, and CPU + Xeon Phi.Time is given in seconds.For the computational core a split into current deposition, particle push, and other operations is given

Table 2 .
Run time of simulation of target normal sheath acceleration on CPU, Xeon Phi, and CPU + Xeon Phi.Time is given in seconds.For the computational core a split into current deposition, particle push, and other operations is given