River Routing in the INM RAS-MSU Land Surface Model: Numerical Scheme and Parallel Implementation on Hybrid Supercomputers

The land surface model (LSM) is a necessary compartment of any numerical weather forecast system or the Earth system model. This paper presents a new version of the INM RAS-MSU land surface model where the river hydrodynamic and thermodynamic scheme is embedded into the parallel execution framework using MPI and OpenMP. Numerical experiments have been performed for the East European domain with resolution 0 . 5 ◦ × 0 . 5 ◦ . The soil model parallel eﬃciency at 1–144 MPI cores was 0.52–0.79 and limited by the presence of ocean area, and by imbalance of computational load between soil columns. The acceleration of the river model at MPI level was deﬁned by the size of the largest river basin in the domain. At the OpenMP level, the potential for acceleration of large river basin simulation is shown to be close to number of threads used, based on fractal properties of the river networks. This acceleration was hindered in our numerical experiments by the reduced river orders at the coarse land surface model resolution, so that the optimal speedup for the Volga river basin was 2.5–3 times attained at 4–6 threads. This performance is projected to improve with reﬁnement of the LSM spatial resolution. This paper presents a new version of the INM land where the river hydrodynamic and thermodynamic model is embedded into the parallel execution framework using two levels of parallelism: the ﬁrst is MPI-based indepedent processing of river basins, and the second uses OpenMP technique to parallelize the simulation of rivers of the same Strahler order. Numerical experiments have been performed for the East European domain with resolution 0 . 5 ◦ × 0 . 5 ◦ . The MPI implementation of the soil model is based on conventional even longitude-latitude decomposition of the model domain, inherited from the atmospheric model. The soil model parallel eﬃciency at 1–144 cores was shown to be 0.52–0.79 and limited by the presence of ocean area, and by imbalance of computational load between soil columns depending on the presence of snow cover and number of iterations for the surface temperature needed to advance the soil proﬁles. The acceleration of the river model at MPI level (not exceeding 4 times) is deﬁned by the size of the largest river basin in the domain (Volga), whereas at OpenMP level the potential for acceleration of large river basin simulation is shown to be close to number of threads used. OpenMP-level speedup was hindered in our numerical experiments by the underestimation of river orders at coarse land surface model resolution (recommended performance for the Volga basin attained at 4–6 threads with 2.5–3 times acceleration).


Introduction
The land surface scheme (or land surface model, LSM) is a necessary compartment of any numerical weather forecast system or the Earth system model. It reproduces the thermodynamic, hydrophysical and ecological state of land active layer as well as momentum, mass and energy fluxes between Earth surface and the atmosphere. These fluxes are important to reproduce well for reliable weather forecasts at a broad range of time scales: from subdiurnal to seasonal. Especially, the soil moisture is widely recognized as a critical parameter for seasonal weather dynamics and prediction [6,19,32], as it defines the structure of the soil surface heat balance.
Rivers are an integrating component of the land water cycle, as they gather soil moisture from large terrestrial areas and transport it to ocean. The significance of rivers in the Earth system is caused by their notable contribution to the ocean freshwater budget [14,23], methane, carbon dioxide [17,21,29] and dissolved organic carbon [2,3] transfer from land to ocean and emission of greenhouse gases to the atmosphere. In addition, the river discharge at large timescales is close to accumulated "precipitation minus evaporation" over the river basin, and thus serves as a useful proxy for this difference during the land surface model validation. Those considerations led to introduction to the land surface models the so-called river routing schemes representing river flows in simplified manner [1,11,15,22,31].
The land surface models for long have been consisting of a large number of independent vertical 1D soil problems, enabling straightforward parallel implementation using MPI and the longitude-latitude domain decomposition technique. However, embedding river module introduces horizontal dependency of river variables and thus existing longitude-latitude decomposition is not optimal for LSM river module. This calls for development of new parallel implementation

The INM RAS-MSU Land Surface Scheme
The land surface scheme used in this study is jointly developed by Institute of Numerical Mathematics (INM), Russian Academy of Sciences, and Lomonosov Moscow State University (MSU) (Fig. 1). It is a part of the INM-CM5 Earth system model [27] and of the SL-AV numerical weather forecast system [10]. Here, we consider it in a standalone version, where the atmospheric forcing is prescribed. It reproduces the thermodynamic and hydrophysical state of soil, lakes and rivers as well as momentum, mass and energy fluxes between Earth surface and the atmosphere. The effects of surface vegetation on surface-atmosphere exchanges are represented via modifications of soil-air exchange laws (based mostly on Monin-Obukhov similarity) and introduction of liquid water sink in soil due to roots suction. This means, no vegetation layer storage for heat or mass is considered. Subgrid-scale variability of the land surface is simulated by tile approach, where contribution of each surface type in a cell to cell-averaged fluxes of heat, radiation, water vapor and carbon is proportional to its areal fraction. Carbon cycle processes accounted for into the model include photosynthesis, organic matter decay in soils with CO 2 release and organics degradation in wetlands with CH 4 formation. The most computationally expensive parts of INM RAS-MSU land model are soil, lake and river modules. The soil and river modules are described in more detail below. The lake model is similar in its information structure to the soil model, as both of them comprise a number of independent 1D problems.

Soil Model
The basic numerical kernel of any land surface scheme is a solver for an equation system describing heat and water transport in soil and snow, including phase transitions. In INM RAS-MSU model, this system includes equations as follows. The heat equation is: where ρ d c is a volumetric heat capacity of soil, T is temperature, z is a coordinate directed along acceleration due to gravity, t is time, λ T is coefficient of heat conductivity, ρ d is dry soil bulk density, L k and F k are the specific heat and rate of water freezing/melting (k = i) and evaporation/condensation (k = v), respectively. The heat conductivity λ T is a function of liquid water content W and dry soil conductivity. An equation for liquid water content W describes vertical transport (diffusion due to capillary and sorption forces and gravitational infiltration), freezing/melting and evaporation/sublimation, root uptake and horizontal discharge [16]: Here, W is a ratio of liquid water mass to solid soil matrix mass, S r is root suction, and S l is a sink due to lateral water flow. A dependence of soil moisture potential Ψ (or water retention curve, WTC) and hydraulic conductivity γ on moisture W and ice content I are important features of the system, defining coefficients λ W , λ I , γ as functions of the solution (λ I is neglected in the current version of the model). The liquid water diffusivity λ W (W ) and hydraulic conductivity γ(W ) are related functions: At least 22 semi-empirical forms are proposed for WTC [9], fitting different sets of empirical data with different performance. The WTC function explicitly enters the hydraulic conductivity function γ. For instance, choosing Mualem approach for hydraulic conductivity quantification [20], one gets: where W . = (W −W min )/(W max −W min ) is a degree of soil moisture saturation. Thus, introducing in (4) and (3) different forms of WTC yields corresponding pairs of functions (λ W , γ), based on Mualem equation. In INM RAS-MSU model, the two most widespread sets of (λ W , γ) are implemented, namely those by Brooks-Corey [4,5] and Mualem-van Genuchten [12,20].
The content of water vapor (V , expressed as a mass ratio, similar to W ) is governed by diffusion equation and phase transitions:

34
Supercomputing Frontiers and Innovations while the dynamics of ice content I is defined by phase transitions only: The system of four equations above is supplemented by boundary conditions, representing heat and water mass balance at the soil-air interface z = 0 and the bottom of soil column z = H. At the surface, heat and water balance equations include net radiation and modification of fluxes by vegetation canopy, while at bottom zero diffusive flux condition is imposed. In cold season, snow depth dynamics is simulated as well as temperature and snow moisture vertical distribution [24,28]: where the subscript "sn" denotes thermodynamic properties of snow. This system is coupled to the soil equations set via continuity of fluxes and temperature at the soil-snow interface. The system is solved by implicit in time and central-differences in space numerical scheme with 23 levels in soil down to 10 m depth, 4 levels in snow (if present) and 1 hour time step for every cell of a regular latitude-longitude grid (0.5 • × 0.5 • in this study). In order to ensure the heat balance equation at the soil surface, iteration procedure in respect to surface temperature is implemented.
The code is written in Fortran and uses the external libraries for I/O in netcdf format, MPI exchanges and OpenMP threading.

The Model for River Hydrodynamics and Thermodynamics
The model for river hydrodynamics and thermodynamics is based on diffusive wave approximation for 1D (i.e. averaged over the vertical cross-section of a stream) Saint-Venant system [24]. Under this approach, the pressure gradient force is balanced by quadratic bottom friction, which delivers a closed set of equations for vertical cross-section area S and temperature T : supplemented with boundary conditions: where the along-river coordinate x = 0 corresponds to the river origin and x = L locates at the river mouth, E r is a water volume inflow rate from soil and tributaries per unit of river length, u tr T tr h tr is heat inflow per river unit length from soil and inlets, b s is a width of a river surface, F is net kinematic heat and radiation flux at the water surface, U * , k S and k ST are functions of channel slope and geometry. The net radiation includes shortwave and longwave components, net heat flux includes sensible and latent heat flux calculated by the same Monin-Obukhov similarity as for the soil model (with water-specific roughness lengths). The ice cover is not explicitly simulated in the current version, and water temperature is allowed to decrease below 0 • C. The system is solved for each river by Preisman scheme semi-implicit in time with 10 mesh nodes per the model cell ( Fig. 2) and Vreman conservative filtering [30] to suppress two-step oscillations.

Information Structure of the Land Surface Model with River Routing
The soil and river models are one-way coupled. It means that the solution of the soil problem affects the solution of the river model (via groundwater runoff), but not vice versa. The total water volume source in the stream continuity equation (9) is E r = E r,s + E r,r , where E r,s is the soil runoff, and E r,r is the water input from tributaries. The soil water runoff to rivers in each model cell is: where A is a cell area, L A is a total river length in a cell, and ρ w0 is the reference water density. Thus, an update of soil variables in a cell during timestep provides E r,s over this time step to be used in a river model as a longitude-latitude field. The water input from tributaries E r,r is: with subscript "tr" denoting the tributary(ies) of the river, if any in the cell. The heat source in (10) is: with H a standing for active soil layer depth, assumed 1 m in current model version. The above formulas mean that in order to advance river variables at each time step the soil runoff and discharge of tributaries are necessary. Thus the sequence of model execution at each time step is as follows (omitting other parts of algorithm such as lakes, I/O, etc.): • soil model (time step i), • river model (time step i): rivers of the 1st order, -rivers of the 2nd order, -...
rivers of the maximal order.
• soil model (time step i + 1), • river model (time step i + 1): rivers of the 1st order, -... Here, we introduced the Strahler orders of rivers, where by definition, rivers of order n have tributaries of the order not larger than n − 1, and the streams having no inlets are of n = 1. This is the strict sequence that cannot be changed under selected numerical scheme. The independent in terms of information exchange parts of the algorithm are: • time step updates of different soil columns, • time step updates of different river basins, • time step updates of different rivers of the same order. Here, we used a term "basin" to denote a river network with highest-order river flowing into the ocean or a lake with no outlet. The serial and parallel parts of the algorithm indicated above cause the levels of parallel implementation described in the following section.
The INM RAS-MSU land surface model with river routine scheme described above was shown to successfully reproduce the annual cycle of runoff of two North Eurasian rivers: Severnaya Dvina and Kolyma [18,24].

Levels of Parallelism and Analytical Estimates for Model Speedup
The soil model is parallelized using standard decomposition of longitude-latitude domain in M λ M φ subdomains of the same dimensions, where M λ and M φ are the numbers of MPI processes along those two coordinates, respectively. Given no data exchanges are needed between any two MPI processes for soil simulations, this part of the model should speedup with efficiency ≈ 1 under even computational load of processes. The river model parallel implementation consists of two levels. The top level comprises the distribution of river basins between MPI processes. This distribution is realized as follows. Let us assume, there are N b basins in the model domain, and n i , i = 1, ..., N b is a decreasing sequence of numbers of land model cells constituting these basins. Then, the MPI process of rank 0 simulates the largest basin (i = 1), whereas k-th process computes dynamics of rivers in basins labeled i = i 1,k , ..., i mk,k , so that the total number of cells in those basins i=i1,k,...,im k ,k n i ≤ n 1 , and closest possible to n 1 . Such a scheme ensures distribution of basins between a small number of MPI processes which is even in terms of total number of land model cells, but not necessarily in number of basins.
To supply the river model with input data (meteorological variables, inflow of water volume and heat from soil to streams), each MPI process should have access to the 2D (lan-lot) arrays of those variables covering all river basins, which are associated to this process. However, these 2D arrays are originally split between MPI-processes according to 2D domain decomposition of the soil model (see the first paragraph of this Section, and Fig. 5 in Section 4). This issue is solved by allocating auxiliary 2D array covering the whole simulation domain which is identical on each MPI process and which is filled in with the above mentioned variables by ALLREDUCE operation. This operation is performed one time each time step after the call of soil update (PBLFLX subroutine) and before the river update (RIVWAT subroutine).
The second level of parallelizm of the river model is realized for each MPI process, and uses the informational independency of river equations solution for rivers of the same order. This level is implemented using OpenMP instructions. Below we estimate the maximal speedup for a single river basin simulation following this approach, which is attainable under no overhead costs.
Let ω be the Strahler order of a river. We can now introduce the values n ω (total number of watercourses of order ω in a given basin) and L ω (average length of watercourses of order ω in the basin). According to Horton's laws, which can be derived from an approximation that the river network is a fractal [8,25], the following relations are statistically valid: where C n and C L are constants of the river basin. The wall-clock time of solving a one-dimensional problem for one river at one time step can be expressed as O(L k ω ), where k is defined by numerical scheme, e.g., k = 1 for explicit schemes and implicit schemes, where linear systems are solved by direct factorization method (the case of our model). Now estimate the time of sequential processing T s , i.e., time needed for sequential solution of one-dimensional problems for all rivers of the river basin. For the number of rivers of order ω and their length we have: Let ω max be a maximal order of rivers in the basin. Then Since 1 = n ωmax = n 1 C −(ωmax−1) n , and introducing C * L . = C −1 L > 1, the expression for T s reads: If all rivers of the same order are processed simultaneously by different cores, then the update of all rivers of order ω takes the time O(L k ω ) rather than n ω O(L k ω ) as in the sequential version, so that the total processing time of the whole river network T p is estimated as: and the speedup of the whole river basin simulation: River Routing in the INM RAS-MSU Land Surface Model: Numerical Scheme and...

Supercomputing Frontiers and Innovations
If for the river basin processing only m cores are available, the above estimate transforms to: An example of river model speedup for single basin processing according to (24) is shown in Fig. 3. One can see that for largest river basins (with maximal river order exceeding 6-8) the speedup approaches the number of cores used. This is important, as the largest basins cause a bottleneck at the MPI level of river model implementation, and their effective speedup at the OpenMP level may significantly reduce the execution time of the whole river module.

Configuration of Numerical Experiments
Numerical experiments with INM RAS-MSU land surface model have been conducted for domain 40 • -80 • N, 20 • -59.5 • E with horizontal resolution 0.5 • ×0.5 • and for the period of 31 days. Atmospheric forcing (surface air temperature, humidity, wind speed, downwelling longwave and shortwave radiation, pressure, precipitation) was read from GFDL-ESM2M piControl simulation data, 1-31 Jan 1661, bias-corrected EWEMBI dataset of the ISIMIP2b project (https://www. isimip.org). Atmospheric data has the same grid with the model, temporal resolution is daily, the model time step is 1 hour. The data for river flow directions and riverbed slopes are taken from the database of GWSP-WATCH project based on DDM30 data [7] made accessible through ISIMIP project as well. The resulting river network of the Volga basin is shown in Fig. 5. On the model mesh, Volga river has the order 6, which is much less than that defined for the river network obtained at much finer resolution (the Volga order is 9 at 200-500 m resolution, [13]). Model simulations have been conducted using Lomonosov-2 supercomputer [26], using nodes with Intel Haswell-EP E5-2697v3, 2.6 GHz, 14 cores and Infiniband FDR interconnect. The executable of the land surface model was assembled with ifort compiler using -O3 optimization. In MPI series of runs, the model was launched as: sbatch --ntasks=<number of MPI processes> --bind-to none <executable> whereas for OpenMP series:  Fig. 4), each arrow denotes the river course segment containing mesh nodes depicted in Fig. 2, black thin dotted lines are used to show model cells, thick dashed black lines delineate MPI subdomains, color is used to denote the river Strahler order sbatch --ntasks=4 --ntasks-per-node=1 --bind-to none <executable> The latter ensures that each MPI process runs at a separate node, and all OpenMP threads of a given MPI process locate at the same node.

Results and Discussion
The wall-clock time and parallel efficiency (defined as the speedup of parallel version in respect to serial code version divided by number of MPI processes) for experiments of MPI series are shown in Fig. 6a and 6b, respectively.
The soil model accelerates with gradual decrease of efficiency from 0.79 at 4 cores to 0.52 at 144 cores, despite the absence of MPI exchanges in this part of the code. The reason is the load imbalance between MPI processes, taking into account that the wall-clock time of the soil model at any timestep is defined be the maximal wall-clock time among processes. At 4 cores, all MPI subdomains include ocean cells and different number of land cells which define the distribution of workload between processes. At large numbers of M φ M λ there are subdomains with all cells residing over ocean, which means corresponding MPI processes are idle. The fraction of such processes is estimated as a fraction of ocean cells which is 30%. The efficiency may be corrected for this factor by taking into account only number of MPI processes, processing land cells, e.g. for 144 cores we have the corrected efficiency 0.52/(1 − 0.3) = 0.74. The remaining loss of efficiency 1 − 0.74 = 0.26 is mostly explained by the increase of the time for one-timestep processing of a single soil column, averaged over a subdomain of the most loaded MPI-process, with rise of the cores number (from 3.9 × 10 −5 s at 1 MPI process to 4.9 × 10 −5 s at 144 MPI processes).  This means that even fully terrestrial MPI subdomains have different workload depending on the presence of snow cover and different number of iterations for surface temperature to perform a timestep depending on the local meteorological conditions.
The regular longitude-latitude decomposition of the LSM computational domain between MPI processes is inherited from the fully coupled ESM, where this decomposition matches the decomposition in the atmospheric model. However, as shown above, in standalone mode it be-comes non-optimal if rectangular domain in (λ, φ) contains ocean cells. In this case, the current MPI decomposition scheme can be optimized, in order the MPI subdomains to contain only land cells or a minimal number of ocean cells. This model improvement is left for the future.
Remarkably, the river model is accelerated almost 4 times when using 4 MPI processes, whereas under larger number of cores the wall-clock time remains almost constant. This is due to fact that at the MPI level, the largest river basin in the domain (Volga basin) is processed by single process, and the wall-clock time of this process is a minimal possible for the river model in general. As a result, the river model contribution to the whole land surface model runtime, being negligible in serial mode starts to dominate over the soil model above ∼ 100 MPI processes (3.49 s for river model vs. 1.75 s for soil model).
The time for ALLREDUCE operation gathering the data for the river model input gradually increases from 0.22 s at M φ M λ = 4 to 0.66 s at M φ M λ = 144 indicating increasing overhead costs of this collective operation, still remaining much less than river model computational time (3.49 s).
The wall-clock time and parallel efficiency of Volga basin processing in experiments of OpenMP series are shown in Fig. 7a and 7b, respectively. The efficiency is defined here in respect to the model configuration with 4 MPI processes and 1 OpenMP thread.
Simulation of rivers with smallest order demonstrated the highest speedup and efficiency, whereas the rivers of order 5 and 6 are simulated with no acceleration. The overall decrease of Volga simulation time when using 14 threads is 4.11 times which is much smaller compared to estimated ≈ 12 times from theoretical formula (24). To address this difference, consider the Volga basin scale-similarity parameters (Horton metrics) computed on the model mesh (Fig. 5), presented in Tab. 1. The river length parameter C * L,ω is close to typical value of 2, excluding the case of ω = 2, because the lengths of low-order rivers are not explicitly reproduced on the model mesh (e.g., the majority of first-order rivers have length of the one model cell). On the contrary, the river number parameter C n,ω is biased from typical value of ∼ 4 for the largest order 5 and 6. The value C n,6 = 1 means that there are one river of order 5 and one river of order 6, which are both simulated sequentially by one OpenMP thread (see no speedup for those rivers in Fig. 7a). There are two rivers of order 4, so that the acceleration of their simulation ceases with the number of threads > 2. Gradual decrease of parallel efficiency of simulation of the rivers with orders 1 and 2 while increasing number of threads is caused by increasing the contribution of overheads of initializing the OpenMP PARALLEL section and DO loop each timestep to the total wall-clock time.
To check the reasoning of the previous paragraph, the formula (24) can be extended to account for variability of Horton coefficients with river order ω:  where c n,ω . = Π ω i=2 C −1 n,i and c L,ω . = Π ω i=2 C * k L,i . Substituting to (25) the values from Tab. 1 provides T s /T p = 3.28 for 14 OpenMP nodes, which is much closer to measured acceleration 4.11 compared to that given by formula (24) with reference Horton metrics (≈ 12 times). The real acceleration appears to be faster, as the expression (25) does not take into account overheads for processing each river, related to invoking subroutines, allocating memory etc. In serial mode, these overheads increase especially the total wall-clock time for solution of model equations for small-order rivers (due to their large number), and since parallelization of these river orders is more efficient, this improves the parallel performance for the whole basin.
To summarize, processing of the Volga basin is significantly accelerated up to 4-6 OpenMP threads, this limit taking place due to serial processing of longest rivers of order 5 and 6. However, with finer meshes of land surface models, which are expected in future, the orders of resolved large river basins are to increase and thus the efficiency OpenMP-based approach presented here is anticipated to improve, according to estimates presented in Section 3.

Conclusion
This paper presents a new version of the INM RAS-MSU land surface scheme where the river hydrodynamic and thermodynamic model is embedded into the parallel execution framework using two levels of parallelism: the first is MPI-based indepedent processing of river basins, and the second uses OpenMP technique to parallelize the simulation of rivers of the same Strahler order. Numerical experiments have been performed for the East European domain with resolution 0.5 • × 0.5 • . The MPI implementation of the soil model is based on conventional even longitudelatitude decomposition of the model domain, inherited from the atmospheric model. The soil model parallel efficiency at 1-144 cores was shown to be 0.52-0.79 and limited by the presence of ocean area, and by imbalance of computational load between soil columns depending on the presence of snow cover and number of iterations for the surface temperature needed to advance the soil profiles. The acceleration of the river model at MPI level (not exceeding 4 times) is defined by the size of the largest river basin in the domain (Volga), whereas at OpenMP level the potential for acceleration of large river basin simulation is shown to be close to number of threads used. OpenMP-level speedup was hindered in our numerical experiments by the underestimation of river orders at coarse land surface model resolution (recommended performance for the Volga basin attained at 4-6 threads with 2.5-3 times acceleration).
The future development of the parallel code includes MPI+OpenMP implementation of the soil model, optimization of MPI domain decomposition for soil model under presence of ocean surface, and further tuning the MPI+OpenMP configuration of the river model.