Multicore Platform Efficiency Across Remote Sensing Applications

A wide range of modern system architectures and platforms targeted for different algorithms and application areas is now available. Even general-purpose systems have advantages in some computation areas and bottlenecks in another. Scientific applications on specific areas, on the other hand, have different requirements for CPU performance, scalability and power consumption. The best practice now is algorithm/architecture co-exploration approach, where scientific problem requirements influence the hardware configuration; on the other hand, algorithm implementation is re-factored and optimized in accordance with the platform architectural features. In this research, two typical modules used for multispectral nighttime satellite image processing are studied: • measurement of local perceived sharpness in visible band using the Fourier transform; • cross-correlation in a moving window between visible and infrared bands. Both modules are optimized and studied on wide range of up-to-date testbeds, based on different architectures. Our testbeds include computational nodes based on Intel Xeon E5-2697A v4, Intel Xeon Phi, Texas Instruments Sitara AM5728 dual-core ARM Cortex-A15, and NVIDIA JETSON TX2. The study includes performance testing and energy consumption measurements. The results achieved can be used for assessing serviceability for multispectral nighttime satellite image processing by two key parameters: execution time and energy consumption.


Introduction
This paper describes the cross-platform analysis of the nighttime remote sensing multispectral image processing algorithms.
The timeliness and relevance of the nighttime remote sensing was reaffirmed by such studies as correlation of electric lighting on the Earths surface with socioeconomic trends [1], monitoring of night fishing boat lights [2], detection and characterization of combustion sources [3], and global survey of natural gas flaring [4].
The first step to design a suitable HPC system for processing remote sensing data is analyzing the applicability of modern platforms to the typical algorithms used in multispectral image processing.
A wide range of modern system architectures and platforms targeted for different algorithms and application areas is now available.Even general-purpose systems have advantages in some computation areas and bottlenecks in another.
Following the current trends in system architecture, all the experiments were conducted on different platforms: • Modern Intel ® architectures Intel ® Xeon ® E5 (2697A v4).
Scientific applications used in specific areas have different requirements for CPU performance, scalability and power consumption.The co-design approach is widely considered as best practice for designing an effective and economical system.This approach implies that scientific problem requirements influence the hardware configuration; on the other hand, algorithm implementation is re-factored and optimized in accordance with the platform architectural features.
Two typical algorithms used in nighttime image processing have been selected therefore for further cross-platform examination: • Correlation Module Inter-channel image cross-correlation in a moving window.
• Sharpness Module Spectral and spatial measure of local perceived sharpness [9].
The study was conducted using multispectral images from VIIRS radiometer onboard of Suomi National Polar Partnership (SNPP) satellite.
The image processing modules have been implemented and optimized for the different hardware architectures.
According to the latest trends, the key research issue remains in providing a holistic solution that can collectively minimize energy consumption by HPC facility [5].So, the energy consumption study in addition to the performance analysis is required to choose the most suitable architecture for future HPC system.This paper mainly contributes to characterization of selected HPC architectures in terms of energy consumption and execution time while running the remote sensing data processing tasks.To perform this study, different implementation and architecture-dependent optimization of a source code were developed, and both time and energy consumptions were measured and analyzed using the chosen architectures.
The rest of the paper is structured as follows.Section 1 reviews earlier research of performance and energy consumption in hyperspectral imaging field.Section 2 provides a detailed specification of the compared architectures.After that, Section 3 describes the selected benchmarking algorithms used in nighttime image processing.Section 4 describes the parallel implementation and architecture-specific optimizations of these algorithms on selected architectures.Section 5 describes software and hardware used for measurements.Section 6 provides tables with measured results as well as the testing protocol.Finally, the last Section concludes the paper with discussion of the obtained results.

Related Work
Advances in sensor technologies are resulted in substantial increase in spatial, spectral and temporal resolution of satellite imagers.For example, Visible Infrared Imaging Radiometer Suite (VIIRS) onboard of the Suomi NPP satellite generates 3 TB of multispectral images for every month of nighttime observations.Both re-processing of the full 6-year archive of the nightime images and recent addition of the second sattelite with the same imager require an upgrade in energy efficiency and computing performance of the current processing environments.
There are some research efforts aimed at energy efficient processing of hyperspectral image data.For example, in 2013 Remon et al. [6] presented a detailed assessment of performance and energy consumption of hyperspectral unmixing algorithms on multi-core platform equipped with 4 AMD Opteron 6172 processors.
Another study entitled "Energy consumption characterization of a Massively Parallel Processor Array (MPPA) platform running a hyperspectral SVM classifier" [7] presents a study of the MPPA-256-N power dissipation and energy consumption while running a SVM hyperspectral classifier.This paper also includes comparison with GPU 780Ti GTX.

Hardware
In Tab. 1, codenames and specifications of the studied testbeds are listed.

Sharpness
Sharpness module is the most computationally intensive part of the automatic system for detecting fishing boat lights from nighttime images of the VIIRS multispectral radiometer [2].
VIIRS Boat Detector (VBD) considers all isolated bright spikes that are sharply visible on the sea's night surface as candidates for boats.In the moon light, the interference by clouds and lunar glint are taken into account as well.This Sharpness Module processes visible images from the VIIRS Day/Night Band (DNB).If part of the image appears blurry according to the Sharpness Module result, it will be discarded from the search for the isolated electric lights from boats.
The flow graph of the module is shown in Fig. 1 Figure 1.Sharpness Module Flow Graph The Sharpness module reads input from the VIIRS DNB image stored in HDF5 format.Output data are stored in binary ENVI format.Data processing includes the following steps: • Logarithmic transformation of the brightness histogram (stretch).
• Computing the Sharpness Index (SI) [9] in a moving window of Block Size × Block Size.
The Direct Fourier Transforms and Overdetermined real linear systems solving routines are repeatedly performed during this step.

Cross-Correlation
The Cross-correlation module calculates correlations between two spectral bands, visible and infrared.The main idea of the algorithm is validating the detected sources in different spectral bands under moonlit conditions.
The validation is carried out by performing a synchronous computing the linear Pearson's correlation between the corresponding moving windows in two spectral images.If the visible and infrared images are locally well-correlated, it means that the signal in the visible images is coming from moonlit clouds.If the local correlation is weak, it means that the visible signal is coming from the sea surface.

Implementation Details
The original versions of both algorithms were implemented using Matlab programming language.
We implemented the studied modules using C++.The source code was refactored to reach the maximal level of compiler-assisted optimization.The final C++ version of the code was implemented in a straight-line manner; all repeatedly performed loops had single entry and single, not data-dependent exit.
Input and output data details for both algorithms are presented in Tab. 2. HDF5-1.8.19 was used for parsing and reading HDF5 data.

Intel Version
In order to achieve the best performance on Intel testbeds, the vectorization features were used.In this context vectorization means using of the Intel SSE instruction set, which is an extension to the x86 architecture [10].
The efficient memory access was used by data alignment to the 32 byte boundaries (for Intel Advanced Vector Extensions (Intel AVX) ) and 64 byte boundaries (for Intel AVX-512).Intel compiler pragmas were used to inform the compiler of where it can safely ignore data dependencies and to inform that data is aligned.
Repeated operations with data arrays were implemented in a consecutive manner to use direct load from memory in a single SSE instruction.
Moreover, the typical trip count of the loop based on the typical image size is advised to the compiler in the Cross-Correlation module.
Mathematical calculations such as vector logarithm computation, Direct Fourier Transforms and solving the Overdetermined real linear systems were performed using the Intel MKL Library (2017).
The Processor-specific options of the form -ipo -O3 -xMIC-AVX512 (for KNL)/-xCORE-AVX2 (for BRW) were used to generate optimized and specialized code for processors.
The hybrid (MPI + OpenMP) parallelization scheme for an efficient application of the multicore architectures was used for this implementation.Each MPI rank processes its own images independently, so there are minimum communications between the processes.The OpenMP threads were used on Sharpness Index computation step.OpenMP threads process independent data in different positions of a moving window.
This version was tested on Intel testbeds with codenames Broadwell and KNL (Tab.1).

ARM Version
The ARM-optimized version of FFTW3 open source library was used for Direct Fourier Transforms.The LAPACK library (3.7.1-4) was used for solving the overdetermined real linear systems.The processor-specific options were used for compilation.
Current implementation uses only ARM Cortex-A15 cores; the GPU cores are idle during the computation.So, the ARM testbed still have room for code optimization to achieve maximum possible performance.
The simple MPI-only parallelization scheme was used for this implementation, where each MPI rank processes its own images independently.An additional OpenMP parallelization layer is not required in this case due to the absence of hyperthreads.We used MPICH MPI implementation optimized for the ARM.
This version was tested on the ARM testbed (Tab.1).

CUDA Version
The Sharpness algorithm is optimized for Jetson testbed according to the algorithm's logical structure described in subsection 3.1 The preparation steps, such as data input, stretch, applying the Wiener filter, and computing the Spike Median Index (SMI) are performed on CPU.The most computationally intensive step (computing the Sharpness Index) is implemented using CUDA (V8.0.62).This step is performed using GPU cores only.
The Cross-Correlation algorithm is also implemented using CUDA.The number of threads used in each block is justified with the image's width; the number of blocks is justified with the image's height.
CUDA threads effectively use GPU resources, so the MPI layer is not used in this implementation.
This version was tested on Jetson testbed (Tab.1).

Measuring Equipment
Running Average Power Limit (RAPL) energy sensors, available in recent Intel CPUs, were used for measuring energy consumption for Intel testbeds (Broadwell and KNL).According to the Intel research [11], RAPL software power closely follows the actual power measurements.RAPL reports various energy readings.This includes energy consumption for the processor packages and the DRAM packages.
PAPI library [12] was used on Intel testbeds as an interface to RAPL energy consumption measurements [13].PAPI provides a uniform access to performance counters as well as to RAPL data, so it provides the opportunities for enhanced measurements in the feature.
Hantek DSO2000 Series USB Oscilloscope [14] was used for power measurements for ARM and NVidia testbeds.Electric current was measured in ampers at every second of testing.Voltage was measured before execution of test series.
The execution time was measured using PAPI Library (P AP I g et r eal n sec() function) on all testbeds.

Performance and Energy Consumption Study
This section presents experimental results of processing of time and energy consumption measurements of the modules reported in Section 3, measured using the equipment described in Section 5 on the testbeds listed in Section 2.
The testing procedure consisted of measurements regarding energy consumption and execution time.The testing procedure included a series of 10 executions per each combination of input data set and input feature sets.
Appropriate preparatory steps had been done prior to each execution, especially removing the results of previous computations and cleaning up the caches and swap.
The aggregate result is calculated as a median value of the measured results.Median value is used for understanding the central tendency of benchmarking results and for filtering out values that are skewing the results (for example, abnormally big values caused by temporal system processes' routines).
Input data for parallel processing was duplicated, so each of MPI rank processes separates a copy of input data.(According to the real case of archive processing, where each MPI rank should process a separate image).The numbers of MPI processes and CUDA threads are carefully adjusted according to the available number of cores and implementations for each architecture.It is important to note that measuring tools used in this research (see Section 5) oversee global energy consumption of the system, not just the energy consumed by the module under study.So, energy consumption results listed in Tab. 3, 4 refer to the total testbeds' consumptions during execution of the studied module.
As stated above, the number of processes and threads was selected according to the architecture requirements and implementation details.For Intel architectures in particular the optimal number of MPI processes refers to the number of physical cores; the number of OMP threads refers to the number of hyperthreads per core (2 OMP threads per MPI process for Broadwell testbed and 4 OMP threads per MPI process for KNL testbed).For ARM testbed, only MPI processes weere used.Finally, only CUDA threads were used for Jetson testbed.Thus, the number of pictures, processed in a parallel, differs for each testbed; execution time and consumed energy also vary in a wide range.So, it is difficult to define the appropriate testbed for these modules.
In this context, in is important to outline that rapid technological progress in multispectral imaging area stimulates new methods and challenges coming to existense in analysis and interpretation of hyperspectral data sets.This, in turn, leads to re-processing of data collected over the last year(s).So, the re-processing procedure is maintained systematically.
According to the current data, one Visible Infrared Imaging Radiometer Suite (VIIRS) day/night band (DNB) image corresponds to 5 min observations' data.Therefore, observation data archived for 1 year contains approximately 52560 DNB images.Table 5 shows the estimated time and energy to process a 1-year archive using the the Sharpness module in conformity with the experimental results mentioned above.As shown in Tab. 5, the best energy consumption (13560 kJ) is reported for Jetson testbed.Following a close second is Broadwell testbed with 16337 kJ estimated energy consumption to process a 1-year archive.However, the execution time is much longer in Jetson testbed (481 hours corresponding to 20 days), while Broadwell testbed should complete processing of the archive with 18 hours.
As an alternative solution, the GPU cluster can be constructed to reduce the computation time and increase the performance.While this approach could benefit for the time of processing, energy consumption would be increased due to communication costs.Moreover, it is worth noting that increasing the number of components affects resilience of the solution.

Conclusion
This paper presents a research regarding execution time and energy consumption at different testbeds while running a multispectral image processing module.
Intel ® Xeon ® E5 and Nvidia Jetson TX2 demonstrated the most efficient results regarding computation performance and energy consumption criteria.
As a result, Intel ® Xeon ® E5 can be recommended for periodical re-processing of large archives of multispectral images in a reasonable time (days or weeks of full HPC cluster load).NVidia Jetson TX2 could be used for near real-time image processing, for example, at a direct receiving station, because it shows good results in per-picture processing.
However, ARM testbed must be further studied to fully exploit its potential.We intend to continue this work in the following directions: • In the nearest future, we are planning to study other types of architectures, including Russian VLIW Elbrus CPUs and Intel ® Xeon ® Scalable Processors.• We are planning to carry out a more detailed analysis of correlations between energy consumption and other performance metrics, including cache misses, the number of cycles and executed instructions, and so on.Designing an energy-efficient system for processing multispectral observation data is a complex task that introduces new programming and optimization challenges.However, the results listed in this paper could be helpful for selection of the most appropriate architectures.
Testing results for the Sharpness module are listed in Tab.3; results for the Cross-Correlation module are shown in Tab. 4.

Table 2 .
Data Specifications

Table 3 .
Sharpness Module Execution Statistics

Table 4 .
Cross-Correlation Module Execution Statistics

Table 5 .
Sharpness Module's Estimated Time To Process a 1-year archive