State of the Art and Future Trends in Data Reduction for High-Performance Computing

Authors

  • Kira Duwe University of Hamburg
  • Jakob Lüttgau
  • Georgiana Mania
  • Jannek Squar
  • Anna Fuchs
  • Michael Kuhn
  • Eugen Betke
  • Thomas Ludwig

DOI:

https://doi.org/10.14529/jsfi200101

Abstract

Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After introducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ compression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments.

References

Abdelfattah, M.S., Hagiescu, A., Singh, D.: Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL. In: McIntosh-Smith, S., Bergen, B. (eds.) Proceedings of the International Workshop on OpenCL, IWOCL 2013 & 2014, 13-14 May 2013, Georgia Tech, Atlanta, GA, USA / 12-13 May 2014 Bristol, UK. pp. 4:1–4:9. ACM (2014), DOI: 10.1145/2664666.2664670

Ahrens, J.P., Geveci, B., Law, C.C.: ParaView: An End-User Tool for Large-Data Visualization. In: Hansen, C.D., Johnson, C.R. (eds.) The Visualization Handbook, pp. 717–731. Academic Press / Elsevier (2005), DOI: 10.1016/b978-012387582-2/50038-1

Ainsworth, M., Tugluk, O., Whitney, B., et al.: Multilevel techniques for compression and reduction of scientific data - the univariate case. Computat. and Visualiz. in Science 19(5-6), 65–76 (2018), DOI: 10.1007/s00791-018-00303-9

Ainsworth, M., Tugluk, O., Whitney, B., et al.: Multilevel Techniques for Compression and Reduction of Scientific Data - The Multivariate Case. SIAM J. Scientific Computing 41(2), A1278–A1303 (2019), DOI: 10.1137/18M1166651

Ainsworth, M., Tugluk, O., Whitney, B., et al.: Multilevel Techniques for Compression and Reduction of Scientific Data-Quantitative Control of Accuracy in Derived Quantities. SIAM J. Scientific Computing 41(4), A2146–A2171 (2019), DOI: 10.1137/18M1208885

Ajdari, M., Park, P., Kim, J., et al.: CIDR: A cost-effective in-line data reduction system for terabit-per-second scale SSD arrays. In: 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, 16-20 Feb. 2019, Washington, DC, USA. pp. 28–41. IEEE (2019), DOI: 10.1109/HPCA.2019.00025

Akenine-M¨oller, T., Str¨om, J.: Graphics Processing Units for Handhelds. Proceedings of the IEEE 96(5), 779–789 (2008), DOI: 10.1109/JPROC.2008.917719

Alakuijala, J., Farruggia, A., Ferragina, P., et al.: Brotli: A general-purpose data compressor. ACM Trans. Inf. Syst. 37(1), 4:1–4:30 (2019), DOI: 10.1145/3231935

Alameldeen, A.R., Wood, D.A.: Adaptive Cache Compression for High-Performance Processors. In: 31st International Symposium on Computer Architecture, ISCA 2004, 19-23 June 2004, Munich, Germany. pp. 212–223. IEEE Computer Society (2004), DOI: 10.1109/ISCA.2004.1310776

Alforov, Y., Ludwig, T., Novikova, A., et al.: Towards Green Scientific Data Compression Through High-Level I/O Interfaces. In: 30th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2018, 24-27 Sept. 2018, Lyon, France. pp. 209–216. IEEE (2018), DOI: 10.1109/CAHPC.2018.8645921

Alted, F.: Blosc2-Meets-Rome. https://blosc.org (2019), accessed: 2020-02-17

Alvarez, D., Cais, A.O., Geimer, M., et al.: Scientific Software Management in Real Life: Deployment of EasyBuild on a Large Scale System. In: 2016 Third International Workshop on HPC User Support Tools, HUST@SC 2016, 13 Nov. 2016, Salt Lake City, UT, USA. pp. 31–40. IEEE Computer Society (2016), DOI: 10.1109/HUST.2016.009

Amlekar, S.: Compression support in Spectrum Scale 5.0.0. https://developer.ibm.com/storage/2018/01/11/compression-support-spectrum-scale-5-0-0/ (2018), accessed: 2020-02-20

Ayachit, U., Bauer, A.C., Geveci, B., et al.: ParaView Catalyst: Enabling In Situ Data Analysis and Visualization. In: Weber, G.H. (ed.) Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV 2015, 15-20 Nov. 2015, Austin, TX, USA. pp. 25–29. ACM (2015), DOI: 10.1145/2828612.2828624

Azzurri, P.: Track Reconstruction Performance in CMS. Nuclear Physics B - Proceedings Supplements 197(1), 275–278 (2009), DOI: 10.1016/j.nuclphysbps.2009.10.084

Baker, A.H., Hammerling, D., Turton, T.L.: Evaluating image quality measures to assess the impact of lossy data compression applied to climate simulation data. Comput. Graph. Forum 38(3), 517–528 (2019), DOI: 10.1111/cgf.13707

Balkenhol, B., Kurtz, S.: Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice. IEEE Trans. Computers 49(10), 1043–1053 (2000), DOI: 10.1109/12.888040

Balle, J., Laparra, V., Simoncelli, E.P.: End-to-end Optimized Image Compression. CoRR abs/1611.01704 (2016), http://arxiv.org/abs/1611.01704

Ballester-Ripoll, R., Lindstrom, P., Pajarola, R.: TTHRESH: Tensor Compression for Multidimensional Visual Data. CoRR abs/1806.05952 (2018), http://arxiv.org/abs/1806.05952

Barbay, J.: Optimal Prefix Free Codes with Partial Sorting. Algorithms 13(1), 12 (2020), DOI: 10.3390/a13010012

Barr, K.C., Asanovic, K.: Energy-aware lossless data compression. ACM Trans. Comput. Syst. 24(3), 250–291 (2006), DOI: 10.1145/1151690.1151692

Baudat, G., Anouar, F.: Generalized Discriminant Analysis Using a Kernel Approach. Neural Computation 12(10), 2385–2404 (2000), DOI: 10.1162/089976600300014980

Bellman, R., Lee, E.: History and development of dynamic programming. IEEE Control Systems Magazine 4(4), 24–28 (1984), DOI: 10.1109/MCS.1984.1104824

Bogaardt, L., Goncalves, R., Zurita-Milla, R., et al.: Dataset Reduction Techniques to Speed Up SVD Analyses on Big Geo-Datasets. ISPRS Int. J. Geo-Information 8(2), 55 (2019), DOI: 10.3390/ijgi8020055

Bookstein, A., Klein, S.T.: Is Huffman coding dead? Computing 50(4), 279–296 (1993), DOI: 10.1007/BF02243872

Boyuka II, D.A., Lakshminarasimhan, S., Zou, X., et al.: Transparent in Situ Data Transformations in ADIOS. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014, 26-29 May 2014, Chicago, IL, USA. pp. 256–266. IEEE Computer Society (2014), DOI: 10.1109/CCGrid.2014.73

Bricman, P.A., Ionescu, R.T.: CocoNet: A deep neural network for mapping pixel coordinates to color values. CoRR abs/1805.11357 (2018), http://arxiv.org/abs/1805.11357

Brinckman, A., Chard, K., Gaffney, N., et al.: Computing environments for reproducibility: Capturing the “Whole Tale”. Future Generation Comp. Syst. 94, 854–867 (2019), DOI: 10.1016/j.future.2017.12.029

Canal, R., Gonzalez, A., Smith, J.E.: Very low power pipelines using significance compression. In: Wolfe, A., Schlansker, M.S. (eds.) Proc. of the 33rd Annual IEEE/ACM Int. Symposium on Microarchitecture, MICRO 33, 10-13 Dec. 2000, Monterey, California, USA. pp. 181–190. ACM/IEEE Computer Society (2000), DOI: 10.1109/MICRO.2000.898069

Cappello, F., Di, S., Li, S., et al.: Use cases of lossy compression for floating-point data in scientific data sets. IJHPCA 33(6) (2019), DOI: 10.1177/1094342019853336

Chao, G., Luo, Y., Ding, W.: Recent Advances in Supervised Dimension Reduction: A Survey. Machine Learning and Knowledge Extraction 1(1), 341–358 (2019), DOI: 10.3390/make1010020

Chen, K., Ramabadran, T.V.: Near-lossless compression of medical images through entropy-coded DPCM. IEEE Trans. Med. Imaging 13(3), 538–548 (1994), DOI: 10.1109/42.310885

Chen, X., Yang, L., Dick, R.P., et al.: C-Pack: A High-Performance Microprocessor Cache Compression Algorithm. IEEE Trans. VLSI Syst. 18(8), 1196–1208 (2010), DOI: 10.1109/TVLSI.2009.2020989

Chen, X., Benson, J., Peterson, M., et al.: KeyBin2: Distributed Clustering for Scalable and In-Situ Analysis. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, 13-16 Aug. 2018, Eugene, OR, USA. pp. 34:1–34:10. ACM (2018), DOI: 10.1145/3225058.3225149

Chen, Z., Son, S.W., Hendrix, W., et al.: NUMARCK: machine learning algorithm for resiliency and checkpointing. In: Damkroger, T., Dongarra, J.J. (eds.) International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, 16-21 Nov. 2014, New Orleans, LA, USA. pp. 733–744. IEEE Computer Society (2014), DOI: 10.1109/SC.2014.65

Childs, H., Brugger, E., Whitlock, B., et al.: Visit. In: Bethel, E.W., Childs, H., Hansen, C.D. (eds.) High Performance Visualization - Enabling Extreme-Scale Scientific Insight. Chapman and Hall / CRC computational science series, CRC Press (2012), DOI: 10.1201/b12985-21

Cliff, A., Romero, J., Kainer, D., et al.: A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks. Genes 10(12), 996 (2019), DOI: 10.3390/genes10120996

Critchlow, T., van Dam, K.K.: Data-Intensive Science. CRC Press (2013)

Cunningham, J.P., Ghahramani, Z.: Linear dimensionality reduction: survey, insights, and generalizations. J. Mach. Learn. Res. 16, 2859–2900 (2015), http://dl.acm.org/citation.cfm?id=2912091

Delaunay, X., Courtois, A., Gouillon, F.: Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files. Geoscientific Model Development 12(9), 4099–4113 (2019), DOI: 10.5194/gmd-12-4099-2019

Di, S., Cappello, F.: Fast Error-Bounded Lossy HPC Data Compression with SZ. In: 2016 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016, 23-27 May 2016, Chicago, IL, USA. pp. 730–739. IEEE Computer Society (2016), DOI: 10.1109/IPDPS.2016.11

Diederich, M., Doerk, T., Muehge, T., et al.: Decision-based data compression by means of deep learning technologies (2018), https://patentswarm.com/patents/US20190221192A1, application US 20180277068 A1

Dorier, M., Sisneros, R., Peterka, T., et al.: Damaris/Viz: A nonintrusive, adaptable and user-friendly in situ visualization framework. In: Geveci, B., Pfister, H., Vishwanath, V. (eds.) IEEE Symposium on Large-Scale Data Analysis and Visualization, LDAV 2013, 13-14 Oct. 2013, Atlanta, Georgia, USA. pp. 67–75. IEEE Computer Society (2013), DOI: 10.1109/LDAV.2013.6675160

Duque, E.P., Hiepler, D.E., Haimes, R., et al.: EPIC - An Extract Plug-In Components Toolkit for In-Situ Data Extracts Architecture. DOI: 10.2514/6.2015-3410

Filgueira, R., Singh, D.E., Pichel, J.C., et al.: Exploiting data compression in collective I/O techniques. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 Sept.-1 Oct. 2008, Tsukuba, Japan. pp. 479–485. IEEE Computer Society (2008), DOI: 10.1109/CLUSTR.2008.4663811

Fogal, T., Proch, F., Schiewe, A., et al.: Freeprocessing: Transparent in situ Visualization via Data Interception. In: Amor, M., Hadwiger, M. (eds.) Eurographics Symposium on Parallel Graphics and Visualization, Swansea, Wales, UK. pp. 49–56. Eurographics Association (2014), DOI: 10.2312/pgv.20141084

Fournier, Q., Aloise, D.: Empirical Comparison between Autoencoders and Traditional Dimensionality Reduction Methods. In: 2nd IEEE International Conference on Artificial Intelligence and Knowledge Engineering, AIKE 2019, 3-5 June 2019, Sardinia, Italy. pp. 211–214. IEEE (2019), DOI: 10.1109/AIKE.2019.00044

Fowers, J., Kim, J., Burger, D., et al.: A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs. In: 23rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2015, 2-6 May 2015, Vancouver, BC, Canada. pp. 52–59. IEEE Computer Society (2015), DOI: 10.1109/FCCM.2015.46

Fukunaga, K., Olsen, D.R.: An Algorithm for Finding Intrinsic Dimensionality of Data. IEEE Trans. Computers 20(2), 176–183 (1971), DOI: 10.1109/T-C.1971.223208

Gamblin, T., LeGendre, M.P., Collette, M.R., et al.: The Spack package manager: bringing order to HPC software chaos. In: Kern, J., Vetter, J.S. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, 15-20 Nov. 2015, Austin, TX, USA. pp. 40:1–40:12. ACM (2015), DOI: 10.1145/2807591.2807623

Geist, A., Reed, D.A.: A survey of high-performance computing scaling challenges. IJHPCA 31(1), 104–113 (2017), DOI: 10.1177/1094342015597083

Geist II, G.A., Kohl, J.A., Papadopoulos, P.M.: Cumulvs: Providing Fault Tolerance, Visualization, and Steering of Parallel Applications. IJHPCA 11(3), 224–235 (1997), DOI: 10.1177/109434209701100305

Gilchrist, J.: Parallel data compression with bzip2. In: Proc. of the 16th IASTED int. conf. on parallel and distributed computing and systems. vol. 16, pp. 559–564 (2004)

Godlove, D.: Singularity: Simple, secure containers for compute-driven workloads. In: Furlani, T.R. (ed.) Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), PEARC 2019, 28 July-1 Aug. 2019, Chicago, IL, USA. pp. 24:1–24:4. ACM (2019), DOI: 10.1145/3332186.3332192

Goyal, M., Tatwawadi, K., Chandak, S., et al.: DeepZip: Lossless Data Compression using Recurrent Neural Networks. CoRR abs/1811.08162 (2018), http://arxiv.org/abs/1811.08162

Gupta, A., G¨unther, U., Incardona, P., et al.: A Proposed Framework for Interactive Virtual Reality In Situ Visualization of Parallel Numerical Simulations. CoRR abs/1909.02986 (2019), http://arxiv.org/abs/1909.02986

Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003), http://jmlr.org/papers/v3/guyon03a.html

Hadjidoukas, P.E., Wermelinger, F.: A Parallel Data Compression Framework for Large Scale 3D Scientific Data. CoRR abs/1903.07761 (2019), http://arxiv.org/abs/1903.07761

Halevi, S., Harnik, D., Pinkas, B., et al.: Proofs of ownership in remote storage systems. In: Chen, Y., Danezis, G., Shmatikov, V. (eds.) Proceedings of the 18th ACM Conference on Computer and Communications Security, CCS 2011, 17-21 Oct. 2011, Chicago, Illinois, USA. pp. 491–500. ACM (2011), DOI: 10.1145/2046707.2046765

Halkiadakis, E.: Proceedings for TASI 2009 Summer School on “Physics of the Large and the Small”: Introduction to the LHC experiments (2010)

Higgins, J., Holmes, V., Venters, C.C.: Orchestrating Docker Containers in the HPC Environment. In: Kunkel, J.M., Ludwig, T. (eds.) High Performance Computing - 30th Int. Conf., 12-16 July 2015, Frankfurt, Germany. Lecture Notes in Computer Science, vol. 9137, pp. 506–513. Springer (2015), DOI: 10.1007/978-3-319-20119-1_36

Hu, X., Wang, F., Li, W., et al.: QZFS: QAT Accelerated Compression in File System for Application Agnostic and Cost Efficient Data Storage. In: Malkhi, D., Tsafrir, D. (eds.) 2019 USENIX Annual Technical Conference, USENIX ATC 2019, 10-12 July 2019, Renton, WA, USA. pp. 163–176. USENIX Association (2019), https://www.usenix.org/conference/atc19/presentation/hu-xiaokang

Huang, C., Harris, R.W.: A comparison of several vector quantization codebook generation approaches. IEEE Trans. Image Processing 2(1), 108–112 (1993), DOI: 10.1109/83.210871

Ibarria, L., Lindstrom, P., Rossignac, J., et al.: Out-of-core Compression and Decompression of Large n-dimensional Scalar Fields. Comput. Graph. Forum 22(3), 343–348 (2003), DOI: 10.1111/1467-8659.00681

Iturbide, M., Bedia, J., Garcia, S.H., et al.: The R-based climate4R open framework for reproducible climate data access and post-processing. Environmental Modelling and Software 111, 42–54 (2019), DOI: 10.1016/j.envsoft.2018.09.009

Jimenez, I., Sevilla, M., Watkins, N., et al.: The Popper Convention: Making Reproducible Systems Evaluation Practical. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, 29 May-2 June 2017, Orlando / Buena Vista, FL, USA. pp. 1561–1570. IEEE Computer Society (2017), DOI: 10.1109/IPDPSW.2017.157

Jin, S., Di, S., Liang, X., et al.: DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression. CoRR abs/1901.09124 (2019), http://arxiv.org/abs/1901.09124

Kaiser, J., Gad, R., Suß, T., et al.: Deduplication Potential of HPC Applications’ Checkpoints. In: 2016 IEEE International Conference on Cluster Computing, CLUSTER 2016, 12-16 Sept. 2016, Taipei, Taiwan. pp. 413–422. IEEE Computer Society, DOI: 10.1109/CLUSTER.2016.32

Kane, J., Yang, Q.: Compression Speed Enhancements to LZO for Multi-core Systems. In: Panetta, J., Moreira, J.E., Padua, D.A., et al. (eds.) IEEE 24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012, 24-26 Oct. 2012, New York, NY, USA. pp. 108–115. IEEE Computer Society (2012), DOI: 10.1109/SBAC-PAD.2012.29

Kraska, T., Beutel, A., Chi, E.H., et al.: The Case for Learned Index Structures. In: Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, 10-15 June 2018, Houston, TX, USA. pp. 489–504 (2018), DOI: 10.1145/3183713.3196909

Kress, J.: In Situ Visualization Techniques for High Performance Computing. http://www.cs.uoregon.edu/Reports/AREA-201703-Kress.pdf (2017), accessed: 2020-01-23

Kuhn, M., Kunkel, J., Ludwig, T.: Data Compression for Climate Data. Supercomputing Frontiers and Innovations 3(1), 75–94 (2016), DOI: 10.14529/jsfi160105

Kumar, A., Zhu, X., Tu, Y., et al.: Compression in Molecular Simulation Datasets. In: Sun, C., Fang, F., Zhou, Z., et al. (eds.) Intelligence Science and Big Data Engineering - 4th International Conference, IScIDE 2013, 31 July-2 Aug. 2013, Beijing, China, Revised Selected Papers. Lecture Notes in Computer Science, vol. 8261, pp. 22–29. Springer (2013), DOI: 10.1007/978-3-642-42057-3_4

Kunkel, J., Novikova, A., Betke, E.: Towards Decoupling the Selection of Compression Algorithms from Quality Constraints An Investigation of Lossy Compression Efficiency. Supercomputing Frontiers and Innovations 4(4) (2017), DOI: 10.14529/jsfi170402

Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: Scientific containers for mobility of compute. PLOS ONE 12(5), 1–20 (2017), DOI: 10.1371/journal.pone.0177459

Lakhani, G.: Reducing coding redundancy in LZW. Inf. Sci. 176(10), 1417–1434 (2006), DOI: 10.1016/j.ins.2005.03.007

Lakshminarasimhan, S., Shah, N., Ethier, S., et al.: Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011 Parallel Processing - 17th International Conference, Euro-Par 2011, 29 Aug.-2 Sept. 2011, Bordeaux, France, Proceedings, Part I. Lecture Notes in Computer Science, vol. 6852, pp. 366–379. Springer (2011), DOI: 10.1007/978-3-642-23400-2_34

Lakshminarasimhan, S., Shah, N., Ethier, S., et al.: ISABELA for effective in situ compression of scientific data. Concurrency and Computation: Practice and Experience 25(4), 524–540 (2013), DOI: 10.1002/cpe.2887

Larsen, M., Brugger, E., Childs, H., et al.: Strawman: A Batch In Situ Visualization and Analysis Infrastructure for Multi-Physics Simulation Codes. In: Weber, G.H. (ed.) Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV 2015, 15-20 Nov. 2015, Austin, TX, USA. pp. 30–35. ACM (2015), DOI: 10.1145/2828612.2828625

Lee, S.M., Jang, J.H., Oh, J., et al.: Design of hardware accelerator for Lempel-Ziv 4 (LZ4) compression. IEICE Electronic Express 14(11), 20170399 (2017), DOI: 10.1587/elex.14.20170399

Li, B., Zhang, L., Shang, Z., et al.: Implementation of LZMA compression algorithm on FPGA. Electronics Letters 50(21), 1522–1524 (2014), DOI: 10.1049/el.2014.1734

Li, S., Marsaglia, N., Garth, C., et al.: Data Reduction Techniques for Simulation, Visualization and Data Analysis. Comput. Graph. Forum 37(6), 422–447 (2018), DOI: 10.1111/cgf.13336

Li, W., Yao, Y.: Accelerate Data Compression in File System. In: Bilgin, A., Marcellin, M.W., Serra-Sagrist`a, J., et al. (eds.) 2016 Data Compression Conference, DCC 2016, 30 March-1 April 2016, Snowbird, UT, USA. p. 615. IEEE (2016), DOI: 10.1109/DCC.2016.24

Liang, X., Di, S., Tao, D., et al.: Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets. In: Abe, N., Liu, H., Pu, C., et al. (eds.) IEEE International Conference on Big Data, Big Data 2018, 10-13 Dec. 2018, Seattle, WA, USA. pp. 438–447. IEEE (2018), DOI: 10.1109/BigData.2018.8622520

Lin, J., Hu, Y., Liu, D.: Deep Learning-Based Video Coding (DLVC). http://dlvc.bitahub.com/ (2020), accessed: 2020-02-20

Lindstrom, P.: Fixed-Rate Compressed Floating-Point Arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014), DOI: 10.1109/TVCG.2014.2346458

Lindstrom, P., Isenburg, M.: Fast and Efficient Compression of Floating-Point Data. IEEE Trans. Vis. Comput. Graph. 12(5), 1245–1250 (2006), DOI: 10.1109/TVCG.2006.143

Liu, D., Li, Y., Lin, J., et al.: Deep Learning-Based Video Coding: A Review and A Case Study. CoRR abs/1904.12462 (2019), http://arxiv.org/abs/1904.12462

Liu, Q., Hazarika, S., Patchett, J.M., et al.: Deep Learning-Based Feature-Aware Data Modeling for Complex Physics Simulations. CoRR abs/1912.03587 (2019), http://arxiv.org/abs/1912.03587

Liu, W., Mei, F., Wang, C., et al.: Data Compression Device Based on Modified LZ4 Algorithm. IEEE Trans. Consumer Electronics 64(1), 110–117 (2018), DOI: 10.1109/TCE.2018.2810480

Liu, Y., Wang, Y., Deng, L., et al.: A novel in situ compression method for CFD data based on generative adversarial network. J. Visualization 22(1), 95–108 (2019), DOI: 10.1007/s12650-018-0519-x

Lofstead, J.F., Baker, J., Younge, A.: Data Pallets: Containerizing Storage for Reproducibility and Traceability. In: Weiland, M., Juckeland, G., Alam, S.R., et al. (eds.) High Performance Computing - ISC High Performance 2019 InternationalWorkshops, 16-20 June 2019, Frankfurt, Germany, Revised Selected Papers. Lecture Notes in Computer Science, vol. 11887, pp. 36–45. Springer (2019), DOI: 10.1007/978-3-030-34356-9_4

Lu, T., Liu, Q., He, X., et al.: Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, 21-25 May 2018, Vancouver, BC, Canada. pp. 348–357. IEEE Computer Society (2018), DOI: 10.1109/IPDPS.2018.00044

Lu, Z.M., Guo, S.Z.: Chapter 1 - Introduction. In: Lu, Z.M., Guo, S.Z. (eds.) Lossless Information Hiding in Images, pp. 1–68. Syngress (2017), DOI: 10.1016/B978-0-12-812006-4.00001-2

Lundborg, M., Apostolov, R., Spangberg, D., et al.: An efficient and extensible format, library, and API for binary trajectory data from molecular simulations. Journal of Computational Chemistry 35(3), 260–269 (2014), DOI: 10.1002/jcc.23495

Ma, C., Jung, J., Kim, S., et al.: Random projection-based partial feature extraction for robust face recognition. Neurocomputing 149, 1232–1244 (2015), DOI: 10.1016/j.neucom.2014.09.004

van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008)

van der Maaten, L., Postma, E., van den Herik, J.: Dimensionality reduction: a comparative review. Journal of Machine Learning Research 10(66-71), 13 (2009)

Magenheimer, D.: In-kernel memory compression. https://lwn.net/Articles/545244/ (2013), accessed: 2020-02-20

Mahoney, M.: Data Compression Explained. http://mattmahoney.net/dc/dce.html#Section_524 (2013), accessed: 2020-02-20

Marsaglia, N., Li, S., Belcher, K., et al.: Dynamic I/O Budget Reallocation For In Situ Wavelet Compression. In: Childs, H., Frey, S. (eds.) Eurographics Symposium on Parallel Graphics and Visualization, EGPGV 2019, 3-4 June 2019, Porto, Portugal. pp. 1–5. Eurographics Association (2019), DOI: 10.2312/pgv.20191104

Martel, E., Lazcano, R., Lopez, J.F., et al.: Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons. Remote Sensing 10(6), 864 (2018), DOI: 10.3390/rs10060864

Martinez, A.M., Kak, A.C.: PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 228–233 (2001), DOI: 10.1109/34.908974

Masek, P., Stusek, M., Krejci, J., et al.: Unleashing Full Potential of Ansible Framework: University Labs Administration. In: 22nd Conference of Open Innovations Association, FRUCT 2018, 15-18 May 2018, Jyvaskyla, Finland. pp. 144–150. IEEE (2018), DOI: 10.23919/FRUCT.2018.8468270

Matthes, A., Huebl, A., Widera, R., et al.: In situ, steerable, hardware-independent and data-structure agnostic visualization with ISAAC. Supercomputing Frontiers and Innovations 3(4), 30–48 (2016), DOI: 10.14529/jsfi160403

McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. CoRR abs/1802.03426 (2018), https://arxiv.org/abs/1802.03426

Mecum, B.D., Jones, M.B., Vieglais, D., et al.: Preserving Reproducibility: Provenance and Executable Containers in DataONE Data Packages. In: 14th IEEE International Conference on e-Science, e-Science 2018, 29 Oct.-1 Nov. 2018, Amsterdam, The Netherlands. pp. 45–49. IEEE Computer Society (2018), DOI: 10.1109/eScience.2018.00019

Meister, D., Kaiser, J., Brinkmann, A., et al.: A study on data deduplication in HPC storage systems. In: Hollingsworth, J.K. (ed.) SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, 11-15 Nov. 2012, Salt Lake City, UT, USA. p. 7. IEEE/ACM (2012), DOI: 10.1109/SC.2012.14

Menegidio, F.B., Jabes, D.L., de Oliveira, R.C., et al.: Dugong: a Docker image, based on Ubuntu Linux, focused on reproducibility and replicability for bioinformatics analyses. Bioinformatics 34(3), 514–515 (2018), DOI: 10.1093/bioinformatics/btx554

Mentzer, F., Agustsson, E., Tschannen, M., et al.: Practical Full Resolution Learned Lossless Image Compression. CoRR abs/1811.12817 (2018), http://arxiv.org/abs/1811.12817

Moffat, A.: Huffman Coding. ACM Comput. Surv. 52(4), 85:1–85:35 (2019), DOI: 10.1145/3342555

Muthitacharoen, A., Chen, B., Mazieres, D.: A Low-Bandwidth Network File System. In: Marzullo, K., Satyanarayanan, M. (eds.) Proceedings of the 18th ACM Symposium on Operating System Principles, SOSP 2001, 21-24 Oct. 2001, Chateau Lake Louise, Banff, Alberta, Canada. pp. 174–187. ACM (2001), DOI: 10.1145/502034.502052

Norton, A., Clyne, J.P.: The VAPOR Visualization Application. In: Bethel, E.W., Childs, H., Hansen, C.D. (eds.) High Performance Visualization - Enabling Extreme-Scale Scientific Insight. Chapman and Hall / CRC computational science series, CRC Press (2012), DOI: 10.1201/b12985-25

Ohtani, H., Hagita, K., Ito, A.M., et al.: Irreversible data compression concepts with polynomial fitting in time-order of particle trajectory for visualization of huge particle system. Journal of Physics: Conference Series 454, 012078 (2013), DOI: 10.1088/1742-6596/454/1/012078

Park, J., Park, H., Choi, Y.: Data compression and prediction using machine learning for industrial IoT. In: 2018 International Conference on Information Networking, ICOIN 2018, 10-12 Jan. 2018, Chiang Mai, Thailand. pp. 818–820 (2018), DOI: 10.1109/ICOIN.2018.8343232

Plugariu, O., Gegiu, A.D., Petrica, L.: FPGA systolic array GZIP compressor. In: 2017 9th International Conference on Electronics, Computers and Artificial Intelligence (ECAI). pp. 1–6. IEEE (2017), DOI: 10.1109/ECAI.2017.8166387

Portner, A., Hoffmann, M., Zug, S., et al.: SwarmRob: A Docker-Based Toolkit for Reproducibility and Sharing of Experimental Artifacts in Robotics Research. In: IEEE International Conference on Systems, Man, and Cybernetics, SMC 2018, 7-10 Oct. 2018, Miyazaki, Japan. pp. 325–332. IEEE (2018), DOI: 10.1109/SMC.2018.00065

Qiao, Y., Fang, J., Hofstee, H.P.: An FPGA-based Snappy Decompressor-Filter (2018), DOI: 10.13140/RG.2.2.30215.44962

Qin, Z., Wang, J., Liu, Q., et al.: Estimating Lossy Compressibility of Scientific Data Using Deep Neural Networks. IEEE Letters of the Computer Society 3(1), 5–8 (2020), DOI: 10.1109/LOCS.2020.2971940

Rattanaopas, K., Kaewkeeree, S.: Improving Hadoop MapReduce performance with data compression: A study using wordcount job. In: 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON, 27-30 June 2017, Phuket, Thailand. pp. 564–567 (2017), DOI: 10.1109/ECTICon.2017.8096300

Rippel, O., Bourdev, L.D.: Real-Time Adaptive Image Compression. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, 6-11 Aug. 2017, Sydney, NSW, Australia. pp. 2922–2930 (2017), http://proceedings.mlr.press/v70/rippel17a.html

Rivia, M., Caloria, L., Muscianisia, G., et al.: In-situ Visualization: State-of-theart and Some Use Cases. http://www.prace-ri.eu/IMG/pdf/In-situ_Visualization_State-of-the-art_and_Some_Use_Cases-2.pdf (2012), accessed: 2020-02-20

Rober, N., Engels, J.F.: In-Situ Processing in Climate Science. In: Weiland, M., Juckeland, G., Alam, S.R., et al. (eds.) High Performance Computing - ISC High Performance 2019 International Workshops, 16-20 June 2019, Frankfurt, Germany, Revised Selected Papers. Lecture Notes in Computer Science, vol. 11887, pp. 612–622. Springer (2019), DOI: 10.1007/978-3-030-34356-9_46

Rougier, N.P., Hinsen, K., Alexandre, F., et al.: Sustainable computational science: the ReScience initiative. PeerJ Computer Science 3, e142 (2017), DOI: 10.7717/peerj-cs.142

Sahinalp, S.C., Rajpoot, N.M.: Chapter 6 - Dictionary-Based Data Compression: An Algorithmic Perspective. In: Sayood, K. (ed.) Lossless Compression Handbook, pp. 153–167. Communications, Networking and Multimedia, Academic Press, San Diego (2003), DOI: 10.1016/B978-012620861-0/50007-3

Salomon, D.: Data compression - The Complete Reference, 4th Edition. Springer (2007)

Samanta, R., Mahapatra, R.: An Enhanced CAM Architecture to Accelerate LZW Compression Algorithm. In: 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems, VLSID’07, 6-10 Jan. 2007, Bangalore, India. pp. 824–829. IEEE (2007), DOI: 10.1109/VLSID.2007.34

Sasaki, N., Sato, K., Endo, T., et al.: Exploration of Lossy Compression for Application-Level Checkpoint/Restart. In: 2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, 25-29 May 2015, Hyderabad, India. pp. 914–922. IEEE Computer Society (2015), DOI: 10.1109/IPDPS.2015.67

Schendel, E.R., Jin, Y., Shah, N., et al.: ISOBAR Preconditioner for Effective and Highthroughput Lossless Data Compression. In: Kementsietsidis, A., Salles, M.A.V. (eds.) IEEE 28th International Conference on Data Engineering, ICDE 2012, 1-5 April 2012, Washington, DC, USA. pp. 138–149. IEEE Computer Society (2012), DOI: 10.1109/ICDE.2012.114

Setia, A., Ahlawat, P.: Enhanced LZW Algorithm with Less Compression Ratio. In: Proceedings of Int. Conf. on Advances in Computing. pp. 347–351. Springer India, New Delhi (2012), DOI: 10.1007/978-81-322-0740-5_41

Shadura, O., Bockelman, B.P.: ROOT I/O compression algorithms and their performance impact within run 3. CoRR abs/1906.04624 (2019), http://arxiv.org/abs/1906.04624

Shanmugasundaram, S., Lourdusamy, R.: A Comparative Study Of Text Compression Algorithms. ICTACT Journal on Communication Technology 1(3), 68–76 (2011), DOI: 10.21917/ijct.2011.0062

Shehabi, A., Smith, S., Sartor, D., et al.: United States Data Center Energy Usage Report (2016), DOI: 10.2172/1372902

Shibata, Y., Kida, T., Fukamachi, S., et al.: Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching (1999), https://pdfs.semanticscholar.org/1e94/41bbad598e181896349757b82af42b6a6902.pdf

Shudler, S., Ferrier, N.J., Insley, J.A., et al.: Spack meets singularity: creating movable in-situ analysis stacks with ease. In: Moreland, K., Garth, C., Bethel, E.W., et al. (eds.) Proceedings of the Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV@SC 2019, 18 Nov. 2019, Denver, Colorado, USA. pp. 34–38. ACM (2019), DOI: 10.1145/3364228.3364682

Silver, J., Zender, C.: The compressionerror trade-off for large gridded data sets. Geoscientific Model Development 10, 413–423 (2017), DOI: 10.5194/gmd-10-413-2017

Simone, S.D.: Apple Open-Sources its New Compression Algorithm LZFSE (2016), https://www.infoq.com/news/2016/07/apple-lzfse-lossless-opensource/, accessed: 2020-02-20

Singhal, S., Sussman, A.: Adaptive Compression to Improve I/O Performance for Climate Simulations. https://web.njit.edu/~qliu/assets/adaptive-compression-scheme(acomps).pdf (2017), accessed: 2020-02-17

Srinivasan, R., Rao, K.R.: Predictive Coding Based on Efficient Motion Estimation. IEEE Trans. Communications 33(8), 888–896 (1985), DOI: 10.1109/TCOM.1985.1096398

Szorc, G.: Better Compression with Zstandard. https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard (2017), accessed: 2020-02-17

Tahghighi, M., Mousavi, M., Khadivi, P.: Hardware implementation of a novel adaptive version of Deflate compression algorithm. In: 2010 18th Iranian Conference on Electrical Engineering, 11-13 May 2010, Isfahan, Iran. pp. 566–569. IEEE (2010), DOI: 10.1109/IRANIANCEE.2010.5507007

Tajul, T.K., Bhuiyan, S.R., Habib, A.: Enhancement of LZAP (Lempel Ziv All Prefixes) Compression Algorithm. In: 2018 4th International Conference on Electrical Engineering and Information Communication Technology, iCEEiCT. pp. 69–73 (2018), DOI: 10.1109/CEEICT.2018.8628148

Tao, D., Di, S., Chen, Z., et al.: Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization. CoRR abs/1706.03791 (2017), http://arxiv.org/abs/1706.03791

Tao, D., Di, S., Guo, H., et al.: Z-checker: A framework for assessing lossy compression of scientific data. IJHPCA 33(2) (2019), DOI: 10.1177/1094342017737147

Tao, D., Di, S., Liang, X., et al.: Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP. IEEE Trans. Parallel Distrib. Syst. 30(8), 1857–1871 (2019), DOI: 10.1109/TPDS.2019.2894404

Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000), https://science.sciencemag.org/content/sci/290/5500/2319.full.pdf

Toderici, G., Vincent, D., Johnston, N., et al.: Full Resolution Image Compression with Recurrent Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 21-26 July 2017, Honolulu, HI, USA. pp. 5435–5443 (2017), DOI: 10.1109/CVPR.2017.577

Underwood, R., Di, S., Calhoun, J.C., et al.: FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data. CoRR abs/2001.06139 (2020), https://arxiv.org/abs/2001.06139

Vetterli, M., Kovacevic, J.: Wavelets and Subband Coding. Prentice Hall Signal Processing Series, Prentice Hall (1995)

Visualization and Analysis Software Team: VAPOR product roadmap. Tech. rep., NCAR (2017), https://ncar.github.io/vapor2website/sites/default/files/VAPORRoadmap.pdf

Welch, T.A.: A Technique for High-Performance Data Compression. IEEE Computer 17(6), 8–19 (1984), DOI: 10.1109/MC.1984.1659158

Welton, B., Kimpe, D., Cope, J., et al.: Improving I/O Forwarding Throughput with Data Compression. In: 2011 IEEE International Conference on Cluster Computing, CLUSTER, 26-30 Sept. 2011, Austin, TX, USA. pp. 438–445. IEEE Computer Society (2011), DOI: 10.1109/CLUSTER.2011.80

Whitlock, B., Favre, J.M., Meredith, J.S.: Parallel In Situ Coupling of Simulation with a Fully Featured Visualization System. In: Kuhlen, T.W., Pajarola, R., Zhou, K. (eds.) Eurographics Symposium on Parallel Graphics and Visualization, EGPGV 2011, Llandudno, Wales, UK. Proceedings. pp. 101–109. Eurographics Association (2011), DOI: 10.2312/EGPGV/EGPGV11/101-109

Widianto, E.D., Prasetijo, A.B., Ghufroni, A.: On the implementation of ZFS (Zettabyte File System) storage system. In: 2016 3rd International Conference on Information Technology, Computer, and Electrical Engineering, ICITACEE. pp. 408–413 (2016), DOI: 10.1109/ICITACEE.2016.7892481

Williams, R.N.: An Extremely Fast Ziv-Lempel Data Compression Algorithm. In: Storer, J.A., Reif, J.H. (eds.) Proceedings of the IEEE Data Compression Conference, DCC 1991, 8-11 April 1991, Snowbird, Utah, USA. pp. 362–371. IEEE Computer Society (1991), DOI: 10.1109/DCC.1991.213344

Xia, W., Jiang, H., Feng, D., et al.: A Comprehensive Study of the Past, Present, and Future of Data Deduplication. Proceedings of the IEEE 104(9), 1681–1710 (2016), DOI: 10.1109/JPROC.2016.2571298

Xie, H., Li, J., Xue, H.: A survey of dimensionality reduction techniques based on random projection. CoRR abs/1706.04371 (2017), http://arxiv.org/abs/1706.04371

Yamada, M., Jitkrittum, W., Sigal, L., et al.: High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso. Neural Computation 26(1), 185–207 (2014), DOI: 10.1162/NECO_a_00537

Zender, C.S.: Bit Grooming: statistically accurate precision-preserving quantization with compression, evaluated in the netCDF Operators (NCO, v4.4.8+). Geoscientific Model Development 9(9), 3199–3211 (2016), DOI: 10.5194/gmd-9-3199-2016

Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Information Theory 23(3), 337–343 (1977), DOI: 10.1109/TIT.1977.1055714

Downloads

Published

2020-04-14

How to Cite

Duwe, K., Lüttgau, J., Mania, G., Squar, J., Fuchs, A., Kuhn, M., Betke, E., & Ludwig, T. (2020). State of the Art and Future Trends in Data Reduction for High-Performance Computing. Supercomputing Frontiers and Innovations, 7(1), 4–36. https://doi.org/10.14529/jsfi200101