Analyzing Data Properties using Statistical Sampling – Illustrated on Scientific File Formats

Julian Martin Kunkel


Understanding the characteristics of data stored in data centers helps computer scientists in identifying the most suitable storage infrastructure to deal with these workloads. For example, knowing the relevance of file formats allows optimizing the relevant formats but also helps in a procurement to define benchmarks that cover these formats. Existing studies that investigate performance improvements and techniques for data reduction such as deduplication and compression operate on a subset of data. Some of those studies claim the selected data is representative and scale their result to the scale of the data center. One hurdle of running novel schemes on the complete data is the vast amount of data stored and, thus, the resources required to analyze the complete data set. Even if this would be feasible, the costs for running many of those experiments must be justified.
This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly.

Full Text:



Nathanel Hübbe and Julian Kunkel. Reducing the HPC-Datastorage Footprint with MAFISC – Multidimensional Adaptive Filtering Improved Scientific data Compression. Computer Science - Research and Development, pages 231–239, 05 2013.

Keren Jin and Ethan Miller. The Effectiveness of Deduplication on Virtual Machine Disk Images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, page 7. ACM, 2009.

JWKJW Kotrlik and CCHCC Higgins. Organizational Research: Determining Appropriate Sample Size in Survey Research Appropriate Sample Size in Survey Research. Information technology, learning, and performance journal, 19(1):43, 2001.

Michael Kuhn, Konstantinos Chasapis, Manuel Dolz, and Thomas Ludwig. Compression By Default – Reducing Total Cost of Ownership of Storage Systems, 06 2014.

Julian M. Kunkel. Analyzing Data Properties using Statistical Sampling Methods – Illustrated on Scientific File Formats and Compression Features. In High Performance Computing – ISC HPC 2016 International Workshops, Revised Selected Papers (to appear), volume 9945 of Lecture Notes in Computer Science. 2016.

Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Seung-Hoe Ku, Choong-Seock Chang, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F Samatova. ISABELA for Effective in Situ Compression of Scientific Data. Concurrency and Computation: Practice and Experience, 25(4):524–540, 2013.

Solomon Desalegn Legesse. Performance Evaluation of File Systems Compression Features. Master’s thesis, University of Oslo, 2014.

Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield, Matthew Wolf, and Qing Liu. Six Degrees of Scientific Data: Reading Patterns for Extreme Scale Science IO. In Proceedings of the 20th international symposium on High performance distributed computing, pages 49–60. ACM, 2011.

Uwe Schulzweida, Luis Kornblueh, and Ralf Quast. CDO User’s guide: Climate Data Operators Version 1.6. 1, 2006.

Aviad Zuck, Sivan Toledo, Dmitry Sotnikov, and Danny Harnik. Compression and SSDs: Where and How? In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 14), Broomfield, CO, October 2014. USENIX Association.

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)