Analyzing Data Properties using Statistical Sampling – Illustrated on Scientific File Formats
Abstract
This paper investigates stochastic sampling methods to compute and analyze quantities of interest on file numbers but also on the occupied storage space. It will be demonstrated that on our production system, scanning 1% of files and data volume is sufficient to deduct conclusions. This speeds up the analysis process and reduces costs of such studies significantly.
Full Text:
PDFReferences
Nathanel Hübbe and Julian Kunkel. Reducing the HPC-Datastorage Footprint with MAFISC – Multidimensional Adaptive Filtering Improved Scientific data Compression. Computer Science - Research and Development, pages 231–239, 05 2013.
Keren Jin and Ethan Miller. The Effectiveness of Deduplication on Virtual Machine Disk Images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, page 7. ACM, 2009.
JWKJW Kotrlik and CCHCC Higgins. Organizational Research: Determining Appropriate Sample Size in Survey Research Appropriate Sample Size in Survey Research. Information technology, learning, and performance journal, 19(1):43, 2001.
Michael Kuhn, Konstantinos Chasapis, Manuel Dolz, and Thomas Ludwig. Compression By Default – Reducing Total Cost of Ownership of Storage Systems, 06 2014.
Julian M. Kunkel. Analyzing Data Properties using Statistical Sampling Methods – Illustrated on Scientific File Formats and Compression Features. In High Performance Computing – ISC HPC 2016 International Workshops, Revised Selected Papers (to appear), volume 9945 of Lecture Notes in Computer Science. 2016.
Sriram Lakshminarasimhan, Neil Shah, Stephane Ethier, Seung-Hoe Ku, Choong-Seock Chang, Scott Klasky, Rob Latham, Rob Ross, and Nagiza F Samatova. ISABELA for Effective in Situ Compression of Scientific Data. Concurrency and Computation: Practice and Experience, 25(4):524–540, 2013.
Solomon Desalegn Legesse. Performance Evaluation of File Systems Compression Features. Master’s thesis, University of Oslo, 2014.
Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield, Matthew Wolf, and Qing Liu. Six Degrees of Scientific Data: Reading Patterns for Extreme Scale Science IO. In Proceedings of the 20th international symposium on High performance distributed computing, pages 49–60. ACM, 2011.
Uwe Schulzweida, Luis Kornblueh, and Ralf Quast. CDO User’s guide: Climate Data Operators Version 1.6. 1, 2006.
Aviad Zuck, Sivan Toledo, Dmitry Sotnikov, and Danny Harnik. Compression and SSDs: Where and How? In 2nd Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW 14), Broomfield, CO, October 2014. USENIX Association.