Extreme Big Data (EBD): Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year

Satoshi Matsuoka, Hitoshi Sato, Osamu Tatebe, Michihiro Koibuchi, Ikki Fujiwara, Shuji Suzuki, Masanori Kakuta, Takashi Ishida, Yutaka Akiyama, Toyotaro Suzumura, Koji Ueno, Hiroki Kanezashi, Takemasa Miyoshi


Our claim is that so-called ``Big Data'' will evolve into a new era with proliferation of data from multiple sources such as massive numbers of sensors whose resolution is increasing exponentially, high-resolution simulations generating huge data results, as well as evolution of social infrastructures that allow for ``opening up of data silos'', i.e., data sources being abundant across the world instead of being confined within an institution, much as how scientific data are being handled in the modern era as a common asset openly accessible within and across disciplines. Such a situation would create the need for not only petabytes to zetabytes of capacity and beyond, but also for extreme scale computing power. Our new project, sponsored under the Japanese JST-CREST program is called ``Extreme Big Data", and aims to achieve the {\it convergence of extreme supercomputing and big data} in order to cope with such explosion of data. The project consists of six teams, three of which deals with defining future EBD convergent SW/HW architecture and system, and the other three the EBD co-design applications that represent different facets of big data, in metagenomics, social simulation, and climate simulation with real-time data assimilation. Although the project is still early in its lifetime, started in Oct. 2013, we have already achieved several notable results, including becoming world #1 on the Green Graph 500, a benchmark to measure the power efficiency of graph processing that appear in typical big data scenarios.

Full Text:



The Adaptable IO System (ADIOS). https://www.olcf.ornl.gov/center-projects/adios/.

Apache Hadoop. http://hadoop.apache.org.

Graph500. http://www.graph500.org/.

Green Graph500. http://green.graph500.org.

OpenNVM. https://opennvm.gitbug.io.

ScaleGraph Library. http://www.scalegraph.org/.

Sloan Digital Sky Survey. http://www.sdss.org/.

Spark. http://spark.apache.org.

Worldwide LHC Computing Grid. http://wlcg-public.web.cern.ch/.

Dennis Abts and Bob Felderman. A guided tour of data-center networking. Commun. ACM, 55(6):44-51, 2012.

Jung Ho Ahn, Nathan Binkert, Al Davis, Moray McLaren, and Robert S. Schreiber. HyperX: topology, routing, and packaging of efficient large-scale networks. In Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC), pages 1-11, 2009.

Kevin Ashton. That 'internet of things' thing. RFID Journal, 2009.

Nathan Blow. Metagenomics: exploring unseen communities. Nature, 453(7195):687-90, 2008.

Quan Chen, Daqiang Zhang, Minyi Guo, Qianni Deng, and Song Guo. Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, CIT'10, pages 2736-2743, Washington, DC, USA, 2010. IEEE Computer Society.

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10, Berkeley, CA, USA, 2004. USENIX Association.

Jose Flich, Tor Skeie, Andres Mejia, Olav Lysne, Pedro Lopez, Antonio Robles, Jose Duato, Michihiro Koibuchi, Tomas Rokicki, and Jose Carlos Sancho. A Survey and Evaluation of Topology Agnostic Deterministic Routing Algorithms. IEEE Transactions on Parallel and Distributed Systems, 23(3):405-425, 2012.

Ikki Fujiwara, Michihiro Koibuchi, Hiroki Matsutani, and Henri Casanova. Skywalk: a Topology for HPC Networks with Low-delay Switches. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS), May 2014.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP'03, pages 29-43, New York, NY, USA, 2003. ACM.

The Human and Microbiome Project. A framework for human microbiome research. Nature, 486(7402):215-21, June 2012.

B. R. Hunt, E. kalnay, E. J. Kostelich, E. Ott, D. J. Patil, T. Sauer, I. Szun-Yogh, J. A. Yorke, and A. V. Zimin. Four-dimensional ensemble kalman filtering. Tellus A, 56(4):273-277, 2004.

Ryo Mizote Yuichiro Yasui Katsuki Fujisawa Keita Iwabuchi, Hitoshi Sato and Satoshi Matsuoka. Hybrid bfs approach using semi-external memory. In International Workshop on High Performance Data Intensive Computing (HPDIC2014), May 2014.

John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. Technology-Driven, Highly-Scalable Dragon y Topology. In Proc. of the International Symposium on Computer Architecture (ISCA), pages 77-88, 2008.

Michihiro Koibuchi, Hiroki Matsutani, Hideharu Amano, D. Frank Hsu, and Henri Casanova. A Case for Random Shortcut Topologies for HPC Interconnects. In Proc. of the International Symposium on Computer Architecture (ISCA), pages 177-188, 2012.

Satoshi Matsuoka, Takayuki Aoki, Toshio Endo, Hitoshi Sato, Shin'ichiro Takizawa, Akihiko Nomura, and Kento Sato. TSUBAME2.0. In Contemporary High Performance Computing, Chapman & Hall/CRC Computational Science, pages 525-555. Chapman and Hall/CRC, April 2013.

Yuri Matsuzaki, Nobuyuki Uchikoga, Masahito Ohue, Takehiro Shimoda, Toshiyuki Sato, Takashi Ishida, and Yutaka Akiyama. MEGADOCK 3.0: a high-performance protein-protein interaction prediction software using hybrid parallel computing for petascale supercomputing environments. Source code for biology and medicine, 8(1):18, January 2013.

Koichi Shirahata, Hitoshi Sato, Toyotaro Suzumura, and Satoshi Matsuoka. A scalable implementation of a mapreduce-based graph processing algorithm for large-scale heterogeneous supercomputers. Cluster Computing and the Grid, IEEE International Symposium on, 0:277-284, 2013.

Shuji Suzuki, Takashi Ishida, and Yutaka Akiyama. An ultra-fast computing pipeline for metagenome analysis with next-generation dna sequencers. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1549-1550, Nov 2012.

Shuji Suzuki, Takashi Ishida, Ken Kurokawa, and Yutaka Akiyama. GHOSTM: a GPU-accelerated homology search tool for metagenomics. PLoS One, 7(5):e36060, January 2012.

Shuji Suzuki, Masanori Kakuta, Takashi Ishida, and Yutaka Akiyama. GHOSTX: An improved sequence homology search algorithm using a query suffix array and a database suffix array. submitted.

Toyotaro Suzumura and Hiroki Kanezashi. Highly scalable x10-based agent simulation platform and its application to large-scale traffic simulation. In Proceedings of the 2012 IEEE/ACM 16th International Symposium on Distributed Simulation and Real Time Applications, DS-RT '12, pages 243-250, Washington, DC, USA, 2012. IEEE Computer Society.

Toyotaro Suzumura and Hiroki Kanezashi. A holistic architecture for super real-time multiagent simulation platform. In Winter Simulation Conference 2013, 2013.

Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa, and Satoshi Matsuoka. Performance characteristics of graph500 on large-scale distributed environment. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization, IISWC '11, pages 149-158, Washington, DC, USA, 2011. IEEE Computer Society.

Masahiro Tanaka and Osamu Tatebe. Pwrake: A parallel and distributed flexible workflow management tool for wide-area data intensive computing. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 356-359, New York, NY, USA, 2010. ACM.

Masahiro Tanaka and Osamu Tatebe. Workflow scheduling to minimize data movement using multi-constraint graph partitioning. Cluster Computing and the Grid, IEEE International Symposium on, 0:65-72, 2012.

Osamu Tatebe, Kohei Hiraga, and Noriyuki Soda. Gfarm grid file system. New Generation Computing, 28(3):257-275, 2010.

Susannah G Tringe, Tao Zhang, Xuguo Liu, Yiting Yu, Wah Heng Lee, Jennifer Yap, Fei Yao, Sim Tiow Suan, Seah Keng Ing, Matthew Haynes, Forest Rohwer, Chia Lin Wei, Patrick Tan, James Bristow, Edward M Rubin, and Yijun Ruan. The airborne metagenome in an indoor urban environment. PloS one, 3(4):e1862, January 2008.

Koji Ueno and Toyotaro Suzumura. Highly scalable graph search for the graph500 benchmark. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC '12, pages 149-160, New York, NY, USA, 2012. ACM.

Koji Ueno and Toyotaro Suzumura. Parallel distributed breadth first search on gpu. In High Performance Computing (HiPC), 2013 20th International Conference on, pages 314-323, Dec 2013.

Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, pages 265-278, New York, NY, USA, 2010. ACM.

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pages 29-42, Berkeley, CA, USA, 2008. USENIX Association.

Publishing Center of South Ural State University (454080, Lenin prospekt, 76, Chelyabinsk, Russia)