Zhao Zhang

HPC Reading List

Edited by: Zhao Zhang. Reviewed by: Jack Dongarra, William D. Gropp. Other reviews are pending.

Interconnect

Scott, Steve, Dennis Abts, John Kim, and William J. Dally. “The blackwidow high-radix clos network.” ACM SIGARCH Computer Architecture News 34, no. 2 (2006): 16-28.
Leiserson, Charles E., Zahi S. Abuhamdeh, David C. Douglas, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, Daniel Hillis et al. “The network architecture of the Connection Machine CM-5.” In Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, pp. 272-285. 1992.
Kim, John, Wiliam J. Dally, Steve Scott, and Dennis Abts. “Technology-driven, highly-scalable dragonfly topology.” ACM SIGARCH Computer Architecture News 36, no. 3 (2008): 77-88.
Adiga, Narasimha R., Matthias A. Blumrich, Dong Chen, Paul Coteus, Alan Gara, Mark E. Giampapa, Philip Heidelberger et al. “Blue Gene/L torus interconnection network.” IBM Journal of Research and Development 49, no. 2.3 (2005): 265-276.
Besta, Maciej, and Torsten Hoefler. “Slim fly: A cost effective low-diameter network topology.” In SC’14: proceedings of the international conference for high performance computing, networking, storage and analysis, pp. 348-359. IEEE, 2014.

Programming Model

Walker, D.W. and Dongarra, J.J., 1996. “MPI: a standard message passing interface.” Supercomputer, 12, pp.56-68.
Zheng, Yili, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. “UPC++: a PGAS extension for C++.” In 2014 IEEE 28th international parallel and distributed processing symposium, pp. 1105-1114. IEEE, 2014.
Jiang, Weihang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K. Panda, William Gropp, and Rajeev Thakur. “High performance MPI-2 one-sided communication over InfiniBand.” In IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004., pp. 531-538. IEEE, 2004.
Kale, Laxmikant V., and Sanjeev Krishnan. “Charm++ a portable concurrent object oriented system based on c++.” In Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications, pp. 91-108. 1993.

Collective Communication

Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. “Optimization of collective communication operations in MPICH.” The International Journal of High Performance Computing Applications 19, no. 1 (2005): 49-66.
Chan, Ernie, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. “Collective communication: theory, practice, and experience.” Concurrency and Computation: Practice and Experience 19, no. 13 (2007): 1749-1783.
Pješivac-Grbović, Jelena, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, and Jack J. Dongarra. “Performance analysis of MPI collective operations.” Cluster Computing 10 (2007): 127-143.
Hoefler, Torsten, James Dinan, Darius Buntinas, Pavan Balaji, Brian W. Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. “Leveraging MPI’s one-sided communication interface for shared-memory programming.” In Recent Advances in the Message Passing Interface: 19th European MPI Users’ Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings 19, pp. 132-141. Springer Berlin Heidelberg, 2012.

Math Library

Balay, Satish, William D. Gropp, Lois Curfman McInnes, and Barry F. Smith. “Efficient management of parallelism in object-oriented numerical software libraries.” In Modern software tools for scientific computing, pp. 163-202. Boston, MA: Birkhäuser Boston, 1997.
Agullo, Emmanuel, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. “Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects.” In Journal of Physics: Conference Series, vol. 180, no. 1, p. 012037. IOP Publishing, 2009.
Choi, Jaeyoung, Jack J. Dongarra, Roldan Pozo, and David W. Walker. “ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers.” In The Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120-121. IEEE Computer Society, 1992.
Dongarra, Jack J., Piotr Luszczek, and Antoine Petitet. “The LINPACK benchmark: past, present and future.” Concurrency and Computation: practice and experience 15, no. 9 (2003): 803-820.
Williams, Samuel, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. “Optimization of sparse matrix-vector multiplication on emerging multicore platforms.” In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1-12. 2007.

I/O and Storage

Thakur, Rajeev, William Gropp, and Ewing Lusk. “On implementing MPI-IO portably and with high performance.” In Proceedings of the sixth workshop on I/O in parallel and distributed systems, pp. 23-32. 1999.
Folk, Mike, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. “An overview of the HDF5 technology suite and its applications.” In Proceedings of the EDBT/ICDT 2011 workshop on array databases, pp. 36-47. 2011.
Chen, Peter M., Edward K. Lee, Garth A. Gibson, Randy H. Katz, and David A. Patterson. “RAID: High-performance, reliable secondary storage.” ACM Computing Surveys (CSUR) 26, no. 2 (1994): 145-185.
Patil, Swapnil V., Garth A. Gibson, Sam Lang, and Milo Polte. “GIGA+ scalable directories for shared file systems.” In Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing’07, pp. 26-29. 2007.
Carns, Philip H., Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur. “{PVFS}: A parallel file system for linux clusters.” In 4th Annual Linux Showcase & Conference (ALS 2000). 2000.

Performance Modeling

Gibbons, Phillip B. “A more practical PRAM model.” In Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, pp. 158-168. 1989.
Culler, David, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. “LogP: Towards a realistic model of parallel computation.” In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 1-12. 1993.
Valiant, Leslie G. “A bridging model for multi-core computing.” Journal of Computer and System Sciences 77, no. 1 (2011): 154-166.
Williams, Samuel, Andrew Waterman, and David Patterson. “Roofline: an insightful visual performance model for multicore architectures.” Communications of the ACM 52, no. 4 (2009): 65-76.
Gropp, William, Luke N. Olson, and Philipp Samfass. “Modeling MPI communication performance on SMP nodes: Is it time to retire the ping pong test.” In Proceedings of the 23rd European MPI Users’ Group Meeting, pp. 41-50. 2016.
Hoefler, Torsten, Timo Schneider, and Andrew Lumsdaine. “Characterizing the influence of system noise on large-scale applications by simulation.” In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-11. IEEE, 2010.
Hoefler, Torsten, and Roberto Belli. “Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12. 2015.
Hoefler, Torsten, William Gropp, William Kramer, and Marc Snir. “Performance modeling for systematic performance tuning.” In State of the Practice Reports, pp. 1-12. 2011.
Taylor, Valerie, Xingfu Wu, and Rick Stevens. “Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications.” ACM SIGMETRICS Performance Evaluation Review 30, no. 4 (2003): 13-18.

Connecting Supercomputers

Foster, Ian, and Carl Kesselman. “Globus: A metacomputing infrastructure toolkit.” The International Journal of Supercomputer Applications and High Performance Computing 11, no. 2 (1997): 115-128.

Applications

Solomonik, Edgar, and James Demmel. “Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms.” In European Conference on Parallel Processing, pp. 90-109. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011.
Agarwal, Ramesh C., Susanne M. Balle, Fred G. Gustavson, Mahesh Joshi, and Prasad Palkar. “A three-dimensional approach to parallel matrix multiplication.” IBM Journal of Research and Development 39, no. 5 (1995): 575-582.
Datta, Kaushik, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. “Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures.” In SC’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1-12. IEEE, 2008.
Warren, Michael S., and John K. Salmon. “Astrophysical N-body simulations using hierarchical tree data structures.” SC 92 (1992): 570-576.
Sengupta, Shubhabrata, Mark Harris, Yao Zhang, and John D. Owens. “Scan primitives for GPU computing.” (2007).
Buluç, Aydin, and Kamesh Madduri. “Parallel breadth-first search on distributed memory systems.” In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12. 2011.
Holst, Terry L. “Supercomputer applications in computational fluid dynamics.” In Supercomputing’88: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Vol. II Science and Applications, pp. 51-60. IEEE, 1988.
Dubey, Abhimanyu, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).
Ben-Nun, Tal, and Torsten Hoefler. “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis.” ACM Computing Surveys (CSUR) 52, no. 4 (2019): 1-43.
You, Yang, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.” In International Conference on Learning Representations.