I am an assistant professor in the Department of Electrical and Computer Engineering at Rutgers University.
Before joining Rutgers, I was a computer scientist in the Data Intensive Computing Group at Texas Advanced Computing Center. From 2014 to 2016, I spent two years as a postdoc researcher in AMPLab and BIDS (Berkeley Institute of Data Science) at University of California, Berkeley, working with Prof. Michael J. Franklin. I received my Ph.D from Department of Computer Science at University of Chicago in June, 2014 under the supervision of Prof. Ian Foster.
Research Interest
I have a wide interest in distributed computing, high performance computing, applied machine learning, and applying the computing techniques to solve big data problems. I am also interested in data management systems for domain science research and discovery. My recent research focus is scalable deep learning on supercomputers. My cv is here.
Research Projects
- CAREER: Efficient and Scalable Large Foundational Model Training on Supercomputers for Science
- Diamond: Democratizing Large Neural Network Model Training for Science
- Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
- ICICLE AI Institute
- New Optimization Methods
Publications
My publication list can be found on Google Scholar
Highlighted Recent Papers
- [NeurIPS’23] Mozaffari, Mohammad, Sikan Li, Zhao Zhang, and Maryam Mehri Dehnavi. “MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates” to appear in NeurIPS’23.
- [SC’23-1] Q. Ding, P. Zheng, S. Kudari, S. Venkataraman, Z. Zhang. “Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning” to appear in SC’23.
- [SC’23-2] E. Karrels, L. Huang, Y. Kan, I. Arora, Y. Wang, D. S. Katz, B. D. Gropp, Z. Zhang. “Fine-grained Policy-driven I/O Sharing for Burst Buffers” to appear in SC’23.
- [TPDS’22] J. G. Pauloski, L. Huang, W. Xu, I. T. Foster, Z. Zhang. “Deep Neural Network Training with Distributed K-FAC” in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2022.3161187.
- [SC’21] J. G. Pauloski, Q. Huang, L. Huang, S. Venkataraman, K. Chard, I. T. Foster, Z. Zhang. “KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-14. 2021.
- [Nature Methods’21] Fang, Linjing, Fred Monroe, Sammy Weiser Novak, Lyndsey Kirk, Cara R. Schiavon, Seungyoon B. Yu, Tong Zhang et al. “Deep learning-based point-scanning super-resolution imaging.” Nature methods 18, no. 4 (2021): 406-416.
- [SC’20] J. G. Pauloski, Z. Zhang, L. Huang, W. Xu, I. T. Foster. “Convolutional Neural Network Training with Distributed K-FAC” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-14. 2020.
- [IPDPS’20] Z. Zhang, L. Huang, J. G. Pauloski, I. T. Foster. “Efficient I/O for Neural Network Training with Compressed Data” In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 409-418. IEEE, 2020.
- [CLUSTER’19] Z. Zhang, L. Huang, R. Huang, W. Xu, D. S. Katz. “Quantifying the Impact of Memory Errors in Deep Learning.” In 2019 IEEE International Conference on Cluster Computing (CLUSTER), p.1. IEEE, 2019.
- [TPDS’19] Y. You, Z. Zhang, J. Demmel, K. Keutzer, C. Hsieh. “Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs.” IEEE Transactions on Parallel and Distributed Systems 30, no. 11 (2019): 2449-2462.
Other Papers
- [ISC’21] Evans, Richard Todd, Matthew Cawood, Stephen Lien Harrell, Lei Huang, Si Liu, Chun-Yaung Lu, Amit Ruhela, Yinzhi Wang, and Zhao Zhang. “Optimizing GPU-Enhanced HPC System and Cloud Procurements for Scientific Workloads.” In International Conference on High Performance Computing, pp. 313-331. Springer, Cham, 2021.
- [ICPP’18] Y. You, Z. Zhang, J. Demmel, K. Keutzer, C. Hsieh. “ImageNet Training in Minutes.” In Proceedings of the 47th International Conference on Parallel Processing, p. 1. ACM, 2018. Best Paper Award.
- [HPDC’17] Z. Zhang, E. R. Sparks, and M. J. Franklin. “Diagnosing machine learning pipelines with fine-grained lineage.” In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pp. 143-153. ACM, 2017.
- [FGCS’16] D. S. Katz, A. Merzky, Z. Zhang, S. Jha. “Application Skeletons: Construction and Use in eScience.” Future Generation Computer Systems 59 (2016): 114-124.
- [IPDPS’16] M. Turilli, F. Liu, Z. Zhang, A. Merzky, M. Wilde, J. Weissman, D. S. Katz, and S. Jha. “Integrating abstractions to enhance the execution of distributed applications.” In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 953-962. IEEE, 2016.
- [BigData’16] Z. Zhang, K. Barbary, F. A. Nothaft, E. R. Sparks, O. Zahn, M. J. Franklin, D. A. Patterson, and S. Perlmutter. “Kira: Processing astronomy imagery using big data technology.” IEEE Transactions on Big Data (2016).
- [SIGMOD’15] F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, Carl Yeksigian, J. Kattalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, D. A. Patterson. “Rethinking Data-Intensive Science Using Scalable Analytics Systems.”” In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 631-646. ACM, 2015.
- [CIDR’15] D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, M. I. Jordan. “The Missing Piece in Complex Analytics: Low Latency.” Scalable Model Management and Serving with Velox, 7th Biennial Conference on Innovative Data Systems Research (CIDR), 2015.
- [eScience’14] Z. Zhang, and D. S. Katz. “Using application skeletons to improve escience infrastructure.” In 2014 IEEE 10th International Conference on e-Science, vol. 1, pp. 111-118. IEEE, 2014.
- [SC’13] Z. Zhang, D. S. Katz, T. G. Armstrong, J. Wozniak, I. Foster. “Parallelizing the Execution of Sequential Scripts.” In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 31. ACM, 2013.
- [HPDC’13] Z. Zhang, D. S. Katz, M. Wilde, J. Wozniak, I. Foster. “MTC Envelope: Defining the Capability of Large Scale Computers in the Context of Parallel Scripting Applications.” In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pp. 37-48. ACM, 2013.
- [IPDPS’13] T. Li, X. Zhou, K. Brandstatter, D. Zhao, K. Wang, A. Rajendran, Z. Zhang, I. Raicu. “ZHT: A Light-weight Reliable Persistent Dynamic Scalable Zero-hop Distributed Hash Table.” In Parallel & distributed processing (IPDPS), 2013 IEEE 27th international symposium on, pp. 775-787. IEEE, 2013.
- [SC’12] Z. Zhang, D. S. Katz, J. Wozniak, A. Espinosa, I. Foster. “Design and Analysis of Data Management in Scalable Parallel Scripting.” In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 85. IEEE, 2012.
- [CLUSTER’10] I. Raicu, I. Foster, M. Wilde, Z. Zhang, Y. Zhao, A. Szalay, P. Beckman, K. Iskra, P. Little, C. Moretti, A. Chaudhary, D. Thain. “Middleware Support for Many-Task Computing.” Cluster Computing 13, no. 3 (2010): 291-314.
- [Computer’09] M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, I. Raicu. “Parallel Scripting for Applications at the Petascale and Beyond.” Parallel scripting for applications at the petascale and beyond.” Computer 42, no. 11 (2009).
- [SC’08] Ioan Raicu, Zhao Zhang, Mike Wilde, Ian Foster, Pete Beckman, Kamil Iskra, Ben Clifford, “Toward Loosely Coupled Programming on Petascale Systems.” In SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1-12. IEEE, 2008.
Services
- Program Committee Member, SC’24, SC’22-20
- Program Committee Member, IPDPS’24
- Program Committee Member, ICPP’24-23
- Program Committee Member, HPDC’23
- IEEE TPDS Review Boards, 2022-2023
- Proceedings Vice-chair (Main Conference), IPDPS’20, New Orleans LA, USA
- Co-chair, Deep Learning on Supercomputers workshop in SC’18, SC’19, SC’20
- Co-chair, Deep Learning for Science workshop in ISC’19, ISC’21, Frankfurt, Germany
Current Students
- Shuyuan Fan
- Jingxin Wang
- Haotian Xie
- Mingkai Zheng
Students Graduated
- Yue Shangguan, MS in UT ECE 2023, Software Engineer at Bloomberg.
- Yuhong Kan, MS in UT CS 2023, Software Engineer at Bloomberg.
- Qiyang Ding, MS in UT ECE 2023, Hardware Engineer at Apple.
- Sikan Li, MS in UT ECE 2022, Software Engineer at TACC.
- Ishank Arora, MS in UT CS 2022, Softwware Engineer at Apple.
- Shreyas Kudari, BS in UT ECE 2022, Software Engineer at Meta.
- Qi Huang, MS in UT CS 2021, Software Engineer at Bloomberg.
- J. Greg Pauloski, BS in UT CS 2020, Ph.D student in CS UChicago.
Contact Information
94 Brett Rd, Piscataway, NJ 08854
zhao.zhang (at) rutgers.edu