Distributed Deep Learning System

Cross-listed course, 16:332:579:08, 2023

This is a cross-listed course with the focus of system perspectives in distributed deep learning. The goal of this course is to develop comprehensive and deep understanding of internals of deep learning systems to inspire and foster students’ future research direction. This course covers a wide range of topics of neural network architecture, optimization methods, parallel training paradigms, high-performance computing architecture, and communication algorithms. This course conveys the principles of distributed/parallel system design with the state-of-the-art deep learning progress.

Prerequisite:

  • Prior Python and C programming experience is required
  • Basic knowledge of linear algebra (at the level of 01:640:250 - Introductory Linear Algebra)
  • Basic knowledge of calculus (at the level of 01:640:251 - Multivariable Calculus)
  • Basic knowledge of parallel/distributed programming is recommended
  • Basic knowledge of high-performance computing architecture is recom- mended
  • Prior CUDA knowledge is recommended

Topics:

  • Deep Learning Overview
  • Distributed Training Paradigms
  • Large-scale Optimization – Alogirhtms
  • Large-scale Optimization – Systems
  • Performance Profiling and Modeling
  • High-performance Computing Architecture
  • Sparsification in Deep Learning
  • I/O in Deep Learning