Pushing MPI into the Deep Learning Training Stack
We have written much about large-scale deep learning implementations over the last couple of years, but one question that is being posed with increasing frequency is how these workloads (training in particular) will scale to many nodes. While different companies, including Baidu and others, have managed to get their deep learning training clusters to scale across many GPU-laden nodes, for the non-hyperscale companies with their own development teams, this scalability is a sticking point.
The answer to deep learning framework scalability can be found in the world of supercomputing. For the many nodes required for large-scale jobs, the de facto …
Pushing MPI into the Deep Learning Training Stack was written by Nicole Hemsoth at The Next Platform.