Big Data

Greedy approach to maximizing gradient diversity for minibatch SGD

How do we scale distributed gradient descent to a large batch size?