Group stochastic gradient descent: A tradeoff between straggler and staleness

GAO Xiang; CHEN Li

doi:10.3969/j.issn.0253-2778.2020.08.016

PDF( 1655 KB)

Open Access JUSTC Original Paper

Group stochastic gradient descent: A tradeoff between straggler and staleness

GAO Xiang,
CHEN Li

Department of Electronic Engineering and Information Science , University of Science and Technology of China , HeFei 230027, China

Cite this:

https://doi.org/10.3969/j.issn.0253-2778.2020.08.016

Received Date: 01 July 2020
Accepted Date: 28 July 2020
Rev Recd Date: 28 July 2020
Publish Date: 31 August 2020

Abstract Full text PDF

Abstract

Abstract

Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

Abstract

Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

FullText(HTML)

References(24)

References

[1]	BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2]	BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3]	RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5]	DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6]	COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7]	DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8]	XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9]	ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10]	LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11]	ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12]	LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13]	CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14]	TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15]	HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16]	MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17]	CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18]	ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19]	HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20]	GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21]	ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22]	BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23]	BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24]	DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.)

Supplements(0)

Track Citations

Proportional views

Proportional views

Get Citation

PDF

XML

[1]	BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2]	BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3]	RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5]	DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6]	COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7]	DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8]	XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9]	ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10]	LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11]	ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12]	LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13]	CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14]	TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15]	HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16]	MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17]	CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18]	ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19]	HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20]	GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21]	ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22]	BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23]	BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24]	DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.)

TrendMD

Volume 50 Issue 8 page: 1156-1161

Cover

Keywords

Article Metrics

Article views (56) PDF downloads(96)

Group stochastic gradient descent: A tradeoff between straggler and staleness

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Group stochastic gradient descent: A tradeoff between straggler and staleness

Share

Tools

Abstract

Abstract

References

Proportional views

Catalog

Recommended articles

TrendMD

Article Metrics

Proportional views

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content