Article

Training very deep networks

Authors:

Rupesh Kumar Srivastava,

Jürgen SchmidhuberAuthors Info & Claims

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

Pages 2377 - 2385

Published: 07 December 2015 Publication History

Abstract

Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.

References

[1]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[2]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv:1409.4842 [cs], September 2014.

[3]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [cs], September 2014.

[4]

DC Ciresan, Ueli Meier, Jonathan Masci, Luca M Gambardella, and Jürgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI, 2011.

[5]

Dan Ciresan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[6]

Dong Yu, Michael L. Seltzer, Jinyu Li, Jui-Ting Huang, and Frank Seide. Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605, 2013.

[7]

Sepp Hochreiter and Jurgen Schmidhuber. Bridging long time lags by weight guessing and "long short-term memory". Spatiotemporal models in biological and artificial systems, 37:65-72, 1996.

[8]

Johan Håstad. Computational limitations of small-depth circuits. MIT press, 1987.

[9]

Johan Håstad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1(2):113-129, 1991.

[10]

Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks, 2014.

[11]

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems. 2014.

[12]

James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product networks. arXiv:1411.7717[cs, stat], November 2014.

[13]

James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization. Neural Networks: Tricks of the Trade, pages 1-58, 2012.

[14]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. pages 1139-1147, 2013.

[15]

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27, pages 2933-2941. 2014.

[16]

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, pages 249-256, 2010.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv:1502.01852 [cs], February 2015.

[18]

David Sussillo and L. F. Abbott. Random walk initialization for training very deep feedforward networks. arXiv:1412.6558 [cs, stat], December 2014.

[19]

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013.

[20]

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv:1302.4389 [cs, stat], February 2013.

[21]

Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jürgen Schmidhuber. Compete to compute. In Advances in Neural Information Processing Systems, pages 2310-2318, 2013.

[22]

Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924-932, 2012.

[23]

Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.

[24]

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. pages 562-570, 2015.

[25]

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. arXiv:1412.6550 [cs], December 2014.

[26]

Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, March 1992.

[27]

Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527-1554, 2006.

[28]

Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Masters thesis, Technische Universität München, München, 1991.

[29]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735-1780, November 1997.

[30]

Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. In ICANN, volume 2, pages 850-855, 1999.

[31]

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv:1505.00387 [cs], May 2015.

[32]

Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long Short-Term memory. arXiv:1507.01526 [cs], July 2015.

[33]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093 [cs], 2014.

[34]

Benjamin Graham. Spatially-sparse convolutional neural networks. arXiv:1409.6070, September 2014.

[35]

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv:1312.4400, 2014.

[36]

Marijn F Stollenga, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS. 2014.

[37]

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806 [cs], December 2014.

[38]

Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, and Jürgen Schmidhuber. Understanding locally competitive networks. In International Conference on Learning Representations, 2015.

Cited By

India MHernando JFonollosa J(2023)Language modelling for speaker diarization in telephonic interviewsComputer Speech and Language10.1016/j.csl.2022.10144178:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.csl.2022.101441
Ren ZZhao JChen CLou YMa XTao P(2022)Rendered Image Superresolution Reconstruction with Multichannel Feature NetworkScientific Programming10.1155/2022/93935892022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9393589
Xu NWang XXu YZhao TLi X(2022)Deep Multi-Scale Residual Connected Neural Network Model for Intelligent Athlete Balance Control Ability EvaluationComputational Intelligence and Neuroscience10.1155/2022/90127092022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9012709
Show More Cited By

Training very deep networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Fast Training of Deep LSTM Networks
Advances in Neural Networks – ISNN 2019
Abstract
Deep recurrent neural networks (RNN), such as LSTM, have many advantages over forward networks. However, the LSTM training method, such as backward propagation through time (BPTT), is really slow.
In this paper, by separating the LSTM cell into ...
Evolutionary training of deep neural networks on heterogeneous computing environments
GECCO '22: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Deep neural networks are typically trained using gradient-based optimizers such as error backpropagation. This study proposes a framework based on Evolutionary Algorithms (EAs) to train deep neural networks without gradients. The network parameters, ...
Training Recurrent Networks by Evolino

In recent years, gradient-based LSTM recurrent neural networks (RNNs) solved many previously RNN-unlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

December 2015

3626 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 07 December 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
2
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

India MHernando JFonollosa J(2023)Language modelling for speaker diarization in telephonic interviewsComputer Speech and Language10.1016/j.csl.2022.10144178:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.csl.2022.101441
Ren ZZhao JChen CLou YMa XTao P(2022)Rendered Image Superresolution Reconstruction with Multichannel Feature NetworkScientific Programming10.1155/2022/93935892022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9393589
Xu NWang XXu YZhao TLi X(2022)Deep Multi-Scale Residual Connected Neural Network Model for Intelligent Athlete Balance Control Ability EvaluationComputational Intelligence and Neuroscience10.1155/2022/90127092022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/9012709
Wang QCui JQin ZAn NMa XLi G(2022)Using DSCB: A Depthwise Separable Convolution Block Rebuild MTCNN for Face DetectionProceedings of the 2022 5th International Conference on Image and Graphics Processing10.1145/3512388.3512389(1-8)Online publication date: 7-Jan-2022
https://dl.acm.org/doi/10.1145/3512388.3512389
Pei SWu YGuo JQiu M(2022)Neural Network Pruning by Recurrent Weights for Finance MarketACM Transactions on Internet Technology10.1145/343354722:3(1-23)Online publication date: 22-Jan-2022
https://dl.acm.org/doi/10.1145/3433547
Ganaie MHu MMalik ATanveer MSuganthan P(2022)Ensemble deep learningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105151115:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.engappai.2022.105151
Zhang SFan XZheng HTanwisuth KZhou MRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Alignment attention by matching key and query distributionsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541291(13444-13457)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3541291
Munir KBai HZhao HZhao J(2021)Memorizing All for Implicit Discourse Relation RecognitionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/348501621:3(1-20)Online publication date: 13-Dec-2021
https://dl.acm.org/doi/10.1145/3485016
Zhao QChen CLiu GLiu QChen S(2021)Parallel Connected LSTM for Matrix Sequence Prediction with Elusive CorrelationsACM Transactions on Intelligent Systems and Technology10.1145/346943712:4(1-16)Online publication date: 12-Aug-2021
https://dl.acm.org/doi/10.1145/3469437
Shivachi CMokhosi RShijie ZQihe L(2021)Learning Syllables Using Conv-LSTM Model for Swahili Word Representation and Part-of-speech TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/344597520:4(1-25)Online publication date: 26-May-2021
https://dl.acm.org/doi/10.1145/3445975
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents