Article

Free access

Attention is all you need

Authors:

Ashish Vaswani,

Jakob Uszkoreit,

Aidan N. Gomez,

Łukasz Kaiser,

Illia PolosukhinAuthors Info & Claims

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

Pages 6000 - 6010

Published: 04 December 2017 Publication History

PDF eReader Publisher Site

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

[3]

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017.

[4]

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.

[5]

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.

[6]

Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.

[7]

Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.

[8]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolu-tional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.

[9]

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016.

[11]

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.

[12]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.

Digital Library

[13]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

[14]

Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.

[15]

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Ko-ray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.

[16]

Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.

[17]

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[18]

Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.

[19]

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.

[20]

Samy Bengio Łukasz Kaiser. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.

Digital Library

[21]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.

[22]

Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.

[23]

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.

[24]

Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.

[25]

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.

[26]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.

[27]

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929-1958, 2014.

Digital Library

[28]

Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440-2448. Curran Associates, Inc., 2015.

Digital Library

[29]

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104-3112, 2014.

Digital Library

[30]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.

[31]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[32]

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016.

Cited By

Wong DWu HMolder CGunasekar SLu JKhandkar SSharma ABerger DBeckmann NGanger GMa XWon Y(2024)BaleenProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650718(347-372)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650718
Zhao BZhang WZou ZDastani MSichman JAlechina NDignum V(2024)Distance-Aware Attentive Framework for Multi-Agent Collaborative Perception in Presence of Pose ErrorProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663242(2606-2608)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663242
Yin HChen FHe HDastani MSichman JAlechina NDignum V(2024)Solving Offline 3D Bin Packing Problem with Large-sized Bin via Two-stage Deep Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663232(2576-2578)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663232
Show More Cited By

Recommendations

LL and LR translators need k>1 lookahead

Language translation is a harder and more important problem than language recognition. In particular, programmers implement translators not recognizers. Yet too often, translation is equated with the simpler task of syntactic parsing. This misconception ...
Do we need phrases?: challenging the conventional wisdom in statistical machine translation
HLT-NAACL '06: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics

We begin by exploring theoretical and practical issues with phrasal SMT, several of which are addressed by syntax-based SMT. Next, to address problems not handled by syntax, we propose the concept of a Minimal Translation Unit (MTU) and develop MTU ...
Attention-via-attention neural machine translation
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence

Since many languages originated from a common ancestral language and influence each other, there would inevitably exist similarities between these languages such as lexical similarity and named entity similarity. In this paper, we leverage these ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems

December 2017

7104 pages

ISBN:9781510860964

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 04 December 2017

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3,330
Total Citations
View Citations
25,767
Total Downloads

Downloads (Last 12 months)12,588
Downloads (Last 6 weeks)1,755

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wong DWu HMolder CGunasekar SLu JKhandkar SSharma ABerger DBeckmann NGanger GMa XWon Y(2024)BaleenProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650718(347-372)Online publication date: 27-Feb-2024
https://dl.acm.org/doi/10.5555/3650697.3650718
Zhao BZhang WZou ZDastani MSichman JAlechina NDignum V(2024)Distance-Aware Attentive Framework for Multi-Agent Collaborative Perception in Presence of Pose ErrorProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663242(2606-2608)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663242
Yin HChen FHe HDastani MSichman JAlechina NDignum V(2024)Solving Offline 3D Bin Packing Problem with Large-sized Bin via Two-stage Deep Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663232(2576-2578)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663232
Li DXu ZZhang BZhou GZhang ZFan GDastani MSichman JAlechina NDignum V(2024)From Explicit Communication to Tacit Cooperation: A Novel Paradigm for Cooperative MARLProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663160(2360-2362)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663160
Kikuta DIkeuchi HTajiri KToyama YNakamura MNakano YDastani MSichman JAlechina NDignum V(2024)Electric Vehicle Routing for Emergency Power Supply with Deep Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663152(2336-2338)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663152
Gendron GChen YRogers MLiu YAzhar MHeidari SValdez DKnowles KO'Leary PEyre SWitbrock MDobbie GLiu JDelmas PDastani MSichman JAlechina NDignum V(2024)Behaviour Modelling of Social Animals via Causal Structure Discovery and Graph Neural NetworksProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663132(2276-2278)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663132
Yang YFan MHe CWang JHuang HSartoretti GDastani MSichman JAlechina NDignum V(2024)Attention-based Priority Learning for Limited Time Multi-Agent Path FindingProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663063(1993-2001)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663063
Newman BPaxton CKitani KAdmoni HDastani MSichman JAlechina NDignum V(2024)Bootstrapping Linear Models for Fast Online Adaptation in Human-Agent CollaborationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663006(1463-1472)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663006
Cao ZVillafuerte GAlmaznaai J(2024)EPSSNetInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.34208720:1(1-22)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.4018/IJSWIS.342087
Yu TZou ZSun WYan Y(2024)Refactoring Index Tuning Process with Benefit EstimationProceedings of the VLDB Endowment10.14778/3654621.365462217:7(1528-1541)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654622
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents