# Summary and Analysis of "Attention Is All You Need" (Vaswani et al., 2017) ## Core Concept This paper introduces the **Transformer**, a novel network architecture based solely on attention mechanisms, completely dispensing with recurrence (RNNs) and convolutions. The authors propose that this model allows for significantly more parallelization and can reach a new state-of-the-art (SOTA) in machine translation quality. The model's name is derived from its central idea: that "attention is all you need." ## Model Architecture The Transformer follows a standard **encoder-decoder structure**. Both the encoder and decoder are composed of a stack of identical layers. #### 1. Encoder * **Composition:** The encoder is a stack of N=6 identical layers. * **Layer Structure:** Each layer has two sub-layers: 1. A **multi-head self-attention mechanism**. 2. A simple, position-wise **fully connected feed-forward network**. * **Connections:** A residual connection is employed around each of the two sub-layers, followed by layer normalization. #### 2. Decoder * **Composition:** The decoder is also a stack of N=6 identical layers. * **Layer Structure:** Each layer has three sub-layers: 1. A **masked multi-head self-attention mechanism**. This mask ensures that predictions for a position `i` can depend only on the known outputs at positions less than `i`. 2. A **multi-head attention mechanism** that performs attention over the output of the encoder stack. This allows the decoder to "look at" the input sequence. 3. A simple, position-wise **fully connected feed-forward network**. * **Connections:** Similar to the encoder, residual connections and layer normalization are applied to each sub-layer. ### Key Mechanism: Multi-Head Attention Instead of performing a single attention function, the authors found it beneficial to project the queries, keys, and values `h` times with different, learned linear projections. This is called "Multi-Head Attention." * **Model Dimensions:** The paper uses $d_{\text{model}} = 512$. * **Number of Heads:** The paper uses $h = 8$ parallel attention heads. * **Process:** For each head, the model calculates attention, producing an output vector. These 8 output vectors are then concatenated and once again projected, resulting in the final values. * **Benefit:** This allows the model to jointly attend to information from different representation subspaces at different positions. ### Scaled Dot-Product Attention The specific attention mechanism used is "Scaled Dot-Product Attention." * **Formula:** $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ * **Components:** Queries (Q), Keys (K), and Values (V). * **Scaling Factor:** The $\sqrt{d_k}$ (where $d_k$ is the dimension of the key) is a crucial scaling factor. The authors note that for large values of $d_k$, the dot products grow large, pushing the softmax into regions with extremely small gradients. Scaling by $\sqrt{d_k}$ counteracts this. ### Positional Encoding Since the model contains no recurrence or convolution, it has no inherent sense of word order. To inject positional information, the authors add "positional encodings" to the input embeddings. * **Method:** Sine and cosine functions of different frequencies are used. * **Formula (simplified):** * $PE_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i / d_{\text{model}}})$ * $PE_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i / d_{\text{model}}})$ * **Reasoning:** This allows the model to easily learn to attend by relative positions, since for any fixed offset `k`, $PE_{\text{pos}+k}$ can be represented as a linear function of $PE_{\text{pos}}$. ## Training and Results * **Task:** Machine Translation * **Dataset:** WMT 2014 English-to-German * **Hardware:** Trained on 8 NVIDIA P100 GPUs. * **Training Time:** The "big" model trained for 3.5 days (100,000 steps). * **Optimizer:** Adam optimizer was used. * **Regularization:** Residual Dropout and Label Smoothing were applied. ### Key Results: * **WMT 2014 English-to-German:** Achieved a new SOTA **BLEU score of 28.4**. This was more than 2.0 BLEU points better than the previous best models (like the recurrent "GNMT"). * **WMT 2014 English-to-French:** Achieved a new SOTA **BLEU score of 41.8**. * **Training Cost:** The Transformer (base model) trained in only 12 hours, a fraction of the time required for previous SOTA models. ## Conclusion The authors concluded that a model based entirely on attention can outperform models based on recurrence or convolution in both quality (BLEU score) and training cost (parallelization). This work laid the foundation for subsequent models like BERT, GPT, and T5.