Lightning AI
lightning.ai › docs › pytorch › stable › notebooks › course_UvA-DL › 05-transformers-and-MH-attention.html
Tutorial 5: Transformers and Multi-Head Attention — PyTorch Lightning 2.6.1 documentation
May 1, 2025 - ) in the diagram above represents the optional masking of specific entries in the attention matrix. This is for instance used if we stack multiple sequences with different lengths into a batch. To still benefit from parallelization in PyTorch, we pad the sentences to the same length and mask out the padding tokens during the calculation of the attention values.
GitHub
github.com › ksopyla › seq2seq-attention-pytorch-lightning
GitHub - ksopyla/seq2seq-attention-pytorch-lightning: Pytorch-Lightning Seq2seq model with the use of recurrent neural network · GitHub
Pytorch-Lightning Seq2seq model with the use of recurrent neural network - ksopyla/seq2seq-attention-pytorch-lightning
Starred by 10 users
Forked by 4 users
Languages Python
arXiv
arxiv.org › pdf › 2401.04658 pdf
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths
Lightning Attention-2 integrated. Our implementation uti- lizes the Metaseq framework (Zhang et al., 2022), a PyTorch-
UvA DL Notebooks
uvadlc-notebooks.readthedocs.io › en › latest › tutorial_notebooks › tutorial6 › Transformers_and_MHAttention.html
Tutorial 6: Transformers and Multi-Head Attention — UvA DL Notebooks v1.2 documentation
Thus, we focus here on what makes the Transformer and self-attention so powerful in general. In Tutorial 15, we will discuss the application of Transformers in Computer Vision. Below, we import our standard libraries. Similarly as in Tutorial 5, we will use PyTorch Lightning as an additional ...
Meta-pytorch
meta-pytorch.org › torchtune › 0.4 › _modules › torchtune › modules › attention.html
torchtune.modules.attention — torchtune 0.4 documentation
Multi-Query Attention is an extreme version where we have a single key and value head shared by all query heads. Following is an example of MHA, GQA and MQA with num_heads = 4 (credit for the documentation: `litgpt.Config <https://github.com/Lightning-AI/litgpt/blob/eda1aaaf391fd689664f954...
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.6.5 › ecosystem › transformers.html
Lightning Transformers — PyTorch Lightning 1.6.5 documentation
Lightning Transformers offers a flexible interface for training and fine-tuning SOTA Transformer models using the PyTorch Lightning Trainer.
PyTorch
docs.pytorch.org › docs › stable › generated › torch.nn.MultiheadAttention.html
Redirecting…
Redirecting… · Continue to ../../2.12/generated/torch.nn.MultiheadAttention.html
Lightning AI
lightning.ai › docs › pytorch › stable › notebooks › course_UvA-DL › 05-transformers-and-MH-attention.ipynb
05-transformers-and-MH-attention.ipynb
\n", "\n", "One aspect we haven't discussed yet is the scaling factor of $1/\\sqrt{d_k}$.\n", "This scaling factor is crucial to maintain an appropriate variance of attention values after initialization.\n", "Remember that we initialize our layers with the intention of having equal variance throughout the model, and hence,\n", "$Q$ and $K$ might also have a variance close to $1$.\n", "However, performing a dot product over two vectors with a variance $\\sigma$ results\n", "in a scalar having $d_k$-times higher variance:\n", "\n", "$$q_i \\sim \\mathcal{N}(0,\\sigma), k_i \\sim \\mathcal{N}(0,\
GitHub
github.com › pier-maker92 › pytorch-lightning-Transformer
GitHub - pier-maker92/pytorch-lightning-Transformer: Pytorch implementation of Transformer wrapped with Pytorch Lightning
Pytorch implementation of Transformer wrapped with Pytorch Lightning - pier-maker92/pytorch-lightning-Transformer
Author pier-maker92
Lightning AI
lightning.ai › docs › pytorch › stable › tutorials.html
PyTorch Lightning Tutorials — PyTorch Lightning 2.6.1 documentation
In this tutorial, we will discuss one of the most impactful architectures of the last 2 years: the Transformer model. Since the paper Attention Is All You Need by Vaswani et...
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.6.5
Welcome to PyTorch Lightning — PyTorch Lightning 1.6.5 documentation
From PyTorch to PyTorch Lightning [Video] Tutorial 1: Introduction to PyTorch · Tutorial 2: Activation Functions · Tutorial 3: Initialization and Optimization · Tutorial 4: Inception, ResNet and DenseNet · Tutorial 5: Transformers and Multi-Head Attention ·
OpenReview
openreview.net › pdf pdf
Efficient Language Modeling with Lightning Attention
Promoting openness in scientific communication and the peer-review process
Lightning AI
lightning.ai › docs › pytorch › stable › notebooks › course_UvA-DL › 11-vision-transformer.html
Tutorial 11: Vision Transformers — PyTorch Lightning 2.6.1 documentation
May 1, 2025 - Args: embed_dim: Dimensionality of input and attention feature vectors hidden_dim: Dimensionality of hidden layer in feed-forward network (usually 2-4x larger than embed_dim) num_heads: Number of heads to use in the Multi-Head Attention block dropout: Amount of dropout to apply in the feed-forward network """ super().__init__() self.layer_norm_1 = nn.LayerNorm(embed_dim) self.attn = nn.MultiheadAttention(embed_dim, num_heads) self.layer_norm_2 = nn.LayerNorm(embed_dim) self.linear = nn.Sequential( nn.Linear(embed_dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, embed_dim), nn.Dropout(dropout), ) def forward(self, x): inp_x = self.layer_norm_1(x) x = x + self.attn(inp_x, inp_x, inp_x)[0] x = x + self.linear(self.layer_norm_2(x)) return x
Lightning AI
lightning.ai › docs › pytorch › stable › notebooks › course_UvA-DL › 06-graph-neural-networks.html
Tutorial 6: Basics of Graph Neural Networks — PyTorch Lightning 2.6.1 documentation
May 1, 2025 - Attention describes a weighted average of multiple elements with the weights dynamically computed based on an input query and elements’ keys (if you don’t know what attention is, it is recommended to at least go through the very first section called What is Attention?).
PyTorch Lightning
pytorch-lightning.readthedocs.io › en › 1.7.7 › visualize › logging_advanced.html
Track and Visualize Experiments (advanced) — PyTorch Lightning 1.7.7 documentation
from pytorch_lightning.callbacks.progress import Tqdm class CustomProgressBar(Tqdm): def get_metrics(self, *args, **kwargs): # don't show the version number items = super().get_metrics() items.pop("v_num", None) return items
Lightning AI
lightning.ai › docs › pytorch › 1.5.9 › advanced › training_tricks.html
Training Tricks — PyTorch Lightning 1.5.9 documentation
For a more detailed explanation of SWA and how it works, read this post by the PyTorch team.
GitHub
github.com › Lightning-AI › pytorch-lightning
GitHub - Lightning-AI/pytorch-lightning: Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes. · GitHub
Run on any device at any scale with expert-level control over PyTorch training loop and scaling strategy. You can even write your own Trainer. Fabric is designed for the most complex models like foundation model scaling, LLMs, diffusion, transformers, reinforcement learning, active learning. Of any size. ... + import lightning as L import torch; import torchvision as tv dataset = tv.datasets.CIFAR10("data", download=True, train=True, transform=tv.transforms.ToTensor()) + fabric = L.Fabric() + fabric.launch() model = tv.models.resnet18() optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
Starred by 31.2K users
Forked by 3.8K users
Languages Python
PyTorch Forums
discuss.pytorch.org › t › flash-attention › 174955
Flash Attention - PyTorch Forums
March 16, 2023 - Hi @ptrblck, I just wanted to confirm what is the best way to ensure that only the new Flash Attention in PyTorch 2.0 is being used for scaled dot product attention: For example: # pytorch 2.0 flash attn: q, k, v, mask, dropout, causal, softmax_scale with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_math=False, enable_mem_efficient=False ): out = F.scaled_dot_product_attention( q, k, v, attn_mask = mask, dropout_p = flash_attn_dropout, ...