a-PyTorch-Tutorial-to-Transformers

a-PyTorch-Tutorial-to-Transformers

PyTorch实现Transformer模型的详细教程与实践指南

本项目提供了一个基于PyTorch的Transformer模型实现教程。教程深入讲解了Transformer的核心概念,如多头注意力机制和编码器-解码器架构,并以机器翻译为例展示应用。内容涵盖模型实现、训练、推理和评估等环节,适合想要深入理解和应用Transformer技术的学习者。

PyTorchTransformer机器翻译注意力机制编码器-解码器架构Github开源项目

This is a PyTorch Tutorial to Transformers.

While we will apply the transformer to a specific task – machine translation – in this tutorial, this is still a tutorial on transformers and how they work. You've come to the right place, regardless of your intended task, application, or domain – natural language processing (NLP) or computer vision (CV).

This is the sixth in a series of tutorials I'm writing about implementing cool models on your own with the amazing PyTorch library.

Basic knowledge of PyTorch is assumed.

If you're new to PyTorch, first read Deep Learning with PyTorch: A 60 Minute Blitz and Learning PyTorch with Examples.

Questions, suggestions, or corrections can be posted as issues.

I'm using PyTorch 1.4 in Python 3.6.

Contents

Objective

Concepts

Overview

Implementation

Training

Inference

Evaluation

Frequently Asked Questions

Objective

Broadly...

To build a transformer model.


More precisely...

To build a model that can generate an output sequence given an input sequence.

We choose an application that represents one of the most complex uses of a transformer – a sequence to sequence problem.

But once you understand the transformer, you can just as easily apply it to any task, application, or even domain (NLP or CV) of your choosing. After all, an image is also a sequence, but over two-dimensional space instead of time.


Even more precisely...

To build a model that can translate from one language to another.

Um ein Modell zu erstellen, das von einer Sprache in eine andere übersetzen kann.

We will be implementing the pioneering research paper 'Attention Is All You Need', which introduced the Transformer network to the world. A watershed moment for cutting-edge Natural Language Processing.

Wir werden das wegweisende Forschungspapier "Attention Is All You Need" umsetzen, das das Transformer-Netzwerk in die Welt eingeführt hat. Ein Wendepunkt für die hochmoderne Natural Language Processing.

Specifically, we are going to be translating from English to German. And yes, everything written here in German is straight from the horse's mouth! (The horse, of course, being the model.)

Konkret werden wir vom Englischen ins Deutsche übersetzen. Und ja, alles, was hier in deutscher Sprache geschrieben wird, ist direkt aus dem Mund des Pferdes! (Das Pferd ist natürlich das Modell.)

Concepts

  • Machine Translation. duh.

  • Transformer Network. We have all but retired recurrent neural networks (RNNs) in favour of transformers, a new type of sequence model that possesses an unparalleled ability for representation and abstraction – all while being simpler, more efficient, and significantly more parallelizable. Today, the application of transformers is near universal, as their resounding success in NLP has also led to increasing adoption in computer vision tasks.

  • Multi-Head Scaled Dot-Product Attention. At the heart of the transformer is the attention mechanism, specifically this flavour of attention. It allows the transformer to interpret and encode a sequence in a multitude of contexts and with an unprecedented level of nuance.

  • Encoder-Decoder Architecture. Similar to RNNs, transformer models for sequence transduction typically consist of an encoder that encodes an input sequence, and a decoder that decodes it, token by token, into the output sequence.

  • Positional Embeddings. Unlike RNNs, transformers do not innately account for the sequential nature of text – they instead view such a sequence as a bag of tokens, or pieces of text, that can be freely mixed and matched with tokens from the same or different bag. The coordinates of tokens in a sequence are therefore manually injected into the transformer as one-dimensional vectors or embeddings, allowing the transformer to incorporate their relative positions into its calculations.

  • Byte Pair Encoding. Language models are both enabled and constrained by their vocabularies. Machine translation, especially, is an open-vocabulary problem. Byte Pair Encoding is a way to construct a vocabulary of moderate size that is still able to represent nearly any word, whether it is known, seldom known, or unknown.

  • Beam Search. As an alternative to simply choosing the highest-scoring token at each step of the generative process, we consider multiple candidates, reserving judgement until we see what they give rise to in subsequent steps – before finally picking the best overall output sequence.

Overview

In this section, I will present an overview of the transformer. If you're already familiar with it, you can skip straight to the Implementation section or the commented code.

Transformers have completely changed the deep learning landscape. They've replaced recurrent neural networks (RNNs) as the workhorse of modern NLP. They have caused a seismic shift in our sequence modeling capabilities, not only with their amazing representational ability, but also with their capacity for transfer learning after self-supervised pre-training on large amounts of data. You have no doubt heard of these models in one form or another – BERT, GPT, etc.

Today, they are also increasingly being used in computer vision applications as an alternative to, or in combination with, convolutional neural networks (CNNs).

As in the original transformer paper, the context presented here is an NLP task – specifically the sequence transduction problem of machine translation. If you want to apply transformers to images, this tutorial is still a good place to learn about how they work.

Better than RNNS, but how?

Recurrent neural networks (RNNs) have long been a staple among NLP practitioners. They operate upon a sequence – you guessed it – sequentially. This isn't weird at all; it's even intuitive because that's how we do it too – we read text from one side to another.

<p align="center"> <img src="./img/rnn_sequential.PNG"> </p>

On the other hand, our ability to train deep neural networks is somewhat predicated on our ability to perform efficient computation. The sequential nature of RNNs precludes parallelization – you cannot move to the next token in the sequence without processing the previous one. Training networks on large amounts of data can be significantly time consuming.

Transformers can process elements in a sequence in parallel.

<p align="center"> <img src="./img/transformer_parallel.PNG"> </p>

The sequential processing of text in an RNN introduces another problem. The output of the RNN at a given position is conditioned directly on the output at the previous position, which in turn is conditioned on its previous position, and so on. However, logical dependencies in text can occur across longer distances. It is often the case that you need access to information from a dozen positions ago.

No doubt some of this information can persist across moderate distances, but a lot of it could have decayed in the daisy-chain of computation that defines the RNN. It's easy to see why – we're relying on the output at each position to encode not only the output at that position but also other information that may (or may not) be useful ten steps down the line, with the outputs of each intervening step also having to encode their own information and pass on this possibly relevant information.

There have been various modifications to the original RNN cell over the years to allievate this problem, the most notable of which is probably the Long Short-Term Memory (LSTM) cell, which introduces an additional pathway known as the "cell state" for the sequential flow of information across cells, thereby reducing the burden on the cell outputs to encode all of this information. While this allows for modeling longer dependencies, the fundamental problem still exists – an RNN can access other positions only through intervening positions and not directly.

Transformers allow direct access to other positions.

<p align="center"> <img src="./img/access.PNG"> </p>

This means that each position can use information directly from other positions in a sequence, producing a highly context-aware and nuanced output. This, along with other design choices we will see later, makes way for transformers' unprecedented representational ability.

The "direct access" we are speaking of occurs through an attention mechanism, something not completely unlike attention mechanisms you may have encountered earlier – for example, between an RNN encoder and decoder. If the concept is completely unfamiliar to you, it doesn't matter as we will go over it in quite some detail very soon.

I would like to point out here that RNNs need not be unidirectional. Since important textual context can occur both before and after a certain position in the sequence, we often use bidrectional RNNs, where we operate two different RNN cells in opposite directions and combine their outputs. On the other hand, a transformer layer is inherently bidirectional – unless we choose to constrain access to a particular direction, as we will see later.

Also note that in RNNs, information about the positions of individual elements in the sequence are inherently encoded into the RNN by the fact that we process them in a specific order. If in a transformer, they are being processed all at once, we need another way of introducing this positional information. We will explore this later on.

Another thing to keep in mind is that, during inference, the part of a trained transformer network that deals with the generation of a new sequence still operates autoregressively, like in an RNN, where each element of the sequence is produced from previously generated elements. During training, everything is parallelized. Encoding a known sequence, such as the input sequence, is always parallelized.

Wait. A sequence of what?

Words? Not necessarily.

In fact, over the years, we've tried just about everything – characters, words, and... subwords.

Subwords can be characters, words, or anything in between. They are a nice trade-off between a compact character vocabulary with units that aren't meaningful by themselves that produce extremely long sequences, and a monstrously large vocabulary of full words that would still be completely tripped up by a new word. Subwords allow for encoding almost any word, even unseen words, with a relatively compressed vocabulary size.

To create an optimal vocabulary of subwords, we will be using a technique called Byte Pair Encoding, a form of subword tokenization. We will study how this works later. For now, I only wanted to give you a heads up in case you're wondering why the sequences in our examples are not always split into full words.

As an example, let's consider the following –

The most beautiful thing we can experience is the mysterious. It is the fundamental emotion which stands at the cradle of true art and true science.

This is a quotation from Albert Einstein, but only a translation because he originally said it in his native German –

Das Schönste, was wir erleben können, ist das Geheimnisvolle. Es ist das Grundgefühl, das an der Wiege von wahrer Kunst und Wissenschaft steht.

The model implemented in this tutorial back-translates this common English translation back to German, as –

Das Schönste, was wir erleben können, ist das Geheimnis: Es ist die grundlegende Emotion die an der Wiege der wahren Kunst und der wahren Wissenschaft steht.

The model does a fairly good job, as far as I can tell. It opts for the noun Geheimnis (instead of the adjective Geheimnisvolle) which is primarily used to mean something akin to "secret" rather than "mystery", but the context is still clear. Also, it presents the output as a single sentence which is not surprising because it was trained predominantly on single sentences.

The English quote above is tokenized as follows –

["_The", "_most", "_beautiful", "_thing", "_we", "_can", "_experience", "_is", "_the", "_myster", "ious", ".", "_It", "_is", "_the", "_fundamental", "_emotion", "_which", "_stands", "_at", "_the", "_c", "rad", "le", "_of", "_true", "_art", "_and", "_true", "_science", "."]

The German translation by the model is tokenized as –

["_Das", "_Schön", "ste,", "_was", "_wir", "_erleben", "_können,", "_ist", "_das", "_Geheim", "nis", ":", "_Es", "_ist", "_die", "_grundlegende", "_Em", "otion", "_die", "_an", "_der", "_Wie", "ge", "_der", "_wahren", "_Kunst", "_und", "_der", "_wahren", "_Wissenschaft", "_steht."]

We will use this English-German input-output pair as an example frequently through this tutorial, although there may be other examples as well.

You can see in the tokenization that common words are retained as full words, but others are split into multiple parts. The "_" (underscore) character signifies the beginning of a word in the original sentence.

From this point on, I will simply refer to the units in a sequence as tokens.

A familiar form

While a transformer is quite different in its inner working compared to an RNN, they do take on a structure that you may already be familiar with – an encoder and a decoder.

The goal of the encoder is to encode the input sequence into a deep, "learned" representation. Like in an RNN, this involves encoding each token in the sequence.

<p align="center"> <img src="./img/encoder.PNG"> </p>

The goal of the decoder is to use this encoded representation of the input sequence to produce the output sequence. Like in an RNN, during training, we use something called teacher forcing – we use every word from the true output sequence to learn to generate the next true word in the sequence. But while an RNN still processes the input sequence and generates sequentially, a transformer does it in parallel.

<p align="center"> <img src="./img/decoder_training.PNG"> </p>

During inference, the decoder operates autoregressively, which means that it generates one word at a time, each of which is used to generate the next word.

The first generation is prompted with a <BOS> token, which implies "beginning of sequence – start generating".

<p align="center"> <img src="./img/decoder_inference_1.PNG"> </p>

In the next step, the second token is generated from the first generation.

<p align="center"> <img src="./img/decoder_inference_2.PNG"> </p>

And then, another.

<p align="center"> <img src="./img/decoder_inference_3.PNG"> </p>

You get the drift. The generative process is terminated upon the generation of an <EOS> token which signifies "end of sequence – I'm done".

Depending on the task at hand, you need either only the encoder, only the decoder, or both

  • A sequence classification or labeling task requires only the encoder. Popular transformer models like BERT are encoder-only.

  • A sequence-to-sequence task (such as machine translation in this tutorial) conventionally uses both the encoder and decoder in the set-up we just described. Popular transformer models like T5 and BART are encoder-decoder formulations.

  • Sequence generation can also be accomplished by a decoder-only model, where the input sequence or prompt can be used as the first bunch of tokens in an autoregressive decoder. The popular GPT family of transformer models are decoder-only.

In fact, after this tutorial, it will be easy for you to read and understand the research papers for these and other popular transformer models because they adopt, for the most part, the same transformer architecture with some modifications.

Now, without further delay, let's dive into what makes a transformer... a transformer.

Queries, Keys, and Values

Consider a search problem.

<p align="center"> <img src="./img/queries_keys_values_1.PNG"> </p>

The goal is, given a query, to find a value that is most closely matched to it. This will be accomplished by comparing the query to a key associated with each candidate value.

<p align="center"> <img src="./img/queries_keys_values_2.PNG"> </p>

But in real life, there is rarely a single relevant match – relevancy is a spectrum!

<p align="center"> <img src="./img/queries_keys_values_3.PNG"> </p>

Would it then make sense to return a weighted average of the values instead as the result?

<p align="center"> <img src="./img/queries_keys_values_4.PNG"> </p>

Yes, it would absolutely make sense, especially if the values are vectors or embeddings, where the different dimensions numerically encode various (and often abstract)

编辑推荐精选

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

AI办公办公工具AI工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图热门
讯飞星火

讯飞星火

深度推理能力全新升级,全面对标OpenAI o1

科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。

热门AI开发模型训练AI工具讯飞星火大模型智能问答内容创作多语种支持智慧生活
Spark-TTS

Spark-TTS

一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型

Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。

Trae

Trae

字节跳动发布的AI编程神器IDE

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
咔片PPT

咔片PPT

AI助力,做PPT更简单!

咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
材料星

材料星

专业的AI公文写作平台,公文写作神器

AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。

openai-agents-python

openai-agents-python

OpenAI Agents SDK,助力开发者便捷使用 OpenAI 相关功能。

openai-agents-python 是 OpenAI 推出的一款强大 Python SDK,它为开发者提供了与 OpenAI 模型交互的高效工具,支持工具调用、结果处理、追踪等功能,涵盖多种应用场景,如研究助手、财务研究等,能显著提升开发效率,让开发者更轻松地利用 OpenAI 的技术优势。

Hunyuan3D-2

Hunyuan3D-2

高分辨率纹理 3D 资产生成

Hunyuan3D-2 是腾讯开发的用于 3D 资产生成的强大工具,支持从文本描述、单张图片或多视角图片生成 3D 模型,具备快速形状生成能力,可生成带纹理的高质量 3D 模型,适用于多个领域,为 3D 创作提供了高效解决方案。

3FS

3FS

一个具备存储、管理和客户端操作等多种功能的分布式文件系统相关项目。

3FS 是一个功能强大的分布式文件系统项目,涵盖了存储引擎、元数据管理、客户端工具等多个模块。它支持多种文件操作,如创建文件和目录、设置布局等,同时具备高效的事件循环、节点选择和协程池管理等特性。适用于需要大规模数据存储和管理的场景,能够提高系统的性能和可靠性,是分布式存储领域的优质解决方案。

下拉加载更多