a-PyTorch-Tutorial-to-Transformers

a-PyTorch-Tutorial-to-Transformers

PyTorch实现Transformer模型的详细教程与实践指南

本项目提供了一个基于PyTorch的Transformer模型实现教程。教程深入讲解了Transformer的核心概念,如多头注意力机制和编码器-解码器架构,并以机器翻译为例展示应用。内容涵盖模型实现、训练、推理和评估等环节,适合想要深入理解和应用Transformer技术的学习者。

PyTorchTransformer机器翻译注意力机制编码器-解码器架构Github开源项目

This is a PyTorch Tutorial to Transformers.

While we will apply the transformer to a specific task – machine translation – in this tutorial, this is still a tutorial on transformers and how they work. You've come to the right place, regardless of your intended task, application, or domain – natural language processing (NLP) or computer vision (CV).

This is the sixth in a series of tutorials I'm writing about implementing cool models on your own with the amazing PyTorch library.

Basic knowledge of PyTorch is assumed.

If you're new to PyTorch, first read Deep Learning with PyTorch: A 60 Minute Blitz and Learning PyTorch with Examples.

Questions, suggestions, or corrections can be posted as issues.

I'm using PyTorch 1.4 in Python 3.6.

Contents

Objective

Concepts

Overview

Implementation

Training

Inference

Evaluation

Frequently Asked Questions

Objective

Broadly...

To build a transformer model.


More precisely...

To build a model that can generate an output sequence given an input sequence.

We choose an application that represents one of the most complex uses of a transformer – a sequence to sequence problem.

But once you understand the transformer, you can just as easily apply it to any task, application, or even domain (NLP or CV) of your choosing. After all, an image is also a sequence, but over two-dimensional space instead of time.


Even more precisely...

To build a model that can translate from one language to another.

Um ein Modell zu erstellen, das von einer Sprache in eine andere übersetzen kann.

We will be implementing the pioneering research paper 'Attention Is All You Need', which introduced the Transformer network to the world. A watershed moment for cutting-edge Natural Language Processing.

Wir werden das wegweisende Forschungspapier "Attention Is All You Need" umsetzen, das das Transformer-Netzwerk in die Welt eingeführt hat. Ein Wendepunkt für die hochmoderne Natural Language Processing.

Specifically, we are going to be translating from English to German. And yes, everything written here in German is straight from the horse's mouth! (The horse, of course, being the model.)

Konkret werden wir vom Englischen ins Deutsche übersetzen. Und ja, alles, was hier in deutscher Sprache geschrieben wird, ist direkt aus dem Mund des Pferdes! (Das Pferd ist natürlich das Modell.)

Concepts

  • Machine Translation. duh.

  • Transformer Network. We have all but retired recurrent neural networks (RNNs) in favour of transformers, a new type of sequence model that possesses an unparalleled ability for representation and abstraction – all while being simpler, more efficient, and significantly more parallelizable. Today, the application of transformers is near universal, as their resounding success in NLP has also led to increasing adoption in computer vision tasks.

  • Multi-Head Scaled Dot-Product Attention. At the heart of the transformer is the attention mechanism, specifically this flavour of attention. It allows the transformer to interpret and encode a sequence in a multitude of contexts and with an unprecedented level of nuance.

  • Encoder-Decoder Architecture. Similar to RNNs, transformer models for sequence transduction typically consist of an encoder that encodes an input sequence, and a decoder that decodes it, token by token, into the output sequence.

  • Positional Embeddings. Unlike RNNs, transformers do not innately account for the sequential nature of text – they instead view such a sequence as a bag of tokens, or pieces of text, that can be freely mixed and matched with tokens from the same or different bag. The coordinates of tokens in a sequence are therefore manually injected into the transformer as one-dimensional vectors or embeddings, allowing the transformer to incorporate their relative positions into its calculations.

  • Byte Pair Encoding. Language models are both enabled and constrained by their vocabularies. Machine translation, especially, is an open-vocabulary problem. Byte Pair Encoding is a way to construct a vocabulary of moderate size that is still able to represent nearly any word, whether it is known, seldom known, or unknown.

  • Beam Search. As an alternative to simply choosing the highest-scoring token at each step of the generative process, we consider multiple candidates, reserving judgement until we see what they give rise to in subsequent steps – before finally picking the best overall output sequence.

Overview

In this section, I will present an overview of the transformer. If you're already familiar with it, you can skip straight to the Implementation section or the commented code.

Transformers have completely changed the deep learning landscape. They've replaced recurrent neural networks (RNNs) as the workhorse of modern NLP. They have caused a seismic shift in our sequence modeling capabilities, not only with their amazing representational ability, but also with their capacity for transfer learning after self-supervised pre-training on large amounts of data. You have no doubt heard of these models in one form or another – BERT, GPT, etc.

Today, they are also increasingly being used in computer vision applications as an alternative to, or in combination with, convolutional neural networks (CNNs).

As in the original transformer paper, the context presented here is an NLP task – specifically the sequence transduction problem of machine translation. If you want to apply transformers to images, this tutorial is still a good place to learn about how they work.

Better than RNNS, but how?

Recurrent neural networks (RNNs) have long been a staple among NLP practitioners. They operate upon a sequence – you guessed it – sequentially. This isn't weird at all; it's even intuitive because that's how we do it too – we read text from one side to another.

<p align="center"> <img src="./img/rnn_sequential.PNG"> </p>

On the other hand, our ability to train deep neural networks is somewhat predicated on our ability to perform efficient computation. The sequential nature of RNNs precludes parallelization – you cannot move to the next token in the sequence without processing the previous one. Training networks on large amounts of data can be significantly time consuming.

Transformers can process elements in a sequence in parallel.

<p align="center"> <img src="./img/transformer_parallel.PNG"> </p>

The sequential processing of text in an RNN introduces another problem. The output of the RNN at a given position is conditioned directly on the output at the previous position, which in turn is conditioned on its previous position, and so on. However, logical dependencies in text can occur across longer distances. It is often the case that you need access to information from a dozen positions ago.

No doubt some of this information can persist across moderate distances, but a lot of it could have decayed in the daisy-chain of computation that defines the RNN. It's easy to see why – we're relying on the output at each position to encode not only the output at that position but also other information that may (or may not) be useful ten steps down the line, with the outputs of each intervening step also having to encode their own information and pass on this possibly relevant information.

There have been various modifications to the original RNN cell over the years to allievate this problem, the most notable of which is probably the Long Short-Term Memory (LSTM) cell, which introduces an additional pathway known as the "cell state" for the sequential flow of information across cells, thereby reducing the burden on the cell outputs to encode all of this information. While this allows for modeling longer dependencies, the fundamental problem still exists – an RNN can access other positions only through intervening positions and not directly.

Transformers allow direct access to other positions.

<p align="center"> <img src="./img/access.PNG"> </p>

This means that each position can use information directly from other positions in a sequence, producing a highly context-aware and nuanced output. This, along with other design choices we will see later, makes way for transformers' unprecedented representational ability.

The "direct access" we are speaking of occurs through an attention mechanism, something not completely unlike attention mechanisms you may have encountered earlier – for example, between an RNN encoder and decoder. If the concept is completely unfamiliar to you, it doesn't matter as we will go over it in quite some detail very soon.

I would like to point out here that RNNs need not be unidirectional. Since important textual context can occur both before and after a certain position in the sequence, we often use bidrectional RNNs, where we operate two different RNN cells in opposite directions and combine their outputs. On the other hand, a transformer layer is inherently bidirectional – unless we choose to constrain access to a particular direction, as we will see later.

Also note that in RNNs, information about the positions of individual elements in the sequence are inherently encoded into the RNN by the fact that we process them in a specific order. If in a transformer, they are being processed all at once, we need another way of introducing this positional information. We will explore this later on.

Another thing to keep in mind is that, during inference, the part of a trained transformer network that deals with the generation of a new sequence still operates autoregressively, like in an RNN, where each element of the sequence is produced from previously generated elements. During training, everything is parallelized. Encoding a known sequence, such as the input sequence, is always parallelized.

Wait. A sequence of what?

Words? Not necessarily.

In fact, over the years, we've tried just about everything – characters, words, and... subwords.

Subwords can be characters, words, or anything in between. They are a nice trade-off between a compact character vocabulary with units that aren't meaningful by themselves that produce extremely long sequences, and a monstrously large vocabulary of full words that would still be completely tripped up by a new word. Subwords allow for encoding almost any word, even unseen words, with a relatively compressed vocabulary size.

To create an optimal vocabulary of subwords, we will be using a technique called Byte Pair Encoding, a form of subword tokenization. We will study how this works later. For now, I only wanted to give you a heads up in case you're wondering why the sequences in our examples are not always split into full words.

As an example, let's consider the following –

The most beautiful thing we can experience is the mysterious. It is the fundamental emotion which stands at the cradle of true art and true science.

This is a quotation from Albert Einstein, but only a translation because he originally said it in his native German –

Das Schönste, was wir erleben können, ist das Geheimnisvolle. Es ist das Grundgefühl, das an der Wiege von wahrer Kunst und Wissenschaft steht.

The model implemented in this tutorial back-translates this common English translation back to German, as –

Das Schönste, was wir erleben können, ist das Geheimnis: Es ist die grundlegende Emotion die an der Wiege der wahren Kunst und der wahren Wissenschaft steht.

The model does a fairly good job, as far as I can tell. It opts for the noun Geheimnis (instead of the adjective Geheimnisvolle) which is primarily used to mean something akin to "secret" rather than "mystery", but the context is still clear. Also, it presents the output as a single sentence which is not surprising because it was trained predominantly on single sentences.

The English quote above is tokenized as follows –

["_The", "_most", "_beautiful", "_thing", "_we", "_can", "_experience", "_is", "_the", "_myster", "ious", ".", "_It", "_is", "_the", "_fundamental", "_emotion", "_which", "_stands", "_at", "_the", "_c", "rad", "le", "_of", "_true", "_art", "_and", "_true", "_science", "."]

The German translation by the model is tokenized as –

["_Das", "_Schön", "ste,", "_was", "_wir", "_erleben", "_können,", "_ist", "_das", "_Geheim", "nis", ":", "_Es", "_ist", "_die", "_grundlegende", "_Em", "otion", "_die", "_an", "_der", "_Wie", "ge", "_der", "_wahren", "_Kunst", "_und", "_der", "_wahren", "_Wissenschaft", "_steht."]

We will use this English-German input-output pair as an example frequently through this tutorial, although there may be other examples as well.

You can see in the tokenization that common words are retained as full words, but others are split into multiple parts. The "_" (underscore) character signifies the beginning of a word in the original sentence.

From this point on, I will simply refer to the units in a sequence as tokens.

A familiar form

While a transformer is quite different in its inner working compared to an RNN, they do take on a structure that you may already be familiar with – an encoder and a decoder.

The goal of the encoder is to encode the input sequence into a deep, "learned" representation. Like in an RNN, this involves encoding each token in the sequence.

<p align="center"> <img src="./img/encoder.PNG"> </p>

The goal of the decoder is to use this encoded representation of the input sequence to produce the output sequence. Like in an RNN, during training, we use something called teacher forcing – we use every word from the true output sequence to learn to generate the next true word in the sequence. But while an RNN still processes the input sequence and generates sequentially, a transformer does it in parallel.

<p align="center"> <img src="./img/decoder_training.PNG"> </p>

During inference, the decoder operates autoregressively, which means that it generates one word at a time, each of which is used to generate the next word.

The first generation is prompted with a <BOS> token, which implies "beginning of sequence – start generating".

<p align="center"> <img src="./img/decoder_inference_1.PNG"> </p>

In the next step, the second token is generated from the first generation.

<p align="center"> <img src="./img/decoder_inference_2.PNG"> </p>

And then, another.

<p align="center"> <img src="./img/decoder_inference_3.PNG"> </p>

You get the drift. The generative process is terminated upon the generation of an <EOS> token which signifies "end of sequence – I'm done".

Depending on the task at hand, you need either only the encoder, only the decoder, or both

  • A sequence classification or labeling task requires only the encoder. Popular transformer models like BERT are encoder-only.

  • A sequence-to-sequence task (such as machine translation in this tutorial) conventionally uses both the encoder and decoder in the set-up we just described. Popular transformer models like T5 and BART are encoder-decoder formulations.

  • Sequence generation can also be accomplished by a decoder-only model, where the input sequence or prompt can be used as the first bunch of tokens in an autoregressive decoder. The popular GPT family of transformer models are decoder-only.

In fact, after this tutorial, it will be easy for you to read and understand the research papers for these and other popular transformer models because they adopt, for the most part, the same transformer architecture with some modifications.

Now, without further delay, let's dive into what makes a transformer... a transformer.

Queries, Keys, and Values

Consider a search problem.

<p align="center"> <img src="./img/queries_keys_values_1.PNG"> </p>

The goal is, given a query, to find a value that is most closely matched to it. This will be accomplished by comparing the query to a key associated with each candidate value.

<p align="center"> <img src="./img/queries_keys_values_2.PNG"> </p>

But in real life, there is rarely a single relevant match – relevancy is a spectrum!

<p align="center"> <img src="./img/queries_keys_values_3.PNG"> </p>

Would it then make sense to return a weighted average of the values instead as the result?

<p align="center"> <img src="./img/queries_keys_values_4.PNG"> </p>

Yes, it would absolutely make sense, especially if the values are vectors or embeddings, where the different dimensions numerically encode various (and often abstract)

编辑推荐精选

Refly.AI

Refly.AI

最适合小白的AI自动化工作流平台

无需编码,轻松生成可复用、可变现的AI自动化工作流

酷表ChatExcel

酷表ChatExcel

大模型驱动的Excel数据处理工具

基于大模型交互的表格处理系统,允许用户通过对话方式完成数据整理和可视化分析。系统采用机器学习算法解析用户指令,自动执行排序、公式计算和数据透视等操作,支持多种文件格式导入导出。数据处理响应速度保持在0.8秒以内,支持超过100万行数据的即时分析。

AI工具酷表ChatExcelAI智能客服AI营销产品使用教程
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
AIWritePaper论文写作

AIWritePaper论文写作

AI论文写作指导平台

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

AI辅助写作AI工具AI论文工具论文写作智能生成大纲数据安全AI助手热门
博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

AI办公办公工具AI工具博思AIPPTAI生成PPT智能排版海量精品模板AI创作热门
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

下拉加载更多