Artificial Intelligence/Paper

Attention Is All You Need

geum 2022. 2. 23. 14:49

์ด๋ฒˆ์ฃผ๋ถ€ํ„ฐ ํ•œ ์ฃผ์— ํ•˜๋‚˜์˜ ๋…ผ๋ฌธ์„ ์ฝ์–ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. ๋‚˜ ์ž˜ํ•  ์ˆ˜ ์žˆ๊ฒ ์ง€ ? ^_^

 

๐Ÿ’ฌ ๋…ผ๋ฌธ ๋‚ด์šฉ๊ณผ ์ด ๊ธ€์— ๋Œ€ํ•œ ์˜๊ฒฌ ๊ณต์œ , ์˜คํƒˆ์ž ์ง€์  ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€ ๋‚จ๊ฒจ์ฃผ์„ธ์š” !


์›๋ฌธ : https://arxiv.org/pdf/1706.03762.pdf

 

Abstract

dominantํ•œ sequence transduction ๋ชจ๋ธ๋“ค์€ ๋ณต์žกํ•œ RNN/CNN ๊ตฌ์กฐ

→ Attention ๋งค์ปค๋‹ˆ์ฆ˜๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒˆ๋กญ๊ณ  ๊ฐ„๋‹จํ•œ ๊ตฌ์กฐ์˜ Transformer ์ œ์•ˆ

 

2022. 3. 4 ์ถ”๊ฐ€

 

Transformer ์š”์•ฝ : ํ•™์Šต๊ณผ ๋ณ‘๋ ฌํ™”๊ฐ€ ์‰ฝ๊ณ  attention ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์†๋„๋ฅผ ๋†’์ธ ๋ชจ๋ธ

 

Introduction

Attention ๋งค์ปค๋‹ˆ์ฆ˜์€ ์ž…๋ ฅ, ์ถœ๋ ฅ ๊ฐ„ ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€์—†์ด modeling์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค๋Š” ์ ์—์„œ sequence modeling๊ณผ transduction ๋ชจ๋ธ์˜ ํ•ต์‹ฌ์ ์ธ ๋ถ€๋ถ„์ด์ง€๋งŒ ์—ฌ์ „ํžˆ recurrent network์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ข…์ข… ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋…ผ๋ฌธ ์ €์ž๋“ค์€ recurrent network ์—†์ด attention ๋งค์ปค๋‹ˆ์ฆ˜์—๋งŒ ์˜์กดํ•˜๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

 

Background

Self-attention์ด๋ž€ ์‹œํ€€์Šค์˜ representation์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ์‹œํ€€์Šค์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์— ์žˆ๋Š” ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. (์„œ๋กœ ๋‹ค๋ฅธ ๋ฌธ์žฅ์— attention์„ ์ ์šฉํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋‹ค ์ด๋Ÿฐ ์˜๋ฏธ๋กœ ํ•ด์„ํ•จ)   

 

Model Architecture

๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” neural sequence transduction ๋ชจ๋ธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

 

- ์ธ์ฝ”๋” : ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” symbol representations sequence (x1, …, xn)์„ ์—ฐ์†์ ์ธ representations sequence (z1, …, zn)์— ๋งคํ•‘ํ•˜๋Š” ์—ญํ•  (1~n์€ ์•„๋ž˜์ฒจ์ž)

- ๋””์ฝ”๋” : z๊ฐ€ ์ฃผ์–ด์ง€๋ฉด symbol์˜ output sequence (y1, …, yn)์„ ์ถœ๋ ฅ

 

** ๋””์ฝ”๋” ๋ถ€๋ถ„ ๋‚ด์šฉ์— the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. ์ด๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๋Š”๋ฐ one element at a time์˜ ์˜๋ฏธ๋ฅผ ์•„์ง ๋ชป ์ดํ•ดํ–ˆ๋‹ค ใ… 

 

Transformer๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์ „๋ฐ˜์ ์œผ๋กœ ๋”ฐ๋ฅด๋Š”๋ฐ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์—์„œ ๋ชจ๋‘ stacked self-attention, point-wise fully connected layers๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

Transformer ๋ชจ๋ธ ๊ตฌ์กฐ

 

1. Transformer encoder & Transformer decoder

 

1) Transformer encoder

- ๋‘ ๊ฐœ์˜ sub-layers(multi-head self-attention, wise-fully connected feed forward)๋กœ ๊ตฌ์„ฑ๋œ ๋ ˆ์ด์–ด(๊ทธ๋ฆผ ํšŒ์ƒ‰ ๋ถ€๋ถ„) 6๊ฐœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

- ๋‘ ๊ฐœ์˜ sub-layer์— residual connection๊ณผ layer normalization์„ ์ ์šฉํ–ˆ๋‹ค.

- ๊ฐ sub-layer์˜ ์ถœ๋ ฅ์€ LayerNorm(x+Sublayer(x))์ด๊ณ  Sublayer(x)๋Š” sub-layer ์ž์ฒด์— ์˜ํ•ด ๊ตฌํ˜„๋œ ํ•จ์ˆ˜์ด๋‹ค.

- residual connection์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด embedding layers๋ฅผ ํฌํ•จํ•œ ๋ชจ๋ธ์˜ ๋ชจ๋“  sub-layer์˜ ์ถœ๋ ฅ ์ฐจ์›์„ 512๋กœ ์„ค์ •ํ–ˆ๋‹ค.  

 

2) Transformer decoder

- ์„ธ ๊ฐœ์˜ sub-layersub-layers(masked multi-head attention, multi-head self-attention, wise-fully connected feed forward)๋กœ ๊ตฌ์„ฑ๋œ ๋ ˆ์ด์–ด 6๊ฐœ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

- ์ดํ›„์˜ ์œ„์น˜์— ์ ‘๊ทผํ•˜์ง€ ๋ชปํ•˜๊ฒŒ self-attention layer๋ฅผ ์ˆ˜์ •ํ–ˆ๋‹ค. → position i๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ i๋ณด๋‹ค ์ž‘์€ ์œ„์น˜์— ์žˆ๋Š” ๊ฒƒ์—๋งŒ ์˜์กดํ•˜๋„๋ก

** i๊ฐ€ ์ฒ ์ž ํ•˜๋‚˜ํ•˜๋‚˜์˜ ์ธ๋ฑ์Šค๋ฅผ ์˜๋ฏธํ•˜๋Š” ๊ฒŒ ๋งž๋‚˜? ๐Ÿ™„

 

2. Attention

๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” Attention์˜ ๊ตฌ์กฐ

1) Scaled Dot-Product Attention

- dx ์ฐจ์›์˜ key, dv ์ฐจ์›์˜ value, ์ฟผ๋ฆฌ๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ๋‹ค. 

- ํ‚ค๋ฅผ ์ด์šฉํ•ด ์ฟผ๋ฆฌ์˜ dot products๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ๊ฐ์„ sqrt(dk)๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ, softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ values์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ตฌํ•œ๋‹ค.

- output matrix๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐํ•œ๋‹ค.

 

2022. 3. 8 ์ถ”๊ฐ€

 

$sqrt(d_{k})$์˜ ์—ญํ•  : $d_{k}$์˜ ๊ฐ’์ด ์ž‘์„ ๋•Œ๋Š” additive attention๊ณผ dot-product attention์˜ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•˜์ง€๋งŒ $d_{k}$๊ฐ€ ํฐ ๊ฐ’์ผ ๋•Œ๋Š” dot-product attention์˜ ์„ฑ๋Šฅ์ด ํ›จ์”ฌ ๋–จ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ’ ํฌ๊ธฐ ์กฐ์ • ๋ชฉ์ 

 

๋…ผ๋ฌธ์—์„œ ์œ„์— ๋Œ€ํ•œ ์„ค๋ช…์ด ์ ํ˜€์žˆ๋Š” ๋ถ€๋ถ„

 

2) Multi-Head Attention

- d_model ์ฐจ์›์˜ keys, values, queries๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹  dk ์ฐจ์›, dv ์ฐจ์›์— ๋Œ€ํ•ด ํ•™์Šต๋œ ์„œ๋กœ ๋‹ค๋ฅธ linear projections์„ ์‚ฌ์šฉํ•˜์—ฌ queries, keys, values๋ฅผ linearํ•˜๊ฒŒ hํšŒ ํˆฌ์˜ํ•˜๋Š” ๊ฒƒ์ด ์œ ์ตํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

- queries, keys, values์˜ ํˆฌ์˜ ๋ฒ„์ „์—์„œ self-attention์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•˜์—ฌ dv ์ฐจ์›์˜ ์ถœ๋ ฅ ๊ฐ’์„ ์‚ฐ์ถœํ•˜๋Š”๋ฐ ์ตœ์ข…์ ์œผ๋กœ ์‚ฐ์ถœ๋˜๋Š” ๊ฐ’์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. 

 

 

3) Applications of Attention in our Model

-  Transformer๋Š” multi-head attention์„ ์„ธ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

> ์ธ์ฝ”๋”-๋””์ฝ”๋” attention ๋ ˆ์ด์–ด์—์„œ queries๋Š” ์ด์ „ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์—์„œ ๋‚˜์˜ค๊ณ  memory keys์™€ values๋Š” ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜ค๋Š”๋ฐ ์ด ๊ณผ์ •์—์„œ ๋””์ฝ”๋”์˜ ๋ชจ๋“  ์œ„์น˜๊ฐ€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜์— ๋Œ€ํ•ด ๋ฐฐ์น˜๋œ๋‹ค. → sequence-to-sequence ๋ชจ๋ธ์—์„œ ์ผ๋ฐ˜์ ์ธ ์ธ์ฝ”๋”-๋””์ฝ”๋” attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ๋ชจ๋ฐฉ

>  ์ธ์ฝ”๋”์— self-attention ๋ ˆ์ด์–ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ธ์ฝ”๋”์˜ ๊ฐ ์œ„์น˜๋Š” ์ธ์ฝ”๋” ์ด์ „ ๋ ˆ์ด์–ด์— ์žˆ๋Š” ๋ชจ๋“  ์œ„์น˜๋ฅผ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ** ??

** ๋งˆ์ง€๋ง‰ ๋‚ด์šฉ์€ ๋””์ฝ”๋” ๊ด€๋ จ์ธ๋ฐ ๋ฌธ์žฅ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ดํ•ด๋ถˆ๊ฐ€์—ฌ์„œ ์ƒ๋žตํ•˜์˜€์Œ ..

 

3. Position-wise Feed-Forward Networks

 

attention sub-layer ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ๊ฐ ๋ ˆ์ด์–ด๋“ค์€ ๊ฐ ์œ„์น˜์— ๊ฐœ๋ณ„์ ์œผ๋กœ ์ ์šฉ๋œ fully connected feed-forward network๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

 

 

4. Embedding and Softmax

 

๋‹ค๋ฅธ ์‹œํ€€์Šค ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ Transformer๋„ ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํ† ํฐ๊ณผ ์ถœ๋ ฅ ํ† ํฐ์„ d_model ์ฐจ์›์„ ๊ฐ–๋Š” ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋˜ํ•œ ๋””์ฝ”๋” ์ถœ๋ ฅ์„ ์˜ˆ์ธก๋˜๋Š” ๋‹ค์Œ ํ† ํฐ์˜ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด usual learned linear transformation๊ณผ softmax ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

5. Positional Encoding

 

Transformer๋Š” recurrence, convolution์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ์‹œํ€€์Šค ์ˆœ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์‹œํ€€์Šค์— ํ† ํฐ์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜(๋˜๋Š” ์ ˆ๋Œ€์ ์ธ ์œ„์น˜)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”์˜ ํ•˜๋‹จ์— ์œ„์น˜ํ•œ ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ์— d_model ์ฐจ์›์„ ๊ฐ–๋Š” positional encodings์„ ์ถ”๊ฐ€ํ•œ๋‹ค.

 

pos: position, i: dimension

 

Why Self-Attention

์ด ๋ถ€๋ถ„์€ ์ข€ ๋” ๊ผผ๊ผผํžˆ ๋ณธ ํ›„์— ๋‚ด์šฉ์„ ์ถ”๊ฐ€ํ•  ์˜ˆ์ •์ด๋‹ค.

 

2022. 2. 28 ์ถ”๊ฐ€

 

self-attention layer์™€ recurrent and convolution layer๋ฅผ ๋น„๊ตํ•˜๊ณ  ์™œ self-Attention์„ ์‚ฌ์šฉํ–ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๋‚ด์šฉ์ด๋‹ค. ๋…ผ๋ฌธ ์ €์ž๋“ค์€ ์•„๋ž˜ ์„ธ ๊ฐ€์ง€ ์ด์œ ๋Š” ๊ณ ๋ คํ•˜์—ฌ self-Attention์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

โ‘  ๋ ˆ์ด์–ด ๋‹น ์ด ์—ฐ์‚ฐ ๋ณต์žก๋„

โ‘ก ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ ์ˆœ์ฐจ ์—ฐ์‚ฐ ์ˆ˜๋กœ ์ธก์ •ํ•  ๋•Œ ๋ณ‘๋ ฌํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„์‚ฐ์˜ ์–‘ **ํ•ต์‹ฌ์€ '๋ณ‘๋ ฌํ™”'๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

โ‘ข ๋„คํŠธ์›Œํฌ์˜ long-dependencies(์žฅ๊ธฐ ์˜์กด์„ฑ) ์‚ฌ์ด ๊ฒฝ๋กœ์˜ ๊ธธ์ด

 

๐Ÿ” โ‘ข ๊ด€๋ จ ๋ถ€์—ฐ ์„ค๋ช…(๋…ผ๋ฌธ์— ์žˆ๋Š” ๋‚ด์šฉ)

- long-range dependencies๋Š” sequence ๋ณ€ํ™˜ ๊ณผ์ œ์˜ ํ•ต์‹ฌ ๋ฌธ์ œ์ด๋‹ค.

- ๋„คํŠธ์›Œํฌ์—์„œ forward&backward ์‹ ํ˜ธ๊ฐ€ ์ด๋™ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ๋กœ์˜ ๊ธธ์ด๋Š” ์˜์กด์„ฑ ํ•™์Šต ๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์ธ ์ค‘ ํ•˜๋‚˜๋‹ค.

- ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์‹œํ€€์Šค์—์„œ ์œ„์น˜ ์กฐํ•ฉ ์‚ฌ์ด ๊ฒฝ๋กœ๊ฐ€ ์งง์„์ˆ˜๋ก long-range dependencies ํ•™์Šต์ด ์‰ฝ๋‹ค.

 

**์œ„์น˜ ์กฐํ•ฉ์ด ์ •ํ™•ํžˆ ๋ญ˜๊นŒ

 

Self-Attention์˜ ์„ฑ๋Šฅ

 

์œ„์˜ ๋น„๊ต ํ‘œ๋ฅผ ๋ณด๋ฉด ๋ชจ๋“  ๋ฉด์—์„œ self-Attention์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Training & Results & Conclusion

์ƒ๋žต

 

References

https://youtu.be/Yk1tV_cXMMU