Archive

[๋ฒˆ์—ญ] Attention: Sequence 2 Sequence model with Attention Mechanism

geum 2022. 3. 16. 15:57

์—ฌ๋Ÿฌ ์‚ฌ์ดํŠธ์— ํฉ์–ด์ ธ ์žˆ๋Š” ๊ธ€์„ ๋ชจ์œผ๋ฉด ์–ด๋ ค์šด ๊ฐœ๋…๋“ค์„ ์™„๋ฒฝํžˆ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ฏฟ์œผ๋ฉด์„œ ์‹œ์ž‘ํ•œ ์ž์ฒด ์ฝ˜ํ…์ธ  'Medium ๋ฒˆ์—ญ' 

 

 

๐Ÿ’ฌ ์ตœ๋Œ€ํ•œ ๋งค๋„๋Ÿฝ๊ฒŒ ํ•ด์„ํ•˜๊ณ ์ž ๋…ธ๋ ฅํ–ˆ์ง€๋งŒ ์–ด์ƒ‰ํ•œ ๋ฌธ์žฅ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”ผ๋“œ๋ฐฑ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค ๐Ÿ™‚


์›๋ณธ ๊ธ€ ์ฃผ์†Œ : https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a

 

Sequence 2 Sequence model with Attention Mechanism

Detailed explanation about Attention mechanism in a sequence 2 sequence model suggested by Bahdanau and Luong

towardsdatascience.com

 

์ด ๊ธ€์„ ํ†ตํ•ด ๋‹น์‹ ์ด ๋ฐฐ์šฐ๊ฒŒ ๋  ๋‚ด์šฉ

 

  • seq2seq ๋ชจ๋ธ์— ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ํ•„์š”ํ•œ ์ด์œ 
  • Bahdanau ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ์ž‘๋™ ๋ฐฉ์‹
  • Luong ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ์ž‘๋™ ๋ฐฉ์‹
  • Bahdanau ์–ดํ…์…˜๊ณผ Luong ์–ดํ…์…˜์˜ ํ•ต์‹ฌ ์ฐจ์ด์ 

 

attention์ด๋ž€ ๋ฌด์—‡์ด๋ฉฐ, seq2seq ๋ชจ๋ธ์— ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ํ•„์š”ํ•œ ์ด์œ 

 

๋‘ ๊ฐ€์ง€ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ํ•œ ๊ฐ€์ง€๋Š” ํ˜„์žฌ ๋‰ด์Šค๊ฑฐ๋ฆฌ์™€ ๊ด€๋ จ๋œ ๊ธฐ์‚ฌ๋ฅผ ์ฝ๋Š” ๊ฒฝ์šฐ์ด๋‹ค. ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์‹œํ—˜์„ ์ค€๋น„ํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์ด๋‹ค. ๋‘ ์ƒํ™ฉ์— ์žˆ์–ด์„œ ์ง‘์ค‘์˜ ์ •๋„๋Š” ๊ฐ™์„๊นŒ ์•„๋‹ˆ๋ฉด ๋‹ค๋ฅผ๊นŒ?

๋‹น์‹ ์€ ์•„๋งˆ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ์ฝ์„ ๋•Œ๋ณด๋‹ค ์‹œํ—˜์„ ์ค€๋น„ํ•  ๋•Œ ๋” ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฝ์„ ๊ฒƒ์ด๋‹ค. ์‹œํ—˜ ๊ณต๋ถ€๋ฅผ ํ•˜๋Š” ๋™์•ˆ ๋‹จ์ˆœํ•˜๊ฑฐ๋‚˜ ๋ณต์žกํ•œ ๋‚ด์šฉ์„ ๊ธฐ์–ตํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜๋Š” ํ‚ค์›Œ๋“œ์— ๋” ์ง‘์ค‘ํ•ด ํ•™์Šตํ•  ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ํŠน์ • ๊ด€์‹ฌ ๋ถ„์•ผ์— ์ง‘์ค‘ํ•˜๊ณ ์ž ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ณผ์ œ์—๋„ ์ ์šฉ๋œ๋‹ค.

 

Sequence-to-Sequence(seq2seq) ๋ชจ๋ธ์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

Seq2Seq ๋ชจ๋ธ์€ ์†Œ์Šค ๋ฌธ์žฅ์„ ํƒ€๊ฒŸ ๋ฌธ์žฅ์— ๋งคํ•‘ํ•œ๋‹ค. ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์˜ ๊ฒฝ์šฐ ์†Œ์Šค ๋ฌธ์žฅ์€ ์˜์–ด, ํƒ€๊ฒŸ ๋ฌธ์žฅ์€ ํžŒ๋‘์–ด๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. (์˜์–ด๋ฅผ ํžŒ๋‘์–ด๋กœ ๋ฒˆ์—ญํ•œ๋‹ค๋Š” ์˜๋ฏธ)

 

์˜์–ด๋กœ ๋œ ์†Œ์Šค ๋ฌธ์žฅ์„ ์ธ์ฝ”๋”์— ์ „๋‹ฌํ•œ๋‹ค. ์ธ์ฝ”๋”๋Š” ์†Œ์Šค ๋ฌธ์žฅ์˜ ์ „์ฒด ์ •๋ณด๋ฅผ context vector๋ผ๊ณ  ํ•˜๋Š” ํ•˜๋‚˜์˜ ์‹ค์ˆ˜๊ฐ’ ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  context vector๋Š” ๋””์ฝ”๋”๋กœ ์ „๋‹ฌ๋˜์–ด ํƒ€๊ฒŸ ์–ธ์–ด๋กœ ๋œ ์ถœ๋ ฅ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•œ๋‹ค. Context vector๋Š” ์ž…๋ ฅ ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์š”์•ฝํ•ด์•ผ ํ•œ๋‹ค.

 

์ž…๋ ฅ ๋ฌธ์žฅ์ด ๊ธธ๋ฉด ๋””์ฝ”๋”์— ์ œ๊ณตํ•  ๋ชจ๋“  ์ •๋ณด๋“ค์„ ์ธ์ฝ”๋”์˜ ๋‹จ์ผ ๋ฒกํ„ฐ์— ํฌํ•จ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ?

 

๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ ๋ฌธ์žฅ ์ „์ฒด์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๋‹ด๊ณ  ์žˆ๋Š” ๋‹จ์ผ ๋ฒกํ„ฐ ๋Œ€์‹  ๋ฌธ์žฅ ๋‚ด ๋ช‡ ๊ฐœ์˜ ๊ด€๋ จ ๋‹จ์–ด์— ์ดˆ์ ์„ ๋งž์ถ”๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ• ๊นŒ?

 

์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ด ๋ฌธ์ œ๋“ค(์œ„์˜ ๋‘ ๊ฐ€์ง€ ์‚ฌํ•ญ)์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.

 

๋””์ฝ”๋”ฉ ๋‹จ๊ณ„๋งˆ๋‹ค ๋””์ฝ”๋”๋Š” attention weights ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•ด ์ž…๋ ฅ๋œ ๊ฐ ๋‹จ์–ด์— ์–ผ๋งˆ๋‚˜ "attention"ํ•ด์•ผํ•˜๋Š”์ง€ ์•Œ๋ ค์ค€๋‹ค. attention weights๋Š” ๋ฒˆ์—ญ์„ ์œ„ํ•ด ๋””์ฝ”๋”์— ๋ฌธ๋งฅ์ƒ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

 

Bahdanau ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

 

Bahdanau์™€ ์—ฐ๊ตฌ์ง„์€ align๊ณผ translate๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ–ˆ๋‹ค. ์ธ์ฝ”๋” state์™€ ๋””์ฝ”๋” state์˜ ์„ ํ˜• ์กฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Additive attention์ด๋ผ๊ณ  ์•Œ๋ ค์ ธ์žˆ๊ธฐ๋„ ํ•˜๋‹ค.

 

Bahdanau๊ฐ€ ์ œ์•ˆํ•œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ดํ•ดํ•ด๋ณด์ž.

 

  • seq2seq์—์„œ ์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ hidden state๋Š” ์–ดํ…์…˜์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, Bahdanau ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ context vector๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ธ์ฝ”๋”(forward&backward)์™€ ๋””์ฝ”๋”์˜ ๋ชจ๋“  hidden states๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
  • ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ feed-forward ๋„คํŠธ์›Œํฌ์— ์˜ํ•ด ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋œ alignment score์— ๋”ฐ๋ผ ์ž…๋ ฅ ์‹œํ€€์Šค์™€ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ •๋ ฌํ•œ๋‹ค. ์ ์ˆ˜์— ๋”ฐ๋ผ ์ •๋ ฌํ•˜๋Š” ๋ฐฉ์‹์€ ์†Œ์Šค ๋ฌธ์žฅ์—์„œ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ์žˆ๋Š” ์ •๋ณด์— ์ง‘์ค‘ํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋œ๋‹ค.
  • ๋ชจ๋ธ์€ ์†Œ์Šค ์œ„์น˜์™€ ๊ด€๋ จ๋œ context vector์™€ ์ด์ „์— ์ƒ์„ฑ๋œ ํƒ€๊ฒŸ ๋‹จ์–ด์— ๊ธฐ๋ฐ˜ํ•ด ํƒ€๊ฒŸ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

 

 

์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ ์šฉ๋œ Seq2Seq ๋ชจ๋ธ์€ ์ธ์ฝ”๋”, ๋””์ฝ”๋”, ์–ดํ…์…˜ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

 

์–ดํ…์…˜ ๋ ˆ์ด์–ด์˜ ๊ตฌ์„ฑ ์š”์†Œ๋Š”

 

  • Alignment layer
  • Attention weights
  • Context vector

 

Alignment score

 

Alignment score๋Š” ์œ„์น˜ $j$ ์ฃผ๋ณ€์˜ ์ž…๋ ฅ๊ณผ ์œ„์น˜ $i$ ์ฃผ๋ณ€์˜ ์ž…๋ ฅ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งž๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. Score๋Š” ํƒ€๊ฒŸ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์ง์ „ ๋””์ฝ”๋”์˜ hidden state $s_{i-1}$๊ณผ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ hidden state $h_{j}$์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค.

 

$$e_{ij}=a(s_{i-1}, h_{j})$$

 

๋””์ฝ”๋”๋Š” ์†Œ์Šค ๋ฌธ์žฅ์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐ์— ๋‹ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์†Œ์Šค ๋ฌธ์žฅ์˜ ์–ด๋–ค ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•ด์•ผํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•œ๋‹ค.

 

Alignment ๋ฒกํ„ฐ๋Š” ์†Œ์Šค ๋ฌธ์žฅ๊ณผ ๊ธธ์ด๊ฐ€ ๊ฐ™๊ณ  ๋””์ฝ”๋”์˜ ๋งค time step๋งˆ๋‹ค ๊ณ„์‚ฐ๋˜๋Š” ๋ฒกํ„ฐ๋‹ค.

 

 

Attention weights

 

Attention weights๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด alignment score์— softmax ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.

 

→ $a_{ij}$์˜ ์˜๋ฏธ : ๋””์ฝ”๋”์˜ i๋ฒˆ์งธ time step์—์„œ j๋ฒˆ์งธ ์ธ์ฝ”๋” ์ถœ๋ ฅ์˜ ๊ฐ€์ค‘์น˜(ํ•ธ์ฆˆ์˜จ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฐœ์ทŒ)

 

Softmax ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์น˜๋ฉด ์ดํ•ฉ์ด 1์ด ๋˜๋Š” ํ™•๋ฅ ์„ ์–ป์œผ๋ฉฐ ์ด๊ฒƒ์€ ๊ฐ ์ž…๋ ฅ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค. ์ž…๋ ฅ ๋ฌธ์žฅ์˜ attention weights๊ฐ€ ๋†’์„์ˆ˜๋ก ํƒ€๊ฒŸ ๋‹จ์–ด ์˜ˆ์ธก์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ํฌ๋‹ค.

 

 

Context vector

 

Context vector๋Š” ๋””์ฝ”๋”์˜ ์ตœ์ข… ์ถœ๋ ฅ์„ ๊ฒฐ์ •ํ•  ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค. Context vector $c_{i}$๋Š” attention weights์™€ ์ธ์ฝ”๋” hidden states ($h_{1}$, $h_{2}$, ..., $h_{tx}$)์˜ weighted sum(๊ฐ€์ค‘ ํ•ฉ)์ด๋‹ค.

 

 

ํƒ€๊ฒŸ ๋‹จ์–ด ์˜ˆ์ธก

 

ํƒ€๊ฒŸ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด, ๋””์ฝ”๋”๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ’์€

 

  • Context vector $c_{i}$
  • ์ด์ „ time step $y_{i-1}$์—์„œ ๋‚˜์˜จ ๋””์ฝ”๋”์˜ ์ถœ๋ ฅ
  • ๋””์ฝ”๋”์˜ ์ด์ „ hidden state $s_{i-1}$

time step i์—์„œ ๋””์ฝ”๋”์˜ hidden state

 

Luong ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

 

Luong ์–ดํ…์…˜์„ Multiplicative attention์ด๋ผ๊ณ ๋„ ํ•œ๋‹ค. ๊ฐ„๋‹จํ•œ ํ–‰๋ ฌ ๊ณฑ์„ ์ด์šฉํ•ด ์ธ์ฝ”๋” state์™€ ๋””์ฝ”๋” state๋กœ attention score๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๋‹จ์ˆœ ํ–‰๋ ฌ ๊ณฑ์€ ๋น ๋ฅด๊ณ  ๊ณต๊ฐ„ํšจ์œจ์ ์ด๋‹ค.

 

Luong์€ ์†Œ์Šค ๋ฌธ์žฅ์—์„œ์˜ attention ์œ„์น˜์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€์˜ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค.

 

1. ์†Œ์Šค ๋ฌธ์žฅ์˜ ๋ชจ๋“  ์œ„์น˜๋ฅผ attentionํ•˜๋Š” Global attention

2. ์†Œ์Šค ๋ฌธ์žฅ์˜ ํƒ€๊ฒŸ ๋‹จ์–ด์—์„œ ์ผ๋ถ€ ์œ„์น˜์—๋งŒ attentionํ•˜๋Š” Local attention

 

 

Global attention๊ณผ Local attention์˜ ๊ณตํ†ต์ 

 

  • ๋””์ฝ”๋”ฉ ๋‹จ๊ณ„์—์„œ time step t๋งˆ๋‹ค stacking LSTM ์ตœ์ƒ์œ„์ธต์˜ hidden state $h_{\theta}$๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ
  • ๋‘ ์ ‘๊ทผ๋ฒ•์˜ ๋ชฉํ‘œ๋Š” ์†Œ์Šค์™€ ๊ด€๋ จ์ด ์žˆ๋Š” ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜์—ฌ ํ˜„์žฌ ํƒ€๊ฒŸ ๋‹จ์–ด์ธ $y_{t}$๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜๋Š” context vector $c_{t}$๋ฅผ ๋„์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • Attention vector๋Š” ์ด์ „์˜ alignment decision์„ ์•Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ๋‹ค์Œ time step์—์„œ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.

 

 

Global attention๊ณผ Local attention์€ context vector $c_{t}$๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ด ๋‹ค๋ฅด๋‹ค.

 

Global attention๊ณผ Local attention์„ ๋ณด๊ธฐ ์ „์—, Luong ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์—์„œ ์ฃผ์–ด์ง„ ์‹œ๊ฐ„ t์— ๋Œ€ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ์‹์„ ์•Œ์•„๋ณด์ž. (ํ‘œ๊ธฐ๋ฒ•๊ณผ ํ‘œ๊ธฐ๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ์˜๋ฏธ์— ๋Œ€ํ•ด)

 

 

Global Attention

 

  • Global attention ๋ชจ๋ธ์€ context vector $c_{t}$๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์ธ์ฝ”๋”์˜ ๋ชจ๋“  hidden state๋ฅผ ๊ณ ๋ ค
  • Global context vector $c_{t}$๋Š” ๋ชจ๋“  ์†Œ์Šค hidden state $h_{s}$์˜ alignment vector $a_{t}$์— ๋”ฐ๋ผ weighted average๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

 

์†Œ์Šค ๋ฌธ์žฅ์ด ๊ธด ๋ฌธ๋‹จ์ด๊ฑฐ๋‚˜ ํฐ ๋ฌธ์„œ๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ?

 

Global attention ๋ชจ๋ธ์€ ํƒ€๊ฒŸ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์Šค ๋ฌธ์žฅ์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๊ณ  ๊ธด ๋ฌธ์žฅ์˜ ๋ฒˆ์—ญ์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๋‹ค.

 

Local attention์„ ์‚ฌ์šฉํ•˜๋ฉด Global attention ๋ชจ๋ธ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

 

Local Attention

 

  • Local attention์€ ํƒ€๊ฒŸ ๋‹จ์–ด๋งˆ๋‹ค ์†Œ์Šค ์œ„์น˜์˜ ์ผ๋ถ€์—๋งŒ ์ดˆ์ ์„ ๋งž์ถ”๊ธฐ ๋•Œ๋ฌธ์— global attention๋ณด๋‹ค ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋‚ฎ๋‹ค.
  • Local attention ๋ชจ๋ธ์€ ์‹œ๊ฐ„ t์—์„œ ๊ฐ ํƒ€๊ฒŸ ๋‹จ์–ด์— ๋Œ€ํ•ด ์ •๋ ฌ๋œ ์œ„์น˜ $P_{t}$๋ฅผ ๊ตฌํ•œ๋‹ค.
  • ์ •๋ ฌ๋œ ์œ„์น˜๋Š” ๋‹จ์ˆœํ•˜๊ฒŒ(Monotonic alignment) ๋˜๋Š” ์˜ˆ์ธก์ ์œผ๋กœ(Predictive alignment) ์„ ํƒ๋  ์ˆ˜ ์žˆ๋‹ค.

 

Bahdanau ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ Luong ๋ฉ”์ปค๋‹ˆ์ฆ˜์—์„œ์˜ attention ๊ณ„์‚ฐ

 

Bahdanau์™€ ์—ฐ๊ตฌ์ง„๋“ค์€ ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋”์— ์žˆ๋Š” ์ˆœ๋ฐฉํ–ฅ ๋ฐ ์—ญ๋ฐฉํ–ฅ hidden state์™€ non-stacking ๋‹จ๋ฐฉํ–ฅ ๋””์ฝ”๋”์˜ ์ด์ „ ํƒ€๊ฒŸ ๋‹จ์–ด hidden state๋ฅผ ํ•ฉ์น˜๋Š”(concatenation) ๋ฐฉ๋ฒ•์„ ์ผ๋‹ค.

 

Luong๊ณผ ์—ฐ๊ตฌ์ง„๋“ค์€ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๋ชจ๋‘ ์ตœ์ƒ์œ„ LSTM ๋ ˆ์ด์–ด์˜ hidden state๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

Luong ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ alignment vector๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ๋””์ฝ”๋”์˜ ํ˜„์žฌ hidden state๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ˜๋ฉด, Bahdanau ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์ด์ „ time step์˜ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

์ฐธ๊ณ  ์ž๋ฃŒ | References

- Dzmitry Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate(2014)

- Minh-Thang Luong et al., Effective Approaches to Attention-based Neural Machine Translation(2015)

- ํ•ธ์ฆˆ์˜จ ๋จธ์‹ ๋Ÿฌ๋‹