Artificial Intelligence/Paper

Sequence to Sequence Learning with Neural Networks

geum 2022. 3. 21. 20:52

Transformer๋ฅผ ์ œ๋Œ€๋กœ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋ด์•ผ ํ•  ๋…ผ๋ฌธ๊ณผ ๊ฐœ๋…๋“ค์ด ๊ต‰์žฅํžˆ ๋งŽ๋‹ค. ์ฐจ๊ทผ์ฐจ๊ทผ ๋ณด๊ณ  Transformer๋„ ๋‹ค์‹œ ๋ณผ ๊ณ„ํš์ด๋‹ค.

 

๐Ÿ’ฌ ๋…ผ๋ฌธ ๋‚ด์šฉ๊ณผ ์ด ๊ธ€์— ๋Œ€ํ•œ ์˜๊ฒฌ ๊ณต์œ , ์˜คํƒˆ์ž ์ง€์  ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€ ๋‚จ๊ฒจ์ฃผ์„ธ์š” !


์›๋ฌธ : https://arxiv.org/pdf/1409.3215.pdf

 

Abstract

- DNN์€ speech recognition๊ณผ ๊ฐ™์€ ์–ด๋ ค์šด ํ•™์Šต ํƒœ์Šคํฌ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•œ ๋ชจ๋ธ์ด์ง€๋งŒ ๊ณ ์ • ์ฐจ์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž…์ถœ๋ ฅ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅธ ์‹œํ€€์Šค(๋ฌธ์žฅ)๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฌธ์ œ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์•˜๋‹ค. 

- ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์ธต LSTM์„ ์ธ์ฝ”๋”-๋””์ฝ”๋”๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์‹œํ€€์Šค ์˜๋ฏธ์— ๋Œ€์‘ํ•˜๋Š” ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

- ์ž…๋ ฅ ์‹œํ€€์Šค ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๋ฐ˜๋Œ€๋กœ ํ•  ๊ฒฝ์šฐ(์‚ฌ๋ž‘ํ•ด/๋„ˆ๋ฅผ/๋‚˜๋Š” ์ด๋Ÿฐ ์‹) LSTM ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

 

Introduction

- ํ•˜๋‚˜์˜ LSTM์œผ๋กœ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ์ฝ์€ ํ›„์— context vector(๊ณ ์ • ์ฐจ์› ๋ฒกํ„ฐ)๋ฅผ ์–ป๊ณ  ๋˜ ๋‹ค๋ฅธ LSTM(๋””์ฝ”๋”)์„ ์ด์šฉํ•ด ์ถœ๋ ฅ ์‹œํ€€์Šค ์ถ”์ถœ

- ๋‘ ๋ฒˆ์งธ LSTM์€ ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋”ฐ๋ผ ์กฐ์ •๋˜๋Š” ๊ฒƒ์ด์ง€๋งŒ ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” recurrent neural network language model์ด๋‹ค. 

 

 

cf. EOS(End Of Sentence/Sequence)

๋ฌธ์žฅ์˜ ๋์„ ์•Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ํ•œ ํ† ํฐ์œผ๋กœ ๋ชจ๋ธ์€ <EOS> ํ† ํฐ์„ ๋ฐ›์œผ๋ฉด ์˜ˆ์ธก ์ƒ์„ฑ์„ ๋ฉˆ์ถ˜๋‹ค.

 

The model

1) RNN & LSTM

 

- ์ˆœํ™˜์‹ ๊ฒฝ๋ง(RNN)์€ ์‹œํ€€์Šค์— ๋Œ€ํ•œ feedforward ์‹ ๊ฒฝ๋ง์„ ์ผ๋ฐ˜ํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ์ž…๋ ฅ ์‹œํ€€์Šค ($x_{1}$, ..., $x_{T}$)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด RNN์€ ์•„๋ž˜ ์‹์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์ถœ๋ ฅ ์‹œํ€€์Šค ($y_{1}$, ..., $y_{T}$)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

 

$$h_{t} = sigm(W^{hx}x_{t}+W^{hh}h_{t-1})$$

$$y_{t} = W^{yh}h_{t}$$

 

-  ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์‹œํ€€์Šค ํ•™์Šต ๋ฐฉ๋ฒ•์€ ํ•˜๋‚˜์˜ RNN์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๊ณ ์ • ํฌ๊ธฐ ๋ฒกํ„ฐ์— ๋งคํ•‘ํ•œ ๋‹ค์Œ ๋‹ค๋ฅธ RNN์œผ๋กœ ํƒ€๊ฒŸ ์‹œํ€€์Šค์— ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

- RNN์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ long term dependencies ๋ฌธ์ œ๋กœ ์ธํ•ด ์–ด๋ ค์šด ์ ์ด ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์žฅ๊ธฐ ์˜์กด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” LSTM์€ ์„ฑ๊ณต์ ์ด์—ˆ๋‹ค.

- LSTM์˜ ๋ชฉํ‘œ๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค ($x_{1}$, ..., $x_{T}$)์™€ ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋Œ€์‘ํ•˜๋Š” ์ถœ๋ ฅ ์‹œํ€€์Šค ($y_{1}$, ..., $y_{T'}$)์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  p($y_{1}$, ..., $y_{T'}$|$x_{1}$, ..., $x_{T}$)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค. (T'≠T)

- LSTM์€ ๋งˆ์ง€๋ง‰ hidden state์— ์˜ํ•ด ์ฃผ์–ด์ง„ ์ž…๋ ฅ ์‹œํ€€์Šค ($x_{1}$, ..., $x_{T}$)์˜ ๊ณ ์ • ์ฐจ์› ํ‘œํ˜„์ธ $v$๋ฅผ ์–ป์€ ํ›„ ์ดˆ๊ธฐ hidden state๊ฐ€ $v$์ธ LSTM-LM ๊ณต์‹์„ ๊ณ„์‚ฐํ•ด ($y_{1}$, ..., $y_{T'}$)์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค.

 

 

2) ์‹ค์ œ ๊ตฌํ˜„ํ•œ ๋ชจ๋ธ

→ ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ฐฉ์‹๊ณผ ์„ธ ๊ฐ€์ง€ ์ฐจ์ด๊ฐ€ ์žˆ์Œ

 

โ‘  ์ธ์ฝ”๋” ์ชฝ LSTM๊ณผ ๋””์ฝ”๋” ์ชฝ LSTM๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

โ‘ก ์–•์€ LSTM๋ณด๋‹ค ๊นŠ์€ LSTM์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ธฐ ๋•Œ๋ฌธ์— 4๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง„ LSTM์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

โ‘ข ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ๋’ค์ง‘์—ˆ๋‹ค.

 

Experiments

1. Dataset details : ์ƒ๋žต

 

2. Decoding and Rescoring

 

- ์‹คํ—˜์˜ ํ•ต์‹ฌ์€ ๋งŽ์€ ๋ฌธ์žฅ ์Œ์— ๋Œ€ํ•ด ๊นŠ์€ LSTM์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

- ์†Œ์Šค ๋ฌธ์žฅ $S$๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, $S$์— ๋Œ€์‘ํ•˜๋Š” ์ •ํ™•ํ•œ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ์ธ $T$์— ๋Œ€ํ•œ log probability๊ฐ€ ์ปค์ง€๋Š” ์ชฝ์œผ๋กœ ํ•™์Šต ์ง„ํ–‰

 

 

- ํ•™์Šต์ด ๋๋‚˜๋ฉด beam search ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋ฅผ ์ฐพ๋Š”๋‹ค.

 

 

3. Reversing the Source Sentences

 

- LSTM์€ ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ์ ํ•ฉํ•˜์ง€๋งŒ ๋…ผ๋ฌธ ์ €์ž๋“ค์€ ์†Œ์Šค ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๋ฉด ํ•™์Šต์ด ๋” ์ž˜๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

 

์‹คํ—˜ ์„ค์ • ๊ฐ’๊ณผ ๊ฒฐ๊ณผ ๋ถ„์„์— ๋Œ€ํ•œ ํŒŒํŠธ๋Š” ์ƒ๋žต ๐Ÿ™‚