Artificial Intelligence/Paper

BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding

geum 2022. 9. 21. 13:57

๋…ผ๋ฌธ ์ฝ๊ธฐ ์—„์ฒญ ์˜ค๋žœ๋งŒ์ด๋‹ค. BERT ๊ธฐ๋ฐ˜ ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ ์จ๋ณด๋ ค๊ณ  ํ•˜๋‹ค๊ฐ€ ๊ด€๋ จ ๊ฐœ๋…์„ ํ•˜๋‚˜๋„ ๋ชจ๋ฅด๋‹ˆ๊นŒ ๋ชจ๋ธ ์ž…๋ ฅ์— ๋ญ๊ฐ€ ๋“ค์–ด๊ฐ€๋Š”์ง€~ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ๋ฅผ ์–ด๋–ป๊ฒŒ ๋งž์ถฐ์ค˜์•ผ ํ•˜๋Š”์ง€~ ๋„ˆ๋ฌด ์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋Š” ๋ถ€๋ถ„์ด ๋งŽ์•„์„œ ๋…ผ๋ฌธ ๋ณธ์ธ๋“ฑํŒ์‹œํ‚ด

 

๐Ÿ’ฌ ๋…ผ๋ฌธ ๋‚ด์šฉ๊ณผ ์ด ๊ธ€์— ๋Œ€ํ•œ ์˜๊ฒฌ ๊ณต์œ , ์˜คํƒˆ์ž ์ง€์  ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€ ๋‚จ๊ฒจ์ฃผ์„ธ์š” !


์›๋ฌธ: https://arxiv.org/pdf/1810.04805.pdf

 

โ–  : ์•„์ง ๋ฐ”๋กœ ์ดํ•ด ์•ˆ ๋˜๋Š” ๋ถ€๋ถ„ 

Introduction

1. Pre-train๋œ ์–ธ์–ด ํ‘œํ˜„์„ ํ•˜์œ„ ํƒœ์Šคํฌ์— ์ ์šฉํ•˜๋Š” 2๊ฐ€์ง€ ๋ฐฉ๋ฒ• ์กด์žฌ

 

1) Feature-based

- Pre-trained representations์„ ํฌํ•จํ•˜๋Š” task-specific ๊ตฌ์กฐ๋ฅผ ์ถ”๊ฐ€์ ์ธ feature๋กœ ์‚ฌ์šฉ

- ์˜ˆ: ELMo

 

2) Fine-tuning

- ์ตœ์†Œํ•œ์˜ task-specific ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์‚ฌ์ „ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ•˜์œ„ ํƒœ์Šคํฌ์— ์ ์šฉ

- ์˜ˆ: GPT

 

2. (๋…ผ๋ฌธ์ด ๋‚˜์˜จ ์‹œ์ ์—์„œ) ์—ฐ๊ตฌ๋œ ๋ฐฉ๋ฒ•๋“ค์€ ์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„์˜ ํšจ๊ณผ๋ฅผ ์ž˜ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๊ณ  ์žˆ์Œ

- ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ๋ชจ๋ธ์˜ ์ตœ๋Œ€ ์ œ์•ฝ์€ ๋‹จ๋ฐฉํ–ฅ(unidirectional)์ด๋ผ๋Š” ๊ฒƒ์ด๊ณ  ์ด ํŠน์ง•์€ ์‚ฌ์ „ํ•™์Šต์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ ์ œํ•œ

- ์˜๋ฏธ ํŒŒ์•…์„ ์œ„ํ•ด ์–‘์ชฝ์„ ๋ชจ๋‘ ํ™•์ธํ•ด์•ผ ํ•˜๋Š” ํƒœ์Šคํฌ์— ๋งค์šฐ ์น˜๋ช…์ 

 

3. ๋‹จ๋ฐฉํ–ฅ ์–ธ์–ด๋ชจ๋ธ์˜ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜๋Š” BERT ์ œ์•ˆ

 

BERT

์‚ฌ์ „ํ•™์Šต ๋‹จ๊ณ„์™€ ํŒŒ์ธํŠœ๋‹ ๋‹จ๊ณ„

* ๋ชจ๋ธ ์š”์•ฝ

 

1) ๊ตฌ์กฐ

- ๋ฉ€ํ‹ฐ๋ ˆ์ด์–ด ์–‘๋ฐฉํ–ฅ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”

- ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ $ BERT_{BASE} $, $ BERT_{LARGE} $๋กœ ๊ตฌ๋ถ„

 

2) ์ž…๋ ฅ/์ถœ๋ ฅ ํ‘œํ˜„

- ๋‹ค์–‘ํ•œ ํ•˜์œ„ ํƒœ์Šคํฌ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ํ† ํฐ ์‹œํ€€์Šค์—์„œ ๋‹จ์ผ ๋ฌธ์žฅ, ํ•œ ์Œ์˜ ๋ฌธ์žฅ์„ ๋ชจ๋‘ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ•จ

- '๋ฌธ์žฅ'์ด๋ผ๋Š” ํ‘œํ˜„์€ ์ผ๋ฐ˜์ ์ธ ๋ฌธ์žฅ(์˜์–ด์˜ ๊ฒฝ์šฐ ์ฃผ์–ด+๋™์‚ฌ+...)์ด ์•„๋‹ˆ๋ผ ์—ฐ์†์ ์ธ text span

- '์‹œํ€€์Šค'๋ผ๋Š” ํ‘œํ˜„์€ BERT์˜ ์ž…๋ ฅ ๋‹จ์œ„๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ ๋‹จ์ผ ๋ฌธ์žฅ์ผ์ˆ˜๋„ ์žˆ๊ณ  ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์ผ์ˆ˜๋„ ์žˆ์Œ

- ๋ชจ๋“  ์‹œํ€€์Šค๋Š” [CLS] ํ† ํฐ์œผ๋กœ ์‹œ์ž‘ํ•˜๋ฉฐ ์ด ํ† ํฐ์— ๋Œ€์‘ํ•˜๋Š” ๋งˆ์ง€๋ง‰ hidden state๋Š” ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•œ ๊ฒฐํ•ฉ๋œ ์‹œํ€€์Šค ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉ๋จ

- ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ์€ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ํ•ฉ์ณ์ง€๊ณ  ๋‘ ๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ๋ฌธ์žฅ ๊ตฌ๋ถ„

 

โ‘  ์ŠคํŽ˜์…œ ํ† ํฐ [SEP]๋กœ ๋‘ ๋ฌธ์žฅ ๊ตฌ๋ถ„

โ‘ก ๋ชจ๋“  ํ† ํฐ์— ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ† ํฐ์ด ๋ฌธ์žฅ A์— ์†ํ•˜๋Š”์ง€ B์— ์†ํ•˜๋Š”์ง€ ํ‘œ์‹œ

 

1. Pre-training BERT

 

* BERT๋Š” ๊ธฐ์กด์˜ ๋‹จ๋ฐฉํ–ฅ(LTR, RTL) ์–ธ์–ด๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ 2๊ฐ€์ง€ ๋น„์ง€๋„ํ•™์Šต ํƒœ์Šคํฌ๋กœ ์‚ฌ์ „ํ•™์Šต์ด ์ง„ํ–‰๋˜์—ˆ์Œ

 

1) Task #1: Masked LM

- deep bidirectional model์ด LTR, RTL๋ณด๋‹ค ๋” ๊ฐ•๋ ฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ง๊ด€์ ์œผ๋กœ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ „ํ†ต์ ์ธ ์–ธ์–ด๋ชจ๋ธ์€ ์˜ค์ง ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ํ•™์Šต๋˜๊ณ  ์žˆ์Œ

- deep bidirectional model์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ ํ† ํฐ์„ ๋žœ๋คํ•œ ๋น„์œจ(๋…ผ๋ฌธ์—์„œ๋Š” 15%)๋กœ ๋งˆ์Šคํ‚น → Masked LM(MLM) ๊ณผ์ •

- mask token์— ๋Œ€์‘ํ•˜๋Š” ์ตœ์ข… hidden vectors๋Š” vocabulary๋ฅผ ํ†ตํ•ด ์ถœ๋ ฅ softmax ์ธต์œผ๋กœ ๋“ค์–ด๊ฐ

- input ์ „์ฒด๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜์ง€ ์•Š๊ณ  ๋งˆ์Šคํ‚น๋œ ๋‹จ์–ด๋งŒ์„ ์˜ˆ์ธก

- [MASK] ํ† ํฐ์ด ํŒŒ์ธํŠœ๋‹ ์‹œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ „ํ•™์Šต๊ณผ ํŒŒ์ธํŠœ๋‹ ์‚ฌ์ด ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๋‹จ์  ์กด์žฌ

- ์œ„์—์„œ ์–ธ๊ธ‰๋œ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์–ด ๋งˆ์Šคํ‚น ์‹œ ํ•ญ์ƒ [MASK] ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ  ๋žœ๋ค ํ† ํฐ 10%๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ์„ ํƒ๋œ i๋ฒˆ์งธ ํ† ํฐ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹ ์ ์šฉ

 

2) Task #2: Next Sentence Prediction (NSP)

- Question Answering(QA), Natural Language Inference(NLI)๋Š” ๋‘ ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„ ์ดํ•ด๊ฐ€ ์ค‘์š”ํ•จ

- ๋ฌธ์žฅ ์‚ฌ์ด ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‹จ์ผ ์–ธ์–ด corpus์—์„œ ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋Š” ์ด์ง„ํ™”๋œ NSP ํƒœ์Šคํฌ ์‚ฌ์šฉ

- ์‚ฌ์ „ํ•™์Šต ์‹œ ๋ฌธ์žฅ A, B๋ฅผ ๊ณ ๋ฅผ ๋•Œ B์˜ 50%๋Š” ์‹ค์ œ๋กœ ๋ฌธ์žฅ A์˜ ๋’ค์— ๋‚˜์˜ค๋Š” ๋ฌธ์žฅ, 50%๋Š” corpus์—์„œ ์„ ํƒ๋œ ๋žœ๋ค ๋ฌธ์žฅ

 

2. Fine-tuning BERT

 

* ํŠธ๋žœ์Šคํฌ๋จธ์˜ attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด BERT ์ž…์ถœ๋ ฅ์˜ ์ ์ ˆํ•œ ๊ตํ™˜์œผ๋กœ ๋งŽ์€ ํ•˜์œ„ ํƒœ์Šคํฌ ์ž‘์—…์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— fine-tuning์€ ๊ฐ„๋‹จํ•œ ๊ฐœ๋…(...์ธ๋ฐ ์ด ๋ฌธ์žฅ์ด ๋‚˜ํ•œํ…Œ๋Š” ์•ˆ ๊ฐ„๋‹จํ•จ) 

 

- ๊ฐ ํƒœ์Šคํฌ๋งˆ๋‹ค task-specificํ•œ ์ž…๋ ฅ/์ถœ๋ ฅ์„ BERT ๋ชจ๋ธ์— ๋„ฃ๊ณ  end-to-end๋กœ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ fine-tune

- ์ถœ๋ ฅ ๋‹จ๊ณ„์—์„œ ํ† ํฐ ํ‘œํ˜„์€ token-level ํƒœ์Šคํฌ(sequence tagging, QA)๋ฅผ ์œ„ํ•œ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋กœ ๋“ค์–ด๊ฐ

- [CLS]๋Š” ๋ถ„๋ฅ˜(entailment, sentiment analysis)๋ฅผ ์œ„ํ•œ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋กœ ๋“ค์–ด๊ฐ

- ์‚ฌ์ „ํ•™์Šต๊ณผ ๋น„๊ตํ•  ๋•Œ fine-tuning ๋น„์šฉ์ด ๋‚ฎ์€ ํŽธ

 

Experiments & Ablation Studies

์ƒ๋žต