Artificial Intelligence/NLP 7

[NLP ๊ธฐ์ดˆ] ์ž„๋ฒ ๋”ฉ(Embedding)

๊ฐœ๋… ๋‹จ์–ด ์ง‘ํ•ฉ(vocab)์— ์žˆ๋Š” ๋‹จ์–ด ๊ฐ๊ฐ์„ ์‹ค์ˆ˜(real number)๋กœ ์ด๋ฃจ์–ด์ง„ dense vector๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ ๋ฐฉ๋ฒ• โ€ป PyTorch ๊ธฐ์ค€ 1) Embedding layer ์ƒ์„ฑ: nn.Embedding ์‚ฌ์šฉ 2) Pre-trained word embedding: ์‚ฌ์ „ ํ•™์Šต๋œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(Word2Vec, GloVe ๋“ฑ)์„ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉ ์˜ˆ์‹œ 1) Embedding layer ์ƒ์„ฑ โ‘  nn.Embedding layer ์—†์ด ์ง์ ‘ ๊ตฌํ˜„(์…€ ์ˆœ์„œ๋Œ€๋กœ ์ง„ํ–‰๋จ) import torch train_data = 'I want to be a AI engineer' # ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ(์ค‘๋ณต ์ œ๊ฑฐ) word_set = set(train_data.split()) # ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ๋ถ€์—ฌ vocab =..

[NLP ์‹ฌํ™”] encode() / encode_plus()

์‚ฌ์‹ค ์—„์ฒญ๋‚œ ์‹ฌํ™”๋Š” ์•„๋‹˜ ํ—ˆ๊น…ํŽ˜์ด์Šค์—์„œ ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ๋•Œ encode()๋ฅผ ์ด์šฉํ•ด ํ† ํฐํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฒƒ์ €๊ฒƒ ์˜ˆ์ œ๋ฅผ ๋ณด๋‹ค๋ณด๋‹ˆ encode()๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊ฐ€ ์žˆ๊ณ  encode_plus()๊ฐ€ ๋‚˜์˜ฌ ๋•Œ๊ฐ€ ์žˆ์—ˆ๋‹ค. ํ•œ ์ค„์งœ๋ฆฌ ์ฝ”๋“œ๋กœ ์ฐจ์ด์ ์„ ํ™•์ธํ•ด๋ณด๊ธฐ๋กœ ํ•œ๋‹ค. tokenizer.encode() # ๋ฐ์ด์ฝ˜ '์ฒญ์™€๋Œ€ ์ฒญ์› ๋ถ„๋ฅ˜ ๋Œ€ํšŒ' ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ถ”์ถœํ•œ ๋ฌธ์žฅ tokenizer.encode('์‹ ํ˜ผ๋ถ€๋ถ€์œ„ํ•œ ์ฃผํƒ์ •์ฑ… ๋ณด๋‹ค ๋ณด์œก์‹œ์„ค ๋Š˜๋ ค์ฃผ์„ธ์š”') ๊ฒฐ๊ณผ โ—ฝ tokenizer.tokenize(SENTENCE), tokenizer.convert_tokens_to_ids(TOKENIZED_SENTENCE)๋ฅผ ํ•œ ๋ฒˆ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋Šฅ โ—ฝ Vocab์— ์žˆ๋Š” ๊ฐ’์„ ์ด์šฉํ•ด ํ† ํฐ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ vocab inde..

[NLP ๊ธฐ์ดˆ] Vocab

'Vocab ์ƒ์„ฑ' ๊ณผ์ •์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋น ์ง€์ง€ ์•Š๊ณ  ๋“ฑ์žฅํ•˜์ง€๋งŒ, ์Šต๊ด€์ ์œผ๋กœ ์“ธ ๋ฟ ์ƒ์„ฑ ๋ชฉ์ ์ด ๋ญ”์ง€ ์ดํ•ดํ•˜์ง€ ๋ชปํ–ˆ์—ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ๋‹ค๋ค„๋ณด๋ฉด์„œ ๋ญ”๊ฐ€ ์•Œ ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•œ ๋Š๋‚Œ์ด ๋“ค์–ด์„œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ํƒœ์Šคํฌ์— ์žˆ์–ด์„œ Vocab์˜ ์—ญํ• ์— ๋Œ€ํ•ด ๋‚ด๊ฐ€ ์ดํ•ดํ•œ ๋ฐ”๋ฅผ ์ •๋ฆฌํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค. โญ ์ž˜๋ชป๋œ ๋‚ด์šฉ์ด ์žˆ๋‹ค๋ฉด ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€ ๋‚จ๊ฒจ์ฃผ์„ธ์š”! ์ƒ์„ฑ ๋‹จ๊ณ„ โ€ป Vocab ์ƒ์„ฑ ๋‹จ๊ณ„ ์ดํ›„์˜ '๋ฐ์ดํ„ฐ ์ƒ์„ฑ'์€ ๋ชจ๋ธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ์˜๋ฏธ ํ† ํฐํ™”-Vocab ์ƒ์„ฑ ๊ณผ์ • ํ† ํฐํ™” โญ ํ† ํฐํ™”์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ์˜ค๋ฅธ์ชฝ ๋งํฌ๋ฅผ ๋ˆ„๋ฅด๋ฉด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ! ํ† ํฐํ™”๋ฅผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ชจ๋ธ์ด ์ผ๋ฐ˜์ ์ธ ํ‘œํ˜„์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒŒ ์•„๋‹๊นŒ? ํ•˜๋Š” ๊ฒƒ์ด ๋‚ด ์ƒ๊ฐ์ด๋‹ค. ๋ฌธ์žฅ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ํ† ํฐํ™” ๊ณผ์ • ์—†์ด ๋ฌธ์žฅ ์ „์ฒด๋ฅผ ํ•™์Šต์— ์‚ฌ..

[NLP ๊ธฐ์ดˆ] ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ(Co-occurrence Matrix)

๊ฐœ๋… ํ˜„์žฌ ๋‹จ์–ด(์ค‘์‹ฌ ๋‹จ์–ด)์™€ ํŠน์ • ๊ฑฐ๋ฆฌ ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š” ๋‹จ์–ด์˜ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ํ–‰๋ ฌ๋กœ ๋งŒ๋“  ๊ฒƒ์ด๋‹ค. ๊ตฌ๊ธ€๋ง ๋˜๋Š” ์ฑ…์„ ๋ณด๋‹ค ๋ณด๋ฉด ๊ฑฐ๋ฆฌ์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋น ์ ธ ์žˆ๊ณ  '์ถœํ˜„ ๋นˆ๋„'์—๋งŒ ์ดˆ์ ์„ ๋งž์ถฐ ์„ค๋ช…ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”๋ฐ ๊ฐœ์ธ์ ์œผ๋กœ ์ถœํ˜„ ๋นˆ๋„๋ณด๋‹ค ๊ฑฐ๋ฆฌ ๊ฐœ๋…์ด ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ์ถœํ˜„ ๋นˆ๋„๋งŒ์œผ๋กœ ์ดํ•ดํ•˜๋ ค๊ณ  ํ•˜๋ฉด ํ–‰๋ ฌ ํ˜•ํƒœ๋ฅผ ๋ดค์„ ๋•Œ '์ด ๋‹จ์–ด ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ ์™œ ์ด๊ฑฐ์•ผ?' ์‹ถ์€ ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋‚ด๊ฐ€ ๊ทธ๋žฌ์Œ ๐Ÿ™‚.. ์˜ˆ์‹œ โ—ฝ ๋ฌธ์žฅ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๋‹ค๋ฃจ๊ฒŒ ๋  ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๋Š” ์—†๊ฒ ์ง€๋งŒ ์˜ˆ์‹œ์ด๋ฏ€๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณด๊ธฐ๋กœ ํ•œ๋‹ค. a hundred bad days made a hundred good stories. AJR - 100 Days ์ด ๋ฌธ์žฅ์„ ์ด์šฉํ•ด ๋งŒ๋“  ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชจ์Šต์ผ๊ฑฐ๊ณ  ๋‹จ์–ด์˜..

[NLP ๊ธฐ์ดˆ] ํ† ํฐํ™”(Tokenization, ํ† ํฌ๋‚˜์ด์ง•)

๊ฐœ๋… ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ํŠน์ • ๊ธฐ์ค€ ๋‹จ์œ„๋กœ ๋ฌธ์žฅ์„ ๋‚˜๋ˆ„๋Š” ๊ณผ์ •์œผ๋กœ ํ† ํฐํ™”, ํ† ํฌ๋‚˜์ด์ง• ๋“ฑ ๋ถˆ๋ฆฌ๋Š” ์ด๋ฆ„์ด ๋‹ค์–‘ํ•˜๋‹ค. ํ† ํฐ์€ ๋ฌธ์žฅ์ด ๋ ์ˆ˜๋„ ์žˆ๊ณ  ๋‹จ์–ด๊ฐ€ ๋ ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ ๋ณดํ†ต์€ ์œ ์˜๋ฏธํ•œ ๋‹จ์œ„๊ฐ€ ํ† ํฐ์œผ๋กœ ์ •์˜๋œ๋‹ค. ์˜ˆ์‹œ โ—ฝ ๋ฌธ๋‹จ ์ด์œ  ๋ชจ๋ฅผ ๊ฐ์ •์˜ ํ’์š”, ๊ทธ๋Š” ๋๋‚ด ๋งˆ์นจํ‘œ๋ฅผ ์ฑ„์›Œ ๋„ฃ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋Š” ํ™€๋กœ์ด ๋ชป๋‹ค ํ•œ ์ด์•ผ๊ธฐ๋“ค์„ ๋น„์šด๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ์˜ ์ด์•ผ๊ธฐ์˜ ์ฃผ์ธ๊ณต์€ ์—ฌ์ „ํžˆ ๊ทธ๋…€์ด๋‹ค. ๋‚˜๋Š” ์ด๊ฑธ ๋‚ญ๋งŒ์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋กœ ํ•˜์˜€๋‹ค. ๋น…๋‚˜ํ‹ฐ-๋‚ญ๋งŒ์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋กœ ํ•˜์˜€๋‹ค(Narr. ๊น€๊ธฐํ˜„) โ—ฝ ๋ฌธ์žฅ ๋‹จ์œ„ ํ† ํฐํ™” ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด ์˜จ์ (.)์„ ๊ธฐ์ค€์œผ๋กœ ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด 4๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ๋‚˜๋‰˜๊ฒŒ ๋œ๋‹ค. โ—ฝ ๋‹จ์–ด ๋‹จ์œ„ ํ† ํฐํ™” Python split()์ฒ˜๋Ÿผ ๋ฌธ์žฅ๋ถ€ํ˜ธ๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๊ณ  ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ํ† ํฐํ™”ํ•˜๊ฑฐ๋‚˜, ๋ฌธ์žฅ๋ถ€ํ˜ธ๋„ ํ•˜๋‚˜์˜..

[NLP ๊ธฐ์ดˆ] BoW(Bag of Words)

๊ฐœ๋… ๋ฌธ์žฅ์„ ์ด๋ฃจ๊ณ  ์žˆ๋Š” ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธํ•˜๊ณ  ๊ทธ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ• ์˜ˆ์‹œ BoW ๋ชจ๋ธ์€ ๋‹จ์–ด ์‚ฌ์ „์„ ์ฐธ๊ณ ํ•˜์—ฌ ๋ฒกํ„ฐํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์•„๋ž˜์ฒ˜๋Ÿผ 4๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์ด ๋ฌธ์„œ๋ฅผ BoW ๋ชจ๋ธ๋กœ ํ‘œํ˜„ํ•ด๋ณด๊ธฐ๋กœ ํ•œ๋‹ค. โ—ฝ ๋ฌธ์„œ : ["It was the best of times", "It was the worst of times", "It was the age of wisdom", "It was the age of foolishness"] โ—ฝ ๋ฌธ์„œ์—์„œ ์ƒ์„ฑํ•œ ๋‹จ์–ด ์‚ฌ์ „ : ['It', 'was', 'the', 'best', 'of', 'times', 'worst', 'age', 'wisdom', 'foolishness'] โ—ฝ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ ๋ฒกํ„ฐ ํ‘œํ˜„ ๊ฒฐ๊ณผ(๋‚˜๋จธ์ง€ ..

[Transformer ์‹œ๋ฆฌ์ฆˆ] 01. Positional Encoding

์‚ฌ์šฉ ์ด์œ  - ์ž…๋ ฅ์ด RNN์ฒ˜๋Ÿผ ์ˆœ์„œ๋Œ€๋กœ ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด Positional Encoding ์‚ฌ์šฉ - ์ƒ์„ฑ๋œ ๊ณ ์œ ํ•œ Positional Encoding์„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์™€ ๋”ํ•  ๊ฒฝ์šฐ ๋ชจ๋ธ์ด ๋‹จ์–ด์˜ ์ ˆ๋Œ€ ์œ„์น˜ ํŒŒ์•… ๊ฐ€๋Šฅ ๋™์ž‘ ๋ฐฉ์‹ - N๋ฒˆ์งธ Positional Encoding์ด ๊ฐ ๋ฌธ์žฅ์˜ N๋ฒˆ์งธ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์— ๋”ํ•ด์ง„๋‹ค. - ๋…ผ๋ฌธ ์ €์ž๋“ค์€ sin ํ•จ์ˆ˜, cos ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉ → pos: ๋ฌธ์žฅ ๋‚ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์œ„์น˜, i: ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋‚ด ์œ„์น˜ - Positional Encoding์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์™€ ๋”ํ•ด์ ธ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— $d_{positional encoding}$=$d_{embedding vector}$ ๐Ÿง sin ํ•จ์ˆ˜, cos ํ•จ..