Artificial Intelligence/NLP

[NLP ๊ธฐ์ดˆ] ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ(Co-occurrence Matrix)

geum 2022. 6. 30. 09:19

๊ฐœ๋…

ํ˜„์žฌ ๋‹จ์–ด(์ค‘์‹ฌ ๋‹จ์–ด)์™€ ํŠน์ • ๊ฑฐ๋ฆฌ ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š” ๋‹จ์–ด์˜ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ํ–‰๋ ฌ๋กœ ๋งŒ๋“  ๊ฒƒ์ด๋‹ค. ๊ตฌ๊ธ€๋ง ๋˜๋Š” ์ฑ…์„ ๋ณด๋‹ค ๋ณด๋ฉด ๊ฑฐ๋ฆฌ์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋น ์ ธ ์žˆ๊ณ  '์ถœํ˜„ ๋นˆ๋„'์—๋งŒ ์ดˆ์ ์„ ๋งž์ถฐ ์„ค๋ช…ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”๋ฐ ๊ฐœ์ธ์ ์œผ๋กœ ์ถœํ˜„ ๋นˆ๋„๋ณด๋‹ค ๊ฑฐ๋ฆฌ ๊ฐœ๋…์ด ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

 

์ถœํ˜„ ๋นˆ๋„๋งŒ์œผ๋กœ ์ดํ•ดํ•˜๋ ค๊ณ  ํ•˜๋ฉด ํ–‰๋ ฌ ํ˜•ํƒœ๋ฅผ ๋ดค์„ ๋•Œ '์ด ๋‹จ์–ด ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ ์™œ ์ด๊ฑฐ์•ผ?' ์‹ถ์€ ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋‚ด๊ฐ€ ๊ทธ๋žฌ์Œ ๐Ÿ™‚..

 

์˜ˆ์‹œ

โ—ฝ ๋ฌธ์žฅ

 

์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๋‹ค๋ฃจ๊ฒŒ ๋  ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๋Š” ์—†๊ฒ ์ง€๋งŒ ์˜ˆ์‹œ์ด๋ฏ€๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.

 

a hundred bad days made a hundred good stories.

 

AJR - 100 Days

 

์ด ๋ฌธ์žฅ์„ ์ด์šฉํ•ด ๋งŒ๋“  ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชจ์Šต์ผ๊ฑฐ๊ณ  ๋‹จ์–ด์˜ ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ ๋นˆ์นธ์„ ์ฑ„์šฐ๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.

โ” ํŠน์ • ๊ฑฐ๋ฆฌ ๋ฒ”์œ„

๊ฑฐ๋ฆฌ๋ฅผ Window๋ผ๊ณ  ๋ถ€๋ฅด๊ณ  ์ž๊ธฐ ์ž์‹ ์„ ๋งŒ๋‚ฌ์„ ๋•Œ๋Š” ํšŸ์ˆ˜ ์นด์šดํŠธ์— ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด์ œ ๋‘ ๊ฐ€์ง€ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

 

1) ํ˜„์žฌ ๋‹จ์–ด: made / Window ํฌ๊ธฐ: 1

๊ฑฐ๋ฆฌ๊ฐ€ 1์ด๊ธฐ ๋•Œ๋ฌธ์— ํ˜„์žฌ ๋‹จ์–ด made์˜ ์ง์ „, ์งํ›„ ๋‹จ์–ด๋งŒ ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ ์นด์šดํŠธ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ days, good๋งŒ ๊ฐ’์ด 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์ด๋‹ค.

 

2) ํ˜„์žฌ ๋‹จ์–ด: made / window ํฌ๊ธฐ: 2

made ๊ธฐ์ค€ ๊ฑฐ๋ฆฌ๊ฐ€ 2์ธ ๋‹จ์–ด๋“ค์€ bad, days, good, stories์ด๊ณ  ์ด ๋‹จ์–ด๋“ค๋งŒ ์ถœํ˜„ ํšŸ์ˆ˜๊ฐ€ 1, ๋‚˜๋จธ์ง€๋Š” 0์ด๋‹ค.

 

๊ตฌํ˜„

๋‹จ์ˆœ ๊ตฌํ˜„์ธ๋ฐ ์ฝ”๋“œ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์งœ๋Š” ๊ฒƒ๋ณด๋‹ค ์ง์ ‘ ๊ตฌํ˜„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋Š” ๋ฐ์— ์ง‘์ค‘ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ฝ”๋“œ ํ€„๋ฆฌํ‹ฐ๋Š” ๊ต‰์žฅํžˆ ๋‚ฎ๋‹ค.

 

sentence = 'a hundred bad days made a hundred good stories.'.replace('.', ' .').split()

sorted_sentence = []

# ['a', 'hundred', 'bad', 'days', 'made', 'good', 'stories', '.']
for s in sentence:
    if s not in sorted_sentence:
        sorted_sentence.append(s)

WINDOW_SIZE = 1

cooccurrence_mat = []

near_word = []

for i in range(len(sorted_sentence)):
    if i==0:
        check = [sorted_sentence[i+WINDOW_SIZE]]
    elif i==len(sorted_sentence)-1:
        check = [sorted_sentence[i-WINDOW_SIZE]]
    else:
        check = [sorted_sentence[i-WINDOW_SIZE], sorted_sentence[i+WINDOW_SIZE]]
    
    near_word.append(check)
    
# 'hundred': ['a', 'bad'] 
window_size_dict = {k: v for k, v in zip(sorted_sentence, near_word)}

for value in window_size_dict.values():
    count_dict = {k: 0 for k in sorted_sentence}
    
    for v in value:
        count_dict[v] += 1
    
    print(count_dict)
    cooccurrence_mat.append(list(count_dict.values()))

 

 

์ตœ์ข… ๋™์‹œ ์ถœํ˜„ ํ–‰๋ ฌ์ด ์ €์žฅ๋  ๊ณณ์€ cooccurrence_mat์ด๊ณ  ์–ด๋–ป๊ฒŒ ์ €์žฅ๋๋Š”์ง€ ํ™•์ธํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

 

 

โ€ป ์ „์ฒด ์ฝ”๋“œ ๋ฐ ์‹คํ–‰ ๊ฒฐ๊ณผ : https://github.com/nsbg/NLP/tree/main/basic