Artificial Intelligence/NLP

[NLP ๊ธฐ์ดˆ] BoW(Bag of Words)

geum 2022. 6. 27. 12:40

๊ฐœ๋…

๋ฌธ์žฅ์„ ์ด๋ฃจ๊ณ  ์žˆ๋Š” ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธํ•˜๊ณ  ๊ทธ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

 

์˜ˆ์‹œ

BoW ๋ชจ๋ธ์€ ๋‹จ์–ด ์‚ฌ์ „์„ ์ฐธ๊ณ ํ•˜์—ฌ ๋ฒกํ„ฐํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์•„๋ž˜์ฒ˜๋Ÿผ 4๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์„œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์ด ๋ฌธ์„œ๋ฅผ BoW ๋ชจ๋ธ๋กœ ํ‘œํ˜„ํ•ด๋ณด๊ธฐ๋กœ ํ•œ๋‹ค.

 

โ—ฝ ๋ฌธ์„œ : ["It was the best of times", "It was the worst of times", "It was the age of wisdom", "It was the age of foolishness"]

โ—ฝ ๋ฌธ์„œ์—์„œ ์ƒ์„ฑํ•œ ๋‹จ์–ด ์‚ฌ์ „ : ['It', 'was', 'the', 'best', 'of', 'times', 'worst', 'age', 'wisdom', 'foolishness']

โ—ฝ ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ ๋ฒกํ„ฐ ํ‘œํ˜„ ๊ฒฐ๊ณผ(๋‚˜๋จธ์ง€ ๋ฌธ์žฅ๋„ ๋™์ผ)

1 1 1 1 1 1 0 0 0 0

์ผ๋‹จ ๊ฐ ๋ฌธ์žฅ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋Š” ๋‹จ์–ด ์‚ฌ์ „ ์š”์†Œ ๊ฐœ์ˆ˜์™€ ๋™์ผํ•˜๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ์„ ์ด๋ฃจ๊ณ  ์žˆ๋Š” ๋‹จ์–ด๋Š” It, was, the, best, of, times์ด๊ณ  ๊ฐ ๋‹จ์–ด๋“ค์ด ์ด ๋ฌธ์žฅ์—์„œ ํ•œ ๋ฒˆ์”ฉ๋งŒ ๋‚˜์™”๊ธฐ ๋•Œ๋ฌธ์— 1๋กœ ๊ฐ’์ด ์ฑ„์›Œ์ง„๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฌธ์žฅ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์€ worst, age, wisdom, foolishness ์œ„์น˜์˜ ๊ฐ’์€ 0์ด๋‹ค.

 

๊ตฌํ˜„

1) ์ง์ ‘ ๊ตฌํ˜„

import numpy as np

docs = ["It was the best of times", "It was the worst of times", "It was the age of wisdom", "It was the age of foolishness"]

# ๋‹จ์–ด ์‚ฌ์ „ ์ƒ์„ฑ
word_dict = []

for sentences in docs:
    word_list = sentences.split()
    
    for word in word_list:
        if word not in word_dict:
            word_dict.append(word)

sentences_vector = []

for sentences in docs:
    word_count = {key: 0 for key in word_dict}
    
    for i in word_dict:
        word_count[i] = sentences.count(i)

    sentences_vector.append(list(word_count.values()))

 

โญ ์‹คํ–‰ ๊ฒฐ๊ณผ

 

2) CountVectorizer() ์‚ฌ์šฉ

์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” CountVectorizer()๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ž…๋ ฅ๋งŒ ํ•ด์ฃผ๋ฉด BoW ๋ชจ๋ธ์˜ ๋‹จ์–ด ์‚ฌ์ „ ๊ตฌ์ถ• ๋ฐ ๋ฒกํ„ฐ ๋ณ€ํ™˜ ๊ณผ์ •์„ ์•Œ์•„์„œ ์ฒ˜๋ฆฌํ•œ๋‹ค. 

 

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
bow = count_vectorizer.transform(docs) # fit_transform(docs)๋„ ๊ฐ€๋Šฅ

 

โญ ์‹คํ–‰ ๊ฒฐ๊ณผ

์ง์ ‘ ๊ตฌํ˜„ํ•œ ๊ฒฐ๊ณผ์™€ ๋‹จ์–ด ๋ฐฐ์น˜ ์ˆœ์„œ์—์„œ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. CountVectorizer()๋Š” ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ฐฐ์น˜ํ•˜๋Š”์ง€ ๋ณด๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ๋ฐฉ๋ฒ•์„ ์ฐพ์ง€ ๋ชปํ–ˆ๋‹ค ๐Ÿคช

 

ํ•œ๊ณ„

๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ์žฅ์ ์ด ์žˆ์ง€๋งŒ ์˜ค์ง ํšŸ์ˆ˜๋งŒ ์„ธ๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์น˜ํ™”๋œ ํ‘œํ˜„๋“ค์ด ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜์ง€ ๋ชปํ•œ๋‹ค. ๋˜ํ•œ ํŠน๋ณ„ํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ–์ง€๋Š” ์•Š์ง€๋งŒ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด(์กฐ์‚ฌ, ์ง€์‹œ๋Œ€๋ช…์‚ฌ ๋“ฑ)๋“ค์€ ์‹ค์งˆ์ ์œผ๋กœ ์˜๋ฏธ๊ฐ€ ์—†์Œ์—๋„ ํšŸ์ˆ˜๊ฐ€ ๋†’๊ฒŒ ์นด์šดํŠธ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋‹ค.

 

 

โ€ป ์ „์ฒด ์ฝ”๋“œ : https://github.com/nsbg/NLP/blob/main/basic/bag-of-words.ipynb