Archive

[๋ฒˆ์—ญ] Introduction to Stemming and Lemmatization

geum 2022. 4. 1. 16:35

๐Ÿ’ฌ ์ตœ๋Œ€ํ•œ ๋งค๋„๋Ÿฝ๊ฒŒ ํ•ด์„ํ•˜๊ณ ์ž ๋…ธ๋ ฅํ–ˆ์ง€๋งŒ ์–ด์ƒ‰ํ•œ ๋ฌธ์žฅ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”ผ๋“œ๋ฐฑ์€ ์–ธ์ œ๋‚˜ ํ™˜์˜์ž…๋‹ˆ๋‹ค ๐Ÿ™‚


์›๋ณธ ๊ธ€ ์ฃผ์†Œ : https://medium.com/geekculture/introduction-to-stemming-and-lemmatization-nlp-3b7617d84e65

 

์ž์—ฐ์–ด์ฒ˜๋ฆฌ(Natural Language Processing, NLP)

 

ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์–‘ํ•œ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ํƒœ์Šคํฌ์™€ ๊ด€๋ จ์—†๋Š” ์†Œ์Šค ๊ณ ์œ ์˜ ๋งˆํฌ์—…์ด๋‚˜ ๊ตฌ๋ฌธ์ด ์—†๋Š” ํ‰๋ฌธ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๊ตฌ๋‘์ , ๋Œ€๋ฌธ์ž์™€ ๊ฐ™์€ ์–ธ์–ด์˜ ๋ช‡๋ช‡ ํŠน์„ฑ๊ณผ "a"/"of"/"the"์™€ ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ ๋‹จ์–ด๋“ค์ด ๋ฌธ์„œ์˜ ๊ตฌ์กฐ ์ œ๊ณต์— ๋„์›€์„ ์ฃผ๊ธด ํ•˜์ง€๋งŒ ๋งŽ์€ ์˜๋ฏธ๋ฅผ ์ฃผ์ง€๋Š” ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์— ์ž…๋ ฅํ•˜๊ธฐ ์ „์— ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋ฅผ ์ง€์šฐ๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

 

Stemming์ด๋ž€?

 

Stemming์€ ๋‹จ์–ด์˜ ์–ด๊ทผ์ด๋‚˜ ์–ด๊ฐ„ ํ˜•์‹์„ ์ค„์ด๋Š” ๊ณผ์ •์ด๋‹ค. ์„ธ ๋‹จ์–ด "branched", "branching", "branches" ์„ธ ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•ด๋ณด์ž. ๋ชจ๋‘ "branch"๋ผ๋Š” ๊ฐ™์€ ๋‹จ์–ด๋กœ ์ค„์—ฌ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๊ตญ ์„ธ ๊ฐœ ๋‹ค branches๋กœ๋ถ€ํ„ฐ ๋ถ„๊ธฐ๋œ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ์ „๋‹ฌํ•œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, (๋‹จ์–ด์˜ ์›ํ˜•์œผ๋กœ ์ค„์ด๋Š” ๊ฒƒ์€) ์ด ์„ธ ๋‹จ์–ด๊ฐ€ ์ „๋‹ฌํ•˜๋Š” ์˜๋ฏธ์˜ ๋ณธ์งˆ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ณต์žก๋„๋ฅผ ์ค„์ด๋Š” ๋ฐ์— ๋„์›€์ด ๋œ๋‹ค.

 

ํ•œํŽธ, Stemming์€ ๊ฒ€์ƒ‰๊ณผ ์น˜ํ™˜์ด๋ผ๋Š” ๋งค์šฐ ๊ฐ„๋‹จํ•œ ์ž‘์—…์„ ์ ์šฉํ•œ ๋น ๋ฅด๊ณ  crudeํ•œ ์—ฐ์‚ฐ์ด๋‹ค.

 

๋‹ค๋ฅธ ์˜ˆ์‹œ๋กœ ์ ‘๋ฏธ์‚ฌ "ing"์™€ "ed"๋Š” ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ "ies"๋Š” "y"๋กœ ๋Œ€์ฒด๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฅด๋ฉด ์šฐ๋ฆฌ๋Š” ๋ถˆ์™„์ „ํ•œ ๋‹จ์–ด๋ฅผ ์–ป๊ฒŒ ๋˜์ง€๋งŒ ์ƒ๊ด€์—†๋‹ค. ์™œ๋ƒํ•˜๋ฉด corpus(๋ง๋ญ‰์น˜)์— ์žˆ๋Š” ๋‹จ์–ด์˜ ๋ชจ๋“  ํ˜•ํƒœ๋“ค์€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ์ค„์—ฌ์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.  

 

words = ['first', 'time', 'see', 'second', 'renaissance', 'may', 
'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', 
'2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']

 

๋˜ํ•œ NLTK๋‚˜ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ Toolkit๋Š” PorterStemmer๋‚˜ SnowballStemmer์ฒ˜๋Ÿผ ์šฐ๋ฆฌ๊ฐ€ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ stemmer๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.  

 

์—ฌ๊ธฐ์„œ๋Š” ๊ฐ„๋‹จํ•œ stemming ๊ตฌํ˜„์„ ์œ„ํ•ด PorterStemmer๋ฅผ importํ•ด๋ณด์ž.

 

from nltk.stem.porter import PorterStemmer

 

Stemmer๊ฐ€ ๋™์ž‘ํ•˜๋ ค๋ฉด corpus๋กœ๋ถ€ํ„ฐ ํ•œ ๋‹จ์–ด์”ฉ ์ „๋‹ฌํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค. ์ด ์˜ˆ์‹œ์—์„œ stopwords๋Š” ์ด๋ฏธ ์ œ๊ฑฐ๋˜์–ด ์žˆ๋‹ค.

 

โญ stopwords ์ œ๊ฑฐ ์ฝ”๋“œ(๋ฒˆ์—ญ ์ค‘ ์ถ”๊ฐ€)

nltk.download('stopwords')

# word ๋ฆฌ์ŠคํŠธ๋Š” ๋ณธ๋ฌธ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

filtered = []

for w in words:
    if w not in stopword:
        filtered.append(w)

words = filtered

 

stemmed_words = [PorterStemmer().stem(w) for w in words]
print(stemmed_words)

 

์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

 

 

์ถœ๋ ฅ์„ ๋ณด๋ฉด "started"๊ฐ€ "start"๋กœ ์ถ•์†Œ๋˜๊ณ , "people"์˜ ๋งˆ์ง€๋ง‰ "e"๊ฐ€ ์‚ฌ๋ผ์ง€๊ณ  "ones"๊ฐ€ "one"์œผ๋กœ ์ค„์–ด๋“œ๋Š” ๊ฝค ์–‘ํ˜ธํ•œ ๋ณ€ํ™˜์ด ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ** people์—์„œ e๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” ๊ฑด why ์–‘ํ˜ธํ•œ ๋ณ€ํ™˜ ์˜ˆ์‹œ์— ๋“ค์–ด๊ฐ€ ์žˆ๋Š” ๊ฒƒ?

 

Lemmatization ์†Œ๊ฐœ

 

Lemmatization์€ ๋‹จ์–ด๋ฅผ ์ •๊ทœํ™” ํ˜•ํƒœ๋กœ ์ค„์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋˜๋‹ค๋ฅธ ๊ธฐ์ˆ ์ด๋‹ค. Lemmatization์—์„œ transformation์€ ๋‹จ์–ด์˜ ๋ณ€ํ˜•์„ ๊ทธ ๋‹จ์–ด์˜ ์›ํ˜•์œผ๋กœ ๋งคํ•‘ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐฉ์‹์„ ํ†ตํ•ด "is", "was", "were"์™€ ๊ฐ™์€ ์ž‘์€ ๋ณ€ํ˜•์„ ์–ด๊ทผ "be"๋กœ ๋˜๋Œ๋ฆด ์ˆ˜ ์žˆ๋‹ค.

 

์•„๋ž˜ ์˜ˆ์ œ๋ฅผ ์œ„ํ•ด WordNetLemmatizer๋ฅผ importํ•ด๋ณด์ž.

 

from nltk.stem.wordnet import WordNetLemmatizer
lemmed_words = [WordNetLemmatizer().lemmatize(w) for w in words]

print(lemmed_words)

 

NLTK์˜ ๊ธฐ๋ณธ lemmatizer๋Š” ๋‹จ์–ด๋ฅผ ๊ทธ๋“ค์˜ ์›ํ˜•์— ๋งคํ•‘ํ•˜๊ธฐ ์œ„ํ•ด wordnet ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. lemmatization ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ์„ ๋ณด์ž.

 

 

"ones"๋งŒ "one"์œผ๋กœ ๋ฐ”๋€Œ๊ณ  ๋‹ค๋ฅธ ๊ฒƒ๋“ค์€ ๋ฐ”๋€Œ์ง€ ์•Š์€ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค. ์ž…๋ ฅ์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด, "ones"๋งŒ์ด ์œ ์ผํ•œ ๋ณต์ˆ˜๋ช…์‚ฌ์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๊ฒƒ์ด "ones"๋งŒ ๋ฐ”๋€ ์ด์œ ๋‹ค!

 

PoS๋ฅผ ์ด์šฉํ•œ Lemmatization

 

Lemmatizer๋Š” ์–ด๊ทผ ํ˜•ํƒœ๋กœ ๋Œ๋ฆฌ๋ ค๊ณ  ํ•˜๋Š” ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ์–ธ์–ด์  ๋ถ€๋ถ„์„ ์•Œ๊ณ  ์žˆ๊ฑฐ๋‚˜ ์ถ”์ธกํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด ๊ฒฝ์šฐ WordNetLemmatizer๋Š” ๋ช…์‚ฌ๊ฐ€ ๊ธฐ๋ณธ์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ์ง€๋งŒ ๋ณ€๊ฒฝ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. "pos" ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ง€์ •ํ•˜๋ฉด ๊ธฐ๋ณธ ์„ค์ • ๊ฐ’์„ ๋ฎ์–ด์“ธ ์ˆ˜ ์žˆ๋‹ค. ๋™์‚ฌ๋Š” "v"๋กœ ํ•˜์ž.

 

lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
print(lemmed)

 

 

์ด๋ฒˆ์—๋Š” "boring"๊ณผ "started" ๋‘ ๋™์‚ฌ๊ฐ€ ๋ฐ”๋€Œ์—ˆ๋‹ค.

 

๊ฒฐ๋ก 

 

์•ž์˜ ์˜ˆ์‹œ์—์„œ ๋ดค๋“ฏ์ด, stemming์€ ๋ถˆ์™„์ „ํ•œ ํ˜•ํƒœ์˜ ์–ด๊ฐ„์„ ๊ฒฐ๊ณผ๋กœ ๋‚ด๊ธฐ๋„ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋“  ๋‹จ์–ด๋“ค์ด ๋˜‘๊ฐ™์€ ํ˜•ํƒœ๋กœ ์ค„๊ธฐ ๋•Œ๋ฌธ์— ์ด ์‚ฌ์‹ค์€ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๋Š”๋‹ค.

 

Lemmatization์€ ์ตœ์ข… ํ˜•ํƒœ๋„ ์˜๋ฏธ์žˆ๋Š” ๋‹จ์–ด๋ผ๋Š” ์ ์„ ๋นผ๋ฉด stemming๊ณผ ์œ ์‚ฌํ•˜๋‹ค. Stemming ๋ฐฉ์‹์€ Lemmatization์ฒ˜๋Ÿผ ์‚ฌ์ „์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค. ๋”ฐ๋ผ์„œ ์ œ์•ฝ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜๊ฐ€ ๋‚ฎ์€ ์˜ต์…˜์ด ๋  ์ˆ˜ ์žˆ๋‹ค.