๐Ÿ‘ฉ‍๐Ÿ’ป

[DACON] ์ฝ”๋“œ ๋ถ„์„ - MNIST : ์ˆซ์ž ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜

geum 2021. 5. 25. 15:16

เฒฅ_เฒฅ

์‹ค์Šต์šฉ ์‚ฌ์ดํŠธ๋กœ ์„ ํƒํ•œ ๋ฐ์ด์ฝ˜์—์„œ MNIST๋ฅผ ํ˜ผ์ž ํž˜์œผ๋กœ ํ•ด๊ฒฐํ•ด๋ด์•ผ์ง€ ํ•˜๋Š” ํŒจ๊ธฐ์™€ ํ•จ๊ป˜ ์ œ์ถœํ•œ ๋‹ต์•ˆ์˜ ๊ฒฐ๊ณผ๋‹ค.

์ฒซ๋ฒˆ์งธ ์‹œ๋„ ํ›„ ?? ์‹ถ์–ด์„œ ๋‘๋ฒˆ์งธ ํŒŒ์ผ์„ ์ œ์ถœํ–ˆ๋Š”๋ฐ 10%๋„ ์•ˆ๋˜๋Š” ์ •ํ™•๋„์— ๋จธ๋ฆฌ๊ฐ€ ์•„์ฐ”-

ํ•˜์ง€๋งŒ! 3์ผ ๋™์•ˆ ๋ถ™์žก๊ณ  ์žˆ๋˜ ๊ฒฐ๊ณผ 0.981๊นŒ์ง€ ์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์—ˆ๊ณ  ํ–‰๋ณตํ•œ ๋งˆ์Œ์œผ๋กœ ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

 

์ฒ˜์ฐธํ•œ ์ •ํ™•๋„์˜ ์›์ธ๋„ ๋‚˜๋ฆ„๋Œ€๋กœ ์—ด์‹ฌํžˆ ๋ถ„์„ํ•  ์˜ˆ์ •์ด๋ผ ๋ˆ„๊ตฐ๊ฐ€์—๊ฒŒ๋Š” ์œ ์ตํ•œ ๊ธ€์ด ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ผ๋ฉด์„œ!

 


๐Ÿ”Ž ๋ฐ์ดํ„ฐ ํ™•์ธ

โ‘  train.csv: ํ”ฝ์…€๊ฐ’๊ณผ ์ด๋ฏธ์ง€๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ์ˆซ์ž ๊ฐ’

โ‘ก test.csv: ํ”ฝ์…€๊ฐ’

โ‘ข sample_submission.csv: ์ œ์ถœ ํŒŒ์ผ์˜ ์˜ˆ์‹œ

 

โœจ ๋ชฉํ‘œ

test.csv์˜ ํ”ฝ์…€๊ฐ’์œผ๋กœ ๊ฐ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ด๋ธ”(์ˆซ์ž) ์˜ˆ์ธกํ•˜๊ธฐ

 

๐Ÿ‘€ ์ฝ”๋“œ ๋ถ„์„

import tensorflow as tf
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt

from tensorflow import keras
#from sklearn.model_selection import train_test_split

์ฃผ์„ ์ฒ˜๋ฆฌํ•œ ๋ชจ๋“ˆ์€ ํ•„์š”์— ๋”ฐ๋ผ ์‚ฌ์šฉํ•˜๋ฉด ๋˜๋Š” ๋ถ€๋ถ„์ด๋‹ค.

 

from google.colab import drive
drive.mount('/gdrive', force_remount=True)

์ฝ”๋žฉ ํด๋”์— ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ์˜ฌ๋ ค๋†จ๋”๋‹ˆ ๋Ÿฐํƒ€์ž„ ์—ฐ๊ฒฐ์ด ์œ ์ง€๋์„ ๋•Œ๋งŒ ํŒŒ์ผ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ธธ๋ž˜ ๋“œ๋ผ์ด๋ธŒ์— ์˜ฌ๋ ค๋†“๊ณ  ๊ณ„์† ์“ฐ๋ ค๊ณ  ์„ธํŒ…ํ•ด์คฌ๋‹ค.

 

train = pd.read_csv("/gdrive/My Drive/train.csv").iloc[:, 1:]
test = pd.read_csv("/gdrive/My Drive/test.csv").iloc[:, 1:]
submission = pd.read_csv("/gdrive/My Drive/sample_submission.csv")

๋ฐ์ดํ„ฐ ์ฝ์–ด์˜ค๊ธฐ

 

print(train.shape)
print(test.shape)

# (60000, 786): 60000๊ฐœ์˜ ํ–‰, 786๊ฐœ์˜ ์—ด
# (10000, 785): 10000๊ฐœ์˜ ํ–‰, 785๊ฐœ์˜ ์—ด

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • ์ „ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ ํ˜•ํƒœ๋ฅผ ํ™•์ธํ•ด๋ดค๋‹ค.

 

x_train = train.drop(["label"], axis=1)
x_test = test
y = train["label"]

train ๋ฐ์ดํ„ฐ๋Š” ์œ„์—์„œ ํ™•์ธํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ์—ด์ด 786๊ฐœ์ด์ง€๋งŒ ๋ชจ๋“  ์—ด์„ ์‚ฌ์šฉํ•  ๊ฒƒ์€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ธกํ•ด์•ผ ํ•  ๊ฐ’์ธ label ์—ด์€ ์‚ญ์ œํ•˜๊ณ , test ๋ฐ์ดํ„ฐ๋Š” ์ด๋ฏธ ํ”ฝ์…€ ๊ฐ’๋งŒ ์ €์žฅ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€์ ์ธ ๊ฐ€๊ณต ์—†์ด ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

 

# ๊ฐœ์ธ์ ์œผ๋กœ ๊ฐ€์žฅ ํ—ค๋งธ๋˜ ๋ถ€๋ถ„

x_train = x_train[0:50000]
x_val = x_train[50000:60000]

y_train = y[0:50000].to_numpy()
y_val = y[50000:60000].to_numpy()

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 50000๊ฐœ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋Š” 10000๊ฐœ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

reshape๋‚˜ ๋ช‡๋ช‡ ํ•จ์ˆ˜๋“ค์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ์—์„œ๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— to_numpy ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด numpy ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. to_numpy ํ•จ์ˆ˜๋ฅผ ์“ฐ์ง€ ์•Š์•„๋„ ์ฝ”๋“œ๊ฐ€ ์ž˜ ๋Œ์•„๊ฐ„๋‹ค๋ฉด ๊ตณ์ด ์“ธ ํ•„์š”๋Š” ์—†๋Š”๋ฐ ๋‚˜๋Š” 'ndarray.~๋Š” ํ•จ์ˆ˜๋ช… ์š”์†Œ๊ฐ€ ์—†๋‹ค'๋Š” ์—๋Ÿฌ๊ฐ€ ๋– ์„œ ๋ฐ”๊ฟ”์คฌ๋‹ค.

 

# ์ •๊ทœํ™”
x_train = x_train.astype('float32') / 255
x_val = x_val.astype('float32') / 255
x_test = x_test.astype('float32') / 255

csv ํŒŒ์ผ์— ์ €์žฅ๋œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋“ค์€ 0~255์˜ ์ƒ‰์ƒ๊ฐ’ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋Š”๋ฐ ํ•™์Šต์„ ์œ„ํ•ด 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”์‹œ์ผœ์ค€๋‹ค.

 

y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
y_val = tf.keras.utils.to_categorical(y_val, num_classes=10)

y_train, y_val ๊ฐ’์€ to_categorical ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด one-hot ์ธ์ฝ”๋”ฉ ์ž‘์—…์„ ํ•ด์ค€๋‹ค. num_classes๊ฐ€ 10์ธ ์ด์œ ๋Š” ์ˆซ์ž ๋ ˆ์ด๋ธ”์ด 0~9๋กœ ์ด 10๊ฐœ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

model = keras.Sequential([ 
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋‹จ๊ณ„๋กœ ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•œ ์„ ํ˜• ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ๋‹ค. Flatten ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด์„œ ์ž…๋ ฅ์ด 28*28์ธ ์ด๋ฏธ์ง€๋ฅผ 1์ฐจ์› ํ˜•ํƒœ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ณ  Dense๋ฅผ ์ด์šฉํ•ด ์ž…์ถœ๋ ฅ์„ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š”๋ฐ ์ด ๋•Œ Dense layer์˜ ์ˆ˜๋Š” ์ž์œ ๋กญ๊ฒŒ, activation์€ ๊ผญ relu์™€ softmax๋ฅผ ์‚ฌ์šฉํ•  ํ•„์š”๋Š” ์—†๋‹ค.

 

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ์†์‹ค ํ•จ์ˆ˜, ํ‰๊ฐ€ ์ง€ํ‘œ์— ๋Œ€ํ•ด ์ง€์ •ํ•ด์ค€๋‹ค.

 

model.fit(x_train, y_train, epochs=10)

epoch 10๋ฒˆ์œผ๋กœ ํ•™์Šต ์‹œ์ž‘

 

ํ•™์Šต ์ •ํ™•๋„๋Š” ๊ฝค ๊ดœ์ฐฎ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธ!

 

y_pred = np.argmax(model.predict(x_test), axis=1)

test.csv์˜ ๊ฐ’์œผ๋กœ ์ˆซ์ž ๋ ˆ์ด๋ธ”์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด predict ํ•จ์ˆ˜์˜ input์œผ๋กœ x_test๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฐ’์€ 0~9 ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์ด ์•„๋‹ˆ๋ผ ์ธ๋ฑ์Šค ๊ทธ ์ž์ฒด์ด๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๋Œ€๊ฐ’์˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” np.argmax ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

 

# ์˜ˆ์ธก๊ฐ’์„ submisson ํŒŒ์ผ์˜ label ์—ด์— ํ• ๋‹น
submission['label'] = y_pred

# ์ตœ์ข… ์ œ์ถœ ํŒŒ์ผ ์ƒ์„ฑ
submission.to_csv("/content/submission.csv", index=False)

 

๐Ÿ˜ฅ ์–ด๋ ค์› ๋˜ ์ 

1. train, test, val ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

์ผ€๋ผ์Šค์—์„œ ์ œ๊ณตํ•ด์ฃผ๋Š” mnist ๋ฐ์ดํ„ฐ๋Š” load_data ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜๋ฉด ์•Œ์•„์„œ x_train, x_test / y_train, y_test ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๋Š”๋ฐ csv ํŒŒ์ผ์—์„œ ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด์ค˜์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์ด ๋„ˆ๋ฌด ์–ด๋ ค์› ๋‹ค. ์ด๋ก  ๊ณต๋ถ€๋ฅผ ํ•˜๋ฉด์„œ ๋ดค๋˜ ๊ฐ•์˜๋“ค์ด๋‚˜ ๊ตฌ๊ธ€๋ง ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๊ธ€๋“ค๋„ ์ผ€๋ผ์Šค ์ œ๊ณต mnist ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๊ธฐ ๋•Œ๋ฌธ์— train ๋ฐ์ดํ„ฐ๋Š” 50000๊ฐœ, val ๋ฐ์ดํ„ฐ๋Š” 10000๊ฐœ๋กœ ๋‚˜๋ˆ ์ค˜์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์—์„œ ๊ฐ์„ ์•„์˜ˆ ์žก์ง€ ๋ชปํ–ˆ๋‹ค. 

 

2. y ๊ด€๋ จ ๋ฐ์ดํ„ฐ์— to_categorical() ํ•จ์ˆ˜ ์ ์šฉ

to_categorical ํ•จ์ˆ˜๊ฐ€ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์•Œ์•„์„œ ํ•ด์ฃผ๋Š”๊ตฌ๋‚˜ ์ •๋„๋งŒ ์•Œ๊ณ  ์žˆ๋Š” ์ƒํƒœ์—์„œ ์จ๋ณด๋ ค๊ณ  ํ•˜๋‹ˆ๊นŒ ์™œ ์“ฐ๋Š” ๊ฑฐ๊ณ  ์–ธ์ œ ์จ์•ผํ•˜๋Š”์ง€๋ฅผ ํ™•์‹คํ•˜๊ฒŒ ๋ชฐ๋ผ์„œ ์—ฌ๊ธด๊ฐ€? ์‹ถ์€ ๊ณณ์—๋Š” ๋‹ค ๋„ฃ์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค.

 

3. np.argmax ์‚ฌ์šฉ

์—ฌ๋Ÿ๋ฒˆ์งธ ์‹œ๋„๊นŒ์ง€๋Š” np.argmax๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ฐ ํ”ฝ์…€ ๊ฐ’์„ ์ •๊ทœํ™”ํ•œ๋‹ค๊ณ  255๋กœ ๋‚˜๋ˆˆ ๋ชซ์ด ๋ฐฐ์—ด์— ๋“ค์–ด๊ฐ€๋‹ˆ๊นŒ ์ฐพ์•„์•ผ ํ•˜๋Š” ์ธ๋ฑ์Šค ๊ฐ’๊ณผ ์ƒ๊ด€์—†๋Š” ๊ฐ’๋งŒ ๊ณ„์† y_pred์— ์ €์žฅ์ด ๋๋˜ ๊ฑฐ์˜€๋‹ค.

 

๐Ÿ‘ ๊ฒฐ๊ณผ ๋ฐ ๊ฐœ์„ 

์œ„์˜ ์ฝ”๋“œ๋กœ ์ œ์ถœํ•œ ๊ฒฐ๊ณผ๋Š”

0.0065์—์„œ 0.9778๋กœ ์žฅ์กฑ์˜ ๋ฐœ์ „! ์ข€ ๋” ์ •ํ™•๋„๋ฅผ ๋†’์—ฌ๋ณด๊ณ  ์‹ถ์–ด์„œ Batch ์ •๊ทœํ™”์™€ ๋“œ๋กญ์•„์›ƒ์„ ์ถ”๊ฐ€ํ•ด๋ณด์•˜๋‹ค.

 

1) Batch ์ •๊ทœํ™” & ๋“œ๋กญ์•„์›ƒ

model = keras.Sequential([ 
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.BatchNormalization(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10, activation='softmax')
])

์ •ํ™•๋„๋Š” 0.981์ด์—ˆ๋‹ค.

 

2) ๋“œ๋กญ์•„์›ƒ๋งŒ ์‚ฌ์šฉ

model = keras.Sequential([ 
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(10, activation='softmax')
])

์ •ํ™•๋„๋Š” 0.9718์ด์—ˆ๋‹ค.

 

๋ฐฐ์น˜ ์ •๊ทœํ™”์™€ ๋“œ๋กญ์•„์›ƒ์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ํ•™์Šต์— ์ข‹๋‹ค๋”๋ผ ํ•˜๋Š” ๊ฒƒ๋งŒ ๊ณต๋ถ€ํ•˜๋‹ค๊ฐ€ ์ง์ ‘ ํ•ด๋ณด๋‹ˆ๊นŒ ์ •๋ง ๊ทธ๋žฌ๋‹ค. ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ๋ˆˆ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ์•„์ฃผ ์†Œ์ค‘ํ•œ ๊ฒฝํ—˜์ด์—ˆ๊ณ  MNIST๋Š” ์ด๋ ‡๊ฒŒ ๋งˆ๋ฌด๋ฆฌ-!