๐Ÿ‘ฉ‍๐Ÿ’ป

[kaggle] Intro to Machine Learning

geum 2021. 4. 26. 17:20

 

kaggle(www.kaggle.com/)์˜ Intro to Machine Learning ์ฝ”์Šค๋ฅผ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

์ž์ฒด ๋ฒˆ์—ญ์œผ๋กœ ๊ณต๋ถ€ํ•˜๋‹ค๋ณด๋‹ˆ ์˜๋ฏธ๊ฐ€ ์ œ๋Œ€๋กœ ์ „๋‹ฌ๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž˜๋ชป๋œ ๋‚ด์šฉ, ์˜๊ฒฌ์„ ๋‚˜๋ˆ ๋ณด๊ณ  ์‹ถ์€ ๋‚ด์šฉ, ์˜คํƒ€ ๋“ฑ์ด ์žˆ๋‹ค๋ฉด ํŽธํ•˜๊ฒŒ ๋Œ“๊ธ€๋กœ ๋‚จ๊ฒจ์ฃผ์„ธ์š” =)

 

 


 

1. How Models Work

 

'์นจ์‹ค์ด 2๊ฐœ์ธ๊ฐ€?' ๋ผ๋Š” ์งˆ๋ฌธ์˜ ๋‹ต์— ๋”ฐ๋ผ ์ง‘์˜ ์˜ˆ์ธก ๊ฐ€๊ฒฉ์„ ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค๋ฉด, ๊ฐ ๊ฐˆ๋ž˜์— ์–ด๋–ค ์ง‘์„ ๋„ฃ์„ ๊ฒƒ์ธ์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋ˆ ์ง„ ๊ทธ๋ฃน ์•ˆ์—์„œ ํŒจํ„ด(์นจ์‹ค ๊ฐœ์ˆ˜์™€ ๊ฐ™์€ ์กฐ๊ฑด๋“ค)์— ๋”ฐ๋ผ ๋˜๋‹ค์‹œ ์˜ˆ์ธก ๊ฐ€๊ฒฉ์„ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค.

๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํŒจํ„ด์„ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •์€ ๋ชจ๋ธ์˜ fitting ํ˜น์€ training, ์ด ๊ณผ์ •์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๋Š” training data๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

 

Improving the Decision Tree

 

๋‘ ๊ฒฐ์ •ํŠธ๋ฆฌ ์ค‘์—์„œ ํ˜„์‹ค์„ฑ์ด ์žˆ๋Š” ๊ฒƒ์€ ์™ผ์ชฝ์ด์ง€๋งŒ ์นจ์‹ค์˜ ๊ฐœ์ˆ˜๋งŒ์ด ์ง‘ ๊ฐ€๊ฒฉ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ๋Š” ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— split(์ชผ๊ฐœ์ง„, ๊ฐˆ๋ผ์ง„, ๋ถ„ํ•ด)์„ ํ†ตํ•ด ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ์ข€ ๋” ๊นŠ์€ ํ˜•ํƒœ์˜ ํŠธ๋ฆฌ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. 

 

deeper trees

 

2. Basic Data Exploration

๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ์— ๊ผญ ํ•„์š”ํ•œ ํˆด์ธ Pandas์—์„œ๋Š” DataFrame์„ ์ œ๊ณตํ•ด์ค€๋‹ค.

(Pandas.DataFrame์— ๊ด€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋Š” → โญ)

 

์•„๋ž˜์™€ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๊ณ  describe() ๋ฉ”์„œ๋“œ๋Š” ์ƒ์„ฑํ•œ DF์˜ ๊ฐ„๋‹จํ•œ ํ†ต๊ณ„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 

# print a summary of the data in Melbourne data
melbourne_data.describe()

 

3. Your First Machine Learning Model

Selecting Data for Modeling

๋ฐ์ดํ„ฐ์…‹์€ ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•œ ์ •๋ณด๋งŒ์„ ๊ณจ๋ผ์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ํ•„์š”ํ•˜๋‹ค.

์œ„ ์ด๋ฏธ์ง€๋Š” melbourne_data๋ผ๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ column(์—ด) ํ•ญ๋ชฉ๋งŒ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ฝ”๋“œ์™€ ๊ทธ ๊ฒฐ๊ณผ์ด๋‹ค.

 

์•„์ง๋„ ์ข…์ข… ํ—ท๊ฐˆ๋ฆฌ๋Š” column๊ณผ row,,, (์ถœ์ฒ˜: https://docs.devexpress.com/)

 

Selecting The Prediction Target

์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ์—ด ์„ ํƒ์„ ์œ„ํ•ด dot-notation์„ ์ด์šฉํ•ด ์ ‘๊ทผํ•  ๊ฒƒ์ด๊ณ  ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ y๋Š” ์˜ˆ์ธก ํƒ€๊ฒŸ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

 

Choosing "Features"

๋ชจ๋ธ์— ์ž…๋ ฅ๋˜์–ด ์˜ˆ์ธก์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉ๋˜๋Š” ์—ด์„ Features๋ผ๊ณ  ํ•œ๋‹ค. Features๋Š” ๋ฆฌ์ŠคํŠธ์˜ ํ˜•ํƒœ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ์„ ํƒํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ๊ฐ์˜ ์•„์ดํ…œ๋“ค์ด string ํ˜•ํƒœ์—ฌ์•ผ๋งŒ ํ•œ๋‹ค.

 

 

Building Your Model - ๋ชจ๋ธ์˜ ์ƒ์„ฑ๊ณผ ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ฑฐ์น˜๋Š” ๋‹จ๊ณ„

  • Define : ๋ชจ๋ธ ํƒ€์ž… ์ง€์ •
  • Fit : ์˜์‚ฌ๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ ์ƒ์„ฑ
  • Predict : ์˜ˆ์ธก
  • Evaluate : ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์ •ํ™•๋„ ํ‰๊ฐ€

 

4. Model Validation

What is Model Validation

๋ชจ๋ธ ํ€„๋ฆฌํ‹ฐ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ฒ™๋„๋Š” MAE(Mean Absolute Error)๋กœ error๋Š” ์‹ค์ œ ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์ฐจ์ด๋กœ ๊ตฌํ•œ๋‹ค. MAE๋ฅผ ๊ตฌํ•˜๋ ค๋ฉด ๋ชจ๋ธ์ด ์ƒ์„ฑ๋˜์–ด์•ผ ํ•˜๊ณ  ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” mean_absolute_error๋ฅผ ์ด์šฉํ•ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

์˜ค์ฐจ๊ฐ€ ๊ต‰์žฅํžˆ ์ž‘์•„๋ณด์ด์ง€๋งŒ ์—ฌ๊ธฐ์—๋Š” ํ•œ๊ฐ€์ง€ ํฐ ๋ฌธ์ œ์ ์ด ์กด์žฌํ•œ๋‹ค.

 

์ด ์ฝ”๋“œ์—์„œ predicted_home_prices๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ์— ์‚ฌ์šฉ๋œ X๋Š” ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๊ฐ’์ด๋‹ค. ์ฆ‰ ์ด๋ฏธ ํ•™์Šต๋œ ๊ฐ’์„ ๋‹ค์‹œ ๋ชจ๋ธ์— ๋„ฃ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜ค์ฐจ๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์ด๋‹ค. 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋ธ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๊ฒƒ์ด๊ณ  ์ด ๋ฐ์ดํ„ฐ๋Š” validation data๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

train_test_split ์‚ฌ์šฉ ์˜ˆ์‹œ

์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•˜๋Š” train_test_split ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ training/validation ๋‘ ๊ฐ€์ง€ ํ˜•ํƒœ๋กœ ๋‚˜๋ˆ„์–ด์ค€๋‹ค.

train ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ๋๋‚ธ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์ค‘๊ฐ„ ์ ๊ฒ€์— validation ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ชจ๋ธ์˜ ์ตœ์ข… ์„ฑ๋Šฅ ํ‰๊ฐ€๋Š” test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜๋Š”๋ฐ train, validation, test ์ด ์„ธ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ๊ธ€ ํ•˜๋‚˜๋ฅผ ๋”ฐ๋กœ ์“ฐ๋„๋ก ํ•˜๊ฒ ๋‹ค.

 

5. Underfitting and Overfitting

Underfitting Overfitting
train ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๋Œ€๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ•ด์„œ train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์กฐ์ฐจ ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋งค์šฐ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ–์ง€๋งŒ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋Š” ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ

 

6. Random Forests

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์— ๋Œ€ํ•ด์„œ๋Š” ์•„์˜ˆ ํ•˜๋‚˜์˜ ๊ธ€๋กœ ์“ฐ๊ณ  ์žˆ์–ด์„œ ๋‹ค ์ž‘์„ฑํ•˜๋ฉด ๋งํฌ๋ฅผ ์ถ”๊ฐ€ํ•ด๋†“๋Š”๊ฑธ๋กœ!