Artificial Intelligence/πŸ“–

[정리] train_test_split을 μ΄μš©ν•œ 데이터셋 λΆ„ν• 

geum 2022. 3. 17. 16:36

Bagging μ‹€μŠ΅ν•˜λ‹€κ°€ 데이터셋 λΆ„ν•  μˆœμ„œ λ•Œλ¬Έμ— μ—λŸ¬ λ©”μ‹œμ§€λ₯Ό λ§Œλ‚œ 적이 μžˆλŠ”λ°(무렀 두 달 μ „) μ΄μ œμ„œμ•Ό μ •λ¦¬ν•œλ‹€.

 

 

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

import numpy as np

 

λ°μ΄ν„°λŠ” 두 달 전에 썼던 κ±° κ·ΈλŒ€λ‘œ λΆˆλŸ¬μ™”κ³  ν•„μš”ν•œ λͺ¨λ“ˆλ§Œ import해쀬닀.

 

 

μœ„μŠ€μ½˜μ‹  μœ λ°©μ•” 진단 λ°μ΄ν„°μ…‹μ—λŠ” 총 569개의 데이터가 μžˆλŠ” 것을 ν™•μΈν–ˆλ‹€.

 

X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target)

 

train_test_split을 μ¨μ„œ 데이터셋을 λ‚˜λˆ„λŠ”λ°, μˆœμ„œμ˜ μ€‘μš”μ„±λ§Œ ν™•μΈν•˜κΈ° μœ„ν•œ μ‹€μŠ΅μ΄λ―€λ‘œ μ–΄λ–€ μΈμžλ„ μ„€μ •ν•˜μ§€ μ•Šκ³  λ‚˜λˆŒ λ°μ΄ν„°λ§Œ λ„£μ–΄μ£Όμ—ˆλ‹€.

 

⭐ test_size 값을 μ •ν•˜μ§€ μ•Šμ„ 경우 default인 0.25κ°€ 적용됨

 

κ²°κ³Ό

1) X_train, X_test, y_train, y_test

 

- test_size κΈ°λ³Έ 값이 0.25λ‹ˆκΉŒ 전체 λ°μ΄ν„°μ˜ 25%λ₯Ό ν…ŒμŠ€νŠΈ λ°μ΄ν„°λ‘œ μ“΄λ‹€λŠ” λœ»μ΄λ‹€.

- 569*0.25 = 142.25 → X, y λͺ¨λ‘ 143개의 ν…ŒμŠ€νŠΈ 데이터

 

2) X_train, y_train, X_test, y_test

 

- train_test_split으둜 λΆ„ν• ν•œ λ°μ΄ν„°λŠ” λ³€μˆ˜μ˜ 지정 μˆœμ„œλŒ€λ‘œ λ“€μ–΄κ°€λŠ”λ° train끼리, test끼리 묢어버리면 X, y λͺ¨λ‘ 전체 데이터λ₯Ό trainμš©μœΌλ‘œλ„ μ“°κ³  testμš©μœΌλ‘œλ„ μ“΄λ‹€λŠ” 뜻이 돼버린닀.

- 데이터셋은 λ‚˜λˆ μ§€κ² μ§€λ§Œ 과거의 λ‚΄κ°€ 그랬던 κ²ƒμ²˜λŸΌ μƒ˜ν”Œ μˆ˜κ°€ μ•ˆ λ§žλ‹€λŠ” μ—λŸ¬λ₯Ό λ§ˆμ£Όν•  수 μžˆλ‹€.

 

κ²°λ‘ 

train_test_split μ‚¬μš© μ‹œ train, test μˆœμ„œλ‘œ 데이터λ₯Ό λ„£μ–΄μ£Όμ–΄μ•Ό ν•œλ‹€!

 

μ°Έκ³  μ‚¬μ΄νŠΈ

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html