πŸ‘¨‍🏫

[Hugging Face] ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive n..

geum 2023. 8. 24. 13:55

μ΄μ€€λ²”λ‹˜κ»˜μ„œ μ˜¬λ €μ£Όμ‹  QLoRA+Polyglot-Ko-12.8B ν•™μŠ΅ 예제λ₯Ό 보고 λ”°λΌν•˜κ³  μžˆμ—ˆλŠ”λ° 원본 μ½”λ“œμ—μ„œλŠ” λ‚˜μ˜€μ§€ μ•Šλ˜ μ—λŸ¬κ°€ λ°œμƒν–ˆλ‹€.

 

κ΅¬κΈ€λ§ν•΄μ„œ μ°Ύμ•˜λ˜ 해결법(tokenizer 인자둜 padding=True/λ˜λŠ” 'max_length', truncation=True/λ˜λŠ” 'max_length' μΆ”κ°€)이 ν•˜λ‚˜λ„ 먹지 μ•Šμ•„μ„œ λ„ˆλ¬΄ λ‹΅λ‹΅ν–ˆμ—ˆλŠ”λ° μ•„λž˜μ™€ 같은 λ°©λ²•μœΌλ‘œ ν•΄κ²°ν•  수 μžˆμ—ˆλ‹€. 핡심은 remove_columns!

 

βœ… ν•΄κ²° 방법

dataset = dataset.map(lambda samples: tokenizer(samples["text"], padding=True, truncation=True, max_length=128), batched=True, remove_columns=['inputs', 'labels'])