[논문 요약] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Data/논문 읽기

[논문 요약] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

neulvo 2022. 11. 17. 17:30

논문 링크 :

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classificati

arxiv.org

코드 링크 :

GitHub - jasonwei20/eda_nlp: Data augmentation for NLP, presented at EMNLP 2019

Data augmentation for NLP, presented at EMNLP 2019 - GitHub - jasonwei20/eda_nlp: Data augmentation for NLP, presented at EMNLP 2019

github.com

한국어 구현 코드 링크 :

GitHub - catSirup/KorEDA: EDA를 한국어 데이터에서도 사용할 수 있도록 WordNet을 추가

EDA를 한국어 데이터에서도 사용할 수 있도록 WordNet을 추가. Contribute to catSirup/KorEDA development by creating an account on GitHub.

github.com

내용 요약 :

구현 방식 소개

SR : stop words가 아닌 n개의 단어를 무작위로 선택, 랜덤한 유의어로 교체
RI : stop word가 아닌 문장 내 무작위 단어의 무작위 유의어를 찾고 문장 내 무작위 위치에 삽입, n번 반복
RS : 문장 내 두 단어를 무작위로 선택하고 그 위치를 뒤바꿈, n번 반복
RD : p 확률로 문장 내의 각 단어를 무작위로 제거

구현 예시

SR, RI, RS, RD

EDA(Easy data augmentation) 유무에 따른 성능 차이

full datasets에 대해 0.8의 평균 성능 향상치, N=500 일 때 3.0%.

original 문장과 augmented 문장의 Latent space 시각화

증간된 문장이 원본 문장에 매우 근접해 있음을 확인할 수 있음

alpha 값에 따른 성능 변화

alpha =0.1 이 sweet spot

n (증강 문장 수)에 다른 성능 변화

table 3의 파라미터를 추천

EDA의 한계

데이터가 충분할 때 성능 향상은 크지 않다.
pre-trained 모델 사용 시에 의미 있는 성능 향상을 가지지 못한다.
관련 작업과의 공정한 비교는 중요하지 않다.

728x90

저작자표시 비영리 변경금지

'Data > 논문 읽기' 카테고리의 다른 글

[논문 요약] Should We Rely on Entity Mentions for Relation Extraction? Debiasing Relation Extraction with Counterfactual Analysis (2)	2022.12.25
[논문 같이 읽기] Denoising Diffusion Probabilistic Models (2)	2022.12.06
[논문 같이 읽기] BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding (1)	2022.10.22
[논문 같이 읽기] Attention Is All You Need (0)	2022.10.12
[논문 같이 읽기] Distributed Representations of Words and Phrases and their Compositionality (1)	2022.10.05

현재글[논문 요약] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

늘보의 서랍

빠르게 소비하지 않고 느리게 향유하기

자작시, 시, 네이버커넥트, 수필, 내한공연, 가사번역, 늘보, 학습일지, 시공간, 프렌지오리지널, 늘보시, notre-dame de paris, 뮤지컬, 느린사진관, 네이버부스트캠프, 노트르담드파리, 미라클모닝, 프랑스어, 늘보시공간, 공부,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

늘보의 서랍