How to overcome these binary text classification problems?
1) Highly imbalanced dataset (less than 10% positive samples)
2) Positive class subjectivity
To overcome problem 1, one could use more refined classification systems (boosting, bagging, e.g., AdaBoost for starters), although a very large dataset (even if imbalanced) could help ease the problem. This complicates the situation "a little bit": to collect enough positive samples one has to annotate tons of instances.
To overcome problem 2, one could find a god to pray.
Nessun commento:
Posta un commento