martedì 1 dicembre 2015

Classification of imbalanced datasets

How to overcome these binary text classification problems?
1) Highly imbalanced dataset (less than 10% positive samples)
2) Positive class subjectivity

To overcome problem 1, one could use more refined classification systems (boosting, bagging, e.g., AdaBoost for starters), although a very large dataset (even if imbalanced) could help ease the problem. This complicates the situation "a little bit": to collect enough positive samples one has to annotate tons of instances.

To overcome problem 2, one could find a god to pray.

Nessun commento:

Posta un commento