giovedì 10 dicembre 2015

Levenshtein distance, AKA "how to compute similarity between strings"

The Levenshtein distance is a useful distance metrics that you can use to state whether two strings are similar between each other. It computes the number of different characters of a pair of strings.
The Wikipedia page proposes two alternatives:

  1. the first one is recursive
  2. the second one is iterative
Although they are fully working, and the recursive one is elegant to be read, when I found myself in implementing it in JavaScript to compare pairs of tweet texts, the recursive one wouldn't come to a result. It took ages and did not finish the computation for the pair of provided strings!
Thus, I found this nice test case which compares the two versions (in JavaScript):

[Oh, well, why JavaScript? Tweets are collected in JSON format and stored in MongoDB, and when you are required to analyze in a quick-and-dirty manner your dataset, you can JavaScript functions directly via the MongoDB console]

martedì 1 dicembre 2015

Classification of imbalanced datasets

How to overcome these binary text classification problems?
1) Highly imbalanced dataset (less than 10% positive samples)
2) Positive class subjectivity

To overcome problem 1, one could use more refined classification systems (boosting, bagging, e.g., AdaBoost for starters), although a very large dataset (even if imbalanced) could help ease the problem. This complicates the situation "a little bit": to collect enough positive samples one has to annotate tons of instances.

To overcome problem 2, one could find a god to pray.