sabato 7 novembre 2015

A crash course in Python - Lesson 3: Dictionaries

Data structure: the map

A fundamental data structure in computer science is the map. A map associates a key (usually a string) with a value (which usually can be any object).

Here is an example of a map construction and usage.
Suppose we would like to count the number of words in a document, in order to build a histogram of frequencies. Let's assume we use the following map to store the histogram:
"word1" -> frequency1
"word2" -> frequency2
....
"wordN" -> frequencyN
On the left side, I listed the words of the documents, without duplicates. These represent the keys of the map. Whenever a key is provided (say, word1), the corresponding value (in our example, frequency1) is retrieved.
In order to build the aforementioned map, we can use the following algorithm (written in pseudo-code):
for (aWord in document)
    map[aWord] = map[aWord] + 1
In this pseudo-code, map[aWord] retrieves the value associated with the word aWord. In this way, every time the word aWord is encountered in the document, the frequency associated with that word and stored in the map is incremented.
Where is the pro in all this?

  • Accessing the map given a word takes O(1)
  • With an algorithm of time complexity O(N) we built the histogram of frequencies

Almost every programming language has its own implementation of such data structure (e.g., HashMap is the Java implementation).
Moreover, there are many products based on it. If you are fond of NodeJS, for instance, you could know Redis, an in-memory data structure stored, which can be used as a cache, and is fully integrated with NodeJS. In this case, keys and values are both stored as strings.

From map to Python's dictionary

Python has its own "map-like" data structure, called dictionary. A dictionary associates values with keys, and is declared as follows:
empty_dict1 = {}
empty_dict2 = dict()
grades = {"Joel": 80, "Tim": 95}

Properties of keys

Dictionary keys must be immutable. In particular, you cannot use lists as keys! If you need a multipart key, then you should use a tuple, or figure out a way to turn the key into a string.

Retrieving a value

You can look up the value for a specific key using square brackets:
joel_grade = grades["Joel"]

Obviously, it could be the case that a key does not exist in the map. Thus, you may want to cast an exception that cover such case:
try:
    kates_grade = grades["Kate"]
except KeyError:
    print "No grade for Kate"

Instead of casting an exception, one may want to check if a key exists by doing the following operation:
joel_has_grade = "Joel" in grades   # True
In this case, we are using the operator in to check if the specified value is contained in the dictionary. This reminds lists and the membership check on an element... (check my previous Python lesson on lists).

However, to avoid the whole fuss and muss that could be created by the absence of a key, you can use the get() method, which:

  • in case the key exists, returns the associated value
  • in case the key does not exist, returns a default value
Here are some examples:
joel_grade = grades.get("Joel",0)  # equals 80, default is 0
no_one_grade = grades.get("No one")  # default is None

Adding a new value

It is possible to assign key-value pairs using the square brackets notation:
grades["Tim"] = 99

Going back to our word histogram example

We are going to create a dictionary where the keys are words and the values are word counts.
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts [word] += 1
    else:
        word_counts[word] = 1

A second (and equivalent) approach is the following, which uses exceptions to handle the special case in which the word is not in the dictionary:
word_counts = {}
for word in document:
    try:
        word_counts [word] += 1
    except KeyError:
        word_counts[word] = 1

A third approach uses the get() method:
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

defaultdict

A defaultdict (i.e., "default dictionary") is a regular dictionary that handles particularly the case in which someone tries to look up a key it does not contain. In this case, the dictionary adds a value for it, using a zero-argument function you provide when you create it.

To use a default dictionary, you have to import it:
from collections import defaultdict

Going back to our example, we can use default dictionaries to store the histogram of words:
word_counts = defaultdict(int)    # int() produces 0
for word in document:
    word_counts[word] += 1

You can use a default dictionary to initialize several objects:
def_list_dict = defaultdict(list)  # list() produces an empty list
def_list_dict[2]. append(1)  # {2:[1]}

default_dict = defaultdict(dict) # dict() produces an empty dict
default_dict["Joel"]["City"] = "Seattle"  # {"Joel":{"City":"Seattle"}}

default_pair = defaultdict(lambda: [0,0])
default_pair[2][1] = 1  # {2:[0,1]}

Counter

A counter turns a sequence of values into a defaultdict(int)-like object, mapping keys to counts. It can be used for histograms.
from collections import Counter
c = Counter([0, 1, 2, 0])  # {0:2, 1:1, 2:1}
The output shows that:

  • the number "0" appeared 2 times in the sequence
  • the number "1" appeared 1 time in the sequence
  • the number "2" appeared 1 time in the sequence

We can use a Counter to solve the problem of counting words in a document, too:

word_counts = Counter(document)

In the Counter object there is the possibility of using the most_common method:
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10)
    print word, count

Nessun commento:

Posta un commento