Cherish Idempotency

Idempotent means that a program does not change the final output, even if run multiple times. Try to write idempotent code, it will make your life much easier and prevent errors you didn't know you had!


Imagine that you collected a bunch of comments from a social media site. You now want to figure out how many of those contain references to "climate". Simple enough, right?

Let's say the raw data looks like this:

texts = ["As a climate-saving cloud hugger, I love guerilla gardening!"]

Now you want to split individual words to gather counts. And you also want to remove irrelevant characters such as punctuation. You write the following code:

import re
# Regular expression to find puctuation
p_rx = re.compile('[,.-]')

def split_words(strings):
    output = []
    for string in strings:
        # Split text string into individual words
        words = string.split(' ')
        # Apply regular expression to remove
        # punctuation
        cleaned = [p_rx.sub(' ', w) for w in words]
    return output

Now what's the problem? You might have spotted it already: The function split_words is not idempotent, meaning it will give different results when we apply it multiple times. The first run results in this result:

>>> result = split_words(texts)
>>> result
['As', 'a', 'climate saving', 'cloud', 'hugger ', 'I', 'love', 'guerilla', 'gardening!']

But if we pass that result into the same function once more, we get:

>>> split_words(result)
['As', 'a', 'climate', 'saving', 'cloud', 'hugger', '', 'I', 'love', 'guerilla', 'gardening!']

In the first run, 'climate-saving' counds as a single word which is not separated. But then, we remove the dash and replace it with a space! In the second run, this gets picked up by split(' ') and is separated. The problem might seem trivial, but is quite frequent in data cleaning: If we want to count all appearances of the word "climate", the first run yields 0 cases (because it's still a composite with the dash), while the second run yields 1 case.

Doing it right

We can fix our example by simply swapping the operations: First replace, then split. Or we could simply combine them: Split on either spaces or dashes.

In real work, interdependencies are often less obvious, for example if patterns span multiple characters. When the data processing pipelines become very long, it may be unfeasible to have individually idempotent functions in them. In that case, the best strategy is to have encapsulated code sections that perform complete end-to-end analyses, beginning from the raw data. This is one of the main advantages of the R Tidyverse: It encourages writing end-to-end pipes instead of many intermediate steps (which would leave intermediate variables lying around and invite multiple transformations).

# Instead of intermediate variables:
selection <- data %>% select('text')
cleaned <- selection %>% mutate(
    text=str_replace(text, '-', '')

# Better put everything in a single pipe
cleaned <- data %>% 
    select('text') %>%
    mutate(text=str_replace(text, '-', ''))

Data collection

Idempotency evidently gets more complicated when we gather external data, especially if that data changes over time: Even if our code is clean, we may still get different results every time it runs. That's a topic for a future lab note!