7 Working with data

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

7.1 Elements of data processing

In the previous chapter we have dealt with functions as reusable pieces of code. There is another common notion of what a function is: a function processes data. The basic model of data processing is the following:

So, what is data? As psychologists we are used to think of data as observations from empirical studies that we process by statistical procedures. That precisely matches our above model:

In Python, as a general programming language, there exist libraries that implement tables of observations, as well as the t-test (and more modern statistical procedures, luckily). However, we don’t want to bother you with statistics for the moment. Instead, take another look at two function that we have introduced in 6:

def summation(alist):
    sum = 0.0
    for i in range(len(alist)):
        sum += alist[i]
    print("Sum of the list is:", sum)
    return sum

summation([1,2,3,4])
## Sum of the list is: 10.0
## 10.0

The function summation takes a list of values and computes the sum. Again, that satisfies the basic model of data processing:

It is easy to see how input and output are implemented with Python functions:

  • function arguments provide the input
  • the function body does the processing
  • the return statement gives the output

Consequently, every function that takes arguments and has a return statement can be considered data processing. Furthermore, data processing can take place as chains. Let’s take the variance of a list of numbers as an example. The formula for variance, shown below, reveals that the variance statistic is produces in three steps: first, the residuals are computed, these are squared an finally the mean value is taken.

Let’s see how this is implemnented in Python. For the squared residuals, we create our own function, but resort to the function mean in the numpy library. Note how the two defined functions operate on a dedicated list for output, rather than modifying the original one. Recall that lists are called by reference, which means the original list would otherwise be altered by the function.

from numpy import mean

def residuals(lon):
    avg = mean(lon)
    output = []
    for i in lon:
          output.append((i - avg))
    return output

def square(lon):
    output = []
    for i in lon:
          output.append(i**2)
    return output

LON = [1, 8, 2, 9]
RES = residuals(LON)
SQR = square(RES)
VAR = mean(SQR)

print(VAR)
## 12.5

The above code represents such a data processing in such a way that the intermediate results are stored in dedicated variables (RES, SQR). While this makes the processing chain clear in the code, for longer chains there is the downside of length and cluttering the program with variables that are just used once. A more compact form to write the chain is by nesting function:

from numpy import mean

def residuals(lon):
    avg = mean(lon)
    output = []
    for i in lon:
          output.append((i - avg))
    return output

def square(lon):
    output = []
    for i in lon:
          output.append(i**2)
    return output

LON = [1, 8, 2, 9]
VAR = mean(square(residuals(LON)))
print(VAR)
## 12.5

Nested function are evaluated inside-out. Consequently, when setting up such a chain, one begins with the first processing step and builds the subsequent one “around” it. While this is conveniently compact, it lacks the intuition of the data processing chain. Therefore, it is best in many situations, to encapsulate the whole line as a dedicated function:

from numpy import mean

def residuals(lon):
    avg = mean(lon)
    output = []
    for i in lon:
          output.append((i - avg))
    return output

def square(lon):
    output = []
    for i in lon:
          output.append(i**2)
    return output

def variance(lon):
    return mean(square(residuals(lon)))
  
LON = [1, 8, 2, 9]
print(variance(LON))
## 12.5

This simple example highlights another important aspect about the data processing model: models can be nested. The function variance is itself a data processing chain, but it can take a role in a more global data processing chain. For example, when you prepare a summary statistics table for your report, you use the variance function as part of it. You do care about its required input and know what it will produce, but you don’t worry about the above processing chain, anymore. That is called a black box.

7.2 Data tables

The purpose of psychological experiments is to collect data from participants under various conditions. Imagine you wanted to analyze the data from a single participant in the Stroop task. Perhaps, you wanted to analyze it as a factorial linear model (or ANOVA, if you like). Your data analysis program most likely expects the data in the following form:

Condition Correctness RT
congruent TRUE 800.9212
congruent FALSE 696.8643
congruent TRUE 765.2435
congruent FALSE 814.3323
incongruent TRUE 630.5570
incongruent FALSE 686.0606
incongruent TRUE 693.3339
incongruent TRUE 731.7975

How would you create such a table in Python? The most simple form is to regard every column as a separate list and you would initialize the data gathering by creating three empty lists. In the main part of the program you can then populate the three lists separately, for example:

Condition = []
Correctness = []
RT = []

Condition.append(this_condition)
Correctness.append(this_correctness)
RT.append(this_reaction_time)

7.3 Down the sink: Writing data files

In the case of psychological experiments, the main part of the data processing is most likely done in other programs, such as R, SPSS or Excel. These programs can make little use of data that is stored in Python variables. Rather, data is read in from files. While all programs have their own data format for tables (and specialized functions exist for some), the Lingua Franca of file based data exchange is the comma-separated-values (CSV) form. This is almost as easy as it sounds:

  • every row of data is represented as a row of text
  • all values in a row are separated by commata

The example data set above would look like the following in CSV:

write.csv(D_Stroop, file = "", eol = "\n\n")
## "","Condition","Correctness","RT"
## 
## "1","congruent",TRUE,800.921185693852
## 
## "2","congruent",FALSE,696.864295047379
## 
## "3","congruent",TRUE,765.243482711174
## 
## "4","congruent",FALSE,814.332269635055
## 
## "5","incongruent",TRUE,630.556964944383
## 
## "6","incongruent",FALSE,686.060561659131
## 
## "7","incongruent",TRUE,693.333933180317
## 
## "8","incongruent",TRUE,731.797519903504

In order to write such a file, we could use a loop function, together with string concatenation. Fortunately, the csv library that comes with Python implements CSV export right-away. However, it is not as simple as issueing just one command. Dealing with files is an issue on its own. Unlike variables, which are linked to a running program, files are just dumb sinks on the hard disk on your computer. What must not happen is that, for example, two programs modify the file simultaneously, as this would create a mess. Therefore, the operating system takes special care of files and ensures that only one programn can write it at a time (simultaneous reading is less of a problem). To signal the operating system that a file is being written, some decoration is necessary. More specifically, the programmer must first create a fileg handler on the file, which is what the command open does.

Then the csv library commands operate on the writing handler. This happens by means of a writing handler, which is created by the function csv.writer. Then, the function writerow adds the rows of a table, one by one. That is slightly inconvenient, because our data table is arranged by columns (the three lists) in the first place. For that reason, the for loops first gathers every row of the table as a list this_row. And finally, the close command signals the operating system that the file is no longer in use.

Condition = ["incongruent", "incongruent", "congruent"]
Correctness = [False, False, True]
RT = [800, 696, 765]

import csv
myfile =  open("Stroop.csv", mode = "w")
writer = csv.writer(myfile)
for i in range(0, len(Condition)):
    thisrow = [Condition[i], Correctness[i], RT[i]]
    writer.writerow(thisrow)
## 23
## 23
## 20
myfile.close()

As a general rule, it is good to keep the period between open and close short. Especially, one should not open the file at the start of the program and close it when it quits. Programs in development (and written by apprentices) tend to crash and that means data loss and potential file corruption. The safest way probably is to open the file after an observation has been gathered, append the observation and close it immediatly.

So how do we append rows to an existing file? A closer look at the open function reveals that it takes a second parameter, the mode, which is of type string. Three major mode exists (plus a bunch of fine-tuning options):

mode
r reading
w writing
a appending

Hence, the code for writing an observation one-by-one could be:

import csv
myfile =  open("Stroop.csv", mode = "a")
writer = csv.writer(myfile)
writer.writerow([thisCondition, thisCorrectness, thisRT])
myfile.close()

7.4 May the source …: reading data files

A processing chain starts with a data source, no matter what. The source can be one of three kinds:

  1. a function call, for example one that produces random numbers
  2. user input, such as the response time, which is derived from a button press
  3. data files

The first two options we have already covered in this course. In the following we will see how to read in data from files. Files are typically produced by other programs. For tables that could be Excel (or a programming editor, if you like). As we have seen, CVS is a good choice.

In the Stroop program, we currently use only one external source of data, the user responses. All other data, like the words andf colors are just coded as list, tupels and dictionaries. Reading data from files seems an unncecessary exercise at first, but in fact can make the program more flexibel when used by non-programmers. A possible scenario for the Stroop task is that different experimenters use it, but with their own sets of colors. Here is how one could do it. First, we prepare a table that assigns color words to their respective RGB codes and save it as a CSV file. Any spreadsheet program is suited for the job.

Lets assume the above table has been created in Excel and saved as Colors.csv. The following Python codes opens the csv file and creates a reader (analog to the writer above). Then four lists are created that will capture the table columns. The for loop walks the reader through the CSV file line-by-line. As the first line of the file (with index 0) typically contains the column names, it is omitted. The result is four lists of the same length which makes a table. By using two integer ramdom numbers, we can select a color word, and the ink color and which we can produce the stimuli.

import csv
import random

csv_file   = open("Colors.csv")
csv_reader = csv.reader(csv_file)

Color = []
R = []
G = []
B = []

row_number = 0

for row in csv_reader:
  if row_number > 0:
    Color.append(row[0])
    R.append(row[1])
    G.append(row[2])
    B.append(row[3])
  row_number += 1
  
csv_file.close()


this_word  = random.randint(0,2)
this_color = random.randint(0,2)

print("Presenting", 
Color[this_word], "in color", Color[this_color], "(", 
R[this_color],
G[this_color],
B[this_color], ")")
## Presenting violet in color orange ( 255 150 0 )

7.5 Exercises

7.5.1 Exercise 1

Write a simple image presentation program, that takes a table of the following form as input and changes to the following picture on key press.

tibble(File = c("Image01.jpg", "Image02.jpg", "Image03.jpg"))
## # A tibble: 3 x 1
##   File       
##   <chr>      
## 1 Image01.jpg
## 2 Image02.jpg
## 3 Image03.jpg

When that works, make the program automatic. Add a column to the table that gives the presentation time for an image in seconds.

7.5.2 Exercise 2

With the Python library openpyxl you can read and write Excel files, directly. Work through (this tutorial)[https://automatetheboringstuff.com/chapter12/]. Now modify your presentation program to use an Excel files as input.

7.6 Think further

If you are working with data tables a lot, it may be worth to take a look at the Pandas library. The primary contribution of Pandas is an object class DataFrame, which captures a whole table at once and provides a plethora of useful methods for filtering and manipulation of data. And there are also commands to read and write csv files into a data frame.

import pandas

Colors = pandas.read_csv("colors.csv")

print(Colors)
##     Color    R    G    B
## 0  violet  180    0  180
## 1   green    0  150  150
## 2  orange  255  150    0

7.7 References