Documentation for fxl

Intro to R and Graphing with the fxl Package

Written by Shawn P. Gilroy (Last Updated: 2024-05-25)Programming Data Types Data Structures

Rationale for fxl rather than Base R graphics or ggplot

The purpose of this post is to provide a brief, but useful overview of some of the conventions and practices I have found helpful when working with R to build and display results from single-case design research. This content is intended to be a primer for those who may be new to R, and is not intended to be exhaustive or comprehensive. Reviewing this content should help to understand certain practices and patterns reflected across the R code used in posts related to the fxl R package.

Again, this is not a comprehensive introduction to R, but is intended to provide a basic understanding of some of the conventions and practices that will be reflected across the fxl library. This content is likely to change as features are added or changed to the plotting library.

I made the fxl library because certain conventions that are common and expected in certain domains of scientific research (i.e., single-case design research) are either not easily performed or are not possible entirely. For example, single-case design figure often convey information in ways that are seldom used in other domains (e.g., conveying information across multiple plots with some shared feature, such as a dogleg phase change line). This creates a situation where the information presented is not exactly consistent with historical norms or practices related to figure designs.

To be fair, you can likely accomplish 90% of what you need in single-case design using either base R graphics or the ggplot2 library, and the fxl library was created to get at that the remaining 10% of features not easily accomplished in those libraries.

Review of Relevant Terms and Concepts

These posts are made to provide a pragmatic (i.e., limited) introduction to R for readers who may typically do most or all of their work in spreadsheet software. A number of base terms and concepts (e.g., code management) are presented to assist in those new to R in understanding common/recommended conventions and practices as they relate to later use of the fxl library. Seasoned R users are of course welcome to skip ahead of course.

As a minor, but relevant note, there is no generally accepted right way to do most operations in R. What is presented here is primary based on my anecdotal experiences and preferences after working with R for some time.

Variables in R

The term variable in R is used to refer to some named object that can store data (Note: variables more typically corresponded to the location of information, but R variables cannot be used pointers, so this is beside the point). In R, variables are created using one of the two assignment operators ‘<-’ or ‘=’. The value on the right-hand side of the operator is assigned to the variable on the left-hand side.

Most users of R will default to ‘<-’ as the assignment operator, but ‘=’ is also used in many cases and is my personal preference. For example, when I see ‘x<-10’ in code that I had not personally written, I cannot always say for sure that the writer assigned a value of 10 to a variable or performed a logical statement (e.g., whether variable x is less than -3). Not critical, just my 0.02 on the practice, and they’re often functionally the same in typically contexts (Note: scoped assignment is a special case not relevant at this stage).

See some relevant usage below below:


value_1 <- 3
value_1
## [1] 3

value_2 = 3
value_2
## [1] 3

Functions and Methods in R

Functions and methods are used to perform operations on data (e.g., mutate data, display data, etc.). Both functions and methods are pieces of code that can be reused to accomplish specific purposes. For most users of R, they will often use these labels/terms interchangeably, because the distinction is something made much clearer in the context of object-oriented programming (R can get a bit murky in this regard, since we flip flop between function and object-oriented practices; e.g., R6 objects in R).

There is more that can be said about the distinctions between the two terms, but for most casual users of R, you will be creating functions of your own (and maybe using methods in objects designed by others).

An example of the most simple type of function is presented below:


test_function <- function() {
  print("Hello, World!")
}

test_function()
## [1] "Hello, World!"

Data Types in R

There is a fairly straightforward set of simple (primitive) data types that are commonly used in R (Note: complex numbers, factors, and others are not touched on here). These are pretty basic and include numeric, character, logical, and integer types.

Each of these data types, which are assigned to a variable, are illustrated below:


# Note: the class function outputs the type for the variable

# Assigning a numeric (integer) value to a variable
new_variable <- 100
class(new_variable)
## [1] "numeric"

# Assigning a numeric (numeric/float/etc.) value to a variable
new_variable <- 100.01
class(new_variable)
## [1] "numeric"

# Assigning a string value to a variable
new_variable <- "Assigned a string"
class(new_variable)
## [1] "character"

# Assigning a logical value to a variable (TRUE/FALSE)
new_variable <- TRUE
class(new_variable)
## [1] "logical"

Data Structures in R

Whereas data types in R are straightforward… data structures in R are not as straightforward. This area can get pretty deep (i.e., most all things derive from vectors), but for the sake of this introduction, we will limit the content to three of the most relevant structures for graphing the fxl package. These are vectors, data frames, and lists.

Vectors

A vector is simply a series or array of similarly type values (e.g., 1, 2, 3, 4). The key feature here is that they are all of the same type (i.e., they shouldn’t be a mix of numeric and character values; e.g., not 1, 2, and “3”).

A short demonstration of vector variables are illustrated below:


new_variable <- c(1, 2, 3)
new_variable
## [1] 1 2 3

new_variable <- c("1", "2", "3")
new_variable
## [1] "1" "2" "3"

new_variable <- c(TRUE, FALSE, TRUE)
new_variable
## [1]  TRUE FALSE  TRUE

Data Frames

A data frame is, in lay terms, a ‘table’ of data with rows and columns (Note: data tables are related, but also distinct). The data frame is like an array of vectors (i.e., columns), where each column has a specific type (e.g., numeric, character, logical) and each row abides by those types.

A short demonstration of the data frame is provided below:


# Note: the str function outputs the structure of a variable
new_variable <- data.frame(
  Numbers = c(1, 2, 3),
  Strings = c("1", "2", "3")
)

str(new_variable)
## 'data.frame':    3 obs. of  2 variables:
##  $ Numbers: num  1 2 3
##  $ Strings: chr  "1" "2" "3"

new_variable
##   Numbers Strings
## 1       1       1
## 2       2       2
## 3       3       3

Lists

The lists are the most flexible structures that we’ll cover here. In terms of data structure types, they fall somewhere in between a vector and a list. Specifically, lists have keys (i.e., names) that are associated with values, but these do not need to be of the same type as is the case for data frames.

Many of the more complex situations for the fxl package will involve lists (e.g., keying in styles specific to a participant/condition), so it is important to understand how to access data in them.

A short demonstration of the list data structure is provided below:


new_variable <- list(
  'key1' = c(1, 2, 3),
  'key2' = c("1", "2", "3")
)

str(new_variable)
## List of 2
##  $ key1: num [1:3] 1 2 3
##  $ key2: chr [1:3] "1" "2" "3"

new_variable
## $key1
## [1] 1 2 3
## 
## $key2
## [1] "1" "2" "3"

Some Common/Recommended Programming Practices

This sections presents a few conventions (i.e., practices) that I suggest using if starting to learn R. These conventions are not exhaustive, nor are they meant to be prescriptive, but are hopefully useful in (1) building a repertoire of programming skills that are easy to understand (and for others to understand as well) and (2) understanding how the various code samples and examples provided are written and structured.

Variable Naming (i.e., Pick a Strategy and Stick with It)

A consistent convention for naming variables, functions, and files is important because it makes code easier to read and understand. This is especially critical and can become problematic when projects grow in size and complexity. For example, in large projects that may involve multiple scripts and similarly-named variables/functions it can become increasingly difficult to discern what types of objects are variables, functions, etc.

For variables specifically, some examples of naming conventions include camelCase (e.g., ‘newVariable’), snake_case (e.g., ‘new_variable’), and dot.case (e.g., ‘new.variable’). Purely as a matter of personal preference, I prefer snake_case because I find it easier to read (i.e., as if it were an expressive name/sentence).

An example of this is provided below:


character_vector = c("a", "b", "c")

logical_vector = c(TRUE, FALSE, TRUE)

number_vector = c(1, 2, 3)

example_data_frame = data.frame(
  Characters = character_vector,
  Logicals = logical_vector,
  Numbers = number_vector
)

example_data_frame
##   Characters Logicals Numbers
## 1          a     TRUE       1
## 2          b    FALSE       2
## 3          c     TRUE       3

As an added detail, it is generally recommended to use expressive and informative names (e.g., ‘raw_data_all_participants’) over less informative and more terse ones (e.g., ‘dat1’), for the same reasons listed above.

Proprietary Formats and Compatibility Issues

Many new users of R are likely accustomed to working with data in specific proprietary formats (e.g., Excel, SPSS, etc.). The term proprietary simply means that some third-party has their own format (and may or may not support the use of their format outside of direct interactions with their product). This is in contrast to open formats, with standards that are designed to be universally compatible with any and all programs.

Although it is true R can read/write data in with proprietary formats (e.g., Excel, SPSS), it is often most straightforward to work with/from formats that are open by default (e.g., .csv, .tsv, .rds, etc.). For example, both xls and xlsx files feature metadata that may introduce issues when reading cell data into R (e.g., added characters or values that impact what is eventually read by R).

The .csv file format is easily easily read and written in R and is my personal choice for working between R and spreadsheet software. The .csv file format is a good choice for archiving and working with data because it is essentially a simple text file–it is similarly read by many different programs with relatively few (or any) issues across applications. It is also unlikely to ever change in any substantial way, which cannot be said for proprietary formats (e.g., the .xls vs the newer .xlsx file format).

A simple example of reading a CSV file, and plotting the relevant data, is provided below:


csv_data <- read.csv("data/0/data.csv")

csv_data
##   Session Phase Responses
## 1       1     A         2
## 2       2     A         4
## 3       3     A         3
## 4       4     B        10
## 5       5     B        12
## 6       6     B        14
## 7       7     B        16
## 8       8     B        18

### Visualizing data from variable

plot(csv_data$Session, csv_data$Responses,
     xlab = "Session",
     ylab = "Responses",
     main = "Example Plot from CSV File")

Note: Scripts and Data for each posts are made available at the top of the page for each guide/tutorial.

Let Raw Data Stay Raw Data

It is a generally accepted good practice to never modify the raw source data. This is often counter to common practice for spreadsheet users, who may modify aspects of the original data to assist with visualizing data (i.e., the raw data functions as a model and view for figures). This may include practices such as adding, deleting, or modifying (e.g., rounding) certain values and this problematic for a many reasons (e.g., unrecoverable data, lack of transparency in final calculations).

The functional nature of programming in R supports practices that prepare data for analysis/visualization and don’t ever mutate the original source data. That is, data is read into a variable, the variable containing some data is mutated/prepared, and that variable is then used in some analysis or other function (e.g., to plot). This is a highly desirable pattern to maintain because (1) it supports transparency in how an analyst gets from the raw/source data to the final product and (2) no data is ever lost or permanently modified throughout the process. For this reason, I recommend and maintain this as standard practice throughout the following examples.

This practice is demonstrated in the example below using the pipe operator (i.e., ‘%>%’) and the tidyverse package:


# Note: we will use the tidyverse library for a few convenience tools/functions
library(tidyverse)

read.csv("data/0/data.csv")
##   Session Phase Responses
## 1       1     A         2
## 2       2     A         4
## 3       3     A         3
## 4       4     B        10
## 5       5     B        12
## 6       6     B        14
## 7       7     B        16
## 8       8     B        18

# Note: The '%>%' operator 'pipes' the results from one function to another, creating a chain of operations 

# Note: The 'rename' function is piped in data to rename specific columns (e.g., to a more informative name)

# Note: The 'mutate' function is pipe in data and is used to change imported data in some meaningful way--without modifying the original data (e.g., the CSV file)

# The raw data read as the initial part of the chain, but will be mutated for easier analysis
csv_data <- read.csv("data/0/data.csv") %>%
  # We rename columns here, rather than in CSV file
  rename(PhaseName = Phase,
         ResponseCount = Responses) %>%
  # We remap Phase codes to human-readable alternatives (e.g., from "A" to "Baseline")
  mutate(PhaseName = recode(PhaseName,
                            "A" = "Baseline",
                            "B" = "Intervention"))

csv_data
##   Session    PhaseName ResponseCount
## 1       1     Baseline             2
## 2       2     Baseline             4
## 3       3     Baseline             3
## 4       4 Intervention            10
## 5       5 Intervention            12
## 6       6 Intervention            14
## 7       7 Intervention            16
## 8       8 Intervention            18

# Note: See that the original data is unchanged
read.csv("data/0/data.csv")
##   Session Phase Responses
## 1       1     A         2
## 2       2     A         4
## 3       3     A         3
## 4       4     B        10
## 5       5     B        12
## 6       6     B        14
## 7       7     B        16
## 8       8     B        18

Organization of Data Frames: Wide vs. Long Formats

Two types of information organizating using data frames are distinguished here: Wide and Long data.

Working with Wide Data

Most users coming from spreadsheet data are likely more accustomed to working with data in wide format. In this approach, each column represents a single variable, and the many columns results in a table that is wider due to the numerous columns featured in the frame.

An example where many columns are included in the data.frame is provided below:


# Wide format: Columns as distinct variables

# Note: the rep function creates a vector of something n times (e.g., rep(1, 10) creates a vector of 10 values of 1)

# Note: the code 1:10 is shorthand for creating a vector with values ranging from 1 to 10 (e.g., 1, 2, 3, ... 10)

data_example_wide = data.frame(
  Participant = rep(1, 10),
  Session = 1:10,
  Responses = 1:10,
  ReinforcersDelivered = 1:10,
  ProblematicBehavior = 1:10
)

data_example_wide
##    Participant Session Responses ReinforcersDelivered ProblematicBehavior
## 1            1       1         1                    1                   1
## 2            1       2         2                    2                   2
## 3            1       3         3                    3                   3
## 4            1       4         4                    4                   4
## 5            1       5         5                    5                   5
## 6            1       6         6                    6                   6
## 7            1       7         7                    7                   7
## 8            1       8         8                    8                   8
## 9            1       9         9                    9                   9
## 10           1      10        10                   10                  10

Working with Long Data

Whereas wide structures have many columns with fewer rows, long columns have fewer columns but more rows. The number of columns is reduced by converting the information into key-value pairs (i.e., the ‘key’ is the original column, the ‘value’ its original contents). This has the added value of organizing information that is related (e.g., related data across different groups/conditions) and is often the preferred arrangement in data analysis and visualization tools in R (e.g., ggplot2).

A short demonstration of converting from wide to long data is provided below:


# Note: We will use the tidyverse library for a few convenience tools/functions
library(tidyverse)

# Note: We will use the same data frame from before, since it is already in the wide format

# Note: The gather function in the tidyverse package 'gathers' columns to re-organize them into the key/value pair specified

# Note: In the gather function we can use the minus character to say "less this column from the values to be 'gathered' together" (i.e., leave it separate)

data_example_wide <- data_example_wide %>%
  # Note: Key = VariableType, Value = Value
  gather(VariableType, Value,
         # but we tell it not to do it for Participant/Session columns
         -Participant, -Session)

data_example_wide
##    Participant Session         VariableType Value
## 1            1       1            Responses     1
## 2            1       2            Responses     2
## 3            1       3            Responses     3
## 4            1       4            Responses     4
## 5            1       5            Responses     5
## 6            1       6            Responses     6
## 7            1       7            Responses     7
## 8            1       8            Responses     8
## 9            1       9            Responses     9
## 10           1      10            Responses    10
## 11           1       1 ReinforcersDelivered     1
## 12           1       2 ReinforcersDelivered     2
## 13           1       3 ReinforcersDelivered     3
## 14           1       4 ReinforcersDelivered     4
## 15           1       5 ReinforcersDelivered     5
## 16           1       6 ReinforcersDelivered     6
## 17           1       7 ReinforcersDelivered     7
## 18           1       8 ReinforcersDelivered     8
## 19           1       9 ReinforcersDelivered     9
## 20           1      10 ReinforcersDelivered    10
## 21           1       1  ProblematicBehavior     1
## 22           1       2  ProblematicBehavior     2
## 23           1       3  ProblematicBehavior     3
## 24           1       4  ProblematicBehavior     4
## 25           1       5  ProblematicBehavior     5
## 26           1       6  ProblematicBehavior     6
## 27           1       7  ProblematicBehavior     7
## 28           1       8  ProblematicBehavior     8
## 29           1       9  ProblematicBehavior     9
## 30           1      10  ProblematicBehavior    10

Some Other Final Notes

The content provided here was designed to be brief, but also relevant for users new to R that are currently learning about data management and R programming practices. That said, this is not even close to a representative introduction to R nor is it an exhaustive listing of recommended practices or features included in R. Rather, the content provided here was designed to provide early support for shaping the skills necessary to follow along with the posts posts that follow in this series.