semanticprimeR Package Functions

Libraries

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(semanticprimeR)
#> 
#> Attaching package: 'semanticprimeR'
#> The following object is masked from 'package:dplyr':
#> 
#>     top_n
library(udpipe)
library(stopwords)
library(tibble)
library(rio)

Power Functions

During the process of this project, we proposed a new way to calculate “power” for adaptive sampling of items using accuracy in parameter estimation and bootstrapping/simulation methods. You can review the preprint here: https://osf.io/preprints/osf/e3afx

This package has many vignettes for different types of data you can examine by using vignette(package = "semanticprimeR") to review what is available and vignette(topic = "montefinese_vignette", package = "semanticprimeR") as a specific example to view a specific vignette.

The information presented here is a shortened version of the paper to show off the functionality of the package.

Simulate Population

Let’s say we want to run a priming study, but do not have previous data. We can use simulate_population() to create some sample data to use in for estimating sample size for adaptive testing and stopping rules.

df <- simulate_population(mu = 25, # mean priming in ms
                    mu_sigma = 5, # standard deviation of the item means
                    sigma = 10, # population standard deviation for items
                    sigma_sigma = 3, # standard deviation of the standard deviation of items
                    number_items = 75, # number of priming items 
                    number_scores = 100, # a population of values to simulate
                    smallest_sigma = 1, # smallest possible item standard deviation 
                    min_score = -25, # min ms priming
                    max_score = 100, # max ms priming 
                    digits = 3)

head(df)
#>   item  score
#> 1    1 22.966
#> 2    2 27.720
#> 3    3  1.809
#> 4    4 19.104
#> 5    5 38.348
#> 6    6 41.431

Calculate Cutoff

From this dataframe, we can calculate the standard error of the items, which is used to determine if the item has reached the stopping rule (i.e., if the standard error reaches a defined value, it is considered accurately measured, and we can stop sampling it). To get the defined value, we use the 40% decile of the standard errors of the items.

cutoff <- calculate_cutoff(population = df, # pilot data or simulated data
  grouping_items = "item", # name of the item indicator column
  score = "score", # name of the dependent variable column
  minimum = min(df$score), # minimum possible/found score
  maximum = max(df$score)) # maximum possible/found score
                           
cutoff$se_items # all standard errors of items
#>  [1] 1.1914848 1.2944957 0.9411572 1.0181927 0.9976637 1.6398903 1.4534350
#>  [8] 1.1954318 1.1967030 0.7680782 1.4009475 0.2155584 0.7733205 1.1256345
#> [15] 0.8887702 1.1431167 1.1658235 1.2875266 0.9747519 1.0861600 1.4798068
#> [22] 0.9016232 1.2245109 1.4814803 0.8933068 0.9018008 0.7951396 0.7464761
#> [29] 0.7540108 0.4469756 0.7262989 0.7128257 0.8571588 1.1398729 1.3525854
#> [36] 0.9395962 0.5548498 1.0024380 0.8636102 1.2956495 1.5543834 1.2468959
#> [43] 0.4499860 0.9400761 1.3068053 1.6198171 1.1612173 1.2668006 1.0598592
#> [50] 1.0157997 1.0114304 1.1258111 0.9879748 0.8066764 1.0789258 0.5931047
#> [57] 1.3059619 0.9923325 0.4971521 0.7544609 1.1140443 0.7171717 0.9382714
#> [64] 1.1860594 1.0760170 0.9469807 0.7591402 1.4144226 1.1881549 1.2362692
#> [71] 1.3670759 0.6974420 1.1948369 0.4762515 1.5065986
cutoff$sd_items # standard deviation of the standard errors
#> [1] 0.2981085
cutoff$cutoff # 40% decile score
#>       40% 
#> 0.9636434
cutoff$prop_var # proportion of possible variance 
#> [1] 0.006049832

Simulate Samples

Next, we use simulation to pretend we have samples of a starting size (we suggest 20) up to a maximum size we are willing to collect. These samples will be used to determine the number of items that achieve our stopping rule to determine how many participants are needed to accurately measure most items.

samples <- simulate_samples(start = 20, # starting sample size
  stop = 400, # stopping sample size
  increase = 5, # increase bootstrapped samples by this amount
  population = df, # population or pilot data
  replace = TRUE, # bootstrap with replacement? 
  nsim = 500, # number of simulations to run
  grouping_items = "item") # item column label 

save(samples, file = "data/simulatePriming.RData")

# since that's a slow function, we wrote out the data and read it back in
load(paste0(filelocation,"/vignettes/data/simulatePriming.Rdata"))
head(samples[[1]])
#> # A tibble: 6 × 2
#> # Groups:   item [1]
#>    item score
#>   <int> <dbl>
#> 1     1 15.8 
#> 2     1 19.6 
#> 3     1 17.7 
#> 4     1 25.6 
#> 5     1 31.0 
#> 6     1  5.53

Calculate Proportions of Items Measured

From those samples and our estimated cutoff score, we can calculate the number of items below our suggested stopping rule.

proportion_summary <- calculate_proportion(samples = samples, # samples list
  cutoff = cutoff$cutoff, # cut off score 
  grouping_items = "item", # item column name
  score = "score") # dependent variable column name 

head(proportion_summary)
#> # A tibble: 6 × 2
#>   sample_size percent_below
#>         <dbl>         <dbl>
#> 1          20        0.04  
#> 2          25        0.0667
#> 3          30        0.08  
#> 4          35        0.0933
#> 5          40        0.12  
#> 6          45        0.147

Final Sample Size

As you will note in our paper, we determined that this simulation procedure needs a correction to approximate traditional interpretations of power. You can use “power” levels like 80 percent, 90 percent, etc. similarly to traditional power - use higher numbers if you want to be more stringent to make sure all items are “well measured” (and correspondingly you will get higher estimated sample sizes). We suggested using 80% as a “minimum” sample size, the cutoff stopping rule while running the study if you use adaptive sampling (or just overall to review the data), and a higher value like 90% for a maximum sample size to collect.

corrected_summary <- calculate_correction(
  proportion_summary = proportion_summary, # prop from above
  pilot_sample_size = 100, # number of participants in the pilot data 
  proportion_variability = cutoff$prop_var, # proportion variance from cutoff scores
  power_levels = c(80, 85, 90, 95)) # what levels of power to calculate 

corrected_summary
#> # A tibble: 4 × 3
#>   percent_below sample_size corrected_sample_size
#>           <dbl>       <dbl>                 <dbl>
#> 1          80           170                  92.4
#> 2          88           210                 115. 
#> 3          90.7         235                 129. 
#> 4          96           275                 150.

Create Stimuli Functions

Once you’ve determine your sample sizes, you will want to create stimuli. These functions show you how we created stimuli from the OpenSubtitles project using the subs2vec project.

OpenSubtitles: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles
subs2vec: https://github.com/jvparidon/subs2vec.

Get Models

You can review the available models from subs2vec merged with the available data in OpenSubtitles by using the data function. You can use ?subsData to view the information about the columns and the dataset.

data("subsData")
head(tibble(subsData))
#> # A tibble: 6 × 10
#>   language_code subs_vec   subs_count wiki_vec wiki_count files tokens sentences
#>   <chr>         <chr>      <chr>      <chr>    <chr>      <dbl> <chr>  <chr>    
#> 1 hi            https://h… https://h… https:/… https://h…   102 1.0M   0.1M     
#> 2 gl            https://h… https://h… https:/… https://h…   449 2.3M   0.3M     
#> 3 ca            https://h… https://h… https:/… https://h…   832 4.6M   0.6M     
#> 4 bn            https://h… https://h… https:/… https://h…   542 3.7M   0.7M     
#> 5 br            https://h… https://h… https:/… https://h…    32 0.2M   23.1k    
#> 6 bs            https://h… https://h… https:/… https://h… 37309 215.4M 34.1M    
#> # ℹ 2 more variables: language <chr>, udpipe_model <chr>

We only worked with languages in which we could use a part of speech tagger. We recommend udpipe as a great package that has many taggers. The language model necessary is shown in the udpipe_model column.

In this example, let’s use Afrikaans as a smaller dataset for an example. The datasets can be very large - just a warning for downloading and using on your computer. Use the import_subs function to download and import the files you are interested in.

For language, please use the two letter language code in the language_code column of the subsData.

You then need to pick what to download:

subs_vec: The subtitles embeddings from a fastText model.
subs_count: The frequency of tokens found in the subs_vec model.
wiki_vec: The Wikipedia embeddings from a fastText model.
subs_count: The frequency of tokens found in the wiki_vec model.

You may see some warnings based on file formatting.

af_freq <- import_subs(
  language = "af",
  what = "subs_count"
)

head(af_freq)
#>   unigram unigram_freq
#> 1     die        12673
#> 2     nie        11788
#> 3      ek        11605
#> 4      is         9147
#> 5     het         8109
#> 6      jy         7425

We then used udpipe to filter our possible options. You may have other criteria, but here’s an example of how we tagged concepts (for their main part of speech, given no sentence context here). When you use this function, it will download the model necessary for tagging.

# tag with udpipe
af_tagged <- udpipe(af_freq$unigram, 
                    object = subsData$udpipe_model[subsData$language_code == "af"], 
                    parser = "none")

head(tibble(af_tagged))
#> # A tibble: 6 × 17
#>   doc_id paragraph_id sentence_id sentence start   end term_id token_id token
#>   <chr>         <int>       <int> <chr>    <int> <int>   <int> <chr>    <chr>
#> 1 doc1              1           1 die          1     3       1 1        die  
#> 2 doc2              1           1 nie          1     3       1 1        nie  
#> 3 doc3              1           1 ek           1     2       1 1        ek   
#> 4 doc4              1           1 is           1     2       1 1        is   
#> 5 doc5              1           1 het          1     3       1 1        het  
#> 6 doc6              1           1 jy           1     2       1 1        jy   
#> # ℹ 8 more variables: lemma <chr>, upos <chr>, xpos <chr>, feats <chr>,
#> #   head_token_id <chr>, dep_rel <chr>, deps <chr>, misc <chr>

We then:

Lower cased
Removed anything less than three characters when appropriate
Picked words only in nouns, verbs, adjectives, and adverbs
Took out stopwords
Took out words with numbers

We only used the top 10,000 words for the next section, but this selection will depend on your use case as well.

# word_choice
word_choice <- c("NOUN", "VERB", "ADJ", "ADV")

# lower case
af_tagged$lemma <- tolower(af_tagged$lemma)

# three characters 
af_tagged <- subset(af_tagged, nchar(af_tagged$lemma) >= 3)

# only nouns verbs, etc. 
af_tagged <- subset(af_tagged, upos %in% word_choice)

# removed stop words just in case they were incorrectly tagged
af_tagged <- subset(af_tagged, !(lemma %in% stopwords(language = "af", source = "stopwords-iso")))

# removed things with numbers
af_tagged <- subset(af_tagged, !(grepl("[0-9]", af_tagged$sentence)))

# merge frequency back into tagged list
# merge by sentence so one to one match
colnames(af_freq) <- c("sentence", "freq")
af_final <- merge(af_tagged, af_freq, by = "sentence", all.x = T)

head(tibble(af_final))
#> # A tibble: 6 × 18
#>   sentence    doc_id paragraph_id sentence_id start   end term_id token_id token
#>   <chr>       <chr>         <int>       <int> <int> <int>   <int> <chr>    <chr>
#> 1 aanbeveling doc17…            1           1     1    11       1 1        aanb…
#> 2 aanbeweeg   doc81…            1           1     1     9       1 1        aanb…
#> 3 aanbied     doc13…            1           1     1     7       1 1        aanb…
#> 4 aanbreek    doc82…            1           1     1     8       1 1        aanb…
#> 5 aanbring    doc14…            1           1     1     8       1 1        aanb…
#> 6 aandadig    doc57…            1           1     1     8       1 1        aand…
#> # ℹ 9 more variables: lemma <chr>, upos <chr>, xpos <chr>, feats <chr>,
#> #   head_token_id <chr>, dep_rel <chr>, deps <chr>, misc <chr>, freq <int>

# eliminate duplicates by lemma
af_final <- af_final[order(af_final$freq, decreasing = TRUE) , ]
af_final <- af_final[!duplicated(af_final$lemma), ]

# grab top 10K
af_top <- af_final[1:10000 , ]

Next, we used import_subs() again import the embeddings for the subtitles.

af_dims <- import_subs(
  language = "af",
  what = "subs_vec"
)

head(tibble(af_dims))
#> # A tibble: 6 × 301
#>   V1           V2        V3      V4      V5     V6        V7       V8       V9
#>   <chr>     <dbl>     <dbl>   <dbl>   <dbl>  <dbl>     <dbl>    <dbl>    <dbl>
#> 1 </s>  -0.236     0.0544    0.0202 -0.0531  0.146  0.000971  0.0146   0.0423 
#> 2 nie   -0.000683  0.0787    0.0382  0.0379  0.295  0.225    -0.171    0.132  
#> 3 die   -0.0233   -0.000478  0.191  -0.103   0.176 -0.0884   -0.0828  -0.00594
#> 4 is    -0.410     0.144     0.133  -0.104   0.295 -0.0391   -0.00618  0.146  
#> 5 het    0.344    -0.0163   -0.198   0.259  -0.579 -0.154     0.131   -0.0615 
#> 6 n      0.0615   -0.139    -0.0674 -0.0658 -0.105 -0.212    -0.115   -0.0611 
#> # ℹ 292 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>, V14 <dbl>,
#> #   V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
#> #   V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
#> #   V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
#> #   V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
#> #   V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
#> #   V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>, …

In our case, we want to use the tokens as row names, so we want to move the first column to the row names and delete it to have a 300 dimension by tokens matrix.

# lower case
af_dims$V1 <- tolower(af_dims[ , 1]) # first column is always the tokens

# eliminate duplicates 
af_dims <- subset(af_dims, !duplicated(af_dims[ , 1]))

# make row names
rownames(af_dims) <- af_dims[ , 1]
af_dims <- af_dims[ , -1]
head(tibble(af_dims))
#> # A tibble: 6 × 300
#>          V2        V3      V4      V5     V6       V7       V8       V9      V10
#>       <dbl>     <dbl>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 -0.236     0.0544    0.0202 -0.0531  0.146  9.71e-4  0.0146   0.0423   0.00999
#> 2 -0.000683  0.0787    0.0382  0.0379  0.295  2.25e-1 -0.171    0.132    0.216  
#> 3 -0.0233   -0.000478  0.191  -0.103   0.176 -8.84e-2 -0.0828  -0.00594  0.0445 
#> 4 -0.410     0.144     0.133  -0.104   0.295 -3.91e-2 -0.00618  0.146   -0.0680 
#> 5  0.344    -0.0163   -0.198   0.259  -0.579 -1.54e-1  0.131   -0.0615  -0.0159 
#> 6  0.0615   -0.139    -0.0674 -0.0658 -0.105 -2.12e-1 -0.115   -0.0611  -0.0669 
#> # ℹ 291 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>,
#> #   V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
#> #   V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
#> #   V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>,
#> #   V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>, V39 <dbl>,
#> #   V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>, V45 <dbl>,
#> #   V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>, V51 <dbl>, …

Calculate Similarity

We can then use the calculate_similarity() function to get the similarity values for all words based on the dimension matrix. The underlying function is cosine calculated between vectors of the two word dimensions.

af_cosine <- 
  calculate_similarity(
    words = af_final$sentence, # the tokens you want to filter
    dimensions = af_dims, # the matrix of items 
    by = 1 # 1 for rows, 2 for columns 
)

The top_n function can be used to calculate the top number of cosine values for each token in the similarity matrix. Please note: it will always return the token-token combination as 1 (the token related to itself), so you should ask for n+1 number of cosines to then filter out the token-token combinations. Big thanks to Brenton Wiernik who figured out how to make this computational efficient.

# get the top 5 related words 
af_top_sim <- semanticprimeR::top_n(af_cosine, 6)
af_top_sim <- subset(af_top_sim, cue!=target)

head(af_top_sim)
#>    cue  target    cosine
#> 2  nou     als 0.8654161
#> 3  nou   nitto 0.8476698
#> 4  nou    juis 0.8434761
#> 5  nou nicolas 0.8419336
#> 6  nou jakkals 0.8278382
#> 8 hier    bier 0.8086171

Create Pseudowords

We originally set up a function to create words by replacing the number of characters based on the bigrams in the token. We recommend you use the other function based on Wuggy, but you can also do simple replacement.

af_top_sim$fake_cue <- fake_simple(af_top_sim$cue)
# you'd want to also do this based on target depending on your study 
head(af_top_sim)
#>    cue  target    cosine fake_cue
#> 2  nou     als 0.8654161      non
#> 3  nou   nitto 0.8476698      not
#> 4  nou    juis 0.8434761      nof
#> 5  nou nicolas 0.8419336      nau
#> 6  nou jakkals 0.8278382      noo
#> 8 hier    bier 0.8086171     hiër

You can also use the Wuggy algorithm using fake_Wuggy(). This function is not fast. It is slower the larger the size of the words to create from. It returns a dataframe of options to use for pseudowords with the following columns:

word_id: Number id for each unique word.
first: First syllable in pairs of syllables.
original_pair: Pair of syllables together.
second: Second syllable in the pairs of syllables.
syll: Number of syllables in the token.
original_freq: Frequency of the syllable pair.
replacement_pair: Replacement option wherein one of the syllables has been changed.
replacement_syll: The replacement syllable.
replacement_freq: The frequency of the replacement syllable pair.
freq_diff: The difference in frequency of the transition pair.
char_diff: Number of characters difference in the original pair and the replacement pair.
letter_diff: Number of letters difference in the original pair and the replacement pair. If the replacement includes the same letters, the difference would be zero. These values are excluded from being options.}
original_word: The original token.}
replacement_word: The final replacement token.

af_wuggy <- fake_Wuggy(
  wordlist = af_final$sentence, # full valid options in language
  language_hyp = paste0(filelocation,"/inst/latex/hyph-af.tex"), # path to hyphenation.tex 
  lang = "af", # two letter language code
  replacewords <- unique(af_top_sim$cue[1:20]) # words you want to create pseudowords for  
)
#>   |                                                                              |                                                                      |   0%  |                                                                              |                                                                      |   1%  |                                                                              |=                                                                     |   1%  |                                                                              |=                                                                     |   2%  |                                                                              |==                                                                    |   2%  |                                                                              |==                                                                    |   3%  |                                                                              |==                                                                    |   4%  |                                                                              |===                                                                   |   4%  |                                                                              |===                                                                   |   5%  |                                                                              |====                                                                  |   5%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   6%  |                                                                              |=====                                                                 |   7%  |                                                                              |=====                                                                 |   8%  |                                                                              |======                                                                |   8%  |                                                                              |======                                                                |   9%  |                                                                              |=======                                                               |   9%  |                                                                              |=======                                                               |  10%  |                                                                              |=======                                                               |  11%  |                                                                              |========                                                              |  11%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |=========                                                             |  14%  |                                                                              |==========                                                            |  14%  |                                                                              |==========                                                            |  15%  |                                                                              |===========                                                           |  15%  |                                                                              |===========                                                           |  16%  |                                                                              |============                                                          |  16%  |                                                                              |============                                                          |  17%  |                                                                              |============                                                          |  18%  |                                                                              |=============                                                         |  18%  |                                                                              |=============                                                         |  19%  |                                                                              |==============                                                        |  19%  |                                                                              |==============                                                        |  20%  |                                                                              |==============                                                        |  21%  |                                                                              |===============                                                       |  21%  |                                                                              |===============                                                       |  22%  |                                                                              |================                                                      |  22%  |                                                                              |================                                                      |  23%  |                                                                              |================                                                      |  24%  |                                                                              |=================                                                     |  24%  |                                                                              |=================                                                     |  25%  |                                                                              |==================                                                    |  25%  |                                                                              |==================                                                    |  26%  |                                                                              |===================                                                   |  26%  |                                                                              |===================                                                   |  27%  |                                                                              |===================                                                   |  28%  |                                                                              |====================                                                  |  28%  |                                                                              |====================                                                  |  29%  |                                                                              |=====================                                                 |  29%  |                                                                              |=====================                                                 |  30%  |                                                                              |=====================                                                 |  31%  |                                                                              |======================                                                |  31%  |                                                                              |======================                                                |  32%  |                                                                              |=======================                                               |  32%  |                                                                              |=======================                                               |  33%  |                                                                              |=======================                                               |  34%  |                                                                              |========================                                              |  34%  |                                                                              |========================                                              |  35%  |                                                                              |=========================                                             |  35%  |                                                                              |=========================                                             |  36%  |                                                                              |==========================                                            |  36%  |                                                                              |==========================                                            |  37%  |                                                                              |==========================                                            |  38%  |                                                                              |===========================                                           |  38%  |                                                                              |===========================                                           |  39%  |                                                                              |============================                                          |  39%  |                                                                              |============================                                          |  40%  |                                                                              |============================                                          |  41%  |                                                                              |=============================                                         |  41%  |                                                                              |=============================                                         |  42%  |                                                                              |==============================                                        |  42%  |                                                                              |==============================                                        |  43%  |                                                                              |==============================                                        |  44%  |                                                                              |===============================                                       |  44%  |                                                                              |===============================                                       |  45%  |                                                                              |================================                                      |  45%  |                                                                              |================================                                      |  46%  |                                                                              |=================================                                     |  46%  |                                                                              |=================================                                     |  47%  |                                                                              |=================================                                     |  48%  |                                                                              |==================================                                    |  48%  |                                                                              |==================================                                    |  49%  |                                                                              |===================================                                   |  49%  |                                                                              |===================================                                   |  50%  |                                                                              |===================================                                   |  51%  |                                                                              |====================================                                  |  51%  |                                                                              |====================================                                  |  52%  |                                                                              |=====================================                                 |  52%  |                                                                              |=====================================                                 |  53%  |                                                                              |=====================================                                 |  54%  |                                                                              |======================================                                |  54%  |                                                                              |======================================                                |  55%  |                                                                              |=======================================                               |  55%  |                                                                              |=======================================                               |  56%  |                                                                              |========================================                              |  56%  |                                                                              |========================================                              |  57%  |                                                                              |========================================                              |  58%  |                                                                              |=========================================                             |  58%  |                                                                              |=========================================                             |  59%  |                                                                              |==========================================                            |  59%  |                                                                              |==========================================                            |  60%  |                                                                              |==========================================                            |  61%  |                                                                              |===========================================                           |  61%  |                                                                              |===========================================                           |  62%  |                                                                              |============================================                          |  62%  |                                                                              |============================================                          |  63%  |                                                                              |============================================                          |  64%  |                                                                              |=============================================                         |  64%  |                                                                              |=============================================                         |  65%  |                                                                              |==============================================                        |  65%  |                                                                              |==============================================                        |  66%  |                                                                              |===============================================                       |  66%  |                                                                              |===============================================                       |  67%  |                                                                              |===============================================                       |  68%  |                                                                              |================================================                      |  68%  |                                                                              |================================================                      |  69%  |                                                                              |=================================================                     |  69%  |                                                                              |=================================================                     |  70%  |                                                                              |=================================================                     |  71%  |                                                                              |==================================================                    |  71%  |                                                                              |==================================================                    |  72%  |                                                                              |===================================================                   |  72%  |                                                                              |===================================================                   |  73%  |                                                                              |===================================================                   |  74%  |                                                                              |====================================================                  |  74%  |                                                                              |====================================================                  |  75%  |                                                                              |=====================================================                 |  75%  |                                                                              |=====================================================                 |  76%  |                                                                              |======================================================                |  76%  |                                                                              |======================================================                |  77%  |                                                                              |======================================================                |  78%  |                                                                              |=======================================================               |  78%  |                                                                              |=======================================================               |  79%  |                                                                              |========================================================              |  79%  |                                                                              |========================================================              |  80%  |                                                                              |========================================================              |  81%  |                                                                              |=========================================================             |  81%  |                                                                              |=========================================================             |  82%  |                                                                              |==========================================================            |  82%  |                                                                              |==========================================================            |  83%  |                                                                              |==========================================================            |  84%  |                                                                              |===========================================================           |  84%  |                                                                              |===========================================================           |  85%  |                                                                              |============================================================          |  85%  |                                                                              |============================================================          |  86%  |                                                                              |=============================================================         |  86%  |                                                                              |=============================================================         |  87%  |                                                                              |=============================================================         |  88%  |                                                                              |==============================================================        |  88%  |                                                                              |==============================================================        |  89%  |                                                                              |===============================================================       |  89%  |                                                                              |===============================================================       |  90%  |                                                                              |===============================================================       |  91%  |                                                                              |================================================================      |  91%  |                                                                              |================================================================      |  92%  |                                                                              |=================================================================     |  92%  |                                                                              |=================================================================     |  93%  |                                                                              |=================================================================     |  94%  |                                                                              |==================================================================    |  94%  |                                                                              |==================================================================    |  95%  |                                                                              |===================================================================   |  95%  |                                                                              |===================================================================   |  96%  |                                                                              |====================================================================  |  96%  |                                                                              |====================================================================  |  97%  |                                                                              |====================================================================  |  98%  |                                                                              |===================================================================== |  98%  |                                                                              |===================================================================== |  99%  |                                                                              |======================================================================|  99%  |                                                                              |======================================================================| 100%

head(tibble(af_wuggy))
#> # A tibble: 4 × 14
#>   word_id first       original_pair  second  syll original_freq replacement_pair
#>     <int> <chr>       <chr>          <chr>  <dbl>         <dbl> <chr>           
#> 1       2 first_blank first_blank-hi hi         1             5 first_blank-tu  
#> 2       3 ne          ne-t           t          1             1 oi-t            
#> 3       1 first_blank first_blank-no no         1            22 first_blank-va  
#> 4       4 first_blank first_blank-we we         1            23 first_blank-bo  
#> # ℹ 7 more variables: replacement_syll <chr>, replacement_freq <int>,
#> #   freq_diff <dbl>, char_diff <int>, letter_diff <dbl>, original_word <chr>,
#> #   replacement_word <chr>


getwd()
#> [1] "/Users/erinbuchanan/GitHub/Research/2_projects/PSA_Projects/SPAML/semanticprimeR/vignettes"

Get Priming Data

You can load one of the many files included in the SPAML release by using the primeData to see what we have available. The datasets are broken into a couple types:

procedure_stimuli: The stimuli from the study. Each dataset includes the ~5000 trials used in the study listed as cue-target pairs with their cue_type/target_type (word/nonword) and trial type (related, unrelated, nonword). The cosine values from the subs2vec models are included when available for word pairs. If the value is blank or NA, you can assume one of the words did not exist in the subs2vec model or could not be matched. The subs2vec models were often filtered to only the top X words, and some stimuli selected may have be infrequent.
matched_stimuli: The matched stimuli datasets fall into two types: “matched” which matches the original language to English, and “unique” which includes the word pair combo found in the datasets that makes each trial unique. Some targets were repeated due to translation - therefore, the unique datasets allow you to unambiguously match things together. The matched_stimuli.csv files has these all matched together if you want all languages at once. The missing data is the Arabic pairs we were asked to remove due to their taboo nature in that culture.

Each of the following files have codebooks found at: https://github.com/SemanticPriming/SPAML/tree/master/05_Data/codebooks

participant_data: Information on the participants who completed each language.
full_data: The “raw” data with only identifiers removed.
trial_data: The trial level data showing only the trial blocks (i.e., excluding the other lines that indicate the timing and inter-trial interval).
item_data: The average results for each token/item, ignoring the condition presented.
priming_data: The priming data in either _trials format (meaning these have been matched and labeled for trial type) or _summary format (meaning averages/summaries of the target trials matched by related and unrelated to create a priming score).

Load Available Data

data("primeData")
head(primeData)
#>           type                            filename language
#> 1    item_data           ar_answered_item_data.csv       ar
#> 2 priming_data ar_answered_prime_summary_no2.5.csv       ar
#> 3 priming_data ar_answered_prime_summary_no3.0.csv       ar
#> 4 priming_data       ar_answered_prime_summary.csv       ar
#> 5 priming_data        ar_answered_prime_trials.csv       ar
#> 6    full_data                 ar_full_data.csv.gz       ar
#>                                                                                                location
#> 1           https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_item_data.csv
#> 2 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary_no2.5.csv
#> 3 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary_no3.0.csv
#> 4       https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary.csv
#> 5        https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_trials.csv
#> 6                 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_full_data.csv.gz

Once you decide what file you would like to download and import, you can use import_prime() to import that file. Note that some of the full_data datasets are quite large and may take a while download and/or import directly. You can also just use the direct links the primeData file to download them. Some files are heavily compressed in .gz format. I recommend 7-zip if you aren’t familiar with the command line to unzip these: https://www.wikihow.com/Extract-a-Gz-File

You can also import them directly into R with the rio package (which is what this function does, but it does download the file each time, so I’d recommend one download and then put the import into your code directly with rio::import("filepath")).

Import Specific Data

In this example, we import the stimuli dataset for Spanish, which includes the trials, type of trial information, and the cosine calculated from subs2vec.

es_words <- import_prime("es_words.csv")

head(es_words)
#>     es_cue es_target    type cue_type target_type es_cosine
#> 1 lenguado   abadejo related     word        word 0.5712210
#> 2    dejar abandonar related     word        word 0.5418134
#> 3  espalda   abdomen related     word        word 0.4251472
#> 4     beso    abrazo related     word        word 0.7530440
#> 5 ridículo   absurdo related     word        word 0.7036410
#> 6   abuelo    abuela related     word        word 0.7450651

Match to LAB Data

Load Available Data

To review the available data from the Linguistic Annotated Bibliography, you can use data("labData"), which includes information about available datasets overall and which are included in our LAB data release for merging.

data("labData")
head(tibble(labData))
#> # A tibble: 6 × 73
#>   included bibtex author  year ref_title ref_journal ref_volume ref_page ref_doi
#>   <chr>    <chr>  <chr>  <dbl> <chr>     <chr>       <chr>      <chr>    <chr>  
#> 1 no       Adelm… Adelm…  2014 "A behav… Behavior R… "46"       "1052--… "10.37…
#> 2 no       Aguil… Aguil…  2017 "Develop… 2017 IEEE … ""         ""       "10.11…
#> 3 no       Akini… Akini…  2014 "Russian… Behavior R… "47"       "691--7… "10.37…
#> 4 no       Al-Su… Al-Su…  2006 "The Des… Internatio… "11"       "135--1… "10.10…
#> 5 no       Alame… Alame…  1995 "Diccion… Servicio d… ""         ""       ""     
#> 6 yes      Alari… Alari…  1999 "A set o… Behavior R… "31"       "531--5… "10.37…
#> # ℹ 64 more variables: no1 <chr>, no2 <int>, type1 <chr>, ref1 <chr>,
#> #   type2 <chr>, ref2 <chr>, notes_stim <chr>, data_name <chr>, nonling <int>,
#> #   language <chr>, notes_lang <chr>, language_glotto <chr>,
#> #   notes_glotto <chr>, population <chr>, notes_var <chr>, accuracy <int>,
#> #   ambiguity <int>, aoa <int>, arousal <int>, assoc <int>, category <int>,
#> #   cloze <int>, complex <int>, concrete <int>, confusion <int>, context <int>,
#> #   dist <int>, dominate <int>, easelearn <int>, familiar <int>, freq <int>, …

# import_lab() also loads this dataset 

# ?labData # use this to learn about the dataset

Load Filtered Metadata

If you want to find specific types of LAB data, you can use the language and/or variables.

saved <- import_lab(language = "English", variables = c("aoa", "freq"))
# possible datasets that are English, aoa, and frequency
head(tibble(saved))
#> # A tibble: 1 × 1
#>   saved        
#>   <named list> 
#> 1 <df [3 × 74]>

saved <- import_lab(language = "Spanish", variables = c("aoa"))

head(tibble(saved))
#> # A tibble: 1 × 1
#>   saved        
#>   <named list> 
#> 1 <df [8 × 74]>

Load Specific Data

es_aos <- import_lab(bibtexID = "Alonso2015", citation = TRUE)

es_aos$citation
#> [1] "Alonso, Fernandez, & Diez. (2014). Subjective age-of-acquisition norms for 7039 Spanish words. Behavior Research Methods, 47, 268--274. doi: 10.3758/s13428-014-0454-2"

head(tibble(es_aos$loaded_data))
#> # A tibble: 6 × 13
#>   word_spanish aoa_M aoa_SD aoa_min aoa_max aoa_zscore oral_freq_log_M
#>   <chr>        <dbl>  <dbl>   <int>   <int>      <dbl>           <dbl>
#> 1 a             2.28   1.44       1       6     -1.65            4.85 
#> 2 abajo         2.96   1.37       1       6     -1.36            2.62 
#> 3 abandonado    6.06   1.66       2      10     -0.584           1.62 
#> 4 abandonar     7.58   1.66       4      11     -0.02            1.81 
#> 5 abandono      7.22   1.94       3      11      0.04            1.60 
#> 6 abatimiento  10.0    1.49       5      11      1.06            0.602
#> # ℹ 6 more variables: written_freq_log_SUBTLEXESP_M <dbl>,
#> #   written_freq_log_LEXESP_M <dbl>, written_freq_log_espal_M <dbl>,
#> #   lem_cat_espl_max <chr>, lem_max_code <chr>, syllable_N <int>

es_sim <- import_lab(bibtexID = "Cabana2024_R1", citation = TRUE)

es_sim$citation
#> [1] "Cabana, Zugarramurdi, Valle-Lisboa, & De Deyne. (2024). The \xd2Small World of Words\xd3 free association norms for Rioplatense Spanish. Behavior Research Methods, 56, 968--985. doi: 10.3758/s13428-023-02070-z"

head(tibble(es_sim$loaded_data))
#> # A tibble: 6 × 5
#>   cue   response         R1     N R1.Strength
#>   <chr> <chr>         <int> <int>       <dbl>
#> 1 ?     pregunta         26    66      0.394 
#> 2 ?     que               8    66      0.121 
#> 3 ?     duda              6    66      0.0909
#> 4 ?     incógnita         4    66      0.0606
#> 5 ?     interrogación     3    66      0.0455
#> 6 ?     no sé             2    66      0.0303

Match To Prime Data

es_words_merged <- es_words %>%
  # merge with the cue word (will be .x variables)
  left_join(es_aos$loaded_data, 
            by = c("es_cue" = "word_spanish")) %>% 
  # merge with the target word (will be .y variables)
  left_join(es_aos$loaded_data, 
            by = c("es_target" = "word_spanish")) %>% 
  # merge with free association similarity
  left_join(es_sim$loaded_data, 
            by = c("es_cue" = "cue",
                   "es_target" = "response"))

head(tibble(es_words_merged))
#> # A tibble: 6 × 33
#>   es_cue   es_target type    cue_type target_type es_cosine aoa_M.x aoa_SD.x
#>   <chr>    <chr>     <chr>   <chr>    <chr>           <dbl>   <dbl>    <dbl>
#> 1 lenguado abadejo   related word     word            0.571    7.68     2.38
#> 2 dejar    abandonar related word     word            0.542    5.5      1.83
#> 3 espalda  abdomen   related word     word            0.425    3.9      1.75
#> 4 beso     abrazo    related word     word            0.753    2.5      1.59
#> 5 ridículo absurdo   related word     word            0.704   NA       NA   
#> 6 abuelo   abuela    related word     word            0.745    2.32     1.28
#> # ℹ 25 more variables: aoa_min.x <int>, aoa_max.x <int>, aoa_zscore.x <dbl>,
#> #   oral_freq_log_M.x <dbl>, written_freq_log_SUBTLEXESP_M.x <dbl>,
#> #   written_freq_log_LEXESP_M.x <dbl>, written_freq_log_espal_M.x <dbl>,
#> #   lem_cat_espl_max.x <chr>, lem_max_code.x <chr>, syllable_N.x <int>,
#> #   aoa_M.y <dbl>, aoa_SD.y <dbl>, aoa_min.y <int>, aoa_max.y <int>,
#> #   aoa_zscore.y <dbl>, oral_freq_log_M.y <dbl>,
#> #   written_freq_log_SUBTLEXESP_M.y <dbl>, written_freq_log_LEXESP_M.y <dbl>, …

Other Cool Stuff

We used labjs for this project. The datasets you get from labjs are in a SQLite file. It’s not super fun to process. So, they wrote a function to do that. We included that function here as processData(), and you can see that we used it in our data processing files. It’s here if you want to use it yourself on labjs projects.

df <- processData("data.sqlite")

Check out the text package for how to merge word embeddings in R: https://osf.io/preprints/psyarxiv/293kt
https://cran.r-project.org/web/packages/text/vignettes/huggingface_in_r.html

Erin Buchanan

2024-11-10

Libraries

Power Functions

Simulate Population

Calculate Cutoff

Simulate Samples

Calculate Proportions of Items Measured

Final Sample Size

Create Stimuli Functions

Get Models

Calculate Similarity

Create Pseudowords

Get Priming Data

Load Available Data

Import Specific Data

Match to LAB Data

Load Available Data

Load Filtered Metadata

Load Specific Data

Match To Prime Data

Other Cool Stuff