semanticprimeR Package Functions
Erin Buchanan
2024-11-10
package_functions.Rmd
Libraries
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(semanticprimeR)
#>
#> Attaching package: 'semanticprimeR'
#> The following object is masked from 'package:dplyr':
#>
#> top_n
library(udpipe)
library(stopwords)
library(tibble)
library(rio)
Power Functions
During the process of this project, we proposed a new way to calculate “power” for adaptive sampling of items using accuracy in parameter estimation and bootstrapping/simulation methods. You can review the preprint here: https://osf.io/preprints/osf/e3afx
This package has many vignettes for different types of data you can
examine by using vignette(package = "semanticprimeR")
to
review what is available and
vignette(topic = "montefinese_vignette", package = "semanticprimeR")
as a specific example to view a specific vignette.
The information presented here is a shortened version of the paper to show off the functionality of the package.
Simulate Population
Let’s say we want to run a priming study, but do not have previous
data. We can use simulate_population()
to create some
sample data to use in for estimating sample size for adaptive testing
and stopping rules.
df <- simulate_population(mu = 25, # mean priming in ms
mu_sigma = 5, # standard deviation of the item means
sigma = 10, # population standard deviation for items
sigma_sigma = 3, # standard deviation of the standard deviation of items
number_items = 75, # number of priming items
number_scores = 100, # a population of values to simulate
smallest_sigma = 1, # smallest possible item standard deviation
min_score = -25, # min ms priming
max_score = 100, # max ms priming
digits = 3)
head(df)
#> item score
#> 1 1 22.966
#> 2 2 27.720
#> 3 3 1.809
#> 4 4 19.104
#> 5 5 38.348
#> 6 6 41.431
Calculate Cutoff
From this dataframe, we can calculate the standard error of the items, which is used to determine if the item has reached the stopping rule (i.e., if the standard error reaches a defined value, it is considered accurately measured, and we can stop sampling it). To get the defined value, we use the 40% decile of the standard errors of the items.
cutoff <- calculate_cutoff(population = df, # pilot data or simulated data
grouping_items = "item", # name of the item indicator column
score = "score", # name of the dependent variable column
minimum = min(df$score), # minimum possible/found score
maximum = max(df$score)) # maximum possible/found score
cutoff$se_items # all standard errors of items
#> [1] 1.1914848 1.2944957 0.9411572 1.0181927 0.9976637 1.6398903 1.4534350
#> [8] 1.1954318 1.1967030 0.7680782 1.4009475 0.2155584 0.7733205 1.1256345
#> [15] 0.8887702 1.1431167 1.1658235 1.2875266 0.9747519 1.0861600 1.4798068
#> [22] 0.9016232 1.2245109 1.4814803 0.8933068 0.9018008 0.7951396 0.7464761
#> [29] 0.7540108 0.4469756 0.7262989 0.7128257 0.8571588 1.1398729 1.3525854
#> [36] 0.9395962 0.5548498 1.0024380 0.8636102 1.2956495 1.5543834 1.2468959
#> [43] 0.4499860 0.9400761 1.3068053 1.6198171 1.1612173 1.2668006 1.0598592
#> [50] 1.0157997 1.0114304 1.1258111 0.9879748 0.8066764 1.0789258 0.5931047
#> [57] 1.3059619 0.9923325 0.4971521 0.7544609 1.1140443 0.7171717 0.9382714
#> [64] 1.1860594 1.0760170 0.9469807 0.7591402 1.4144226 1.1881549 1.2362692
#> [71] 1.3670759 0.6974420 1.1948369 0.4762515 1.5065986
cutoff$sd_items # standard deviation of the standard errors
#> [1] 0.2981085
cutoff$cutoff # 40% decile score
#> 40%
#> 0.9636434
cutoff$prop_var # proportion of possible variance
#> [1] 0.006049832
Simulate Samples
Next, we use simulation to pretend we have samples of a starting size (we suggest 20) up to a maximum size we are willing to collect. These samples will be used to determine the number of items that achieve our stopping rule to determine how many participants are needed to accurately measure most items.
samples <- simulate_samples(start = 20, # starting sample size
stop = 400, # stopping sample size
increase = 5, # increase bootstrapped samples by this amount
population = df, # population or pilot data
replace = TRUE, # bootstrap with replacement?
nsim = 500, # number of simulations to run
grouping_items = "item") # item column label
save(samples, file = "data/simulatePriming.RData")
# since that's a slow function, we wrote out the data and read it back in
load(paste0(filelocation,"/vignettes/data/simulatePriming.Rdata"))
head(samples[[1]])
#> # A tibble: 6 × 2
#> # Groups: item [1]
#> item score
#> <int> <dbl>
#> 1 1 15.8
#> 2 1 19.6
#> 3 1 17.7
#> 4 1 25.6
#> 5 1 31.0
#> 6 1 5.53
Calculate Proportions of Items Measured
From those samples and our estimated cutoff score, we can calculate the number of items below our suggested stopping rule.
proportion_summary <- calculate_proportion(samples = samples, # samples list
cutoff = cutoff$cutoff, # cut off score
grouping_items = "item", # item column name
score = "score") # dependent variable column name
head(proportion_summary)
#> # A tibble: 6 × 2
#> sample_size percent_below
#> <dbl> <dbl>
#> 1 20 0.04
#> 2 25 0.0667
#> 3 30 0.08
#> 4 35 0.0933
#> 5 40 0.12
#> 6 45 0.147
Final Sample Size
As you will note in our paper, we determined that this simulation procedure needs a correction to approximate traditional interpretations of power. You can use “power” levels like 80 percent, 90 percent, etc. similarly to traditional power - use higher numbers if you want to be more stringent to make sure all items are “well measured” (and correspondingly you will get higher estimated sample sizes). We suggested using 80% as a “minimum” sample size, the cutoff stopping rule while running the study if you use adaptive sampling (or just overall to review the data), and a higher value like 90% for a maximum sample size to collect.
corrected_summary <- calculate_correction(
proportion_summary = proportion_summary, # prop from above
pilot_sample_size = 100, # number of participants in the pilot data
proportion_variability = cutoff$prop_var, # proportion variance from cutoff scores
power_levels = c(80, 85, 90, 95)) # what levels of power to calculate
corrected_summary
#> # A tibble: 4 × 3
#> percent_below sample_size corrected_sample_size
#> <dbl> <dbl> <dbl>
#> 1 80 170 92.4
#> 2 88 210 115.
#> 3 90.7 235 129.
#> 4 96 275 150.
Create Stimuli Functions
Once you’ve determine your sample sizes, you will want to create
stimuli. These functions show you how we created stimuli from the
OpenSubtitles
project using the subs2vec
project.
-
OpenSubtitles
: https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles -
subs2vec
: https://github.com/jvparidon/subs2vec.
Get Models
You can review the available models from subs2vec
merged
with the available data in OpenSubtitles
by using the data
function. You can use ?subsData
to view the information
about the columns and the dataset.
data("subsData")
head(tibble(subsData))
#> # A tibble: 6 × 10
#> language_code subs_vec subs_count wiki_vec wiki_count files tokens sentences
#> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 hi https://h… https://h… https:/… https://h… 102 1.0M 0.1M
#> 2 gl https://h… https://h… https:/… https://h… 449 2.3M 0.3M
#> 3 ca https://h… https://h… https:/… https://h… 832 4.6M 0.6M
#> 4 bn https://h… https://h… https:/… https://h… 542 3.7M 0.7M
#> 5 br https://h… https://h… https:/… https://h… 32 0.2M 23.1k
#> 6 bs https://h… https://h… https:/… https://h… 37309 215.4M 34.1M
#> # ℹ 2 more variables: language <chr>, udpipe_model <chr>
We only worked with languages in which we could use a part of speech
tagger. We recommend udpipe
as a great package that has
many taggers. The language model necessary is shown in the
udpipe_model
column.
In this example, let’s use Afrikaans as a smaller dataset for an
example. The datasets can be very large - just a warning for downloading
and using on your computer. Use the import_subs
function to
download and import the files you are interested in.
For language, please use the two letter language code in the
language_code
column of the subsData.
You then need to pick what
to download:
-
subs_vec
: The subtitles embeddings from a fastText model. -
subs_count
: The frequency of tokens found in thesubs_vec
model. -
wiki_vec
: The Wikipedia embeddings from a fastText model. -
subs_count
: The frequency of tokens found in thewiki_vec
model.
You may see some warnings based on file formatting.
af_freq <- import_subs(
language = "af",
what = "subs_count"
)
head(af_freq)
#> unigram unigram_freq
#> 1 die 12673
#> 2 nie 11788
#> 3 ek 11605
#> 4 is 9147
#> 5 het 8109
#> 6 jy 7425
We then used udpipe
to filter our possible options. You
may have other criteria, but here’s an example of how we tagged concepts
(for their main part of speech, given no sentence context here). When
you use this function, it will download the model necessary for
tagging.
# tag with udpipe
af_tagged <- udpipe(af_freq$unigram,
object = subsData$udpipe_model[subsData$language_code == "af"],
parser = "none")
head(tibble(af_tagged))
#> # A tibble: 6 × 17
#> doc_id paragraph_id sentence_id sentence start end term_id token_id token
#> <chr> <int> <int> <chr> <int> <int> <int> <chr> <chr>
#> 1 doc1 1 1 die 1 3 1 1 die
#> 2 doc2 1 1 nie 1 3 1 1 nie
#> 3 doc3 1 1 ek 1 2 1 1 ek
#> 4 doc4 1 1 is 1 2 1 1 is
#> 5 doc5 1 1 het 1 3 1 1 het
#> 6 doc6 1 1 jy 1 2 1 1 jy
#> # ℹ 8 more variables: lemma <chr>, upos <chr>, xpos <chr>, feats <chr>,
#> # head_token_id <chr>, dep_rel <chr>, deps <chr>, misc <chr>
We then:
- Lower cased
- Removed anything less than three characters when appropriate
- Picked words only in nouns, verbs, adjectives, and adverbs
- Took out stopwords
- Took out words with numbers
We only used the top 10,000 words for the next section, but this selection will depend on your use case as well.
# word_choice
word_choice <- c("NOUN", "VERB", "ADJ", "ADV")
# lower case
af_tagged$lemma <- tolower(af_tagged$lemma)
# three characters
af_tagged <- subset(af_tagged, nchar(af_tagged$lemma) >= 3)
# only nouns verbs, etc.
af_tagged <- subset(af_tagged, upos %in% word_choice)
# removed stop words just in case they were incorrectly tagged
af_tagged <- subset(af_tagged, !(lemma %in% stopwords(language = "af", source = "stopwords-iso")))
# removed things with numbers
af_tagged <- subset(af_tagged, !(grepl("[0-9]", af_tagged$sentence)))
# merge frequency back into tagged list
# merge by sentence so one to one match
colnames(af_freq) <- c("sentence", "freq")
af_final <- merge(af_tagged, af_freq, by = "sentence", all.x = T)
head(tibble(af_final))
#> # A tibble: 6 × 18
#> sentence doc_id paragraph_id sentence_id start end term_id token_id token
#> <chr> <chr> <int> <int> <int> <int> <int> <chr> <chr>
#> 1 aanbeveling doc17… 1 1 1 11 1 1 aanb…
#> 2 aanbeweeg doc81… 1 1 1 9 1 1 aanb…
#> 3 aanbied doc13… 1 1 1 7 1 1 aanb…
#> 4 aanbreek doc82… 1 1 1 8 1 1 aanb…
#> 5 aanbring doc14… 1 1 1 8 1 1 aanb…
#> 6 aandadig doc57… 1 1 1 8 1 1 aand…
#> # ℹ 9 more variables: lemma <chr>, upos <chr>, xpos <chr>, feats <chr>,
#> # head_token_id <chr>, dep_rel <chr>, deps <chr>, misc <chr>, freq <int>
# eliminate duplicates by lemma
af_final <- af_final[order(af_final$freq, decreasing = TRUE) , ]
af_final <- af_final[!duplicated(af_final$lemma), ]
# grab top 10K
af_top <- af_final[1:10000 , ]
Next, we used import_subs()
again import the embeddings
for the subtitles.
af_dims <- import_subs(
language = "af",
what = "subs_vec"
)
head(tibble(af_dims))
#> # A tibble: 6 × 301
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 </s> -0.236 0.0544 0.0202 -0.0531 0.146 0.000971 0.0146 0.0423
#> 2 nie -0.000683 0.0787 0.0382 0.0379 0.295 0.225 -0.171 0.132
#> 3 die -0.0233 -0.000478 0.191 -0.103 0.176 -0.0884 -0.0828 -0.00594
#> 4 is -0.410 0.144 0.133 -0.104 0.295 -0.0391 -0.00618 0.146
#> 5 het 0.344 -0.0163 -0.198 0.259 -0.579 -0.154 0.131 -0.0615
#> 6 n 0.0615 -0.139 -0.0674 -0.0658 -0.105 -0.212 -0.115 -0.0611
#> # ℹ 292 more variables: V10 <dbl>, V11 <dbl>, V12 <dbl>, V13 <dbl>, V14 <dbl>,
#> # V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
#> # V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
#> # V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
#> # V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
#> # V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
#> # V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>, …
In our case, we want to use the tokens as row names, so we want to move the first column to the row names and delete it to have a 300 dimension by tokens matrix.
# lower case
af_dims$V1 <- tolower(af_dims[ , 1]) # first column is always the tokens
# eliminate duplicates
af_dims <- subset(af_dims, !duplicated(af_dims[ , 1]))
# make row names
rownames(af_dims) <- af_dims[ , 1]
af_dims <- af_dims[ , -1]
head(tibble(af_dims))
#> # A tibble: 6 × 300
#> V2 V3 V4 V5 V6 V7 V8 V9 V10
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.236 0.0544 0.0202 -0.0531 0.146 9.71e-4 0.0146 0.0423 0.00999
#> 2 -0.000683 0.0787 0.0382 0.0379 0.295 2.25e-1 -0.171 0.132 0.216
#> 3 -0.0233 -0.000478 0.191 -0.103 0.176 -8.84e-2 -0.0828 -0.00594 0.0445
#> 4 -0.410 0.144 0.133 -0.104 0.295 -3.91e-2 -0.00618 0.146 -0.0680
#> 5 0.344 -0.0163 -0.198 0.259 -0.579 -1.54e-1 0.131 -0.0615 -0.0159
#> 6 0.0615 -0.139 -0.0674 -0.0658 -0.105 -2.12e-1 -0.115 -0.0611 -0.0669
#> # ℹ 291 more variables: V11 <dbl>, V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>,
#> # V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>, V21 <dbl>,
#> # V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>, V27 <dbl>,
#> # V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>, V33 <dbl>,
#> # V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>, V39 <dbl>,
#> # V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>, V45 <dbl>,
#> # V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>, V51 <dbl>, …
Calculate Similarity
We can then use the calculate_similarity()
function to
get the similarity values for all words based on the dimension matrix.
The underlying function is cosine calculated between vectors of the two
word dimensions.
af_cosine <-
calculate_similarity(
words = af_final$sentence, # the tokens you want to filter
dimensions = af_dims, # the matrix of items
by = 1 # 1 for rows, 2 for columns
)
The top_n
function can be used to calculate the top
number of cosine values for each token in the similarity matrix. Please
note: it will always return the token-token combination as 1 (the token
related to itself), so you should ask for n+1 number of cosines to then
filter out the token-token combinations. Big thanks to Brenton Wiernik
who figured out how to make this computational efficient.
# get the top 5 related words
af_top_sim <- semanticprimeR::top_n(af_cosine, 6)
af_top_sim <- subset(af_top_sim, cue!=target)
head(af_top_sim)
#> cue target cosine
#> 2 nou als 0.8654161
#> 3 nou nitto 0.8476698
#> 4 nou juis 0.8434761
#> 5 nou nicolas 0.8419336
#> 6 nou jakkals 0.8278382
#> 8 hier bier 0.8086171
Create Pseudowords
We originally set up a function to create words by replacing the number of characters based on the bigrams in the token. We recommend you use the other function based on Wuggy, but you can also do simple replacement.
af_top_sim$fake_cue <- fake_simple(af_top_sim$cue)
# you'd want to also do this based on target depending on your study
head(af_top_sim)
#> cue target cosine fake_cue
#> 2 nou als 0.8654161 non
#> 3 nou nitto 0.8476698 not
#> 4 nou juis 0.8434761 nof
#> 5 nou nicolas 0.8419336 nau
#> 6 nou jakkals 0.8278382 noo
#> 8 hier bier 0.8086171 hiër
You can also use the Wuggy algorithm using fake_Wuggy()
.
This function is not fast. It is slower the larger the size of the
words to create from. It returns a dataframe of options to use for
pseudowords with the following columns:
-
word_id
: Number id for each unique word. -
first
: First syllable in pairs of syllables. -
original_pair
: Pair of syllables together. -
second
: Second syllable in the pairs of syllables. -
syll
: Number of syllables in the token. -
original_freq
: Frequency of the syllable pair. -
replacement_pair
: Replacement option wherein one of the syllables has been changed. -
replacement_syll
: The replacement syllable. -
replacement_freq
: The frequency of the replacement syllable pair. -
freq_diff
: The difference in frequency of the transition pair. -
char_diff
: Number of characters difference in the original pair and the replacement pair. -
letter_diff
: Number of letters difference in the original pair and the replacement pair. If the replacement includes the same letters, the difference would be zero. These values are excluded from being options.} -
original_word
: The original token.} -
replacement_word
: The final replacement token.
af_wuggy <- fake_Wuggy(
wordlist = af_final$sentence, # full valid options in language
language_hyp = paste0(filelocation,"/inst/latex/hyph-af.tex"), # path to hyphenation.tex
lang = "af", # two letter language code
replacewords <- unique(af_top_sim$cue[1:20]) # words you want to create pseudowords for
)
#> | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
head(tibble(af_wuggy))
#> # A tibble: 4 × 14
#> word_id first original_pair second syll original_freq replacement_pair
#> <int> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 2 first_blank first_blank-hi hi 1 5 first_blank-tu
#> 2 3 ne ne-t t 1 1 oi-t
#> 3 1 first_blank first_blank-no no 1 22 first_blank-va
#> 4 4 first_blank first_blank-we we 1 23 first_blank-bo
#> # ℹ 7 more variables: replacement_syll <chr>, replacement_freq <int>,
#> # freq_diff <dbl>, char_diff <int>, letter_diff <dbl>, original_word <chr>,
#> # replacement_word <chr>
getwd()
#> [1] "/Users/erinbuchanan/GitHub/Research/2_projects/PSA_Projects/SPAML/semanticprimeR/vignettes"
Get Priming Data
You can load one of the many files included in the SPAML release by
using the primeData
to see what we have available. The
datasets are broken into a couple types:
-
procedure_stimuli
: The stimuli from the study. Each dataset includes the ~5000 trials used in the study listed as cue-target pairs with theircue_type
/target_type
(word/nonword) and trialtype
(related, unrelated, nonword). The cosine values from the subs2vec models are included when available for word pairs. If the value is blank or NA, you can assume one of the words did not exist in the subs2vec model or could not be matched. The subs2vec models were often filtered to only the top X words, and some stimuli selected may have be infrequent. -
matched_stimuli
: The matched stimuli datasets fall into two types: “matched” which matches the original language to English, and “unique” which includes the word pair combo found in the datasets that makes each trial unique. Some targets were repeated due to translation - therefore, the unique datasets allow you to unambiguously match things together. Thematched_stimuli.csv
files has these all matched together if you want all languages at once. The missing data is the Arabic pairs we were asked to remove due to their taboo nature in that culture.
Each of the following files have codebooks found at: https://github.com/SemanticPriming/SPAML/tree/master/05_Data/codebooks
-
participant_data
: Information on the participants who completed each language. -
full_data
: The “raw” data with only identifiers removed. -
trial_data
: The trial level data showing only the trial blocks (i.e., excluding the other lines that indicate the timing and inter-trial interval). -
item_data
: The average results for each token/item, ignoring the condition presented. -
priming_data
: The priming data in either_trials
format (meaning these have been matched and labeled for trial type) or_summary
format (meaning averages/summaries of the target trials matched by related and unrelated to create a priming score).
Load Available Data
data("primeData")
head(primeData)
#> type filename language
#> 1 item_data ar_answered_item_data.csv ar
#> 2 priming_data ar_answered_prime_summary_no2.5.csv ar
#> 3 priming_data ar_answered_prime_summary_no3.0.csv ar
#> 4 priming_data ar_answered_prime_summary.csv ar
#> 5 priming_data ar_answered_prime_trials.csv ar
#> 6 full_data ar_full_data.csv.gz ar
#> location
#> 1 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_item_data.csv
#> 2 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary_no2.5.csv
#> 3 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary_no3.0.csv
#> 4 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_summary.csv
#> 5 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_answered_prime_trials.csv
#> 6 https://github.com/SemanticPriming/SPAML/releases/download/v1.0.1/ar_full_data.csv.gz
Once you decide what file you would like to download and import, you
can use import_prime()
to import that file. Note that some
of the full_data
datasets are quite large and may take a
while download and/or import directly. You can also just use the direct
links the primeData file to download them. Some files are heavily
compressed in .gz
format. I recommend 7-zip if you aren’t
familiar with the command line to unzip these: https://www.wikihow.com/Extract-a-Gz-File
You can also import them directly into R with the
rio package (which is what this function does, but it does
download the file each time, so I’d recommend one download and then put
the import into your code directly with
rio::import("filepath")
).
Import Specific Data
In this example, we import the stimuli dataset for Spanish, which includes the trials, type of trial information, and the cosine calculated from subs2vec.
es_words <- import_prime("es_words.csv")
head(es_words)
#> es_cue es_target type cue_type target_type es_cosine
#> 1 lenguado abadejo related word word 0.5712210
#> 2 dejar abandonar related word word 0.5418134
#> 3 espalda abdomen related word word 0.4251472
#> 4 beso abrazo related word word 0.7530440
#> 5 ridículo absurdo related word word 0.7036410
#> 6 abuelo abuela related word word 0.7450651
Match to LAB Data
Load Available Data
To review the available data from the Linguistic Annotated
Bibliography, you can use data("labData")
, which includes
information about available datasets overall and which are included in
our LAB data release for merging.
data("labData")
head(tibble(labData))
#> # A tibble: 6 × 73
#> included bibtex author year ref_title ref_journal ref_volume ref_page ref_doi
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 no Adelm… Adelm… 2014 "A behav… Behavior R… "46" "1052--… "10.37…
#> 2 no Aguil… Aguil… 2017 "Develop… 2017 IEEE … "" "" "10.11…
#> 3 no Akini… Akini… 2014 "Russian… Behavior R… "47" "691--7… "10.37…
#> 4 no Al-Su… Al-Su… 2006 "The Des… Internatio… "11" "135--1… "10.10…
#> 5 no Alame… Alame… 1995 "Diccion… Servicio d… "" "" ""
#> 6 yes Alari… Alari… 1999 "A set o… Behavior R… "31" "531--5… "10.37…
#> # ℹ 64 more variables: no1 <chr>, no2 <int>, type1 <chr>, ref1 <chr>,
#> # type2 <chr>, ref2 <chr>, notes_stim <chr>, data_name <chr>, nonling <int>,
#> # language <chr>, notes_lang <chr>, language_glotto <chr>,
#> # notes_glotto <chr>, population <chr>, notes_var <chr>, accuracy <int>,
#> # ambiguity <int>, aoa <int>, arousal <int>, assoc <int>, category <int>,
#> # cloze <int>, complex <int>, concrete <int>, confusion <int>, context <int>,
#> # dist <int>, dominate <int>, easelearn <int>, familiar <int>, freq <int>, …
# import_lab() also loads this dataset
# ?labData # use this to learn about the dataset
Load Filtered Metadata
If you want to find specific types of LAB data, you can use the
language
and/or variables
.
saved <- import_lab(language = "English", variables = c("aoa", "freq"))
# possible datasets that are English, aoa, and frequency
head(tibble(saved))
#> # A tibble: 1 × 1
#> saved
#> <named list>
#> 1 <df [3 × 74]>
saved <- import_lab(language = "Spanish", variables = c("aoa"))
head(tibble(saved))
#> # A tibble: 1 × 1
#> saved
#> <named list>
#> 1 <df [8 × 74]>
Load Specific Data
es_aos <- import_lab(bibtexID = "Alonso2015", citation = TRUE)
es_aos$citation
#> [1] "Alonso, Fernandez, & Diez. (2014). Subjective age-of-acquisition norms for 7039 Spanish words. Behavior Research Methods, 47, 268--274. doi: 10.3758/s13428-014-0454-2"
head(tibble(es_aos$loaded_data))
#> # A tibble: 6 × 13
#> word_spanish aoa_M aoa_SD aoa_min aoa_max aoa_zscore oral_freq_log_M
#> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>
#> 1 a 2.28 1.44 1 6 -1.65 4.85
#> 2 abajo 2.96 1.37 1 6 -1.36 2.62
#> 3 abandonado 6.06 1.66 2 10 -0.584 1.62
#> 4 abandonar 7.58 1.66 4 11 -0.02 1.81
#> 5 abandono 7.22 1.94 3 11 0.04 1.60
#> 6 abatimiento 10.0 1.49 5 11 1.06 0.602
#> # ℹ 6 more variables: written_freq_log_SUBTLEXESP_M <dbl>,
#> # written_freq_log_LEXESP_M <dbl>, written_freq_log_espal_M <dbl>,
#> # lem_cat_espl_max <chr>, lem_max_code <chr>, syllable_N <int>
es_sim <- import_lab(bibtexID = "Cabana2024_R1", citation = TRUE)
es_sim$citation
#> [1] "Cabana, Zugarramurdi, Valle-Lisboa, & De Deyne. (2024). The \xd2Small World of Words\xd3 free association norms for Rioplatense Spanish. Behavior Research Methods, 56, 968--985. doi: 10.3758/s13428-023-02070-z"
head(tibble(es_sim$loaded_data))
#> # A tibble: 6 × 5
#> cue response R1 N R1.Strength
#> <chr> <chr> <int> <int> <dbl>
#> 1 ? pregunta 26 66 0.394
#> 2 ? que 8 66 0.121
#> 3 ? duda 6 66 0.0909
#> 4 ? incógnita 4 66 0.0606
#> 5 ? interrogación 3 66 0.0455
#> 6 ? no sé 2 66 0.0303
Match To Prime Data
es_words_merged <- es_words %>%
# merge with the cue word (will be .x variables)
left_join(es_aos$loaded_data,
by = c("es_cue" = "word_spanish")) %>%
# merge with the target word (will be .y variables)
left_join(es_aos$loaded_data,
by = c("es_target" = "word_spanish")) %>%
# merge with free association similarity
left_join(es_sim$loaded_data,
by = c("es_cue" = "cue",
"es_target" = "response"))
head(tibble(es_words_merged))
#> # A tibble: 6 × 33
#> es_cue es_target type cue_type target_type es_cosine aoa_M.x aoa_SD.x
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 lenguado abadejo related word word 0.571 7.68 2.38
#> 2 dejar abandonar related word word 0.542 5.5 1.83
#> 3 espalda abdomen related word word 0.425 3.9 1.75
#> 4 beso abrazo related word word 0.753 2.5 1.59
#> 5 ridículo absurdo related word word 0.704 NA NA
#> 6 abuelo abuela related word word 0.745 2.32 1.28
#> # ℹ 25 more variables: aoa_min.x <int>, aoa_max.x <int>, aoa_zscore.x <dbl>,
#> # oral_freq_log_M.x <dbl>, written_freq_log_SUBTLEXESP_M.x <dbl>,
#> # written_freq_log_LEXESP_M.x <dbl>, written_freq_log_espal_M.x <dbl>,
#> # lem_cat_espl_max.x <chr>, lem_max_code.x <chr>, syllable_N.x <int>,
#> # aoa_M.y <dbl>, aoa_SD.y <dbl>, aoa_min.y <int>, aoa_max.y <int>,
#> # aoa_zscore.y <dbl>, oral_freq_log_M.y <dbl>,
#> # written_freq_log_SUBTLEXESP_M.y <dbl>, written_freq_log_LEXESP_M.y <dbl>, …
Other Cool Stuff
We used labjs
for this project. The datasets you get
from labjs
are in a SQLite file. It’s not super fun to
process. So, they wrote a function to do that. We included that function
here as processData()
, and you can see that we used it in
our data processing files. It’s here if you want to use it yourself on
labjs
projects.
df <- processData("data.sqlite")
- Check out the
text
package for how to merge word embeddings in R: https://osf.io/preprints/psyarxiv/293kt - https://cran.r-project.org/web/packages/text/vignettes/huggingface_in_r.html