Power and Sample Size Simulatio: Online search trends and word-related emotional response during COVID-19 lockdown in Italy
Mahmoud M.Elsherif
2024-11-10
montefinese_vignette.Rmd
Project/Data Title:
Online search trends and word-related emotional response during COVID-19 lockdown in Italy
Data provided by: Maria Montefinese
Project/Data Description:
The strong and long lockdown adopted by the Italian government to limit the spread of the COVID-19 represents the first threat-related mass isolation in history that scientists can study in depth to understand the emotional response of individuals to a pandemic. Perception of a pandemic threat through invasive media communication, such as that related to COVID-19, can induce fear-related emotions (Van Bavel et al., 2020). The dimension theory of emotions (Osgood & Suci, 1955) assumes that emotive space is defined along three dimensions: valence (indicating the way an individual judges a stimulus; from unpleasant to pleasant), arousal (indicating the degree of activation an individual feels towards a stimulus; from calm to excited) and dominance (indicating the degree of control an individual feels over a given stimulus; from out of control to in control). Fear is characterized as a negatively valenced emotion, accompanied by a high level of arousal (Witte, 1992; Witte, 1998) and a low dominance (Stevenson, Mikel & James, 2007). This is generally in line with previous results showing that participants judged stimuli related to the most feared medical conditions as the most negative, the most anxiety-provoking, and the least controllable (Warriner, Kuperman & Brysbaert, 2013). Fear is also characterized by extreme levels of emotional avoidance of specific stimuli (Perin et al., 2015) and may be considered a unidirectional precursor to psychopathological responses within the current context (Ahorsu et al., 2020). dealing with fear in a pandemic situation could be easier for some people than others. Indeed, individual differences have been associated with behavioral responses to pandemic status (Carvalho Pianowski & Gonçalves, 2020).
To mitigate the effects of the COVID-19 on the mental health of individuals, it is imperative to evaluate their emotional response to this emergency. The internet searches are a direct tool to address this problem. In fact, COVID-19 has been reported to affect the content that people explore online (Effenberger et al., 2020), and online media and platforms offer essential channels where people express their feelings and emotions and seek health-related information (Kalichman et al., 2003; Reeves, 2001). In particular, Google Trends is an available data source of real-time internet search patterns, which has been shown to be a valid indicator of people’s desires and intentions (Payne, Brown-Iannuzzi & Hannay, 2017; Pelham et al., 2018). Therefore, the amount of searches related to COVID-19 on the internet revealed by Google Trends are an indicator of how people feel about concepts related to the COVID-19 pandemic. A change in online search trends reflects a change in participants’ interests and attitudes towards a specific topic. Based on the topic, the context (that is, the reasons for this change), and this mutated interest per se, it is possible to predict people’s behavior and affective response to the topic in question. In this study, our aim was to understand how emotional reaction and online search behavior have changed in response to the COVID-19 lockdown in the Italian population.
Methods Description:
Data were collected in the period from 4 May to 17 May 2020, the last day of complete lockdown in Italy, from 71 native adult Italian speakers (56 females and 13 males; mean age (SD) = 26.2 (7.9) years; mean education (SD) = 15.3 (3.2) years). There were no other specific eligibility criteria. An online survey was conducted using Google Forms to collect affective ratings during the lockdown caused by the COVID-19 epidemic in Italy. In particular, we asked participants to complete the Positive and Negative Affect Schedule (PANAS, Terraciano, McCrae & Costa, 2003) and Fear of COVID-19 Scale (FCV-19S, Ahorsu et al., 2020) and judged valence, arousal, and dominance (on a 9-point self-assessment manikin, Montefinese et al., 2014) of words related or unrelated to COVID-19, as identified by Google search trends. The word stimuli consisted of 3 groups of 20 words each. The first group (REL+) consisted of the words showing the largest positive relation between their search trends and the search trend for COVID-related terms. On the contrary, the second group (REL-) consisted of the words showing the largest negative relation between their search trends and the search trend for COVID-related terms. In other words, the COVID-19 epidemic in Italy and the consequent increase in interest in terms related to COVID was related to a similar increase in interest for the REL+ words and a decrease in interest for the REL- words. The third group (UNREL) consisted of the words for which the search trend was unrelated to the search trend for the COVID-related terms.
Data Location:
DF <- import("data/montefinese_data.csv")
names(DF) <- make.names(names(DF),unique = TRUE)
names(DF)[names(DF) == 'ITEM..ITA.'] <- "item"
DF <- DF %>%
filter(StimType != "") %>%
filter(Measure == "Valence") %>% # only look at valence score
arrange(item) %>% #orders the rows of the data by the target_name column
group_by(item) %>% #group by the target name
transform(items = as.numeric(factor(item)))%>% #transform target name into a item
select(items, item, everything()
) #select all variables from items and target_name
head(DF)
#> items item ssID Gender Age Education Measure StimType Response
#> 1 1 affogare 1 F 36 21 Valence UNREL 2
#> 2 1 affogare 2 M 40 21 Valence UNREL 2
#> 3 1 affogare 3 F 29 21 Valence UNREL 1
#> 4 1 affogare 4 M 39 13 Valence UNREL 1
#> 5 1 affogare 5 F 27 16 Valence UNREL 1
#> 6 1 affogare 6 M 33 18 Valence UNREL 1
Dataset Citation:
Montefinese M, Ambrosini E, Angrilli A. 2021. Online search trends and word-related emotional response during COVID-19 lockdown in Italy: a cross-sectional online study. PeerJ 9:e11858 https://doi.org/10.7717/peerj.11858
Column Metadata:
Variable Name |
Variable Description |
Type (numeric, character, logical, etc.) |
---|---|---|
ssID |
Participant code |
Numeric |
Gender |
Participants’ gender |
Character |
Age |
Participants’ age |
Numeric |
Education |
Participants’ years of education |
Numeric |
Measure |
Questionnaires and ratings (PANAS, COVID-19 fear, valence, arousal, dominance) |
Character |
ITEM (ITA) |
Test items and word stimuli |
Character |
Stim Type |
Word condition (REL+, REL-, UNREL) |
Character |
Response |
Participants’ scores to the questionnaires and ratings |
Numeric |
AIPE Analysis:
In this dataset, there are REL+ and REL- variables. In the REL+ condition, the words show the largest positive relation between their search trends and the search trend for the COVID-related terms. In the REL- condition, the words showed the largest negative relation between their search trends and the search trends for the COVID-related terms. The third group (UNREL) consisted in the words for which the search trend was unrelated to the search trend for the COVID-related terms.
Stopping Rule
What the usual standard error for the data that could be considered for our stopping rule using the 40% decile? Given potential differences in conditions, we subset the data to each condition to estimate separately.
### create subset for REL+
DF_RELpos <- subset(DF, StimType == "REL+")
### create subset for REL-
DF_RELneg <- subset(DF, StimType == "REL-")
### create subset for UNREL
DF_UNREL <- subset(DF, StimType == "UNREL")
# individual SEs for REL+ condition
cutoff_relpos <- calculate_cutoff(population = DF_RELpos,
grouping_items = "item",
score = "Response",
minimum = min(DF_RELpos$Response),
maximum = max(DF_RELpos$Response))
SE1 <- tapply(DF_RELpos$Response, DF_RELpos$item, function (x) { sd(x)/sqrt(length(x)) })
SE1
#> burro casa cioccolato computer corona famiglia febbre
#> 0.1557687 0.1766870 0.1537913 0.1998951 0.2176631 0.1481984 0.1481984
#> lavare libertà mondo muffin notizie peste salute
#> 0.1860313 0.1916883 0.2017999 0.1846399 0.1636991 0.1621336 0.1569829
#> salvare sole tempo termometro torta vaiolo
#> 0.1431403 0.1483131 0.1952478 0.1441268 0.1393278 0.1585099
cutoff_relpos$cutoff
#> 40%
#> 0.1564972
# individual SEs for REL- condition
cutoff_relneg <- calculate_cutoff(population = DF_RELneg,
grouping_items = "item",
score = "Response",
minimum = min(DF_RELneg$Response),
maximum = max(DF_RELneg$Response))
SE2 <- tapply(DF_RELneg$Response, DF_RELneg$item, function (x) { sd(x)/sqrt(length(x)) })
SE2
#> autobus costa dormire giacca hotel mangiare matrimonio
#> 0.17715152 0.18453238 0.16521553 0.15728951 0.17384120 0.14909443 0.20475544
#> motore palazzo pantalone piazza profumo ristorante sposa
#> 0.15085177 0.17073238 0.17481656 0.20640961 0.16138025 0.20118108 0.21208427
#> tram tumore uomo vacanza viaggio villaggio
#> 0.16872880 0.07001896 0.16715951 0.13096103 0.18557374 0.16663313
cutoff_relneg$cutoff
#> 40%
#> 0.166949
# individual SEs for UNREL condition
cutoff_unrel <- calculate_cutoff(population = DF_UNREL,
grouping_items = "item",
score = "Response",
minimum = min(DF_UNREL$Response),
maximum = max(DF_UNREL$Response))
SE3 <- tapply(DF_UNREL$Response, DF_UNREL$item, function (x) { sd(x)/sqrt(length(x)) })
SE3
#> affogare baco cannone cappio corridore dipendente
#> 0.1312204 0.1622734 0.2131240 0.1750919 0.1781088 0.1796297
#> disturbare fetore firmamento funerale gusto ladro
#> 0.1535516 0.1507766 0.1973986 0.1121214 0.1526631 0.1473738
#> malevolenza mestolo nettare oceano offendersi orgasmo
#> 0.1379994 0.1218824 0.1755445 0.1511708 0.1652155 0.1458857
#> perfezione tradire
#> 0.2210864 0.1158751
cutoff_unrel$cutoff
#> 40%
#> 0.1510131
# sequence of sample sizes to try
nsim <- 10 # small for cran
samplesize_values <- seq(25, 300, 5)
# create a blank table for us to save the values in positive ----
sim_table <- matrix(NA,
nrow = length(samplesize_values)*nsim,
ncol = length(unique(DF_RELpos$item)))
# make it a data frame
sim_table <- as.data.frame(sim_table)
# add a place for sample size values
sim_table$sample_size <- NA
sim_table$var <- "Response"
# make a second table for negative -----
sim_table2 <- matrix(NA,
nrow = length(samplesize_values)*nsim,
ncol = length(unique(DF_RELneg$item)))
# make it a data frame
sim_table2 <- as.data.frame(sim_table2)
# add a place for sample size values
sim_table2$sample_size <- NA
sim_table2$var <- "Response"
# make a second table for unrelated -----
sim_table3 <- matrix(NA,
nrow = length(samplesize_values)*nsim,
ncol = length(unique(DF_UNREL$item)))
# make it a data frame
sim_table3 <- as.data.frame(sim_table3)
# add a place for sample size values
sim_table3$sample_size <- NA
sim_table3$var <- "Response"
iterate <- 1
for (p in 1:nsim){
# loop over sample size
for (i in 1:length(samplesize_values)){
# related positive temp variables ----
temp_RELpos <- DF_RELpos %>%
dplyr::group_by(item) %>%
dplyr::sample_n(samplesize_values[i], replace = T) %>%
dplyr::summarize(se1 = sd(Response)/sqrt(length(Response)))
# put in table
colnames(sim_table)[1:length(unique(DF_RELpos$item))] <- temp_RELpos$item
sim_table[iterate, 1:length(unique(DF_RELpos$item))] <- temp_RELpos$se1
sim_table[iterate, "sample_size"] <- samplesize_values[i]
sim_table[iterate, "nsim"] <- p
# related negative temp variables ----
temp_RELneg <-DF_RELneg %>%
dplyr::group_by(item) %>%
dplyr::sample_n(samplesize_values[i], replace = T) %>%
dplyr::summarize(se2 = sd(Response)/sqrt(length(Response)))
# put in table
colnames(sim_table2)[1:length(unique(DF_RELneg$item))] <- temp_RELneg$item
sim_table2[iterate, 1:length(unique(DF_RELneg$item))] <- temp_RELneg$se2
sim_table2[iterate, "sample_size"] <- samplesize_values[i]
sim_table2[iterate, "nsim"] <- p
# unrelated temp variables ----
temp_UNREL <-DF_UNREL %>%
dplyr::group_by(item) %>%
dplyr::sample_n(samplesize_values[i], replace = T) %>%
dplyr::summarize(se3 = sd(Response)/sqrt(length(Response)))
# put in table
colnames(sim_table3)[1:length(unique(DF_UNREL$item))] <- temp_UNREL$item
sim_table3[iterate, 1:length(unique(DF_UNREL$item))] <- temp_UNREL$se3
sim_table3[iterate, "sample_size"] <- samplesize_values[i]
sim_table3[iterate, "nsim"] <- p
iterate <- iterate + 1
}
}
Minimum Sample Size
Suggestions for REL+ Condition:
# multiply by correction
cutoff <- quantile(SE1, probs = .4)
final_sample <-
sim_table %>%
pivot_longer(cols = -c(sample_size, var, nsim)) %>%
dplyr::rename(item = name, se = value) %>%
dplyr::group_by(sample_size, var, nsim) %>%
dplyr::summarize(percent_below = sum(se <= cutoff)/length(unique(DF_RELpos$item))) %>%
ungroup() %>%
# then summarize all down averaging percents
dplyr::group_by(sample_size, var) %>%
summarize(percent_below = mean(percent_below)) %>%
dplyr::arrange(percent_below) %>%
ungroup()
#> `summarise()` has grouped output by 'sample_size', 'var'. You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by 'sample_size'. You can override using the
#> `.groups` argument.
flextable(final_sample %>% head()) %>% autofit()
sample_size |
var |
percent_below |
---|---|---|
35 |
Response |
0.000 |
25 |
Response |
0.005 |
30 |
Response |
0.010 |
40 |
Response |
0.015 |
50 |
Response |
0.050 |
45 |
Response |
0.055 |
final_table_pos <- calculate_correction(
proportion_summary = final_sample,
pilot_sample_size = length(unique(DF_RELpos$ssID)),
proportion_variability = cutoff_relpos$prop_var
)
flextable(final_table_pos) %>%
autofit()
percent_below |
sample_size |
corrected_sample_size |
---|---|---|
81.0 |
105 |
70.33792 |
88.5 |
115 |
77.34137 |
91.0 |
125 |
84.39584 |
95.5 |
130 |
88.08384 |
Suggestions for REL- Condition:
cutoff <- quantile(SE2, probs = .4)
final_sample2 <-
sim_table2 %>%
pivot_longer(cols = -c(sample_size, var, nsim)) %>%
dplyr::rename(item = name, se = value) %>%
dplyr::group_by(sample_size, var, nsim) %>%
dplyr::summarize(percent_below = sum(se <= cutoff)/length(unique(DF_RELneg$item))) %>%
ungroup() %>%
# then summarize all down averaging percents
dplyr::group_by(sample_size, var) %>%
summarize(percent_below = mean(percent_below)) %>%
dplyr::arrange(percent_below) %>%
ungroup()
#> `summarise()` has grouped output by 'sample_size', 'var'. You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by 'sample_size'. You can override using the
#> `.groups` argument.
flextable(final_sample2 %>% head()) %>% autofit()
sample_size |
var |
percent_below |
---|---|---|
25 |
Response |
0.065 |
30 |
Response |
0.065 |
35 |
Response |
0.075 |
45 |
Response |
0.105 |
40 |
Response |
0.110 |
50 |
Response |
0.130 |
final_table_neg <- calculate_correction(
proportion_summary = final_sample2,
pilot_sample_size = length(unique(DF_RELneg$ssID)),
proportion_variability = cutoff_relneg$prop_var
)
flextable(final_table_neg) %>%
autofit()
percent_below |
sample_size |
corrected_sample_size |
---|---|---|
82.5 |
95 |
62.03046 |
86.0 |
105 |
69.52984 |
90.0 |
110 |
73.23150 |
98.0 |
120 |
80.94565 |
Suggestions for UNREL Condition:
cutoff <- quantile(SE3, probs = .4)
final_sample3 <-
sim_table3 %>%
pivot_longer(cols = -c(sample_size, var, nsim)) %>%
dplyr::rename(item = name, se = value) %>%
dplyr::group_by(sample_size, var, nsim) %>%
dplyr::summarize(percent_below = sum(se <= cutoff)/length(unique(DF_UNREL$item))) %>%
ungroup() %>%
# then summarize all down averaging percents
dplyr::group_by(sample_size, var) %>%
summarize(percent_below = mean(percent_below)) %>%
dplyr::arrange(percent_below) %>%
ungroup()
#> `summarise()` has grouped output by 'sample_size', 'var'. You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by 'sample_size'. You can override using the
#> `.groups` argument.
flextable(final_sample3 %>% head()) %>% autofit()
sample_size |
var |
percent_below |
---|---|---|
25 |
Response |
0.030 |
35 |
Response |
0.055 |
30 |
Response |
0.060 |
40 |
Response |
0.120 |
45 |
Response |
0.125 |
50 |
Response |
0.200 |
final_table_unrel <- calculate_correction(
proportion_summary = final_sample3,
pilot_sample_size = length(unique(DF_UNREL$ssID)),
proportion_variability = cutoff_unrel$prop_var
)
flextable(final_table_unrel) %>%
autofit()
percent_below |
sample_size |
corrected_sample_size |
---|---|---|
81.0 |
105 |
69.89133 |
86.5 |
115 |
76.90026 |
90.0 |
125 |
83.91580 |
96.5 |
150 |
100.77698 |
Based on these simulations, we can decide our minimum sample size by examining all three potential scores at 80% of items below the criterion, = 70, = 62, or = 70. These scores are all very similar, and we should select the largest one.