The State of the Union Presidential Speeches

Case-study 5: Counting words

This assignment explores the speeches that the president of the United States delivers annually since 1790. Each of these speeches constitutes an important source of information about the US political agenda and the wider socio-economic and cultural context surrounding them.

As an illustration, please read the text included in the link below that records the address that Woodrow Wilson gave in December of 1913:

State of the Union Presidential Address - Woodrow Wilson - December 2, 1913

The full corpus contains 235 texts (the speeches delivered between 1790 and 2024; with a total of almost 1.7 million words). This information has been gathered together into a .csv file and can be downloaded here. Instead of columns, comma-separate (.csv) files separate the different pieces of information using commas as delimiters: name of the president delivering the speech (president), the year the speech was delivered (year) and the (whole) text of the speech itself (text). The first row displays the name of these variables and the remaining rows are devoted to each observation (speech) in the dataset.¹

¹ An extended dataset including party affiliation is available here.

We will use computational text analysis to shed light on the contents of these speeches, including how they have changed over time or how they differ between Democrat and Republican presidents.

Extract the information requested below and interpret your results. Present your analysis as a PDF file using Quarto and submit it via Blackboard (deadline: March 12).

Explore how the data set looks like. How many observations does it contain? What is the unit of analysis? What kind of information does it report about each observation? How many speeches do we have each year? Get familiar with the data set by using count() or its visual equivalents: ggplot() plus geom_bar() for qualitative dimensions or geom_histogram() for numerical variables.
Let’s imagine that we are interested in exploring the importance of terms referring to women in these speeches. Has this changed over time? Does the pattern change if we take into account that the length of the speeches may have also changed over time? Tip: Take into account that some years have 2 speeches, so you will need to make a decision of how to group them using group_by() and summarise().
Do the same with “education” as a topic. In order to do so, construct a dictionary of words related to education such as “education”, “school”, “student” and “teacher”, plus any other word you think might be also important.
Let’s now explore the role of political affiliation in the patterns we have observed so far: who mentions these terms more often, Democrat or Republican presidents? The problem is that our dataset does not have a column identifying which political party these presidents belonged to, so you have to construct it yourself. Tip: ask an AI tool for a list of presidents / party and merge it with our data set using full_join().
Going back to the topic of the presence (or absence) of women in the speeches, what is the context in which terms referring to women appear.
Which locations are mentioned more often? Does the geographical scope of the speeches change over time? What about political affiliation?

Warning: Some commands that may work well when interacting directly with Quarto may cause trouble when rendering, especially to pdf. For instance, the command View(data) opens up a RStudio data viewer in a different another tab, which is very useful for exploring how the data frame looks like but it may conflict with rendering since this additional tab cannot be rendered. If you want to show the reader how the data looks like just type the name of the object in a code chunk (potentially in combination with print() and perhaps also select() to only look into particular fields). Printing very long textual fields might also be problematic, so you may want to restrict the number of characters you want to print using, for instance, spstr_sub(1,200).