Other historical datasets

Below you can find a curated set of varied historical datasets that you can use for the class assignments and your own research projects. Please make sure to read the original article describing the information. Digitising historical sources constitutes a huge effort, so we should especially thanks (and properly acknowledge) those authors who make their datasets public.

International data

  • Hansard dataset: UK Parliamentary speeches between 1806 and 1911 (over 1 million parliamentary speeches). This is a huge dataset that might be computationally too demanding on some machines. Find here a random sample of 10,000 speeches. See Blaxill (2020) and Guldi (2023) for illustrations using this information.

  • Population censuses. IPUMS International holds individual-level information from a variety of historical population censuses (Ruggles et al. 2025). The same or even more extended datasets can be requested to the respective national agencies).

  • Civil and parish registers. The European Historical Samples Network provides micro-data on individuals (families and households) taken from birth (and baptisms), marriage and death records (and the like). This information allows following individuals through their life courses and study topics such as fertility, mortality, age at marriage and partner choice, social mobility and migration, among others. See Alter and Mandemakers (2014) and Quaranta (2021).

  • The Proceedings of the Old Bailey (Hitchcock et al. 2023). This dataset contains the texts of 197,752 trials held at London’s central criminal court (The Old Bailey) between 1674 and 1913. As well as the texts themselves, the data has been labeled, so it allows identifying defendants, offences, victims, verdicts and sentences. See, for instance, Hitchcock and Turkel (2016) for an application.

  • VOC Dataset (Petram et al. 2024): This dataset stores the pay ledgers of the Dutch East India Company’s (VOC), primarily from the eighteenth century. It contains almost 800,000 records containing each crew member’s name, place of origin, rank, wage, etc. The raw information has been carefully curated and stored in several .csv files that can be merged together using the corresponding IDs. Read more about this source here.

  • Tudor Network of Power (Ahnert et al. 2023). This data contains all (surviving) items of correspondence in the Tudor State Papers (1509-1603), which are the official government records of the Tudor period in England. As explained by the authors (Ahnert and Ahnert 2023), data cleaning and curation constituted a significant effort. As well as more traditional quantitative methods, this data set is suited for the network analysis.

  • Homicide in Chicago, 1870-1930. This dataset consists on 11,0000 homicide reports filed by Chicago Police Department during the period of study, thus allowing the study of homicide, crime, urban development, and the police themselves. See Bienen and Rottinghaus (2003) (available here) for a description of the project and the dataset.

  • African Names Dataset (Altis 2021). This dataset contains information on 91,491 Africans taken from captured slave ships or from African slave trading sites after 1807 as a result of the British navy’s attempt to suppress the slave trade. It includes names, stature, sex, age, country of origin, the vessel involved, year of arrival, and ports of embarkation and disembarkation.

  • Tatoos dataset (Project, n.d.): Almost 60,000 British convict records from 1793 to 1925. As well as other personal information (age, gender, occupation, religion), these records contain physical descriptions of convict bodies, including their tattoos and other marks (i.e. scars). See additional info here and Alker and Shoemaker (2022).

  • London Lives (Hitchcock et al., n.d.): 240,000 manuscripts from a wide range of primary sources between 1690 and 1800 that allow studying the ordinary life of Londoners (crime, poverty, illness, apprenticeship, work, politics, etc.). See also Hitchcock and Shoemaker (2020) (available here).

  • Petitioning in Early Modern England (Waddell and Howard 2022). This dataset consists of 2,847 petitions filed in England between 1573 and 1799. As well as the text itself, it includes information on date, petitioners, topic, administrative responses, etc. Petitions were a crucial mode of communication between the ‘rulers’ and the ‘ruled’, so they provide a vital source for illuminating the concerns of the people, from noblemen to paupers. The data is hosted in this repository.

  • The Google Books Project has digitised millions of books published from the late 17th century onwards (coverage is uneven) and the Google Books Ngram Viewer allows counting the number of times that a particular term or terms appear in the corpus (Michel et al. 2011). The R package ngramr mimics the functionalities of the latter but directly from within the R environment. It extracts data from the Google corpus and provides it in the form of an R dataframe, which can subsequently be treated with the tools you are familiar with. This corpus should be though used with caution. See Pechenick, Danforth, and Dodds (2015) and Schmidt, Piantadosi, and Mahowald (2021) for its biases and limitations to study socio-cultural and linguistic evolution.

  • The HathiTrust Library also contains millions of digitised texts and an online tool for single word queries (bookworm). The underlying data can be downloaded by request. The hathiTools R package (Marquez and Schmidt 2022) allows to interact directly with these resources.

Norwegian data

The National Biblioteket has million of digitised resources (books, manuscripts, newspapers, letters, etc.). As well as hosting an online norsk n-gram for quering the corpus, the DH-LAB is also creating its own tools for accessing and analysing their collections (including R packages). In any case, if you are interested in a particular period, you can directly contact them and request a particular set of texts.

Statistics Norway has a collection of historical statistics on population, health, education, income, prices, manufacturing, transportation and communication and the environment, among many other dimensions. They usually refer to governmental reports scanned in pdf form. It is often the case that these sources have also been properly digitised.

The Kommunedatabasen also has digitised a huge amount of historical information on municipalities (kommuner).

The HistLab, hosted at the University of Tromsø, stores individual-level information extracted from population censuses (1801-1920), parish registers (baptisms, marriages and deaths) or land registers (1838, 1886). Contact them directly to request the digitised records.

Other additional sources can be found below:

  • The Norwegian Parliamentary Debates Dataset (Fiva, Nedregård, and Øien 2025). This dataset includes all speeches delivered in the Norwegian Parliament between December 1945 and June 2024 (almost one million speeches). As well as the text itself, it includes information on date, speaker, political and regional affiliation, etc.). Given the size of this dataset, we include here a 5 per cent random sample. You can find the whole dataset here.

  • Norwegian parliamentary elections from 1906 to 2013: candidate-level observations for all candidates from all parties since the 1906 election (2024 version; (Fiva and Smith 2017)): here in .dta format (requires read_dta() from the package haven).

Miscellaneous

Those students with other research interests can choose their dataset on their own. The possibilities are endless. Here are just a few examples:

  • Friends Dataset (Hvitfeldt 2020). The complete transcripts from the famous TV show (1994-2004). This dataset is available by installing and loading the R package friends. Although the object itself is not visible in the environment, the object “friends” is implicitly in your environment, so you will have access to it just by typing “friends”. You could in any case create an object with the data yourself (data <- friends). The dataset does not have a explicit “time” variable but you could easily create your own using the information on “season” (and perhaps “episode”). More info here or here.

As mentioned above, I encourage you to find your own dataset.

References

Ahnert, Ruth, and Sebastian E. Ahnert. 2023. Tudor Networks of Power. Oxford University Press.
Ahnert, Ruth, Sebastian E. Ahnert, Jose Cree, and Lotte Fikkers. 2023. “Tudor Networks of Power - Correspondence Network Dataset.” Cliodynamics. Apollo - University of Cambridge Repository. https://doi.org/10.17863/CAM.99562.
Alker, Zoe, and Robert Shoemaker. 2022. “Convicts and the Cultural Significance of Tattooing in Nineteenth-Century Britain.” Journal of British Studies 61 (4): 835–62.
Alter, George, and Kees Mandemakers. 2014. “The Intermediate Data Structure (IDS) for Longitudinal Historical Microdata, Version 4.” Historical Life Course Studies 1.
Altis, David. 2021. “The Trans-Atlantic Slave Trade Database: Origins, Development, Content.” Journal of Slavery and Data Preservation 2 (3).
Bienen, Leigh B., and Brandon Rottinghaus. 2003. “Learning from the Past, Living in the Present: Understanding Homicide in Chicago, 1870-1930.” The Journal of Criminal Law 92 (3).
Blaxill, Luke. 2020. “The War of Words: The Language of British Elections, 1880-1914.”
Fiva, Jon H., Oda Nedregård, and Henning Øien. 2025. “The Norwegian Parliamentary Debates Dataset.” Scientific Data 12 (4).
Fiva, Jon H., and Daniel M. Smith. 2017. “Norwegian Parliamentary Elections, 1906–2013: Representation and Turnout Across Four Electoral Systems.” West European Politics 40 (6): 1373–91.
Guldi, Jo. 2023. The Dangerous Art of Text Mining. Cambridge University Press.
Hitchcock, Tim, and Robert Shoemaker. 2020. London Lives: Poverty, Crime and the Making of a Modern City, 1690-1800. Digital Humanities Institute, University of Sheffield.
Hitchcock, Tim, Robert Shoemaker, Clive Emsley, Sharon Howard, and Jamie McLaughlin. 2023. “The Old Bailey Proceedings Online, 1674-1913.” www.oldbaileyonline.org.
Hitchcock, Tim, Robert Shoemaker, Sharon Howard, and Jamie McLaughlin. n.d. “London Lives, 1690-1800.” www.londonlives.org.
Hitchcock, Tim, and William J. Turkel. 2016. “The Old Bailey Proceedings, 1674–1913: Text Mining for Evidence of Court Behavior.” Law and History Review 34 (4): 929–55. https://doi.org/10.2752/147800413X13515292098070.
Hvitfeldt, Emil. 2020. “The Entire Transcript from Friends in Tidy Format.” https://github.com/EmilHvitfeldt/friends.
Marquez, Xavier, and Ben Schmidt. 2022. “hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from r.” https://github.com/xmarquez/hathiTools.
Michel, Jean-Baptiste, Yuan K. Shen, Aviva P. Aiden, and The Google Books Team. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82.
Pechenick, Eitan A., Christopher M. Danforth, and Peter S. Dodds. 2015. “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.” Plos One 10.
Petram, Lodewijk, Marijn Koolen, Melvin Wevers, and Jelle van Lottum. 2024. “Charting Lives and Careers: Enriched Data about the Dutch East India Company’s Eighteenth-Century European Workforce.” Journal of Open Humanities Data 10. https://doi.org/10.5334/johd.210.
Project, The Digital Panopticon. n.d. “Tracing London Convicts in Britain and Australia, 1780-1925.” www.digitalpanopticon.org.
Quaranta, Luciana. 2021. “Reflections on the Use of the Intermediate Data Structure (IDS) in Historical Demographic Research.” Historical Life Course Studies 10.
Ruggles, Steven, Lara Cleveland, Rodrigo Lovaton, Sula Sarkar, Matthew Sobek, Derek Burk, Dan Ehrlich, Quinn Heimann, Jane Lee, and Nate Merrill. 2025. “Integrated Public Use Microdata Series (IPUMS).” Minneapolis, MN. https://doi.org/10.18128/D020.V7.6.
Schmidt, Benjamin, Steven T. Piantadosi, and Kyle Mahowald. 2021. “Uncontrolled Corpus Composition Drives an Apparent Surge in Cognitive Distortions.” Proceedings of the National Academy of Sciences 118 (45): e2115010118.
Waddell, Brodie, and Sharon Howard. 2022. “The Power of Petitioning in Early Modern England, 1573-1799.” Zenodo. https://zenodo.org/records/7027693.