Get help

Text mining and analysis

A curated list of licenced and open text mining resources and tools.

Open access sources

These sources offer textual data that can be downloaded free for research purposes, subject to the provider's terms of use. Researchers should comply with copyright restrictions and familiarise themselves with the relevant site's access or download procedures including, where recommended, use of an application program interface (API). This list is not exhaustive. If you need advice on how to search for a certain type of textual data, please contact the Library.

Data source

Description

Australian Data Archive (ADA)

 

 


Australian Text Analytics Platform

The Australian Data Archive (ADA) provides a national service for the collection and preservation of digital research data. ADA disseminates this data for secondary analysis by academic researchers and other users. Text data include survey and interview transcripts, audio/video transcriptions. There is no cost for data sets unless specified in special conditions and restrictions. Access to data is mediated and thus requires a login and agreement to terms of use. Accessing ADA data.

 


Access to two datasets with notebooks:  Farms to Freeways - Data collected in 1991-1992 from Western Sydney Women's Oral History Project which analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s.   Corpus of Oz Early English (CoOEE) - Approximately 2 million tokens of material produced in Australia between 1788 and 1900, divided into four time periods and four registers.

arXiv Bulk Data Access Open access to 2 million e-prints, and growing, in the areas of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Access information.
bioRxiv bioRxiv is a preprint server for biology. Provides free and unrestricted access to all articles posted on the server. View bioRxiv's machine access and text/data mining resources.
BMC (Bio-Med Central) Research articles from over 300 peer-reviewed open access journals across 18 disciplines. View a selection of BMC open access research and data articles. BMC is part of the Springer Nature publishing group -  Springer Nature Open Access API provides metadata and full-text content where open access is available.
Crossref The API provided by Crossref allows researchers to harvest full-text documents from participating members who deliver content, regardless of whether the content is open access or subscription. REST API information.
Digital Public Library of America Scholarly researchers use DPLA to find open access sources from archives across the USA through a single portal. Data can be downloaded as zipped JSON files. Users familiar with working with APIs should refer to the documentation for the DPLA API and resources
English Corpora List of the most widely used online corpora, that are used for different purposes by teachers and researchers at universities throughout the world.
Europeana A digital library that makes available heritage material from over 3700 different institutions. Its partners collect, carefully checks and enriches the data with additional information like geo-location or links to other material datasets through associated people, places or topics. APIs allow the building of applications that use Europeana material for research, education and the creative industries.
Facebook Graph API The Graph API allows a user to get data into and out of the social media platform Facebook, see Using the Graph API.
Google Books Currently one of the world's largest digitised book collections. Use Advanced Search and select the search filter "Full view only" to find full text books. Download individual titles in PDF, ePUB or plain text. Access to the whole corpus requires going through a request process.
Hathi Trust HathiTrust makes the text data of public domain works in its collection available to researchers to bulk download directly, for non-commercial research purposes. Accessing a dataset requires going through an approval process described in the Datasets section.
HuNI - Humanities Networked Infrastructure Combines data from many Australian cultural websites into the biggest humanities and creative arts database ever assembled in Australia. HuNI provides discovery tools for casual users from the wider community, but more sophisticated functionality is available to researchers who register for an account in the virtual laboratory. Registered researchers have their own personal workspace within HuNI. The Help pages provide information on how to search and create/edit/export your own collections.
Internet Archive The Internet Archive offers over 20,000,000 freely downloadable books and texts offered in many formats. See instructions for downloading in bulk.
OTA - Oxford Text Archive OTA is like a library for digital texts, as well as for some other types of literary and linguistic data. It’s an open, online location where people can search for texts which can be downloaded easily, or stored safely and shared with others. Click on Login and search for Griffith University and sign in with Griffith credentials. See About OTA for information about relevant policies and theTerms of Service. View the FAQs for how to use OTA and more.
PLOS Publishes open access research from over 200 STEM disciplines. PLOS text and data mining.
PMC Article Datasets PubMed Central articles in machine readable format. PMC and the NCBI Bookshelf include several large datasets of articles and other scientific publications made available for download under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
Project Gutenberg Project Gutenberg is a library of over 60,000 free eBooks, downloadable in plain text and HTML (the two master formats), as well as in ePub and MOBI. Refer to the site's Help pages for information on permissions and licensing, policies, how to get the ebook files in bulk and more.
Trove Trove aggregates content from Australian libraries. The Trove API allows users to capture datasets for research and analysis and to build and create applications, tools and interfaces - see  Using the Trove API. The Trove newspaper and gazette harvester is an example of a custom-built interface that was developed using the Trove API and shared for use by others.
Tweetsets A collection of Twitter datasets for research and archiving from George Washington University. Create your own Twitter dataset from existing datasets. Conforms with Twitter policies.
Twitter Social media platform Twitter's tweets, trends, lists, and other elements can be mined for research. Read about eligibility criteria for Twitter's Academic Research product track and Tools & Guides for researchers. View Twitter API documentation. Fees apply.
Twitter datasets via DocNow Catalog The DocNow Catalog is a collectively curated listing of Twitter datasets. Public datasets are shared as Tweet IDs, which can be 'hydrated' back into full datasets using DocNow's Hydrator desktop application.
Wikidata Wikidata is a free and open knowledge base which also acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Content of WikiData is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web. Wikidata: Verifiability is a guideline on verifiability of Wikidata content. Wikidata: Data Access is the starting point page for learning how to get data out of Wikidata.
WordHoard In its current release WordHoard contains the entire canon of Early Greek epic in the original and in translation, as well as all of Chaucer, Shakespeare, and Spenser. The section on Provenance, Copyrights, and Licenses provides detailed information about the texts.
OpenSubtitles.org

Massive multi-language subtitle database, with subtitle files corresponding to television shows, movies and games content from the years 1950 to 2017. Each subtitle file is mapped to a unique IMDb title.