Library guides: Text mining and analysis: Get started

Get started

What is text mining?

Text mining, also referred to as text analytics, is the process of mining or extracting implicit knowledge from textual data. The process usually involves the extraction or use of texts from licenced library databases, open text sources or researcher generated textual data, followed by the analysis of this content once they have been prepared.

Preparing or cleaning of texts is needed to remove noise and create structure in the unstructured textual data. Then the words often need to be turned into numbers so that statistical methods can be applied.

Text mining is a special type of data mining, the umbrella term used for the process of preparing data, whether it be structured or unstructured, and the subsequent use of statistical methods to draw trends and useful insights.

Looking for data vs text to analyse? Check out the Research Data Guide a curated list of external research data repositories, registries and platforms of most relevance to Griffith researchers. Webscraping or web mining is another type of data mining method and is covered in detail in Web Scraping with Python at Griffith University.

Ethics, copyright, licencing

Ethics

At Griffith, all human research requires ethical review and approval. This applies to text and data mining research as well, regardless whether the data comes from licensed or open sources. According to Griffith's Ethics Booklet on Information Technology and Online Research, whether online content analysis is viewed as 'human research' depends on:

the degree to which the material is 'on the public record' and;
whether it is contentious or likely to be of concern to the subject.

If you are unsure, it is always best to discuss your research plans with Griffith's Human Research Ethics Team.

Where your textual data comes from your own study participants (such as from interviews or surveys), you must be certain that they have given their informed consent and that you do not inadvertently identify them in your published analysis. The Australian Code for the Responsible Conduct of Research describes researchers' obligations regarding the management of data, including managing confidential and sensitive information.

Copyright and access

Copyright refers to whether you have the permission to (re)publish the data you have obtained (including where it has been transformed through analysis). Permission can be obtained by agreeing to a set of predefined terms (i.e. on a website), or by negotiation with the owner of the data.

Note that downloading data and publishing it are different, and permission to do one does not imply permission to do the other.

Specific sites have specific terms explaining how researchers may access their data and what they are permitted to do with it. Some are very generous with their access provisions, while some do not allow re-use at all. Many sites allow their data to be used for research but prohibit commercial re-use.

If you have created the textual data yourself (e.g. by conducting interviews or surveys), then you own the copyright and you do not need to seek copyright clearance, provided you do not substantially reproduce your subjects' exact words (note that the ethical considerations listed above still apply). The same applies for material that is out of copyright or is in the public domain.

A quick checklist might be:

How will the data be downloaded? Directly from the web, or via an API?
To what degree is the data set in the public domain? Who is it owned by, if not?
Is there an agreement or terms of use that you agreed to? What does it say about allowed or excluded uses?
Do you have any plans to commercialise your research? Is that allowed by your access agreement?
Will your published data be 'substantially similar' to the mined dataset?
Will you be mixing different data sources? You might need to comply with both sets of terms, if they exist.

If you are unclear about these questions it is better to get advice from the Information Policy Officer before you start, rather than be forced to retract your paper later.

Licencing

Not all library licensed sources allow text mining and you may breach the university's license agreement with the publisher if you do not check their licencing permissions prior to conducting any text and data mining activities.

It is also important to be considerate of other users when downloading large amounts of data from licenced sources as this can impact response times or in some cases, trigger automatic lockouts preventing further access. Contact the Library if you intend to download content in bulk and for further information on text and data mining permissions.

Use it Cite it

Published textual data is cited in the same way as other scholarly outputs, with variation in styles and formats. Raw data for text analysis, stop word lists, algorithms, visualisations and other textual data transformation methods borrowed from others should be cited.

Griffith University Library Referencing Guides illustrates how to cite data according to common styles.

Support and training

Griffith University digital texts and data workshops for researchers via RED
Training includes, locating digital texts and tools for HASS, managing data, data cleaning and processing, data analysis, visualisation methods and tools, survey tools and more.

eResearch Services
Specialist IT services for researchers including high performance computing, research data storage, data collection tools, and programming workshops.

Hacky Hour
Get online help from eResearch & Library staff with OpenRefine, R, Python, SQL and Bash coding. Practice your new programming skills in a supportive environment. Learn about HPC or virtual machines, what is available and how to use them, and catch up with other researchers learning to wrangle their data. Access the online sessions each Thursday via the calendar.

External training
Programming Historian - novice-friendly, peer-reviewed online tutorials that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate research and teaching. Available in English, Spanish and French.

Digital Observatory (QUT) - offers assistance for researchers with determining the data and analytical requirements for their projects. The Digital Observatory hosts Office Hours, regular Zoom sessions open to all Australian researchers working with linguistics, text analytics, digital and computational methods, social media and web archives. Tap into the expertise of ARDC research infrastructure project members during Office Hours sessions.