Skip to Main Content

Accessing corpora for text data mining

How to get the APIs and Data sets you need to accomplish this work

Generally, in order to use text and data mining methods, one needs to have access to a corpus of material on which these analytical methods can be used. This page is focused on getting access to and permissions for using the necessary APIs, environments, and data sets rather than on the analytical methods themselves.

What is text and data mining?

Text and Data Mining (TDM) is the "automated process of selecting and analyzing large amounts of text or data resources for purposes such as searching, finding patterns, discovering relationships, semantic analysis and learning how content relates to ideas and needs in a way that can provide valuable information needed for studies, research, etc", according to Springer Nature. These techniques are increasingly finding application in a variety of disciplines, allowing for findings not possible through traditional methods of analysis. This page provides information on corpora available for TDM to the UH community, and how to request access if it’s not available. If you’re looking for other research related services, please contact UH Libraries’ research services.

Mining responsibly

There are a variety of ways to approach this kind of work. Some write a script on their own to systematically download content with no problem, others ask permission before doing so, or request a selected data download to work with from a provider. Still others will work with APIs or dedicated environments that may or may not require a fee to access. There are many combinations of environments, access points, formats, required permissions, restrictions and fees that can be factors to get material when trying to use these methods.

While UH Libraries are always striving to expand access to data resources, most of our current licenses do not allow for unrestricted systematic downloading of content for TDM purposes. Commercial vendors actively monitor database activity to detect when users are downloading large amounts of text or data in systematic ways. When UH users are detected doing so, it can trigger a “breach” of our license terms and suspend access for the entire University community. It is even possible, albeit rare, for vendors to sue those who systematically download content without permission.

Besides the exceptions mentioned below, projects hoping to utilize licensed library resources require special arrangements, often involving additional fees to arrange access to the data, either via unique access to the data via API or by arranging for a download of the data for limited purposes. The safest option is to ask permission, and the Libraries are happy to try and broker these permissions when possible.