Guides: Data Management Resources: Data Handling and Storage

Data Handling and Storage

This page covers both the handling of files and the storage of files during research.

These consist of separate boxes each with their own tabbed subtopics

Data Handling includes: File naming, Version Control, and Workflows
Data Storage includes: Basic Guidance, Choosing Storage, Large Scale Options, and Back-up Plans

Please contact Santi Thompson with any further questions.

Data Handling

Naming and Organizing Files

Organizing and structuring files at the beginning of a project will ease the research process and prevent losses and mix-ups.

Tips for creating file naming conventions

Choose more than one distinctive descriptor for the name.

Distinctive descriptors:

Experiment name
Location/spatial coordinates
Researcher name/initials
Date, time, date range
Type of data
Conditions
Version number
Be consistent- you can use batch naming software later as needed.
Practice simple version control: v1, v2 … (Don’t use Final because it probably isn’t!)
Use international standards for dates: YYYY-MM-DD
Stick with letters, numbers, - and _. (Avoid other characters ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " and |
Avoid spaces
Keep things short and concise
Create a master document (see template below) that describes your convention and folder contents.

For a list of best practices: Stanford University’s File-naming best practices

Visual Diagram example: Sample File-Naming Convention Visual

Research Project File Organization and Naming Scheme
A downloadable template to use in creation of file organization and naming scheme documentation.

Defining Version Control

This is the strategy we employ to keep track of the changes to files over time.

During collaborative work, versioning is essential and more complex.

Data Management Plans will often include methods for managing versions of data.

More about version control from Git

Implementing Version control

Manually - For basic small scale needs:

Use ISO standard dates (YYYY-MM-DD) at the end of your files, or V1, V2…save a new version each time.

Tools for Version control:

Better for groups and larger projects and when involving activities such as models, code, etc.

Git and Github - Github Guides
Mercurial
Subversion
Bitbucket
Open Science Framework
Cloud Storage options: Google Drive, Dropbox, Box, and Microsoft OneDrive do this to varying degrees (Choose with caution and test before relying on it completely.)

What is a Workflow?

Workflows are the steps you take to move from start to finish in your research activities.

Things to consider:

Basic elements of work:
- Individual data collection
- Data aggregation
- Analysis processes
Parts of workflows may be computational processes automated via the use of scripts.
Environments and circumstances contextualize decisions and processes.

Documenting

Documenting the workflow aids your ability to pick up where you left off, and to communicate effectively with collaborators.

Three key practices - Justin Kitzes, The Basic Reproducible Workflow Template

Useful Tools

Jupyter Notebooks - an open-source application allowing researchers to generate and collaborate on documents containing live code, equations, visualizations and text.
Electronic Lab Notebooks (ELNs) - See the Harvard ELN matrix for information
Docker - containers for computational environments

Data Storage

Storing and Backing up Data

Storage is where data and research materials reside during the process of collection and analysis.

Back-up strategies are ways to insure that files are intact and up to date.

Guidance for Storage and Back-up of standard data files*

Use the "3-2-1" Rule: 3 copies, 2 different media, 1 copy off-site.
Don’t rely solely on the Cloud, make sure you keep a local back-up.
Designate one copy as the working copy and sync or update at designated intervals.
Automate back-up whenever possible.
Test your backups periodically.
Document these locations and who is responsible.
Pay special attention to raw data files - they are most valuable.
Keep an sharp eye out for vulnerabilities both internal and external.

*data without sensitive content such as personally identifiers or proprietary information

Sample Storage and Back-up Table
Location	URL/filepath	Description	Responsible Party
Department Server	M:/user/somefile...	Working Copy	Jane Dough - Dept IT
UH MS OneDrive	http://sharepoint/file...	Copy 2	John Smyth - Post doc
My External Hard Drive	E:/somefile/file….	Copy 3	Sal E. Mander - PI

For additional storage and back - up tips: Ways to Avoid a Data-Storage Disaster by Jeffrey Perkel, Nature 568, 131-132 (2019)

Storage Choices

UH researchers will want to reach out early and often to department IT with any specific questions and needs related to departmental storage.

Storage within the UH network ensures compliance. Information security adheres to specific protocols designed to keep university systems secure.
- UH UIT provides Microsoft OneDrive for all researchers.
Consider choosing an additional trusted cloud option for one of your storage solutions. (Do not rely on this as your sole storage.)
- Free options you might consider: Box, Dropbox, Google Drive

Large Scale Data

The growing scale of data is one of the biggest challenges we face in research and data services.

First and foremost you will want to seek the advice of your department IT and others in your field who are encountering similar challenges.

Potential options include

Network attached storage (NAS)

These devices contain storage and associated management software - sort of like a small computer with a large amount of storage capacity. They are internet accessible which allows you to centralize data collected in multiple ways and then access files for analysis in one spot. Most models contain multiple hard drives and are set up with RAID to protect against data loss in case of a hard drive failure. (The cost ranges widely approximately $300-500.)

Cloud Storage Services

Beyond the free and institutional storage, there are varying levels of cloud storage services options available, some with additional back-up features.

Amazon Web Services is one of the most common choices, but there may be other options more suitable to your needs and budget.

Back-up Plans

We advise keeping a document that lays out the following:

A list of data files, average size, and format
Three storage locations
Medium of storage
Responsible party
Methods of back-up (Manual, automatic, software used, etc.)
Timing - Daily, weekly, monthly will depend on your output
Log of back-up dates (Verify that back-up is complete)
For Groups: Contingency plans should someone leave

Back-up Plan Template
A downloadable template to be used as a guide to create a simple back-up plan for research data.