How do I organize a data archive?

Question

What are the required contents of a data archive?

Answer

Data should be stored in a single .zip archive per publication (MSc theses, PhD theses or papers in scientific journals) that includes:

the final version of the document,
all primary (raw) data underlying the document. Primary data include a scanned pdf of original data sheets, field books etc. by preference; if this is not possible raw data should be at least included in some electronic way, i.e., in a table, spreadsheet or text file,
all secondary (processed) data used in the preparation of the document,
all program code and scripts used to produce final results such as figures, tables, statistical analyses etc.,
relevant metadata: one or more text files describing the data sources in relation to (corresponding sections of) the document,
for PhD theses: the above four elements per chapter.

Document:

final version of MSc thesis, both as a text file (rtf, doc, docx) and a as a pdf version,
final version of PhD thesis, both as a text file (rtf, doc, docx) and a as a pdf version,
final version of journal paper as a pdf file,
all supplementary information referred to in a journal paper as a text file (rtf, doc, docx), spreadsheet file (csv, xls, xlsx) or pdf file,
optional for archiving and follow-up purposes: RIS or BIBTeX file of all literature references used in the project.

Primary data include all sources of raw data:

scanned field logs, lab journals, score forms,
pictures of gels, microscopic observations,
output from data loggers,
video and audio recordings,
webcam/photo identification files: only when the resulting IDs are NOT included in other primary data sets, e.g. in field journals or data files,
satellite/aerial imagery: only new data; for data re-used from public repositories, add the link plus description in the metadata,
sequencing and genotyping data,
micro array and hi throughput data: only if NOT stored in public database on publication,
simulated datasets: only if NOT possible to reproduce from program code.

Secondary data include all processed data files used in the preparation of the document:

spreadsheets,
databases,
output from statistical packages,
graphics,
output from geographic information systems.

Program code and scripts:

program code in C/C++/NetLogo, Matlab, Maple, Mathematica etc. of all programs developed to produce the published results,
all associated parameter files,
R-scripts (statistical analysis, graphs, etc.),
Python or other batch scripts used for data processing,
specific program code/scripts, e.g. Z-Tree.

Metadata:

A read_me_first.txt file to be included in each folder in the zip archive; this file should contain info on the folder's contents and how these relate to the document, e.g. describe which data files were used to plot a particular figure, or which parameters were used to generate the simulated datasets used for the publication. The read_me_first.txt file in the main folder should contain a formatted description of the publication. For journals and book chapters, use the Annual Review of Ecology, Evolution and Systematics journal format for the bibliographic reference. Do not abbreviate the author list to et al., and include the DOI of the paper (can be found through Web of Science).

Last modified:

12 February 2025 5.00 p.m.