Data Management Plan
Kinds of Data
This project relies on various data sources, including interviews, bibliographic resources and structured datasets.
Interviews generate audio, video and textual resources, formatted according to the WAV, MP4 and markdown specifications, respectively.
While reviewing extant literature, I compile numerous published PDF and HTML documents. I maintain a bibliographic database using Zotero and continually export core information to BibLaTeX format using the Better BibTeX add-on.
I prioritize writing all documents using plaintext file formats. I rely on Quarto, an open-source scientific and technical publishing system, to generate HTML, PDF and DOCX files from singular markdown files via the pandoc conversion egnine and the LaTeX typesetting system.
When collecting structured data, I opt to use open text formats including CSV, JSON and XML, however I may also use XLSX format to maintain interoperability with collaborators.
I embed all original code, including database retrieval queries and API requests, in or alongside quarto documents, where I annotate code with detailed and contextualizing comments.
I use qc for my qualitative data analysis, which stores data across a series of text files and a barebones SQLite database.
Data Collection
Interviews may generate audio, video and textual records.
When given consent to record audio, I primarily rely on a SONY ICD-UX560 audio recorder, which contains 4GB of internal storage supplemented with a 32GB microSD card.1 I record using the lossless 16bit 44.1 kHz Linear PCM wav format.
1 See https://weloty.com/sony-icd-ux560-review and https://weloty.com/using-the-sony-icd-ux560-the-4-how-tos for more information on this audio recording device.
Video is recorded using a GoPro Hero 4 Silver action camera equipped with a 64GB microSD card.
I maintain typed notes before, during and after each interview, which I may edit and re-organize to enhance clarity.
Immediately after each interview, I copy the data off the recording devices and onto a dedicated project drive. I organize and rename files using a semantic naming scheme and then mirror all files onto physical and a cloud-based backup drives.
Data Processing
I use FFmpeg and Audacity to cut, concatenate, and clean the audio and video files, if necessary.
I generate transcripts using noScribe, which I then edit and verify manually.
Setting | Value | Reason |
---|---|---|
Language | English | This is the language spoken during the interview. |
Quality | Precise | I have the hardware to support this. |
Mark pause | 1sec + | Corresponds with common convention for interview transcripts. |
Speaker detection | 2 | This will vary depending on how many people are present during the interview. |
Overlapping speech | True | It is useful to detect this. |
Disfluencies | True | It is useful to detect this. |
Timestamps | False | Distracting information that provides little value, and potentially incompatible with qc. |
Storage and Backups
All data are stored and tracked in a private repository on the MCHI GitLab instance. I maintain a backup on an encrypted portable solid state drive.
I maintain this website where I share documentation that supports this project and reflect on the work as it progresses. It is hosted using GitHub Pages and is backed up using Dropbox, however no sensitive research data passes through these services.
Publishing and Archiving
I make significant effort to document all the data as I collect them. Whenever possible, I record metata within the documents themselves, and if necessary I record relevant metadata in a separate and related file. This will facilitate access and reuse by my future self and by others.
I will explore my options for making the data accessible. This will depend on compliance with ethical concerns (see Hertz 2021). The project website will likely serve as the primary vehicle for sharing public data, and I will deposit its contents in a professional archive as well.