Research Protocol
Overview
This study investigates the social, technical, administrative, and epistemic factors that scaffold data-sharing initiatives in epidemiological research. It takes to heart the notions that data are media that facilitate communication across different research contexts, that data are created with specific intent, and that data are bounded by the social, practical, and material circumstances of their creation. In light of these facts, the study approaches data-sharing as a means of reconciling the varied circumstances of datasets’ creation — both among themselves, and in relation to contexts of reuse. It therefore frames data-sharing as efforts to foster a series of collaborative ties beyond a project’s original intended scope.
The project addresses the following research questions:
- What are the objectives of data-sharing initiatives, how were they established, and what progress has been made to achieve them?
- What strategies do data-sharing initiatives employ to ensure they are able to meet their objectives, and how effective are they?
- What values underlie these strategies, and can they be linked with effective outcomes, such as the production of harmonized datasets and research deriving therefrom?
The intent is to ascertain what actions specific strategies entail, the circumstances in which each is adopted, the value that they bring, and the trade-offs involved. In other words, the study articulates the collaborative experiences that underlie data-sharing activities as a series of converging situated perspectives.
Approach
This study is informed by a set of theoretical and methodological frameworks formed within a more interdisciplinary “science studies” tradition, which contribute to a more sociological outlook on science as cultural practice (cf. Pickering (1992)). In practical terms, the study documents the social and collaborative experiences involved in various research practices, which ultimately bind the many ways in which scientists do science.
The study focuses specifically on how people contribute to and extract from information commons, which comprise both formal documents and mutually-held, information-laden situated experiences. This involves examining the ways in which participation in disciplinary or even more specialized communities of practice fosters mutual understanding about the potential and limitations pertaining to other people’s data, and how this communally-held knowledge is accessed and reproduced. This approach aligns with the situated cognition methodological framework for examining the improvised, contingent, and embodied experiences of human activity, including science (cf. Suchman (2007); Knorr Cetina (2001)).
The situated cognition framework prioritizes subjects’ outlooks, which are contextualized by their prior experiences, and enables scholars to trace how people make sense of their environments and work with the physical and conceptual tools available to them to resolve immediate challenges. Situated cognition therefore lends itself to investigating rather fluid, open-ended, and affect-oriented actions, and is geared towards understanding how actors draw from their prior experiences to navigate unique situations.1
1 I expand on this in an extended note on efforts to frame the plurality of research experiences as a continuum of practice.
Situated cognition is especially salient in explorations of how people learn to work in new and possibly unfamiliar ways, and in this sense is closely related to Lave and Wenger (1991) theory of situated learning (or ‘communities of practice’ approach), which focuses on how individuals acquire professional skills in relation to their social environments. In such situations, situated cognition enables observers to examine how people align their perspectives as work progresses, and to understand better how people’s general outlooks may have changed under the guidance of more experienced mentors. In other words, situated cognition enables researchers of scientific practices to account for discursive aspects of work, including perceived relationships, distinctions, or intersections between practices that professional or research communities deem acceptable and unacceptable, and the cultural or community-driven aspects of decisions that underlie particular actions.
In taking on this theoretical framework, the study frames epidemiology as a collective endeavour to derive a coherent understanding of population-level health trends, which involves the use of already established knowledge in the validation of newly formed ideas, and which relies on systems designed to carry information obtained with different chains of inference. These systems have both technical and social elements. The technical elements are the means through which information becomes encoded onto information objects so that they may form the basis for further inference. The social elements constitute a series of norms or expectations that facilitate the delegation of roles and responsibilities among agents who contribute their time, effort, and accumulated knowledge to communal goals.
The methodological framework situating this study within the grounded theory tradition, and the theoretical choices regarding human agency and the relationship between realist and constructivist perspectives, are elaborated in a separate document.
Study Site
This study draws from interviews with individuals affiliated with a single case: the Covid-19 Immunity Task Force (CITF) Databank. The CITF was an initiative whose mandate was to catalyze, support, fund, and harmonize knowledge on SARS-CoV-2 immunity for federal, provincial, and territorial decision-makers in their efforts to protect Canadians and minimize the impact of the COVID-19 pandemic. The CITF Databank provides continued support to the research community by centralizing research data provided by CITF-funded studies and by making the data accessible for extended research.
In case-study research, cases represent discrete instances of a phenomenon that inform the researcher about it. The cases are not the subjects of inquiry, and instead represent unique sets of circumstances that frame or contextualize the phenomenon of interest (Stake 2006, 4–7). The power of case study research derives from identifying consistencies that relate cases to each other, while simultaneously highlighting how their unique and distinguishing facets contribute to their representativeness of the underlying phenomenon. Case study research therefore plays on the tensions that challenge relationships among cases and the phenomenon that they are being called upon to represent (Ragin 1999, 1139–40).
The case was selected partially out of convenience, but this does not discount the analytical value that it affords. As a community-led initiative, the CITF Databank presents an opportunity to investigate how epidemiologists balance the values deriving from their own epistemic culture with the challenges of coordinating collective efforts. It has established an explicit governance structure, which provides terms and concepts to examine through a critical lens; it comprises people working in a multitude of roles, many of whom are locally available; and it is a venue where multiple relevant epistemic conflicts and challenges manifest themselves, which are rich sources from which a qualitative researcher may ascertain competing and complementary value regimes.
The case provides adequate breadth of perspective, which is of greater concern than sample size under the constructivist grounded theory and situational analysis methodological frameworks. The goal is to articulate the series of inter-woven factors that impact how epidemiological researchers coordinate and participate in data-sharing initiatives while explicitly accounting for and drawing from the unique and situational contexts that circumscribe their perspectives — not to define causal relationships or to derive findings that may be generalized across the whole field of epidemiology. As such, statistical representativeness is not an objective of this research.
I adhere to a theoretical sampling strategy, which effectively entails developing the sample in response to ongoing theory-building. Theoretical sampling focuses on finding new data sources that can best address specific theoretical facets of an emerging analysis. The goal is to eventually reach a point of saturation, when new data fit into the emergent model with ease (Charmaz 2000, 519–20). This sampling strategy enables me to access new dimensions on a topic that arise during reflexive inquiry (Morse and Clark 2019, 146).
Data Collection
The study draws from approximately ten semi-structured interviews with individuals who lead, support, or participate in the CITF Databank’s operations, including professional researchers, research trainees, and administrative and technical support staff. Interviews are oriented by the study’s goal to document processes of reconciling different stakeholders’ interests as they converge in the formation of a common data resource. Specifically, interviews focus on participants’ motives, the challenges they experience, how they envision success and failure, their perceptions of their own roles and the roles of other team members and stakeholders, the values that inform their decisions, how the technological apparatus they set up enables them to realize their goals and values, and ways in which they believe the work could be improved.
Interviews were held either in person or via video conference, in quiet and comfortable environments such as office spaces or conference rooms. In-person interviews were recorded using a SONY ICD-UX560 audio recorder in lossless 16-bit 44.1 kHz Linear PCM WAV format; some were also captured on video using a GoPro Hero 4 Silver action camera. Remote interviews were recorded using the video conferencing software’s built-in recording tools, in compliance with McGill’s Cloud Directive. In all cases, typed notes were maintained throughout and immediately after each session. Immediately after each interview, data were copied off recording devices onto a dedicated project drive, organized and renamed using a semantic naming scheme, and mirrored onto physical and cloud-based backup drives.
Interviews are semi-structured, following a broad pattern of inquiry that traces (1) participants’ goals and perspectives; (2) their project’s missions, purposes, and motivations; and (3) the practices, procedures, and relationships that enable them to operationalize their personal and collective goals. Questions were tailored to each individual so that they could address responses in general terms and in relation to their own specific experiences. The full interview guide is available here.
Transcripts were generated using a locally-hosted Whisper model via noScribe and then manually edited to correct errors — primarily mis-spellings of proper names and acronyms, speaker label reassignments, and minor issues with accent interpretation. All data are collected and curated in full compliance with the ethics protocol. Further details on storage, file formats, and data processing procedures are documented in the data management plan.
Analysis
The study implements qualitative data analysis (QDA) methods to highlight collaborative aspects of data-sharing in epidemiology, as elicited in the corpus of transcribed interviews. QDA involves encoding the primary sources of evidence in ways that enable a researcher to draw cohesive theoretical accounts or explanations. This is done by tagging segments of a document using codes, and by embedding open-ended interpretive memos directly alongside the data. Through these methods, a researcher is able to articulate theories based on empirical evidence that reflect the informants’ diverse experiences.
Coding — which involves defining what specific elicitations are about in terms that are relevant to the theoretical frameworks that inform the research — entails rendering instances within a text as interpreted abstractions called codes (Charmaz 2014, 43). Codes can exist at various levels of abstraction. For instance, an analyst may apply descriptive codes to characterize literal facets of an instance within a text, and theoretical codes to represent more interpretive concepts that correspond with aspects of particular theoretical frameworks. In other words, coding involves applying a precise language to segments of transcribed interviews that serves to bridge the gap between what participants said and the theoretical frameworks that the analyst applies to explore them as epistemic activities, interfaces, and values (cf. Charmaz (2014); Saldaña (2011), 95-98). Further details on the code system and specific coding procedures are documented in the QDA protocol.
Memoing entails more open-ended exploration and reflection upon latent ideas in order to crystallize them into new avenues to pursue (Charmaz 2014, 72). Constructing memos is a relatively flexible way of engaging with data and serves as fertile ground for honing new ideas. Memoing is especially crucial while articulating sensitizing concepts, which Charmaz (2003, 259) refers to as the “points of departure from which to study the data”. Memoing allows the researcher to take initial notions that lack specification of well-defined attributes, and gradually refine them into more cohesive, definitive concepts (Blumer 1954, 7; Bowen 2006). Memoing is also very important in the process of drawing out more coherent meaning from coded data (cf. Charmaz (2014), 181, 290-93). By creating memos pertaining to the intersections of various codes and drawing comparisons across similarly coded instances, an analyst is able to form more robust and generalizable arguments about the phenomena of interest and relate them to alternative perspectives expressed by others.
Throughout the analysis, I follow the approach that Nicolini (2009) and Maryl et al. (2020) advocate, who suggest “zooming in to a granular study of particular research activities and operations and zooming out to considering broader sociotechnical and cultural factors.” This involves “magnifying or blowing up the details of practice, switching theoretical lenses, and selective re-positioning so that certain aspects are foregrounded and others are temporarily sent to the background” (Nicolini 2009, 1412). This approach leverages the fact that participants’ actions and interactions are informed by the broader communities of practice, which imbues them with professional norms, expectations, and value regimes. When asking about their particular work experiences in the context of the case, I also ask them to relate their own unique circumstances to the field as a whole, thereby enhancing the potential for this study to speak of some generalized phenomenon.
This work is performed using qc, a command-line qualitative coding tool that stores data across plain text files and a SQLite database, which enables retrieval of coded segments and identification of patterned distributions of codes across the entire corpus. I have also developed qc-atelier, a companion local server that provides browser-based interfaces for codebook management, reflective analysis, and LLM-assisted code alignment layered on top of qc. Querying the dataset in this way enables the analyst to articulate elaborated accounts of specific kinds of activities, decisions, values, and sentiments that cut across various informants’ perspectives. Further details on the code system, memoing guidelines, and specific QDA procedures are documented in the QDA protocol.
Statistical methods play a limited role in this study. Basic summary statistics (e.g. cross-tabulation) may be used to represent the distribution of codings across individual interviews or ranges of interviews, which may help to identify trends and associations as they pertain to their limited scopes. This may be used to support theory-building but may not be used to infer generalizable causal relationships.
Outcomes
This project contributes insights regarding the practical benefits and challenges involved in epidemiological data-sharing. It identifies how relevant stakeholders engage with the systems that scaffold data-sharing initiatives, which may differ from modelled behaviours specified in aspirational plans and procedural documents. In effect, by articulating how these systems succeed or fail to account for their users’ practical needs and disciplinary values, this study provides constructive feedback that will inform their further development.
The study is working toward peer-reviewed publications and conference presentations as opportunities arise.
Ethics
The study has been conducted according to ethical principles stated in the Declaration of Helsinki (World Medical Association 2013). Ethics approval was obtained from McGill University’s Research Ethics Board (Protocol 25-01-057, approved 2025-03-03) before initiating interviews. Consent forms take into consideration the well-being, free will, and respect of the participants, including respect of privacy. The practices undertaken to ensure adherence to these principles are described in the ethics protocol.