Maelstrom reading notes

reading
Notes perpainting to works published about Maelstrom and harmonization initiatives it has supported.
Published

January 7, 2025

Modified

January 28, 2025

Doiron et al. (2013)

  • Initial overview of data harmonization procedures, using the Healthy Obesity Project (HOP) as an illustrative case.
  • Outlines the technical apparatus, especially for DataShield, but also broadly describes the discursive process of arriving at a DataSchema that is both functional and flexible.
    • This description is quite broad and abstracy, seems somewhat ideal and aspirational.
  • Describes reliance on international standards, such as the International Labour Organization’s International Standard Classification of Occupations.
    • It seems like these are used as black boxes that encapsulate a series of tensions which epidemiologists are unconcerned with; in effect, they simplify the need for stretching the collaborative ties even further than they are already extended, they represent matters out of scope for deeper discursive engagement.
  • It is notable that they emphasize that it’s easy to set up and use DataShield and Maelstorm toolkits independently of university IT and that it can be run using RStudio installed on a basic laptop.
    • Maybe look into the historical context (2013) and the evolving role of university IT in software selection.
  • The conclusion states that the HOP project was successful in its harmonization efforts, but does not go as far as to state that it produced meaningful findings as a result of harmonization.

Doiron et al. (2017)

  • An overview of the key software that facilitates data harmonization practices under Maelstrom, also briefly touched upon in Doiron et al. (2013).
  • Page 1373 refers to graphical and programmatic interfaces and assumes certain roles and tasks associated with each.
  • Briefly describes its use by the Canadian Longitudinal Study on Aging (CLSA), the Canadian Partnership for Tomorrow Project (CPTP) and InterConnect, primarily by describing the range and quantity of data that these systems manage in each case.

Opal provides a centralized web-based data management system allowing study coordinators and data managers to securely import/export a variety of data types (e.g. text, nu merical, geolocation, images, videos) and formats (e.g. SPSS, CSV) using a point-and-click interface. Opal then converts, stores and displays these data under a standar dized model.

Mica is used to create websites and metadata portals for individual epidemiological studies or multi-study consor tia, with a specific focus on supporting observational co hort studies. The Mica application helps data custodians and study or network coordinators to efficiently organize and disseminate information about their studies and net works without significant technical effort.

Fortier et al. (2010)

  • A very grandiose paper presenting the grand vision for DataSHaPER, which would eventually become Maelstrom.
    • Lots of co-authors!
  • Invokes the pan-European EPIC project (European Prospective Investigation into Cancer and Nutrition), which faced numerous data synthesis challenges despite its proactive effort to coordinate work across numerous research centres.

Two complementary approaches may be adopted to support effective data synthesis. The first one principally targets ‘what’ is to be synthesized, whereas the other one focuses on ‘how’ to collect the required information. Thus: (i) core sets of information may be identified to serve as the foundation for a flexible approach to harmonization; or (ii) standard collection devices (questionnaires and stand ard operating procedures) may be suggested as a required basis for collection of information.

  • DataSHaPER is an acronym for DataSchema and Harmonization Platform for Epidemiological Research.

In an ideal world, information would be ‘prospectively harmonized’: emerging studies would make use, where possible, of harmonized questionnaires and standard operating procedures. This enhances the potential for future pooling but entails significant challenges —- ahead of time -— in developing and agree ing to common assessment protocols. However, at the same time, it is important to increase the utility of existing studies by ‘retrospectively harmonizing’ data that have already been collected, to optimize the subset of information that may legitimately be pooled. Here, the quantity and quality of infor mation that can be pooled is limited by the heterogeneity intrinsic to the pre-existing differences in study design and conduct.

Compares prospective and retrospective harmonizatiom, with the former being presented as ideal, and the latter being a pragmatic reconciliation in acknowledgement that the former is essentially impossible to achieve.

  • DataSHaPER is strikingly similar to OCHRE:
    • XML-based data structures
    • Genesis of a generic and ultimately optional base-level schema that illustrates the kind of data that the data structure may hold in ways that are immediately recognizable to all practitioners (at OCHRE it was associations between contexts and finds)
    • Separate harmonization platform where users can edit and manipulate records and associations between them

The question ‘What would constitute the ultimate proof of success or failure of the DataSHaPER approach’ needs to be addressed. Such proof will necessarily accumulate over time, and will involve two fundamental elements: (i) ratification of the basic DataSHaPER approach; and (ii) confirmation of the quality of each individual DataSHaPER as they are developed and/or extended. An important indication of the former would be provided by the widespread use of our tools. However, the ultimate proof of principle will necessarily be based on the generation of replicable scientific findings by researchers using the approach. But, for such evidence to accumulate it will be essential to assure the quality of each individual DataSHaPER. Even if the fundamental approach is sound, its success will depend critically on how individual DataSHaPERs are constructed and used. It seems likely that if consistency and quality are to be assured in the global development of the approach, it will be necessary for new DataSHaPERs to be formally endorsed by a central advisory team.

Fortier et al. (2011)

This paper responds to Hamilton et al. (2011), which presents an effort to devise a standardized nomenclature. The response is basically to advocate for a more flexible approach, rather than a stringent one promoted by Hamilton et al. (2011). It draws extensively from concepts published in the foundational paper by Fortier et al. (2010).

Two complementary approaches to harmonization may be adopted to support effective data synthesis or comparison across studies. The first approach makes use of identical data collection tools and procedures as a basis for harmoni zation and synthesis. Here we refer to this as the ‘‘stringent’’ approach to harmonization. The second approach is con sidered ‘‘flexible’’ harmonization. Critically, the second ap proach does not demand the use of identical data collection tools and procedures for harmonization and synthesis. Rather, it has to be based on sound methodology to ensure inferential equivalence of the information to be harmonized. Here, standardization is considered equivalent to stringent harmonization. It should, however, be noted that the term standard is occasionally employed to refer to common con cepts or comparable classification schemes but does not necessarily involve the use of identical data collection tools and procedures (12, 13).

This directly parallels the distinction made in Fortier et al. (2010) between “ideal” prospective and more pragmatic retrospective approaches to data harmonization.

Synthesis of data using a flexible harmonization approach may be either prospective or retrospective. To achieve flexible prospective harmonization, investigators from several studies will agree on a core set of variables (or measures), compatible sets of data collection tools, and standard operating procedures but will allow a certain level of flexibilit in the specific tools and procedures used in each study (16, 17). Retrospective harmonization targets synthesis of information already collected by existing legacy studies (15, 18, 19). As an illustrative example, using retrospective harmonization, researchers will define a core set of variables (e.g., body mass index, global level of physical activity) and, making use of formal pairing rules, assess the potential for each participating study to create each variable (15). The ability to retrospectively harmonize data from existing studies facilitates the rapid generation of new scientifi knowledge.

I wonder why there is no example provided for prospective data harmonization. Is it because it is ideal and not realistic? I’d argue that it is simply what occurs within individual projects.

Fortier et al. (2017)

Explicit statement regarding the rationale and presumed benefits of harmonization right in the first paragraph:

The rationales underpinning such an approach include ensuring: sufficient statistical power; more refined subgroup analysis; increased exposure hetero geneity; enhanced generalizability and a capacity to under take comparison, cross validation or replication across datasets. Integrative agendas also help maximizing the use of available data resources and increase cost-efficiency of research programmes.

Summarized in bullet points:

  • ensuring sufficient statistical power
  • more refined subgroup analysis
  • increased exposure heterogeneity
  • enhanced generalizability
  • a capacity to undertake comparison, cross validation or replication across datasets.
  • maximizing the use of available data resources
  • increase cost-efficiency of research programmes

Clearly defines harmonization and its benefits:

Essentially, data harmonization achieves or improves comparability (inferential equivalence) of similar measures collected by separate studies.

Adds an additional argument for retrospective harmonization on top of prior discussion of retrospective/prospective approaches (cf. Fortier et al. (2010); Fortier et al. (2011)):

Repeating identical protocols is not necessarily viewed as providing evidence as strong as that obtained by exploring the same topic but using different designs and measures.

Also relates retrospective harmonization from systematic meta reviews. In fact, the paper basically responds to calls for more structured guidelines for data harmonization, similar to those that had been produced to support structured metareviews in the years prior to this publication. The authors identify several papers that have done similar guidelines or reports on harmonization practices, which they claim are too broad. Those papers include:

This paper reports the findings of a systematic inquiry made to data harmonization initiatives, whose data comprise responses to a questionnaire. The findings indicate that procedures were more attentively follows during earlier stages, such as when matching and aligning available data with the project’s designated scope. However, procedures were less sound with regards to documenting procedures, validating the results of data processing, and dissemination strategy. There is a notable division between work that occurs before and after people actually begin handling the data, which indicates a tension between aspirational plans and tangible experiences.

Respondents were asked to delineate the specific procedures or steps undertaken to generate the harmonized data requested. Sound procedures were generally described; however, the terminologies, sequence and technical and methodological approaches to these procedures varied considerably. Most of the procedures mentioned were related to defining the research questions, identifying and selecting the participating studies (generally not through a systematic approach), identifying the targeted variables to be generated and processing data into the harmonized variables. These procedures were reported by at least 75% of the respondents. On the other hand, few reported steps related to validation of the harmonized data (N=4; 11.8%), documentation of the harmonization process (N=5; 14.7%) and dissemination of the harmonized data outputs (N=2; 5.9%).

The paper summarizes some specific “potential pitfalls” reported by respondents to their survey:

  • ensuring timely access to data;
  • handling dissimilar restrictions and procedures related to individual participant data access;
  • managing diversity across the rules for authorship and recognition of input from study-specific investigators;
  • mobilizing sufficient time and resources to conduct the harmonization project;
  • gathering information and guidance on harmonization approaches, resources and techniques;
  • obtaining comprehensive and coherent information on study-specific designs, standard operating procedures, data collection devices, data format and data content;
  • understanding content and quality of study-specific data;
  • defining the realistic, but scientifically acceptable, level of heterogeneity (or content equivalence) to be obtained;
  • generating effective study-specific and harmonized datasets, infrastructures and computing capacities;
  • processing data under a harmonized format taking into account diversity of: study designs and content, study population, synchronicity of measures (events measured at different point in time or at different intervals when repeated) etc;
  • ensuring proper documentation of the process and decisions undertaken throughout harmonization to ensure transparency and reproducibility of the harmonized datasets;
  • maintaining long-term capacities supporting dissemination of the harmonized datasets to users.

It’s not made clear how these responses were distributed among respondents.

The authors then identify several absolute essential requirements needed to achieve success:

  • Collaborative framework: a collaborative environment needs to be implemented to ensure the success of any harmonization project. Investigators involved should be open to sharing information and knowledge, and investing time and resources to ensure the successful implementation of a data-sharing infrastructure and achievement of the harmonization process.
  • Expert input: adequate input and oversight by experts should be ensured. Expertise is often necessary in: the scientific domain of interest (to ensure harmonized variables permit addressing the scientific question with minimal bias); data harmonization methods (to support achievement of the harmonization procedures); and ethics and law (to address data access and integration issues).
  • Valid data input: study-specific data should only be harmonized and integrated if the original data items collected by each study are of acceptable quality.
  • Valid data output: transparency and rigour should be maintained throughout the harmonization process to ensure validity and reproducibility of the harmonization results and to guarantee quality of data output. The common variables generated necessarily need to be of acceptable quality.
  • Rigorous documentation: publication of results generated making use of harmonized data must provide the information required to estimate the quality of the process and presence of potential bias. This includes a description of the: criteria used to select studies; process achieved to select and define variables to be harmonized; procedures used to process data; and characteristics of the study-specific and harmonized dataset(s) (e.g. attribute of the populations).
  • Respect for stakeholders: all study-specific as well as network-specific ethical and legal components need to be respected. This includes respect of the rights, intellectual property interests and integrity of study participants, investigators and stakeholders.

The authors describe how they arrived at guidelines following the results of this study:

A consensus approach was used to assemble information about pitfalls faced during the harmonization process, establish guiding principles and develop the guidelines. The iterative process (informed by workshops and case studies) permitted to refine and formalize the guide lines. The only substantive structural change to the initial version proposed was the addition of specific steps relating to the validation, and dissemination and archiving of harmonized outputs. These steps were felt essential to em phasize the critical nature of these particular issues.

The paper outlines a checklist of stages that data harmonization initiatives need to go through to produce ideal outcomes. For each task, they describe a scenario in which the task can be said to be complete, whhich resembles an ideal outcome. This is described in the paper, summarized in a table, and more comprehensively documented in the supplementary materials.

Also worth noting, this paper includes a list of harmonization initiatives that I may consult when selecting cases. I’m not quite sure how useful it will be since the findings don’t really break down the distribution of responses in any detail, but maybe the authors have done this analysis and not published it.

Bergeron et al. (2018)

The authors reference the drive for efficiency as a motivating factor that drives open data:

However, many cohort databases remain under-exploited. To address this issue and speed up discovery, it is essential to offer timely access to cohort data and samples.

However the paper is actually about the need for better and more publicly accessible documentation about data.

The authors state that catalogues exist to promote discoverability of data and samples and to answer the data documentation needs of individual studies.

They draw attention to the importance of catalogues in research networks (analyzing data across studies), which establish portals that document “summary statistics on study subjects, such as the number of participants presenting specific characteristics (e.g. diseases or exposures)”.

The authors outline several challenges that inhibit or limit the potential value of catalogues:

The quality of a catalogue directly depends on the quality and comprehensiveness of the study-specific information documented. But, maintaining and providing access to understandable and comprehensive documentation to external users can be challenging for cohort investigators, and require resources not always available, particularly for the very small or long-established studies. In addition, the technical work required to build and maintain a catalogue is particularly demanding. For example, gathering comprehensive and comparable information on study designs necessitates the implementation of rigorous procedures and working in close collaboration with study investigators. Manual classification of variables is also a long and a tedious process prone to human error. Moreover, the information collected needs to be regularly revised to update metadata with new data collections. These challenges, among others, can lead to the creation of catalogues with partial or disparate information across studies, documenting limited subsets of variables (e.g. only information collected at baseline) or including only studies with data dictionaries available in a specific language or format.

Bullet point summary:

  • A catalogue’s quality depends on the quality and comprehensiveness of documentation provided by individual studies
  • Cohort investigators, i.e. leaders of individual studies, are under-equipped to provide such comprehensive documentation
    • Do they just need material support? Or also guidance on how to do it, what factors to account for, etc?
  • Technical work for building and maintaining a catalogue is demanding
    • I’m not sure if they example they provide to illustrate these challenges aligns with what I would call “technical work”; they refer to precise and detailed documentation in direct consultation with individual study maintainers, and I suppose the discussion about and documentation of methodological details is technical in that it corresponds with the work that was actually done on the ground, using data collection and processing instruments
  • Classification of variables is a long and tedious process
    • What makes it long and tedious? This isn’t really specified
    • They recite that this is prone to human error, but I wonder what successful or error-ful (?) outcomes would look like and how they would differ
  • The information needs to be regularly revised and updated

The authors recommendations to resolve these concerns:

However, to truly optimize usage of available data and leverage scientific discovery, implementation of high quality metadata catalogues is essential. It is thus important to establish rigorous standard operating procedures when developing a catalogue, obtain sufficient financial support to implement and maintain it over time, and where possible, ensure compatibility with other existing catalogues.

Bullet point summary:

  • establish rigorous standard operating procedures when developing a catalogue
  • obtain sufficient financial support to implement and maintain it over time
  • where possible, ensure compatibility with other existing catalogues

Bergeron et al. (2021)

Identifies several registries of relevant cohorts, but notes that they face challenges getting the data together. Namely, issues concerning institutional policies concerning data-sharing, lack of open access to cohort data and to documentation about the data, the data’s complexity which makes it difficult to harmonize across studies, and lack of access to funding, secure data environments, and specialized expertise and resources.

The Research Advancement through Cohort Cataloguing and Harmonization (ReACH) initiative was establihed in collaboration with Maelstrom to overcome some of these barriers in the context of Developmental Origins of Health and Disease (DOHaD) research.

The authors briefly summarize some projects that rely on ReACH data, and provide a more comprehensive table of ongoing and unpublished work.

In the supplementary materials, the authors also include an illustrative example specific tasks, decisisions and actions that one might got through when using ReACH data. It is a broad-level but fairly sober account of how one would navigate the catalogue and engage with collaborators.

Wey and Fortier (2024)

An overview of the harmonization procedires applied in CanPath and MINDMAP. Authored by two members of the Maelstrom team, but no one from these initiatives.

At first, comes across as another broad-level overview of processes. But, as is elicited in the conslusion, the paper highlights some subtle divergent approaches, some of which are relevant to my project and I picked up here.

Interesting bit about use of centralized systems or systems that are more conducive to end-users’ individual workflows.

In both illustrative projects, information about participating studies and data collected was gathered and made available on the central data server. In the CanPath BL-HRFQ, harmonization documentation and processing scripts were held centrally and only updated by Maelstrom Research. In MINDMAP, because multiple groups simultaneously worked on generating harmonization processing scripts for different areas of information, working versions of the DataSchema and processing scripts were held on a GitHub, allowing for better version control and dynamic updating of scripts by multiple remote data harmonizers.

More about the balance being struck at MINDMAP between institutional and popular tech and documentation platforms:

In the MINDMAP project, R markdown documents with applied harmonization scripts (Figure 13.1b) and any comments were preserved in the GitHub repository, which is also publicly accessible. Furthermore, summary statistics for MINDMAP variables are only available to approved data users through the secure server, as harmonized datasets are only intended for use within the MINDMAP network. Overviews of the harmonization process and outputs have also been published as open-access, peer-reviewed articles for both projects (Fortier et al. 2019; Wey et al. 2021).

Bit about existing collaborative ties making things much easier:

the prospective coordination and unu sually high ability for the harmonization team (Maelstrom Research) to work directly with study data managers on recently collected and documented study data resulted in high standardization and ability to resolve questions.

Data curation as an explicitly creative task, involving impactful decisions. Interesting that this was unexpected enough to warrant inclusion as a key challenge:

In MINDMAP, these differences were clearly documented in comments about harmonization for each source dataset to allow researchers to decide how to use the harmonized variable. In-depth exploration of the best statistical methods to harmonize these types of measures to maintain integrity of content while minimizing loss of information and methodological bias are important and active areas of research (Griffith et al. 2016; Van den Heuvel and Griffith 2016). The approach taken in this case was to apply simpler harmonization meth ods (i.e. rescaling) and document the transformation, leaving investigators the flexibility to further explore and analyze the harmonized datasets as appropriate for their research questions.

As with the above excerpt, documentation was considered as a viable mode of resolving or accounting for significant discrepancies:

The papers describing the harmonization projects attempt to highlight these considerations, for example, providing some comparison of demographics of the harmonized populations against the general populations from which they were drawn (Dummer et al. 2018; Fortier et al. 2019; Wey et al. 2021), listing sources of study-specific heterogeneity in the harmonized datasets to consider (Fortier et al. 2019), and pointing users to individual study documentation where more information on weights to use for analysis should be con sidered (e.g. Wey et al. 2021).

The bit about use of github was rationalized as a way of facilitating collaboration across insitutional boundaries. They refer to R and RMarkdown being open standards as the main reason, but I wonder if a part of it is that GitHub, being a non-institutional platform, was easier to use from an oboarding perspective:

In MINDMAP, due to the need for multiple international groups to work simultaneously and flexibly on the harmonization processing and frequently evolving versions of study-specific harmonization scripts, scripts for harmonization processing were written and applied entirely through R markdown in an RStudio interface, and the DataSchema and R markdown versions were maintained and frequently updated in a GitHub repository.

Perhaps ironically, the private company and platform may have been used due to the strength of pan-european collaborative ties that institutions may not be able to keep up with. Whereas in Canada, with centralized project-oriented work, it may be much easier to enforce adoption of centralized tooling. This is just speculation.

Doiron, Raina, and Fortier (2013)

This paper summarizes what was discussed at a workshop bringing together stakeholders who would contribute to two large data harmonization initiatives: the Canadian Longitudinal Study on Aging (CLSA) and the Canadian Partnership for Tomorrow Project (CPTP). It is therefore representative of plans and challenges that were held at an early stage when collaborations were being established.

The authors identify series of reasons for linking data, which I summarize here:

  1. Maximizing potential of disparate information resources
  • enriching study datasets with additional data not being collected directly from study par ticipants
  • offer vital information on health outcomes of participants
  • validate self-reported information
  1. Drawing maximum value from data produced from public expenditure
  • offers a cost-effective means to maximize the use of existing publicly funded data collections
  1. Develop interdisciplinary collaborative networks
  • by combining a wide range of risk factors, disease endpoints, and relevant socio-economic and biological measurements at a population level, linkage lays the groundwork for multidisciplinary health-research initiatives, which allow the exploration of new hypotheses not foreseeable using independent datasets
  1. Establish long-lasting infrastructure and instill a collaborative culture
  • Last, a coordinated pan-Canadian cohort-to-administrative linked database would establish legacy research infrastructures that will better equip the next generation of researchers across the country

The authors use the term “data linkage”:

Data linkage is “the bringing together from two or more different sources, data that relates to the same individual, family, place or event”. When linking data at the individual level, a common identifier (or a combination of identifiers) such as a personal health number, date of birth, place of residence, or sex, is used to combine data related to the same person but found in separate databases. Data linkage has been used in a number of research fields but is an especially valuable tool for health research given the large amount of relevant information collected by institutions such as governments, hospitals, clinics, health authorities, and research groups that can then be matched to data collected directly from consenting individuals participating in health research.

This is distinct from harmonization in that it is not meant to combine data with similar scope and schematic structure, but rather to relate information collected under various domains so that they could be more easily queried in tandem. I imagine this as reminiscient of establishing links between tables in a relational database.

The authors identify the open-endedness of the linked data as a unique challenge, without elaborating on this point:

CLSA/CPTP-to-AHD linkage also poses unique challenges in that, in contrast to more traditional requests to link data to answer one-off research questions, it aims to establish a rich data repository that will allow investigators to answer a multitude of research questions over time.

The workshop participants established a 5-point plan:

  1. build strong collaborative relationships between stakeholders involved in data sharing (e.g., researchers, data custodians, and privacy commissioners);
  2. identify an entity which could provide overall leadership as well as individual “champions” within each province;
  3. find adequate and long-term resources and funding;
  4. clarify data linkage and data-sharing models and develop a common framework within which the data linkage process takes place; and
  5. develop a pilot project making use of a limited number of linked variables from participating provinces

The second point, about identifying “champions”, is kind of interesting, and I’d like to know more about what qualities these people were expcted to have, their domains of expertise, their collaborative/soft or technical skills, and how this plays into access to funds and general governance structures

Need to look at Roos, Menec, and Currie (2004), which they cite in the conclusion, specifically with reference to the aspiration to develop “information rich environments”. Seems like it is a primary source for the background on linked data in Manitoba and Australia.

Fortier et al. (2023)

Relates harmonization to the FAIR principles, which has not really been featured much in arguments for harmonization in prior Maelstrom papers. Specifically, this paper frames harmonization as a necessary additional condition that enables FAIR data to be made useful; FAIR is deemed not enough.

In the following paper, we aim to provide an overview of the logistics and key ele ments to be considered from the inception to the end of collabo rative epidemiologic projects requiring harmonizing existing data.

Interesting acronym/framework for defining research questions:

The research questions addressed and research plan proposed by harmonization initiatives need to be Feasible, Interesting, Novel, Ethical, and Relevant (FINER). (Cummings, Browner, and Hulley 2013)

Table 1 lists examples of questions that could be addressed to help delineate analytical approach, practical requirements, and operations of a harmonization initiative. These are all questions that look toward or project specific goals, which then inform the strategies through which they may be achieved.

The supplementary materials include very detailed information about specific practices and objectives pertaining to the REACH initiative. However, it’s unclear how well this reflects any specific challenges experienced. In other words, I was hoping for something more candid.

Moreover, this paper is also based on survey responses from 20 harmonization initiatives, but neither the findings resulting from the analysis, nor the data, are referenced or included. Is this the same as the one that informed Fortier et al. (2017)?

Looming in the background of this paper is the DCC lifecycle model. However they do not cite DCC, or the field of digital curation in general. The DCC lifecycle model has always been presented as a guideline, sort of divorced from practical experience, or at least that’s how I’ve always understood it. Basically, and literally, a model for expected behaviours and interstitial outcomes. I think it would be interesting to explore (a) why the authors perceived need to present a model and (b) how they arrived at the phases and principles that are included in it. Is this tacit or common-sense thinking? Or was this directly informed by my concrete thinking from the field of digital curation?

I should brush up on more recent work regarding the DCC, and specifically, critiques of it. Through a quick search, a few papers seem directly relevant:

  • Cox and Tam (2018)
  • Choudhury, Huang, and Palmer (2020)
  • Rhee (2024)

I think Cox and Tam (2018) may be especially relevant as a critique of the lifecycle metephor in a general sense. From the abstract, it seems like they identify various downsides pertaining to lifecycle models, specifically that they “mask various aspects of the complexity of research, constructing it as highly purposive, serial, uni-directional and occurring in a somewhat closed system.” I’m not sure how explicit the connection is, but I sense this ties into the general paradigm shift toward greater recognition of curation “in the wild”, as per Dallas (2016).

DataShield

Gaye et al. (2014)

Introduces DataShield.

Frames DataShield as a technical fix to administrative problems:

Many technical and policy measures can be enacted to render data sharing more secure from a governance per spective and less likely to result in loss of intellectual prop erty. For example, data owners might restrict data release to aggregate statistics alone, or may limit the number of variables that individual researchers might access for speci fied purposes. Alternatively, secure analysis centres, such as the ESRC Secure Data Service and SAIL represent major informatics infrastructures that can provide a safe haven for remote or local analysis/linkage of data from selected sources while preventing researchers from down loading the original data themselves. However, to comple ment pre-existing solutions to the important challenges now faced, the DataSHIELD consortium has developed a flexible new way to comprehensively analyse individual level data collected across several studies or sources while keeping the original data strictly secure. As a technology, DataSHIELD uses distributed computing and parallelized analysis to enable full joint analysis of individual-level data from several sources, e.g. research projects or health or administrative data—without the need for those data to move, or even be seen, outside the study where they usually reside. Crucially, because it does not require underpin ning by a major informatics infrastructure and because it is based on non-commercial open source software, it is both locally implementable and very cost effective.

Adds a social/collaborative element to earlier arguments about the challenges inherent of prospective harmonization, highlighting a need for engagement with individual studies (either through direct or peripheral participation) to conduct research that was not initially planned for:

Unfortunately, both [study-level metadata] SLMA and [individual-level metadata] ILMA present significant problems Because SLMA com bines analytical results (e.g. means, odds ratios, regression coefficients) produced ahead of time by the contributing studies, it can be very inflexible: only the pre-planned analyses undertaken by all the studies can be converted into joint results across all studies combined. Any additional analyses must be requested post hoc. This hinders exploratory analysis for example the investigation of sub-groups, or interactions between key variables.

Provides a detailed overview of how DataShield was implemented for HOP (Healthy Obesity Project), including the code used to generate specific figures and analyses. Hoever it does not really describe or reflect upon the processes through which the code was developed.

The authors highlight the fact that certain analytical approaches are not possible using DataShield, especially analysis that visualize individual data points. It’s unclear how they enforce this, or whether it’s an implicit limitation based on the data that DataShield participants provide.

Because in DataSHIELD potentially disclosive com mands are not allowed, some analyses that are possible in standard R are not enabled. In essence, there are two classes of limitation on potential DataSHIELD functional ity: (i) absolute limitations which require an analysis that can only be undertaken by enabling one of the functional ities (e.g. visualizing individual data points) that is explicitly blocked as a fundamental element of the DataSHIELD philosophy. For example, this would be the case for a standard scatter plot. Such limitations can never be circumvented and so alternatives (e.g. contour and heat map plots) are enabled which convey similar information but without disclosing individual data points; (ii) current limitations which are functions or models that we believe are implementable but we have not, as yet, under taken or completed the development work required. As examples, these latter include generalized linear mixed model (including multi-level modelling) and Cox regression.

The authors list numerous other limitations and challenges. Some have to do with what kinds of data DataShield can handle (something about horizontal and vertical that I do not yet fully understand). Other challenges include the need for data to be harmonized, and having to deal with governance concerns.

Notably, the first challenge mentioned seems to contradict the statement earlier on (and made by Doiron et al. (2013)) that this is relatively easy to set up. The authors acknowledge the fact that coding for analysis using DataShield has a steep learning curve and requires some pre-planning to enable results from satellite computers to be properly combined. Their mitigation is to black-box these concerns by implementing simpler client-side functions that mask the more complex behaviours (and presumably translate error messages in ways that users can understand and act to resolve!).

Despite its potential utility, implementation of DataSHIELD involves significant challenges. First, although set-up is fundamentally straightforward, application involves a relatively steep learning curve because the command structure is complex: it demands specification of the analysis to be undertaken, the studies to use and how to combine the results. In mitigation, most complex serverside functions are now called using simpler client-side functions and we are working on a menu-driven implementation.

Also interesting that they note how there may be unanticipated problems, either accidental or malicious, and their way of mitigating against this is to log all commands:

Fifth, despite the care taken to set up DataSHIELD so that it works properly and is non-disclosive, it is possible that unanticipated prob lems (accidental or malicious) may arise. In order to iden tify, describe and rectify any errors or loopholes that emerge and in order to identify deliberate miscreants, all commands issued on the client server and enacted on each data server are permanently logged.

This is even more interesting in light of their continuous reference to “data.care”, which they do not address in depth, but which seems to have been a scandal involving unauthorized release of personal health data used in research.

The authors add an additional caveat concerning the need to ensure that the data are cleaned in advance.

But, to be pragmatic, many of the routinely collected healthcare and administra tive databases will have to undergo substantial evolution before their quality and consistency are such that they can directly be used in high-quality research without exten sive preparatory work. By its very nature, such preparation—which typically includes data cleaning and data harmonization—cannot usually be undertaken in DataSHIELD, because it involves investigating discrepan cies and/or extreme results in individual data subjects: the precise functionality that DataSHIELD is designed to block. Such work must therefore be undertaken ahead of time by the data generators themselves—and this is de manding of time, resources and expertise that — at present — many administrative data providers may well be unwilling and/or unable to provide. That said, if the widespread us ability of such data is viewed as being of high priority, the required resources could be forthcoming.

This corresponds with another limitation identified earlier, namely with regards to identifying duplicate individual records across jurisdictional boundaries (which involves assumptions regarding nationality and identify – one of those weird myths that programmers can’t seem to let go!):

So far DataSHIELD has been applied in settings where individual participants in different studies are from different countries or from different regions so it is unlikely that any one person will appear in more than one source. However, going forward, that cannot al ways be assumed. We have therefore been consider ing approaches to identify and correct this problem based on probabilistic record linkage. In the genetic setting 48 the BioPIN provides an alternative solution. Ongoing work is required.

Note the last line of the prior block quote regarding data cleaning:

That said, if the widespread us ability of such data is viewed as being of high priority, the required resources could be forthcoming.

This seems like a thread worth tugging at!

Wolfson et al. (2010)

x

References

Bennett, Siiri N., Neil Caporaso, Annette L. Fitzpatrick, Arpana Agrawal, Kathleen Barnes, Heather A. Boyd, Marilyn C. Cornelis, et al. 2011. “Phenotype Harmonization and Cross-Study Collaboration in GWAS Consortia: The GENEVA Experience.” Genetic Epidemiology 35 (3): 159–73. https://doi.org/10.1002/gepi.20564.
Bergeron, Julie, Dany Doiron, Yannick Marcon, Vincent Ferretti, and Isabel Fortier. 2018. “Fostering Population-Based Cohort Data Discovery: The Maelstrom Research Cataloguing Toolkit.” PLOS ONE 13 (7): e0200926. https://doi.org/10.1371/journal.pone.0200926.
Bergeron, Julie, Rachel Massicotte, Stephanie Atkinson, Alan Bocking, William Fraser, Isabel Fortier, and the ReACH member cohorts’ principal investigators. 2021. “Cohort Profile: Research Advancement Through Cohort Cataloguing and Harmonization (ReACH).” International Journal of Epidemiology 50 (2): 396–97. https://doi.org/10.1093/ije/dyaa207.
Choudhury, Sayeed, Caihong Huang, and Carole L. Palmer. 2020. “Updating the DCC Curation Lifecycle Model.” International Journal of Digital Curation 15 (1, 1): 12–12. https://doi.org/10.2218/ijdc.v15i1.721.
Cox, Andrew Martin, and Winnie Wan Ting Tam. 2018. “A Critical Analysis of Lifecycle Models of the Research Process and Research Data Management.” Aslib Journal of Information Management 70 (2): 142–57. https://doi.org/10.1108/AJIM-11-2017-0251.
Cummings, Steven R., Warren S. Browner, and Stephen B. Hulley. 2013. “Conceiving the Research Question and Developing the Study Plan.” In Designing Clinical Research, edited by Stephen B. Hulley, Stephen R. Cummings, Warren S. Browner, Deborah S. Grady, and Thomas S. Newman, 4:14–22. Philadelphia: Wolters Kluwer.
Dallas, Costis. 2016. “Digital Curation Beyond the Wild Frontier’: A Pragmatic Approach.” Archival Science 16 (4): 421–57. https://doi.org/10.1007/s10502-015-9252-6.
Doiron, Dany, Paul Burton, Yannick Marcon, Amadou Gaye, Bruce H. R. Wolffenbuttel, Markus Perola, Ronald P. Stolk, et al. 2013. “Data Harmonization and Federated Analysis of Population-Based Studies: The BioSHaRE Project.” Emerging Themes in Epidemiology 10 (1): 12. https://doi.org/10.1186/1742-7622-10-12.
Doiron, Dany, Yannick Marcon, Isabel Fortier, Paul Burton, and Vincent Ferretti. 2017. “Software Application Profile: Opal and Mica: Open-Source Software Solutions for Epidemiological Data Management, Harmonization and Dissemination.” International Journal of Epidemiology 46 (5): 1372–78. https://doi.org/10.1093/ije/dyx180.
Doiron, Dany, Parminder Raina, and Isabel Fortier. 2013. “Linking Canadian Population Health Data: Maximizing the Potential of Cohort and Administrative Data.” Canadian Journal of Public Health 104 (3): e258–61. https://doi.org/10.17269/cjph.104.3775.
Fortier, Isabel, Paul R Burton, Paula J Robson, Vincent Ferretti, Julian Little, Francois L’Heureux, Mylène Deschênes, et al. 2010. “Quality, Quantity and Harmony: The DataSHaPER Approach to Integrating Data Across Bioclinical Studies.” International Journal of Epidemiology 39 (5): 1383–93. https://doi.org/10.1093/ije/dyq139.
Fortier, Isabel, Dany Doiron, Paul Burton, and Parminder Raina. 2011. “Invited Commentary: Consolidating Data HarmonizationHow to Obtain Quality and Applicability?” American Journal of Epidemiology 174 (3): 261–64. https://doi.org/10.1093/aje/kwr194.
Fortier, Isabel, Parminder Raina, Edwin R Van den Heuvel, Lauren E Griffith, Camille Craig, Matilda Saliba, Dany Doiron, et al. 2017. “Maelstrom Research Guidelines for Rigorous Retrospective Data Harmonization.” International Journal of Epidemiology 46 (1): 103–5. https://doi.org/10.1093/ije/dyw075.
Fortier, Isabel, Tina W. Wey, Julie Bergeron, Angela Pinot de Moira, Anne-Marie Nybo-Andersen, Tom Bishop, Madeleine J. Murtagh, et al. 2023. “Life Course of Retrospective Harmonization Initiatives: Key Elements to Consider.” Journal of Developmental Origins of Health and Disease 14 (2): 190–98. https://doi.org/10.1017/S2040174422000460.
Gaye, Amadou, Yannick Marcon, Julia Isaeva, Philippe LaFlamme, Andrew Turner, Elinor M Jones, Joel Minion, et al. 2014. DataSHIELD: Taking the Analysis to the Data, Not the Data to the Analysis.” International Journal of Epidemiology 43 (6): 1929–44. https://doi.org/10.1093/ije/dyu188.
Hamilton, Carol M., Lisa C. Strader, Joseph G. Pratt, Deborah Maiese, Tabitha Hendershot, Richard K. Kwok, Jane A. Hammond, et al. 2011. “The PhenX Toolkit: Get the Most From Your Measures.” American Journal of Epidemiology 174 (3): 253–60. https://doi.org/10.1093/aje/kwr193.
Hohmann, Cynthia, E Govarts, A Bergström, S Cordier, M Eggesbo, M Guxens, J Heinrich, et al. 2012. “Joint Data Analyses of European Birth Cohorts: Two Different Approaches.” WebmedCentral EPIDEMIOLOGY 3 (12): WMC003869.
Rhee, Hea Lim. 2024. “A New Lifecycle Model Enabling Optimal Digital Curation.” Journal of Librarianship and Information Science 56 (1): 241–66. https://doi.org/10.1177/09610006221125956.
Rolland, Betsy, Suzanna Reid, Deanna Stelling, Greg Warnick, Mark Thornquist, Ziding Feng, and John D. Potter. 2015. “Toward Rigorous Data Harmonization in Cancer Epidemiology Research: One Approach.” American Journal of Epidemiology 182 (12): 1033–38. https://doi.org/10.1093/aje/kwv133.
Roos, Leslie L, Verena Menec, and R. J Currie. 2004. “Policy Analysis in an Information-Rich Environment.” Social Science & Medicine 58 (11): 2231–41. https://doi.org/10.1016/j.socscimed.2003.08.008.
Schaap, Laura A., Geeske MEE Peeters, Elaine M. Dennison, Sabina Zambon, Thorsten Nikolaus, Mercedes Sanchez-Martinez, Estella Musacchio, Natasja M. van Schoor, Dorly JH Deeg, and The EPOSA research group. 2011. “European Project on Osteoarthritis (EPOSA): Methodological Challenges in Harmonization of Existing Data from Five European Population-Based Cohorts on Aging.” BMC Musculoskeletal Disorders 12 (1): 272. https://doi.org/10.1186/1471-2474-12-272.
Wey, Tina W., and Isabel Fortier. 2024. “Maelstrom Research Approaches to Retrospective Harmonization of Cohort Data for Epidemiological Research.” In Survey Data Harmonization in the Social Sciences, 227–47. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781119712206.ch13.
Wolfson, Michael, Susan E. Wallace, Nicholas Masca, Geoff Rowe, Nuala A. Sheehan, Vincent Ferretti, Philippe LaFlamme, et al. 2010. DataSHIELD: Resolving a Conflict in Contemporary Bioscience–Performing a Pooled Analysis of Individual-Level Data Without Sharing the Data.” International Journal of Epidemiology 39 (5): 1372–82. https://doi.org/10.1093/ije/dyq111.