BioData Catalyst Implementation Plan

Download BioData Catalyst Implementation Plan

Preview text

BioData Catalyst Implementation Plan
V2.0 - 20200403

BioData Catalyst Implementation Plan

Version: 2.0

BioData Catalyst Implementation Plan
V2 - 20200403

Document Status

Signatures presented below denote review and approval of the BioData Catalyst Implementation Plan. These approvals are given based on the understanding that the Implementation Plan, and the information herein, will be revised at regular periods over the course of the program. It is the responsibility of the Principal Investigator (PI) of each funded team and select NHLBI program staff to add their name(s) in the indicated space below.

Approved Date
PI Approvals:

PI Robert L. Grossman (University of Chicago) Anthony Philippakis (Broad Institute) Benedict Paten (UCSC) Paul Avillach Ashok Krishnamurthy Brandi Davis-Dusenbery

Carbon Helium Xenon

Approval Date 3/31/2020 3/30/20 3/30/2020 03/31/20 3/31/2020 3/31/2020

NIH Approvals:

Page 2 of 16

BioData Catalyst Implementation Plan

Version: 2.0

Responsible Person Jonathan Kaltman Alastair Thomson, NHLBI CIO
Next Review Date

NIH NHLBI BioData Approval Date Catalyst Role
Program Manager 4/3/2020
Information Security 4/3/2020


Document Owner


Revision History


Versio n Numbe r

Revision Reviewed/ Approved By





V.0.1 Stan Ahalt



Paul Avillach (Carbon team)



Rebecca Boyles



Marcie Rathbun






Brief Description of Change
Draft document created
Content update: all sections
Added i2b2/tranSMART platform and PIC-SURE metaAPI as the “gold master” for clinical data in DataSTAGE
Changes accepted and comments addressed
At the end of section 3.2, added link to Operationalization document: ​NHLBI DataSTAGE 60 Day o16n Plan v1-2
V1.0 reviewed and approved by NHLBI Links, graphics, & editing updates [Marcie]
Changes based on annual consortium review: - re-branded as BioData Catalyst & updated relevant
graphics - replaced the Ambassadors Program section with a
section on the Fellows Program - Small edits to the Beta-user training section to make
more general and add the Help Desk initiative - Updated the org. chart & WG/TT list


Page 3 of 16

BioData Catalyst Implementation Plan
OVERVIEW KEY TERMS AND CONCEPTS PROGRAM ORGANIZATION Consortium Groups Collaboration Groups Coordination of Activities
BIODATA CATALYST PLATFORM OVERVIEW SYSTEM DESCRIPTION SYSTEM DEVELOPMENT User Narratives 2019-2021 Features, Epics, and User Stories Cross-Team Development Coordination

Version: 2.0
5 5 5
5 6 6 6 7 8
9 9 11 12 12 12
13 13 14 14 14 15 15


Page 4 of 16

BioData Catalyst Implementation Plan

Version: 2.0

The BioData Catalyst Implementation Plan describes the process by which the BioData Catalyst Consortium will incrementally progress towards the vision of the program described in the BioData Catalyst Strategic Framework​. The Implementation Plan will enable the teams to decompose the strategic vision into concrete steps and define measures of completion for each step. Additionally, the Implementation Plan and Strategic Framework will focus the Consortium on the steps necessary to execute on the BioData Catalyst strategy.
This document outlines how the various elements from the planning phase of the BioData Catalyst project will come together to form a concrete, operationalized BioData Catalyst platform. The platform will offer the ability to perform novel science and access an unprecedented array of data to a diverse set of users. The platform will advance groundbreaking research and significant advances in medicine.
The Implementation Plan, coupled with the Project Management Plan, establishes priorities and accountabilities for resource use. The ​BioData Catalyst Project Management Plan​ describes how BioData Catalyst will execute, monitor, and control work towards deliverables within the program. To create project priorities and transparency, the Implementation Plan uses the following guiding principles:
● Maximizing availability of resources; ● Positive impact on the NHLBI mission; ● Responsible stewardship of funds; ● Utilization of technologies that maximize data security and integrity; ● Implementation of cost-effective solutions; and ● Consistency with the BioData Catalyst Consortium Charter and values.
The BioData Catalyst Consortium (BDCC) is a collection of teams and stakeholders working to deliver on the common goals of integrated and advanced cyberinfrastructure, leading-edge data management and analysis tools, FAIR data, and HLBS researcher engagement. The BDCC takes an Agile development approach towards implementation to be flexible and responsive to user needs and react to user feedback. Accordingly, work within the Consortium is described at various levels of detail with a focus on collaboration with the user community.
The organization of the program will be coordinated according to objectives that are captured in User Narratives, which are further decomposed into Features, Epics, and User Stories.

Coordination along this framework will support the BioData Catalyst teams working in a coordinated manner towards common goals. User Narratives are an orthogonal​ c​ onstruct to Work Streams, and are critical in the integration of user participation into the development cycle.


Page 5 of 16

BioData Catalyst Implementation Plan

Version: 2.0

Project milestones are defined and tracked via the delivery of User Narratives, Features, and Epics.
In the BioData Catalyst program, the following key terms are used: ● User Narrative​: A description of a user interaction experience within the system from the perspective of a particular persona. Example: An experienced bioinformatician wants to search TOPMed studies for a qualitative trait to be used in a GWAS study. ● Feature​: A functionality at the system level that fulfills a meaningful stakeholder need. Example: Search TOPMed datasets using the PIC-SURE platform. ● Epic:​ A (very) large User Story described at the program level that can be broken into executable stories. Example: PIC-SURE is accessible on BioData Catalyst. ● User Story​: An item that describes a requirement or functionality for a user. Example: A user can access PIC-SURE through an icon on BioData Catalyst to initiate a search. ● Work Stream​: A collection of related features; orthogonal​ t​ o a User Narrative. Example: Work Streams impacted by the above User Narrative include production system, data analysis, data access, and data management.
Consortium Groups The BioData Catalyst program is composed of several groups who each bring various resources towards executing the vision of BioData Catalyst. The organizational chart below displays the teams and their responsibilities.


Page 6 of 16

BioData Catalyst Implementation Plan

Version: 2.0

The BioData Catalyst Coordinating Center (BDC3), in collaboration with NHLBI, develops and maintains the Strategic Framework, Implementation, and Project Management Plans. All members of the Consortium are periodically invited to provide feedback on these plans to the BDC3, with a particular focus on integrating feedback from the Data Stewards (TOPMed) and users. The draft documents and any significant changes are reviewed by the Steering Committee as well as the External Expert Panel. The teams are responsible for collaborating to deliver Features, Epics, and User Stories and advance the BioData Catalyst ecosystem. Additional details on the membership, roles, and responsibilities of each group can be found in the ​Project Management Plan​.

Additionally, the BioData Catalyst has created multiple boards, working groups, and tiger teams that develop standards, protocols, and best practices to ensure ecosystem development. Boards are decision-making entities; working groups make recommendations for standards and protocols; and tiger teams are working groups that operate for a short period of time. Current groups include

Collaboration Groups




Change Control Board


Evaluating requests for changes that impact project risk, cost, scope, or schedule of the NHLBI-approved workplan and requests for intra-team changes.

Data Release Management

Ladwa, Culotti

Making recommendations around prioritization and organization of data, metadata schemas, and data ingestion and release.

Integration Testing Osborn Tiger Team

Designing and building an integration testing framework with tests to regularly verify the basic functionality of the ecosystem.

Data Access

Bradford, Lyons

Identifying, outlining, and developing policies and procedures to guide accessing data on the platform.

Data Harmonization

Carroll, Heavner

Supporting phenotype harmonization across the platform to reduce duplication and maximize expertise/efficiency; defining requirements for search and analytics across TOPMed.

Tools & Applications

Cox, O’Connor Maintaining a list of tools, workflows, or “apps” feasible for inclusion in the environment.

User Engagement

Krishnamurthy Coordinating the platform recruitment , DiGiovanna, of TOPMed investigators; documenting steps to


Page 7 of 16

BioData Catalyst Implementation Plan

Version: 2.0


operationalize data/tools/ computational workflows

needed; supporting training.

Coordination of Activities
Initially executed by four teams, meeting the goals of BioData Catalyst requires intense and ongoing collaboration to create cyberinfrastructure, tools, processes, and a community of practice. The software development teams within the BioData Catalyst Consortium will be largely self-organizing around Epics, which the BDC3 will coordinate to ensure synchronization across shared Features. Multiple Features will commonly compose a User Narrative. Successful completion of work will be measured against the ability for a user to complete the work outlined in a User Narrative.
The ability for a user to complete a User Narrative on the BioData Catalyst system will indicate meaningful progress towards completion.
Independently, Features, Epics, and the more granular User Stories can be mapped to Work Streams, which are useful for reporting on aggregations of specific types of work, as shown in the figure below.

BioData Catalyst maintains a Consortium glossary of terms that is regularly updated in the BDCatalyst-RFC-2_BioData Catalyst_Strategic_Planning_Nomenclature​.
A cyclical evaluation and revision of the ​BioData Catalyst​ User Narratives will be critical to the execution of the ​BioData Catalyst​ Agile program. Regular collection of user feedback and needs will feed into the development process and be represented in new or revised User Narratives that will be prioritized in coordination with NHLBI. This continued and organized refinement of


Page 8 of 16

BioData Catalyst Implementation Plan

Version: 2.0

priorities for Consortium development work will support close coordination and ground the BioData Catalyst ​program in the needs of the user community.


Inherent in the approach to the system design is the recognition that the current state of data and computational resources places onerous limits on the HLBS research community. Examples of limitations include an inability to execute arbitrary code, inability to access and work on very large data (e.g., TOPMed CRAMs) due to technical constraints, inability to search on one platform and execute on another, difficulties for groups of investigators to share controlled-access data and work together in a common workspace, as well as a laborious, several month process for a researcher gaining access to data.
The BioData Catalyst architecture provides an early cyberinfrastructure to researchers as quickly and responsibly as possible with an eye towards addressing the above limitations. BioData Catalyst will balance early delivery with ambitious goals by extending functionality through phased rollouts.
To accomplish this goal, the Consortium will abide by the below design principles:
● Meet user needs and incorporate feedback ● Leverage existing tools and infrastructure, when feasible


Page 9 of 16

BioData Catalyst Implementation Plan

Version: 2.0

● Duplicate functionality when intentional and reasonable ● Architect interoperability with relevant systems ● Encounter a seamless experience, regardless of underlying components ● Leverage cost-advantageous cloud resources ● Support scalability and extension of functionality ● Have an early impact on computational-driven HLBS science ● Enable easy access to applications and tools for users across BioData Catalyst ● Provide systems security for hosting identifiable data ● Implement rigorous testing and Quality Assurance measures for components and data
Applying these design principles, our initial architecture of the BioData Catalyst platform is pictured in the figure below. The teams will leverage the Data Commons Framework Services (DCFS) of Gen3 to provide critical infrastructure, common security, data access services, and the genomic data gold master. ​The DCFS is a set of software services designed specifically to support this kind of Data Commons platform. The DCFS is powered by the Gen3 platform and were initially developed to support the National Cancer Institute’s (NCI) Genomic Data Commons (Grossman, 2018).​ The ​PIC-SURE platform will be the clinical data gold master database leveraging its metaAPI. ​These data services will make use of the NIH STRIDES partnerships that offer NIH investigators cloud services and storage at discount pricing to support research (NIH, 2018).
The DCFS will include authentication and authorization services and digital object globally unique IDs for indexing. The current Terra, Seven Bridges, and PIC-SURE platforms will establish appropriate memos for interoperation with the DCFS. These memos are means through which groups will formalize cooperation with one another to develop interoperability solutions that meet functional, technical, and security/compliance requirements.
BioData Catalyst will be extended through the integration of third-party applications. There are a number of possible models in which a third-party application can operate within the BioData Catalyst platform. The terms of operation for these applications are being developed collaboratively between the Tools and Applications Working Group and the Operationalization Tiger Team.


Page 10 of 16

Preparing to load PDF file. please wait...

0 of 0
BioData Catalyst Implementation Plan