The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites.
TCGA Expedition software consists of a set of scripts written in Python and Java that download, extract, and store all TCGA data and metadata. TCGA Expedition generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools.
We used TCGA Expedition to develop the Pittsburgh Genome Resource Repository (PGRR), which received the 2015 Reader's Choice Award for "Best Use of High-Performance Data Analytics" from HPCwire, an online publication that covers high-performance and data-intensive computing.
The Pittsburgh Genome Resource Repository, in partnership with the Simulation and Modeling Center, Pittsburgh Supercomputing Center, and UPMC, provides the framework through which researchers can access and analyze large national datasets, with links to complete patient data for those who are UPMC patients and who provided consent for their clinical data to be re-linked with research data. Currently, PGRR mirrors The Cancer Genome Atlas (TCGA), with de-identified clinical data from UPMC for patients whose tumors were contributed to TCGA. Additional large omic datasets will be managed in the same way. Investigators interested in gaining access to these datasets and computing infrastructure must request an account from PGRR and must be listed on any relevant Data Use Agreement.
The open-source software developed by IPM to create the PGRR, TCGA Expedition, is freely available for download.
Rebecca Crowley Jacobson, MD, MS