The Cancer Registry (CR) maintained at Hillman Cancer Center is an extraordinary resource. More than 15-20 thousand new patients are added to the registry each year with decades of data already in place among hundreds of thousands of patients across essentially all cancer body-sites. The curation of this data is highly complete and maintained according to North American Association of Central Cancer Registries (NAACCR) standards across essentially all UPMC cancer treatment sites. The focus of CR data curation activities is detailed diagnosis and outcome data, but as with most cancer registries, only general treatment data is recorded (whether chemotherapy, radiation, and/or surgery). The opportunity is extraordinary to create uniquely powerful research collections by combining CR data with several other UPMC/Pitt resources that focus on fine-grained pathology reports (text) and transactional treatment data (specific drugs administered). Specifically, the TIES Cancer Research Network (TCRN), has assembled pathology report data across UPMC. The TIES system primarily uses natural language processing (NLP) to process unstructured pathology and radiology clinical documents enabling them as searchable assets.  The software has been deployed here at University of Pittsburgh and UPMC for over ten years and currently encompasses ~30M de-identified pathology and radiology clinical documents and other data sources including ~ 35K de-identified whole slide images.  With this volume of documents, it provides a researcher with a tremendous resource for identifying patient cohorts and the associated specimens.   Along with the NLP functionality, we have develop, within TIES, the ability to import structured data from any other data source, such as clinical EMR data.  This data can be organized into multiple datasets, and associated with existing patient or report level data processed by the NLP pipeline. With the structured data import functionality, it is possible to leverage the highly structured longitudinal data of the Cancer Registry (CR). Through an established working agreement with our local Cancer Registry, we were able to develop a set of common data elements for data extraction.  For our initial scope, we were able to extract a set of 54,000 breast cancer ​patients, and establish an extract, transformation and load (ETL) pipeline from their data repository (i.e., METRIQ software).  This data was imported into the TIES system using the structured data importer, which established the linkage to the patient data.  This functionality provides a greater availability of this data to the University of Pittsburgh and UPMC cancer research community and creates new opportunities for research programs that currently rely on manual integration of CR data.  In addition, we will use the TIES Cancer Research Network to establish a data-sharing consortium for cancer registry data. 

We propose to create two distinct enabling research resources which together represent the Cancer Registry Records for Research (CR3), each delivered as part of the R3 service.

1: We will develop a new browser-based application, CR3, appropriate for researcher self-service to explore and visualize CR data. Using an Agile approach, improving features and expanding data in two week “sprints”, we will build and deploy faceted search, exploration, survival curves and other interactive visualizations in Javascript and Data Driven Documents (d3js.org). These tools will initially operate over the core data (Site, Stage, Grade/Morphology, Outcome) of the following six cancer types within calendar year 2018: Breast, Colorectal, Head and Neck, Lung, Melanoma, Ovarian.

2: We will integrate the TCRN system and associated text data features with the CR data and CR3 visualization expanding the capabilities of both. The prior proof of concept project by this team has demonstrated import of discrete CR data into the TIES/TCRN system. However, this was tedious review of hundreds of fields in encyclopedic manor. This experience has taught us that we need to ingest a light, usable, cut of data across multiple cancers into TCRN to be most effective rather than deep in one. We will complete this with all six cancer types in calendar 2018.