GWAS Explorer: an open-source tool to explore, visualize, and access GWAS summary statistics in the PLCO Atlas (2024)

The PLCO trial

Study participants were from the NCI PLCO Cancer Screening Trial, a large, randomized trial designed to evaluate if screening for prostate, lung, colorectal, and ovarian cancers lead to mortality reduction for these diseases9,10,11. Almost 155,000 men and women aged between 55 and 74 years were enrolled from 1993 to 2001 at 10 screening centers across the United States (Birmingham, Alabama; Denver, Colorado; Detroit, Michigan; Honolulu, Hawaii; Marshfield, Wisconsin; Minneapolis, Minnesota; Pittsburgh, Pennsylvania; Salt Lake City, Utah; St. Louis, Missouri; Washington, DC). Approximately half of the participants were randomized to the intervention arm and underwent cancer screening, while the other half were in the control arm and received standard medical care. Several self-administered questionnaires were administered at baseline and during follow-up, which collected information on demographics, medical history, family history and various lifestyle and dietary risk factors. Information from these questionnaires have been aggregated and harmonized to produce traits and covariates used in the PLCO Atlas genetic association tests. Blood was collected from screening-arm participants at baseline and at each annual screening visit for up to 5 additional years. In addition, buccal cells were collected from 2000–2003 from control arm participants and again in 2018 from participants in both arms. Cancer incidence and mortality outcomes have been tracked longitudinally with a median follow-up length >18 years for cancer incidence (approximately 44,000 cancers through 2017) and >19 years for deaths (approximately 57,000 deaths through 2018). All cancer diagnoses were confirmed by medical record review and/or via linkage to cancer registries, as previously described12,13. All participants provided written informed consent and the study was approved by the Institutional Review Boards at the National Cancer Institute and the 10 screening centers. Additional information about the cohort can be found at https://cdas.cancer.gov/learn/plco/home/.

GWAS data

The PLCO Atlas genotyping project sought to genotype all PLCO participants with genetic consent and available DNA or source vial (N = 117,551) (Fig.1). These participants were from the screening arm (N = 64,367) with blood and buccal source material and the control arm (N = 53,184) with only buccal source material. The Atlas project combined genotyping data previously generated by high density arrays for 25,831 participants (OncoArray, Omni2.5 M, and OmniExpress) as part of prior GWAS scans1,2,3,4,5,6,7,8 with a new round of genotyping using the Illumina Global Screening Array (GSA) for 84,731 participants who had low-density genotype data (n = 5,233) or no prior genotyping (n = 79,498).

PLCO participants with genotyping data.

Full size image

Samples from a total of 91,720 participants were processed for GSA genotyping. DNA extraction was performed using appropriate chemistry based on source material type and automated on the KingFisher Flex Purification System. Extraction protocols were followed using standard operating procedures developed internally in the NCI Division of Cancer Epidemiology and Genetics Cancer Genomics Research (CGR) Laboratory. The predominant DNA sample source was buccal cells (48.4%), followed by buffy coat (39.8%), whole blood (4.2%), and buffy coat and red blood cells (1.4%), as well as previously extracted DNA (2.6%). Of the 91,720 participants whose samples were processed, 3,360 (3.6%) individuals were not genotyped on GSA due to insufficient DNA extracted (N = 2,313) or insufficient material from previously extracted DNA (N = 1,047). In addition, a total of 3,629 (4.0%) individuals were excluded from the final dataset due to quality control failures described below and summarized in Fig.2a,b, resulting in a total of 84,731 PLCO participants successfully genotyped by GSA.

GSA genotyping was performed at the NCI Division of Cancer Epidemiology and Genetics CGR Laboratory according to Illumina protocols and following internal standard operating procedures. The CGR has extensive experience performing high-throughput Illumina bead-based genotyping having previously genotyped hundreds of thousands of samples. Initial GSA genotyping resulted in an overall failure rate of 1.5% for blood-derived DNA and a failure rate of 13% for buccal-derived DNA. After additional DNA extraction and genotyping to recover sample failures, genotyping was fully completed for a total of 84,731 GSA genotyped individuals.

Extensive quality control filtering was performed for each array to ensure a set of high-quality genotype data for subsequent imputation and association analyses. Detailed quality control steps and the reasons and numbers of exclusions for the GSA platform are described below and summarized in Fig.2a,b, respectively. For subjects genotyped on GSA, 275 subjects failed to produce valid output files (either .idat and/or .gtc files) during array processing and were excluded from the study. Next, 2,787 subjects were removed after applying a two-stage filter by a completion rate threshold of 0.8 for samples and 0.8 for loci, followed by a further 0.95 filter for samples and 0.95 filter for loci. A sample contamination check was performed using VerifyIDintensity, in which 85 subjects with greater than 20% estimated contamination were removed. Pairwise genotype concordance for all subjects was assessed to identify unexpected replicates, where subjects with a genotype concordance greater than 95% for a set of LD-pruned SNPs were considered replicates. After reviewing concordance check results against the enrolled phenotype data, a total of 128 subjects were removed. Sex was verified by comparing the reported sex with the observed sex based on X chromosome method-of-moments F coefficient from PLINK. The F coefficient is expected to be close to 0.0 for males and 1.0 for females with our threshold set to 0.5 for separating the two populations. Samples that failed the sex concordance check were subject to additional screening for sex chromosome aneuploidies by STR profiling using the AmpFLSTR Identifier assay, resulting in a total of 25 subjects excluded due to sex discordance. Further, a total of 5 subjects were identified to have abnormal heterozygosity by using absolute values from PLINK method-of-moments F coefficients greater than 0.2. Pairwise genotype concordance for all subjects from different datasets within the same platform and across different platforms was also assessed to identify cross-dataset and cross-platform discordant expected duplicates (n = 3) and unexpected replicates (n = 49). Additionally, relatedness was examined using Plink IBS/IBD tests. A total of 272 subjects from the GSA platform were identified with genetic relationships at the pi_hat threshold (0.1) and were removed. Consequently, a total of 3,629 individuals were excluded from the GSA dataset due to quality control failures (Fig.2b).

Cumulatively, samples from all platforms were filtered to remove abnormal levels of heterozygosity (N = 12), sex discordance (N = 31), within-dataset unexpected duplicates (N = 130), discordant expected duplicates (N = 14), cross-dataset and cross-platform unexpected duplicates (N = 126), and relatedness check (N = 291). A total of 47 subjects with sex-chromosome abnormalities were retained in the dataset for downstream imputation.

After applying QC exclusions to each array, a total of 112,065 DNA samples genotyped across 110,562 unique individuals on a modern, high-density Illumina genotyping array remained (Table1). For participants genotyped on multiple genotyping arrays (N = 1,192), only genotype data from one array was included in the Atlas project following the prioritization of Global Screening Array (GSA) > OncoArray > Omni2.5 M > OmniExpress (OmniX) to ensure non-redundant subject-level genotyping data. The predominant genotyping array was the GSA (N = 84,731), followed by the OncoArray (N = 16,893), Omni2.5 M (N = 7,211) and OmniX (N = 1,727).

Full size table

Genetic ancestry for PLCO Atlas participants was determined using GRAF (https://github.com/ncbi/graf) on a set of 10,000 pre-selected fingerprinting variants. GRAF assigned individuals into the following 9 ancestral groups: “African”, “African American”, “East Asian”, “European”, “Hispanic1”, “Hispanic2”, “Other”, “Other Asian”, and “South Asian”. Hispanic1 included individuals of Dominican or Puerto Rican ancestry whereas Hispanic2 included individuals of Mexican or Latin American ancestry. For parsimony and to facilitate downstream analyses, we merged “African” and “African American” into a “African American (Combined)” group and also “East Asian” and “Other Asian” into a “East Asian (Combined)” group. The largest ancestral sets in the PLCO Atlas included European (N = 100,448), African American (Combined) (N = 4,576) and East Asian (Combined) (N = 3,528).

For genotype imputation, we used the TopMed reference panel on the Michigan Imputation Server, which is accessible on the Michigan Imputation Server to all TopMed collaborators. To prepare for genotype imputation on the Michigan Imputation Server (MIS, https://imputationserver.sph.umich.edu), we filtered variants with minor allele frequency ≤ 0.01, variant-level missingness ≥ 0.05, and Hardy Weinberg equilibrium exact test p-value ≤ 1 × 10−6 from the imputation input. Data from each genotyping platform were then analyzed using a community-recommended script for aligning data to reference datasets (HRC-1000G-check-bim.pl, from https://www.well.ox.ac.uk/~wrayner/tools/). The script was modified to support TOPMed 5b as a reference panel using a pre-existing test imputation with 1000 Genomes Project subjects versus the TOPMed 5b reference panel. Data were uploaded to the MIS in GRCh37/hg19 and lifted over by the MIS. Pre-phasing using phased reference data from TOPMed release 5b was conducted using EAGLE 2.4. Imputation was conducted against the same reference panel using minimac4 (https://genome.sph.umich.edu/wiki/Minimac4). The “Population” option was set to “EUR” for GSA batches 1–4 that included European ancestry samples, while the option was set to “Other/Mixed” for all other imputations, which consisted of non-European samples or samples of uncharacterized ancestries. The PLCO imputation process took place over several months and was run in different rounds over the span of those months. In total, 110,562 subjects were successfully imputed to the TOPMed 5b reference panel.

Following MIS imputation, raw imputation data were partitioned into subsets according to predicted GRAF genetic ancestry groups to estimate ancestry-specific imputation quality. Ancestry and chip combinations with less than 100 individuals were deemed to have insufficient sample sizes for association testing and removed. After partitioning by ancestry and recomputing imputation quality Rsq values, each platform and ancestry pair was cleaned according to the filtering method described by Kowalski et al.14. Briefly, all variants with Rsq <0.3 were removed to be consistent with traditional quality filters. Remaining variants were then partitioned into minor allele frequency (MAF) bins (<0.05%, 0.05–0.2%, 0.2–0.5%, 0.5–1%, 1–3%, 3–5%, and >5%) and each bin was filtered, starting at the variant with the lowest Rsq, until the average Rsq of remaining variants within the corresponding MAF bin was at least 0.9. In total, more that 78,000,000 high-quality imputed variants were available for association testing. In addition, we observed high concordance between high quality imputed SNPs from the GSA with genotyped variants present on the OmniExpress arrays, with a median correlation of 1.00 and a mean correlation of 0.984.

Filtered imputed data by platform and ancestry were then converted to bgen format (v1.2) for compatibility with BOLT-LMM and SAIGE for association testing. The resulting final imputed PLCO Atlas Project dataset for association analyses is detailed in Table2.

Full size table

Association analysis

Association analyses on the autosomes and X chromosome were carried out using the PLCO pipeline hosted on GitHub (https://github.com/NCI-CGR/plco-analysis). All variants in non-PAR regions of the X chromosome in males were handled by coding these variants as 0/2. Quantitative phenotypes with a sample size of at least 3,000 subjects were analyzed by BOLT-LMM v2.3.415, using linear mixed models on variants with MAF >0.01. The top 20 principal components (generated separately by ancestry) were included as adjustment variables, as well as participant’s age, sex, and study center. Healthy subjects free of any cancer diagnoses throughout the follow-up period were treated as controls for all cancer analyses. Binary and categorical phenotypes were analyzed with SAIGE 0.43.216. We required more than 1,000 subjects and at least 50 cases for each SAIGE phenotype tested. At the variant level, a minimum variant count of 5 and a MAF >0.01 were required for testing. Association analyses were run separately for every GRAF-defined ancestry group, genotyping array, and imputation group. Ancestry-specific results were aggregated by meta-analysis to create overall summary results as well as sex-specific summary result files. Quantile-quantile (Q-Q) plots were generated and lambda values were calculated for each phenotype by linkage disequilibrium score (LDSC) regression17.

After association analyses using BOLT-LMM or SAIGE, the SNP column of the GWAS summary files were annotated by a custom tool (https://github.com/NCI-CGR/annotate_rsids_from_linker.git), in the format of rsid:otherAllele:testedAllele (or chr:pos:otherAllele:testedAllele if there was no matching rsid). Population-specific data from the 1000 Genomes Project imputed with the TopMED imputation 5b panel was used to annotate allele frequencies for each tested variant in the GWAS summary statistics using the annotate_frequency program (https://github.com/NCI-CGR/annotate_frequency). Association analyses for every GRAF-defined ancestry group were run separately for each genotyping array and imputation group. While the genotyping arrays and imputation procedures we implemented in the PLCO Atlas captures trait and disease associations with common variants shared across ancestries, associations with population-specific variants and variants with ancestry-specific differences in allele frequencies may not be well captured by this approach.

Currently the PLCO Atlas project hosts association results for 90 diseases and traits, including a comprehensive list of cancer types and subtypes defined by organ site, etiology, and pathology (Table3). For example, in addition to overall female breast cancer, we’ve included invasive, in situ, ductal, lobular, tubular, ER positive, ER negative, PR positive, PR negative, ER positive or PR positive, ER negative and PR negative, HER2 positive, HER2 negative, ER, PR, and HER2 triple-negative, Grade III or Grade IV, Grade II, and Grade I breast cancer. By etiology, we’ve performed GWAS analyses for smoking-related, alcohol-related, obesity-related, height-related, physical activity-related, diabetes-related, and infection-related cancers (overall, and by HPV- or H. pylori-). For smoking-related cancers, for example, we’ve considered cancers of bladder, ureter, kidney, lip, oral cavity, oropharynx, nasopharynx, hypopharynx, larynx, nasal cavity, paranasal sinuses, colorectum, esophagus, gastric, liver (excluding intrahepatic bile duct cancer), lung, myeloid leukemia, ovarian (mucinous), pancreas, and uterine cervix. By pathology, we’ve organized cancers into solid tumors (e.g., carcinomas, sarcomas, or urothelial cancers) and hematologic cancers (e.g., lymphoid or myeloid). Within carcinomas, we further broke down to adenocarcinomas (excluding mixed adenocarcinoma), endocrine or neuroendocrine, and squamous cell cancers.

Full size table

We also include GWAS association results for key cancer risk factors such as baseline status of body mass index, height, cigarette smoking for ≥6 months (never, ever), and cigarette smoking categories (never, former, current), caffeine consumption from diet, and male pattern baldness at age 45, as well as baseline measures of serum PSA level and serum CA-125 level. These initial traits were selected based on available previous data and represent binary, categorical, and continuous traits for the purpose of analytical pipeline development and validity checking. Analyses of additional traits are in progress and association results will be publicly posted as they become available.

Summary statistics

After association testing and annotation, summary statistic data was imported into a primary MySQL instance using an import script run on the National Institutes of Health (NIH) High Performing Computation Biowulf cluster (https://hpc.nih.gov/) that imported and aggregated participant phenotype metadata and variant association data. Using several parallel processes, each phenotype’s variant association data was aggregated and then indexed. Specific plot views for data visualization, such as the single chromosome summary view in the Manhattan plots and the q-q plots, were generated in this import and indexing process. The results were then pooled into the primary MySQL instance where a snapshot was created in Biowulf using Percona Xtrabackup tool. The snapshot was then uploaded to an Amazon Web Services (AWS) Simple Storage Service (S3) cloud bucket where it was restored to AWS’s Relational Database Service (RDS).

All PLCO Atlas summary statistic data is publicly posted on the GWAS Explorer (Fig.3). The GWAS Explorer is hosted on AWS. It consists of two AWS EC2 servers, an AWS RDS instance, an AWS ElastiCache instance, and an NCI on-premises download server. The website and API are served by each of their own dedicated AWS EC2s. All PLCO data is hosted in a single AWS RDS MySQL instance, which can be scaled-up or duplicated if needed. The GWAS Explorer backend is hosted by Fastify NodeJS, a web application framework like the popular Express framework but optimized for faster API performance. All API routes and database queries defined and utilize MySQL database query logic. Website (internal) requests are routed to a dedicated web server and public API requests are routed to a separate dedicated API server to reduce load on the webserver during periods of high usage. Public API routes are documented with Swagger UI (see Data Records). Download requests are routed to a dedicated local NCI download server to reduce egress costs. Additionally, a cache layer is configured using Redis and AWS ElastiCache to reduce server load and speed-up popular requests.

GWAS Explorer data pipeline and website hosting schematic.

Full size image

The GWAS Explorer frontend website is built with React NodeJS. All user interface components are developed with Bootstrap. Plots for visualization of participant descriptive characteristics and association data are built in Plotly.js as well as custom solutions; for example, Manhattan plots and gene tracks are built in custom canvas. The quantile-quantile (Q-q), principal component (PC) plots and frequency plots are generated using Plotly.js and the bubble charts in the Browse Phenotypes section are produced using custom D3.js. The Apache service handles and serves all incoming web requests.

GWAS Explorer: an open-source tool to explore, visualize, and access GWAS summary statistics in the PLCO Atlas (2024)

FAQs

What is a summary statistic in GWAS? ›

Summary statistics are defined as the aggregate p-values and association data for every variant analysed in a genome-wide association study (GWAS).

What is the GWAS statistical test? ›

Genome-wide association studies (GWAS) test hundreds of thousands of genetic variants across many genomes to find those statistically associated with a specific trait or disease.

What is the format of GWAS summary statistics file? ›

The summary statistics data file is a TSV flat file of tab-delimited values that can be compressed (see schamatic), reporting data from a single genome-wide analysis. The first line of the summary statistics data file contains the headers to the table. The rows after the header store the variant association data.

What software is used for GWAS? ›

Get the right software

The most commonly used GWAS software is PLINK, a command line program that can run association analyses and also perform quality control and regression steps, among other useful features.

What is a 5 statistic summary? ›

Five-number summaries

A five-number summary is especially useful in descriptive analyses or during the preliminary investigation of a large data set. A summary consists of five values: the most extreme values in the data set (the maximum and minimum values), the lower and upper quartiles, and the median.

What is GWAS used for? ›

A genome-wide association study (GWAS) is an approach to compare the genomes from many different people to find genetic markers associated with a particular phenotype or risk of disease.

How to download summary statistics from GWAS Catalog? ›

Following our biweekly data release, studies not under embargo will also be listed in the summary statistics download area https://www.ebi.ac.uk/gwas/downloads/summary-statistics and associated metadata will be available to download via https://www.ebi.ac.uk/gwas/docs/file-downloads.

What is an example of a GWAS study? ›

The first successful GWAS published in 2002 studied myocardial infarction. This study design was then implemented in the landmark GWA 2005 study investigating patients with age-related macular degeneration, and found two SNPs with significantly altered allele frequency compared to healthy controls.

What does summary statistics measure? ›

Summary Statistics: Measures of location

Also referred to as central tendency, this summary shows or describes a data set's center or average. This is measured by the calculated values of the mean, median, and mode. Mean: This is the most common method of calculating the average value.

What is the difference between GWAS and sequencing? ›

GWAS is the association of individual markers or groups of markers across the genome with phenotypic data. Markers are allele calls of a representative set of loci across the genome. Whole genome sequencing is an assay that is literally what it says it is. It gets sequence data, not markers.

What are the steps of a GWAS? ›

With these in mind, carry out the procedures as follows:
  1. 3.1. DNA extraction. ...
  2. 3.2. Genotype quality control, variant calling and exclusions. ...
  3. 3.3. Imputation of genotypes. ...
  4. 3.4. Adjustment for ancestry and population stratification. ...
  5. 3.5. GWAS analysis. ...
  6. 3.6. Reporting and annotation. ...
  7. 3.7. Post-GWAS analyses and procedures.

What are the alternatives to GWAS? ›

Multi-locus models are better alternative methods for GWAS; these include Bayesian LASSO8, penalized Logistic regression9,10, Elastic-Net11 and empirical Bayes12 methods. An obvious advantage of these methods is that no Bonferroni correction is required because of the multi-locus nature.

What is meant by summary statistics? ›

Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after. There are two main types of summary statistics used in evaluation: measures of central tendency and measures of dispersion.

What is a statistical summary report? ›

Description of the Summary Statistics Report

Estimates the expected value of the underlying distribution for the response variable, which is the arithmetic average of the column's values. It is the sum of the nonmissing values divided by the number of nonmissing values.

Why do we need summary statistics? ›

Data visualization and summary statistics are an important part of statistical analysis. It can help you identify trends in your data and communicate your research in presentations. Here are some recommendations of plots and descriptive statistics you can use, based on the type of data you have.

What are the summary statistics in AP Stats? ›

Mean , median , standard deviation , IQR, range, all are summary statistics for a quantitative variable.

Top Articles
Mathematics - Encyclopedia of Mathematics
Is the Paris Olympics’ Swimming Pool ‘Slow’? Let’s Dive into the Math
Craigslist Warren Michigan Free Stuff
Design215 Word Pattern Finder
Camera instructions (NEW)
Monthly Forecast Accuweather
Ghosted Imdb Parents Guide
Otterbrook Goldens
Snarky Tea Net Worth 2022
Bbc 5Live Schedule
Slag bij Plataeae tussen de Grieken en de Perzen
Bc Hyundai Tupelo Ms
Foodland Weekly Ad Waxahachie Tx
Quest Beyondtrustcloud.com
Costco Gas Foster City
Fdny Business
Craigslist In Flagstaff
All Obituaries | Buie's Funeral Home | Raeford NC funeral home and cremation
[Cheryll Glotfelty, Harold Fromm] The Ecocriticism(z-lib.org)
MLB power rankings: Red-hot Chicago Cubs power into September, NL wild-card race
Why do rebates take so long to process?
About My Father Showtimes Near Copper Creek 9
Black Panther 2 Showtimes Near Epic Theatres Of Palm Coast
Xpanas Indo
130Nm In Ft Lbs
Tom Thumb Direct2Hr
A Man Called Otto Showtimes Near Carolina Mall Cinema
Uno Fall 2023 Calendar
Elanco Rebates.com 2022
Angela Muto Ronnie's Mom
Appleton Post Crescent Today's Obituaries
Greencastle Railcam
Blue Beetle Movie Tickets and Showtimes Near Me | Regal
Leena Snoubar Net Worth
Casamba Mobile Login
The Conners Season 5 Wiki
Ethan Cutkosky co*ck
Arnesons Webcam
Is Ameriprise A Pyramid Scheme
844 386 9815
Sherwin Source Intranet
Ouhsc Qualtrics
Workday Latech Edu
Spn 3464 Engine Throttle Actuator 1 Control Command
Product Test Drive: Garnier BB Cream vs. Garnier BB Cream For Combo/Oily Skin
Is TinyZone TV Safe?
Santa Ana Immigration Court Webex
sin city jili
91 East Freeway Accident Today 2022
The Missile Is Eepy Origin
Competitive Comparison
Invitation Quinceanera Espanol
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6257

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.