Creating the annotated files

Annotated variants

To run the GWAS analysis the variants in the study need to be annotated with the ld-score and to a gene. To create the SNPs dataset use the baghera-tool create-files command

baghera_tool.preprocess.create_files(ldscore_folder: folder with LD score as in 1kG = 'data/eur_w_ld_chr/', annotation_file: gtf file for the annotation = 'data/gencode.v31lift37.basic.annotation.gtf', snps_output: annotated ld-score snps file = 'data/ld_annotated_gencode_v31.csv', genes_output: genes table with clustered genes = 'data/genes_gencode_v31.csv', chrom_list: list of chromosomes used by HTSeq, if None chr<no> is used = None)[source]

Builds the annotated set of SNPs required for downstream analysis. It requires a genome annotation in GTF format (preferably Gencode) and the LD score folders with $CHR.l2.ldscore files.

Parameters:
  • ldscore_folder – snps table filename
  • annotation_file – separator of the table
  • snps_output – ldscore snps annotated to genes
  • genes_output – genes table, the same used to annotate the snps
  • chrom_list – list of chromosomes used by HTSeq, if None chr<no> is used

We use precomputed ld-score , from the set of variants for the European population of 1000 Genomes (unzip the ld score files inside the downloaded folder), and the genes in the Gencode v31 annotations , only the protein coding ones. To cope with overlapping genes, we clustered them, obtaining a dataset of 15000 non-overlapping genes. For the annotation, we use a 50 kb window. The resulting dataset of annotated variants has around 1.3 millions SNPs, 55% of which are annotated with a gene.

Please note that this file has already been created, to process the data skip to the next section

It is possible to annotate a different set of variants, for example another reference panel, using the create-files function. For the moment it only supports .gtf files for the genes annotation and the LD-score folder with the structure in https://github.com/bulik/ldsc

The annotated ld scores table, return as an output, has the following structure:

chr rs_id position cm maf l gene
9 rs10123646 108998 0.090 0.499 3.155 FOXD4

While the gene table looks like the one below.

chrom start stop name
1 65418 71585 OR4F5

Create the dataset

The BAGHERA core analysis uses a table like the one below

chr rs_id position cm maf l gene sample_size z
9 rs10123646 108998 0.090 0.499 3.155 FOXD4 361194 1.4

Such file is generated by merging a summary statistics file with the annotated ld file.

To create the SNPs dataset use the baghera-tool generate-SNPs-file command

baghera_tool.preprocess.generate_snp_file(stats_input_file: SNPs file = 'data/c50_breast_snps.csv', input_type: ldsc or ukbb, position or position_ukbb = 'position_ukbb', annotated_ld_file: previously generated annotated LD file = 'data/ld_annotated_gencode_v31.csv', output_file: output filename (.csv) = 'data/c50_snps.csv')[source]
Annotate input summary statistics file using the the annotation
generated by BAGHERA create-files.
Parameters:
  • snps_input_file – snps table with summary stats
  • input_type – type of the input, use one between ldsc, ukbb, position, position_ukbb
  • annotated_ld_file – ldscore snps annotated to genes can be generated with create-files
  • output_file – output snps file with stats ldscore and gene annotation

The function uses a tsv table as input and merges it with the annotated ld score table.

SNPs input types

There are different input types managed by the code, specified in the parameter, use the -i <type> parameter.

We recommend the use of the position option, however we provide functions to directly process data that we have been using for this project.

Merge according to position

Once the user makes sure the genome build in use is consistent across all files, merging SNPs and genes according to their position is the safest. This way no rsId is taken into consideration, with the risk of a different naming.

Specifying the flag - position, BAGHERA expects to find the following columns in the input SNP file:

  • chrom: chromosome
  • pos: BP
  • nCompleteSamples: number of samples
  • tstat: beta/se stat

UKBB format

Since we processed all the data cancer data from the UKBB GWAS study available here

(in the round two results), we provide an off-the-shelf flag to directly process these data. Using the flag -i position_ukbb, the tool automatically extracts the position from the variant field in the table. This function directly splits the variant column if those are not found an exception is raised THe tool is expecting the following fields:

  • variant: a large variant
  • nCompleteSamples: number of samples
  • tstat: beta/se stat

Other input formats

The LD-score project has some available summary statistics that they have processed, use the -i ldsc to process the sumstats file (.sumstats.txt)

The LD-score project has some available summary statistics that they have processed, use -i ukbb to process the the old UKBB files, .assoc.tsv