Creating the annotated files¶
Annotated variants¶
To run the GWAS analysis the variants in the study need to be annotated with the ld-score and to a gene. To create the SNPs dataset use the baghera-tool create-files command
We use precomputed ld-score , from the set of variants for the European population of 1000 Genomes (unzip the ld score files inside the downloaded folder), and the genes in the Gencode v31 annotations , only the protein coding ones. To cope with overlapping genes, we clustered them, obtaining a dataset of 15000 non-overlapping genes. For the annotation, we use a 50 kb window. The resulting dataset of annotated variants has around 1.3 millions SNPs, 55% of which are annotated with a gene.
Please note that this file has already been created, to process the data skip to the next section
It is possible to annotate a different set of variants, for example another reference panel, using the create-files function. For the moment it only supports .gtf files for the genes annotation and the LD-score folder with the structure in https://github.com/bulik/ldsc
The annotated ld scores table, return as an output, has the following structure:
chr | rs_id | position | cm | maf | l | gene |
---|---|---|---|---|---|---|
9 | rs10123646 | 108998 | 0.090 | 0.499 | 3.155 | FOXD4 |
While the gene table looks like the one below.
chrom | start | stop | name |
---|---|---|---|
1 | 65418 | 71585 | OR4F5 |
Create the dataset¶
The BAGHERA core analysis uses a table like the one below
chr | rs_id | position | cm | maf | l | gene | sample_size | z |
---|---|---|---|---|---|---|---|---|
9 | rs10123646 | 108998 | 0.090 | 0.499 | 3.155 | FOXD4 | 361194 | 1.4 |
Such file is generated by merging a summary statistics file with the annotated ld file.
To create the SNPs dataset use the baghera-tool generate-SNPs-file command
The function uses a tsv table as input and merges it with the annotated ld score table.
SNPs input types¶
There are different input types managed by the code, specified in the parameter, use the -i <type> parameter.
We recommend the use of the position option, however we provide functions to directly process data that we have been using for this project.
Merge according to position¶
Once the user makes sure the genome build in use is consistent across all files, merging SNPs and genes according to their position is the safest. This way no rsId is taken into consideration, with the risk of a different naming.
Specifying the flag - position, BAGHERA expects to find the following columns in the input SNP file:
- chrom: chromosome
- pos: BP
- nCompleteSamples: number of samples
- tstat: beta/se stat
UKBB format¶
- Since we processed all the data cancer data from the UKBB GWAS study available here
(in the round two results), we provide an off-the-shelf flag to directly process these data. Using the flag -i position_ukbb, the tool automatically extracts the position from the variant field in the table. This function directly splits the variant column if those are not found an exception is raised THe tool is expecting the following fields:
- variant: a large variant
- nCompleteSamples: number of samples
- tstat: beta/se stat
Other input formats¶
The LD-score project has some available summary statistics that they have processed, use the -i ldsc to process the sumstats file (.sumstats.txt)
The LD-score project has some available summary statistics that they have processed, use -i ukbb to process the the old UKBB files, .assoc.tsv