Creating the annotated files¶

Annotated variants¶

To run the GWAS analysis the variants in the study need to be annotated with the ld-score and to a gene. To create the SNPs dataset use the baghera-tool create-files command

We use precomputed ld-score , from the set of variants for the European population of 1000 Genomes (unzip the ld score files inside the downloaded folder), and the genes in the Gencode v31 annotations , only the protein coding ones. To cope with overlapping genes, we clustered them, obtaining a dataset of 15000 non-overlapping genes. For the annotation, we use a 50 kb window. The resulting dataset of annotated variants has around 1.3 millions SNPs, 55% of which are annotated with a gene.

Please note that this file has already been created, to process the data skip to the next section

It is possible to annotate a different set of variants, for example another reference panel, using the create-files function. For the moment it only supports .gtf files for the genes annotation and the LD-score folder with the structure in https://github.com/bulik/ldsc

The annotated ld scores table, return as an output, has the following structure:

chr	rs_id	position	cm	maf	l	gene
9	rs10123646	108998	0.090	0.499	3.155	FOXD4

While the gene table looks like the one below.

chrom	start	stop	name
1	65418	71585	OR4F5

Create the dataset¶

The BAGHERA core analysis uses a table like the one below

chr	rs_id	position	cm	maf	l	gene	sample_size	z
9	rs10123646	108998	0.090	0.499	3.155	FOXD4	361194	1.4

Such file is generated by merging a summary statistics file with the annotated ld file.

To create the SNPs dataset use the baghera-tool generate-SNPs-file command

The function uses a tsv table as input and merges it with the annotated ld score table.

SNPs input types¶

There are different input types managed by the code, specified in the parameter, use the -i <type> parameter.

We recommend the use of the position option, however we provide functions to directly process data that we have been using for this project.

Merge according to position¶

Once the user makes sure the genome build in use is consistent across all files, merging SNPs and genes according to their position is the safest. This way no rsId is taken into consideration, with the risk of a different naming.

Specifying the flag - position, BAGHERA expects to find the following columns in the input SNP file:

chrom: chromosome

pos: BP

nCompleteSamples: number of samples

tstat: beta/se stat

UKBB format¶

Since we processed all the data cancer data from the UKBB GWAS study available here

(in the round two results), we provide an off-the-shelf flag to directly process these data. Using the flag -i position_ukbb, the tool automatically extracts the position from the variant field in the table. This function directly splits the variant column if those are not found an exception is raised THe tool is expecting the following fields:

variant: a large variant
nCompleteSamples: number of samples
tstat: beta/se stat

Other input formats¶

The LD-score project has some available summary statistics that they have processed, use the -i ldsc to process the sumstats file (.sumstats.txt)

The LD-score project has some available summary statistics that they have processed, use -i ukbb to process the the old UKBB files, .assoc.tsv

Navigation

Related Topics