生信分析进阶5 - 全外显子组变异检测和ANNOVAR注释Snakemake分析流程

基于yaml或ini配置文件,配置文件包含例如样本名称、参考基因组版本、exon capture bed文件路径、参考基因组路径和ANNOVAR注释文件等信息。

基于该流程可以实现全外显测序的fastq文件输入到得到最终变异VCF文件。

1. Snakemake分析流程基础软件安装

# conda安装
conda install -c bioconda snakemake -y
conda install -c bioconda fastqc -y
conda install -c bioconda trim-galore -y
conda install -c bioconda bwa -y
conda install -c bioconda samtools -y

# GATK 4.2.1版本
conda install -c bioconda gatk -y

ANNOVAR注释软件安装和使用参考文章:
https://www.jianshu.com/p/461f2cd47564

ANNOVAR注释全外显子有些坑和经常会报错的问题,
本人后续有空会写下对全外显子注释的ANNOVAR软件安装和使用

2. AgilentSureSelect 外显子序列捕获流程图

外显子捕获基本原理参考下图。
AgilentSureSelect

3. GATK SNP和INDEL分析基本流程

GATK

4. Snakemake全外显子分析流程

其中regions_bed配置为外显子捕获所采用试剂盒对应的bed文件路径。

##################### parser configfile #########################
import os
configfile: "config.yaml"
# configfile: "config.ini"

sample = config["sample"]
genome = config["genome"]
#################################################################
PROJECT_DATA_DIR = config["project_data_dir"]
PROJECT_RUN_DIR = config["project_run_dir"]

os.system("mkdir -p {0}".format(PROJECT_RUN_DIR))

######################### Output #################################
ALL_FASTQC_RAW_ZIP1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip"
ALL_FASTQC_RAW_HTML1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html"
ALL_FASTQC_RAW_ZIP2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip"
ALL_FASTQC_RAW_HTML2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"

ALL_CELAN_FASTQ1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"
ALL_CELAN_FASTQ2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"

ALL_SORTED_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
ALL_SORTED_BAM_BAI1 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
ALL_BAM_FLAGSTAT = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"

ALL_EXON_INTERVAL_BED = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
ALL_STAT_TXT = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
ALL_MKDUP_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
ALL_METRICS = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
ALL_SORTED_BAM_BAI2 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"

ALL_INSERT_SIZE_HISTOGRAM_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"

ALL_RECAL_DATA_TABLE = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
ALL_BQSR_SORTED_BAM = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
ALL_depthOfcoverage = PROJECT_RUN_DIR + "/gatk/depthOfcoverage"

ALL_RAW_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
All_SNP_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF_STAT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"

All_INDEL_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
ALL_INDEL_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"

ALL_SNP_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
ALL_INDEL_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"

ALL_MERG_FILTERE_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"

ALL_SNPANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
ALL_INDELANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
ALL_ALLANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"

QUALIMAP_REPORT = PROJECT_RUN_DIR + "/gatk/bam_QC/" + sample + "_bamQC_report.pdf"


######################## Workflow #################################
rule all:
    input:
        ALL_FASTQC_RAW_ZIP1,
        ALL_FASTQC_RAW_HTML1,
        ALL_FASTQC_RAW_ZIP2,
        ALL_FASTQC_RAW_HTML2,
        ALL_SORTED_BAM,
        ALL_SORTED_BAM_BAI1,
        ALL_BAM_FLAGSTAT,
        ALL_EXON_INTERVAL_BED,
        ALL_STAT_TXT,
        ALL_MKDUP_BAM,
        ALL_METRICS,
        ALL_SORTED_BAM_BAI2,
        ALL_RECAL_DATA_TABLE,
        ALL_BQSR_SORTED_BAM,
        ALL_RAW_VCF,
        All_SNP_SELECT,
        All_INDEL_SELECT,
        ALL_SNP_FILTER_VCF,
        ALL_INDEL_FILTER_VCF,
        ALL_MERG_FILTERE_VCF,
        ALL_SNPANNO,
        ALL_INDELANNO,
        ALL_ALLANNO


rule fastqc_raw:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"
    threads: 8
    params:
        raw_data_dir = PROJECT_RUN_DIR + "/fastqc/raw_data",
        clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data",	
        align_dir = PROJECT_RUN_DIR + "/align", 
        gatk_dir = PROJECT_RUN_DIR + "/gatk",
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        tmp_dir = PROJECT_RUN_DIR + "/tmp",
	fastqc_report_dir = PROJECT_RUN_DIR + "/report/fastqc_report",
	align_report_dir = PROJECT_RUN_DIR + "/report/align_report"
    shell:
        """
        mkdir -p {params.raw_data_dir}
        mkdir -p {params.clean_data_dir}
        mkdir -p {params.align_dir}
        mkdir -p {params.gatk_dir}
        mkdir -p {params.vcf_dir}
        mkdir -p {params.tmp_dir}
        fastqc -f fastq --extract -t {threads} -o {params.raw_data_dir} {input.r1} {input.r2}        
	"""

rule fastqc_clean:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"),
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz")
    threads: 8
    params:
        quality = 20,
        length = 20,
	    clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data"
    shell:
        """
        trim_galore --paired --quality {params.quality} --length {params.length} -o {params.clean_data_dir} {input.r1} {input.r2}
	    """

# BWA mem比对
rule bwa_mem:
    input:
        reads1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz",
        reads2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    threads:8
    params:
        hg = config["hg_fa"]
    shell:
        """
        bwa mem -t {threads} -M -R "@RG\\tID:{sample}\\tPL:ILLUMINA\\tLB:{sample}\\tSM:{sample}" {params.hg} {input.reads1} {input.reads2}|samtools sort -@ {threads} -o {output}
        """

rule bam_index_1:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

rule flagstat:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"
    threads: 1
    shell:
        "samtools flagstat {input} > {output}"

# 创建reference fasta 字典,GATK需要用到
# rule CreateSequenceDictionary:
#     input:
#         "/reference/hg19/ucsc.hg19.fa"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".hg19.dict"
#         # /GATK_bundle/hg19/ucsc.hg19.dict
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk_dir = PROJECT_RUN_DIR + "/" + sample + "/gatk"
#     shell:
#         """
#         mkdir -p {params.gatk_dir}
#         gatk CreateSequenceDictionary -R {input} -O {ouptut}
#         """ 

rule CreateExonIntervalBed:
    input:
        regions_bed = config["regions_bed"],
        hg_dict = config["hg_dict"]
    output:
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    threads: 1
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk BedToIntervalList -I {input.regions_bed} -O {output.Exon_Interval_bed} -SD {input.hg_dict}
        """ 

# 外显子区域覆盖度
rule exon_region:
    input:
        bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam",
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    output:
        stat_txt = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk CollectHsMetrics -BI {input.Exon_Interval_bed} -TI {input.Exon_Interval_bed} -I {input.bam} -O {output.stat_txt}
        """ 

# 标记PCR重复序列
rule mark_duplicates:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        metrics = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" MarkDuplicates -I {input} -O {output.mkdup_bam} -M {output.metrics}
        """

rule bam_index_2:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

# 插入片段分布统计
#rule InsertSizeStat:
#    input:
#        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.bam"
#    output:
#        insert_size_metrics_txt = temp(PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_metrics.txt"),
#        insert_size_histogram_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"
#    threads:1
#    params:
#        picard = "/public/software/picard.jar",
#        M = 0.5
#    shell:
#        """
#       java -jar {params.picard} CollectInsertSizeMetrics \
#        I={input} \
#        O={output.insert_size_metrics_txt} \
#        H={output.insert_size_histogram_pdf} \
#        M=0.5
#        """

# 重新校正碱基质量值
# 计算出所有需要进行重校正的read和特征值,输出为一份校准表文件.recal_data.table
rule BaseRecalibrator:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        mkdup_bam_index = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai",
        vcf_1000G_phase1_indels_sites = config["vcf_1000G_phase1_indels_sites"],
        vcf_Mills_and_1000G_gold_standard_indels_sites = config["vcf_Mills_and_1000G_gold_standard_indels_sites"],
        vcf_dbsnp = config["vcf_dbsnp"]
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
    threads: 8
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        --known-sites {input.vcf_1000G_phase1_indels_sites} \
        --known-sites {input.vcf_Mills_and_1000G_gold_standard_indels_sites} \
        --known-sites {input.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
        
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator -R genome/hg19.fa -I ./TKQX.sorted.mkdup.bam --known-sites /GATK_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.recal_data.table
# java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict

# 利用校准表文件重新调整原来BAM文件中的碱基质量值,使用这个新的质量值重新输出新的BAM文件
rule ApplyBQSR:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        recal_data = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table",
        regions_bed = config["regions_bed"]
        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"]
        #hg_fa = "genome/hg19.fa"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        -bqsr {input.recal_data} \
        -L {input.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR -R genome/hg19.fa -I ./test_sample.sorted.mkdup.bam -bqsr ./test_sample.recal_data.table -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./test_sample.sorted.mkdup.BQSR.bam

# # 校正碱基图
# rule recal_data_plot:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table.plot"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk = "/public/software/anaconda3/envs/PGS_1M/bin/gatk"
#     shell:
#         """
#         gatk AnalyzeCovariates -bqsr {input} -plots {output}
#         """
# gatk AnalyzeCovariates -bqsr ./test_sample.recal_data.table -plots ./test_sample.recal_data.table.plot

# 统计测序深度和覆盖度
#rule depthOfcoverage:
#    input:
#        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
#    output:
#        PROJECT_RUN_DIR + "/gatk/depthOfcoverage"
#    threads:1
#    params:
#        #hg19_fa = "genome/hg19.fa",
#        hg_fa = config["hg_fa"],
#        bamdst = "/public/software/bamdst/bamdst",
#        regions_bed = config["regions_bed"]
#        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
#    shell:
#        """
#        {params.bamdst} -p {params.regions_bed} -o {output} {input.BQSR_mkdup_bam}
#        """
# """
# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage  \
# -R {params.hg19} \
# -o {output} \
# -I {input.BQSR_mkdup_bam} \
# -L {input.regions_bed} \
# --omitDepthOutputAtEachBase --omitIntervalStatistics \
# -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100
# """
#gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage -R genome/hg19.fa -o ./TKQX -I ./TKQX.sorted.mkdup.BQSR.bam -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed --omitDepthOutputAtEachBase --omitIntervalStatistics -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100

# /public/software/bamdst/bamdst -p /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -o ./depth_coverage ./TKQX.sorted.mkdup.BQSR.bam

# # 单样本VCF calling
rule vcf_calling:
    input:
        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        vcf_dbsnp = config["vcf_dbsnp"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller \
        -R {params.hg_fa} \
        -I {input.BQSR_mkdup_bam} \
        -D {params.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller -R genome/hg19.fa -I ./TKQX.sorted.mkdup.BQSR.bam -D /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.raw.vcf

# 变异质控和过滤,对raw data进行质控,剔除假阳性的标记
rule vcf_select_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type SNP -V {input} -O {output}
        """

rule vcf_filter_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        MQ = 40.0,
        FS = 60.0,
        SOR = 3.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || MQ < {params.MQ} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"‘
# gatk VariantFiltration -V TKQX.snp.vcf --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS" -O TKQX.snp.filter.vcf

# rule snp_frequency_stat:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         snp_frequency = "/public/analysis/pipeline/WES/scripts/snp_frequency.py"
#     shell:
#         """
#         python {params.snp_frequency} {input}
#         """

rule vcf_select_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type INDEL -V {input} -O {output}
        """

rule vcf_filter_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        FS = 200.0,
        SOR = 10.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0|| MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"

rule merge_vcf:
    input:
        snp_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    threads: 1
    params:
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        gatk MergeVcfs \
        -I {input.snp_filter_vcf} \
        -I {input.indel_filter_vcf} \
        -O {output}
        """

rule transfer_avinput:
    input:
        snp = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel= PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf",
        merge = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    output:
        snp = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"),
        indel = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"),
        merge = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput")
    threads: 4
    params:
        convert2annovar = "/public/software/annovar/convert2annovar.pl"
    shell:
        """
        perl {params.convert2annovar} -format vcf4 {input.snp} > {output.snp}
        perl {params.convert2annovar} -format vcf4 {input.indel} > {output.indel}
        perl {params.convert2annovar} -format vcf4 {input.merge} > {output.merge}
        """
# SNP注释
rule snp_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_snpanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """
        
# INDEL注释
rule indel_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_indelanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        perl = "/home/miniconda3/pkgs/perl-5.26.2-h36c2ea0_1008/bin/perl"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,gnomad211_exome,exac03,esp6500siv2_all,ALL.sites.2015_08 \
        -operation g,r,f,f,f,f,f -nastring . -csvout
        """

#
rule merge_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_allanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """

生信分析进阶文章推荐

生信分析进阶1 - HLA分析的HLA区域reads提取及bam转换fastq

生信分析进阶2 - 利用GC含量的Loess回归矫正reads数量

生信分析进阶3 - pysam操作bam文件统计unique reads和mapped reads高级技巧合辑

生信分析进阶4 - 比对结果的FLAG和CIGAR信息含义与BAM文件指定区域提取

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/715022.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

面试题 17.17. 多次搜索

链接&#xff1a;. - 力扣&#xff08;LeetCode&#xff09; 题解&#xff1a; class Solution { private:struct Trie {Trie() {end false;index -1;next.resize(26);}bool end;int index;std::vector<std::unique_ptr<Trie>> next;};void insert_trie(int in…

C++编程:vector容器的简单模拟实现

前言&#xff1a; 在C标准库&#xff08;STL&#xff09;中&#xff0c;vector容器是最常见使用的动态数组。它结合了链表与数组的优点&#xff0c;提供了灵活的大小调整与高效的随机访问。本文将简单的对vector容器进行介绍并且对vector容器简单的模拟实现。 一、vector的文…

web前端:作业三

1.回到顶部案例(固定定位) <!DOCTYPE html> <html><head><meta charset"utf-8"><title></title><style>#container{height: 5000px;border: 1px solid blue;}#back-button{width: 100px;height: 100px;border: 1px solid…

Redis分布式锁的实现、优化与Redlock算法探讨

Redis分布式锁最简单的实现 要实现分布式锁,首先需要Redis具备“互斥”能力,这可以通过SETNX命令实现。SETNX表示SET if Not Exists,即如果key不存在,才会设置它的值,否则什么也不做。利用这一点,不同客户端就能实现互斥,从而实现一个分布式锁。 举例: 客户端1申请加…

比亚迪智驾技术震撼登场!L3级自动驾驶领跑全国,无图导航、夜间挑战轻松应对!

作为新能源汽车领域的翘楚&#xff0c;比亚迪在电池技术与智能驾驶方面都有着卓越的表现。近日&#xff0c;比亚迪凭借其领先的智驾技术&#xff0c;成功入选全国首批L3级自动驾驶上路及行驶试点名单&#xff0c;这无疑将推动智驾技术的普及速度。 你知道吗&#xff1f;比亚迪智…

Elasticsearch 认证模拟题 - 22

一、题目 索引 task 索引中文档的 fielda 字段内容包括了 hello & world&#xff0c;索引后&#xff0c;要求使用 match_phrase query 查询 hello & world 或者 hello and world 都能匹配该文档 1.1 考点 分词器 1.2 答案 # 创建符合条件的 task 索引&#xff0c;…

SM2加密算法的公私钥和密文格式以及不同编程语言之间无法互相解密问题分析

1 文章介绍 本文章主要介绍了SM2加密算法的公钥、私钥和密文格式,以及对于不同编程语言之间无法互相解密问题进行了分析和处理。2 SM2加密算法格式 SM2在线加解密测试2.1 公钥格式 SM2公钥是SM2曲线上的一个点,由横、纵坐标两个分量来表示,简记为Q,每个分量长度为256位,即…

Apipost模拟HTTP客户端

模拟HTTP客户端的软件有很多&#xff0c;其中比较著名的就有API-FOX、POSTMAN。 相信很多小伙伴都使用POSTMAN。这篇博客主要介绍Apipost的原因是&#xff0c;Apipost无需下载&#xff0c;具有网页版。 APIFOX的站内下载&#xff1a; Api-Fox&#xff0c;类似于PostMan的软件…

[力扣二叉树]本地调试环境指导手册

以236. 二叉树的最近公共祖先为例子 本地编译软件为Viusal Studio 2022 写代码 项目里文件位置 CreateTree.h #pragma once #ifndef CLIONPROJECT_LEETCODECREATETREE_H #define CLIONPROJECT_LEETCODECREATETREE_H #include<vector> #include<queue> using na…

Java语法和基本结构介绍

Java语法和基本结构是Java编程的基础&#xff0c;它决定了Java代码的书写方式和程序的结构。以下是Java语法和基本结构的一些关键点&#xff1a; 1.标识符和关键字&#xff1a;Java中的标识符是用来标识变量、函数、类或其他用户自定义元素的名称。关键字是预留的标识符&#x…

【Netty】ByteBuffer原理与使用

Buffer则用来缓冲读写数据&#xff0c;常见的buffer有&#xff1a; ByteBuffer MappedByBuffer DirectByteBuffer HeapByteBuffer hortBuffer IntBuffer LongBuffer FloatBuffer DoubleBuffer CharBuffer 有一个普通文本文件data.txt,内容为&#xff1a; 1234567890a…

Spring Boot实战:图书信息网站

实战概述&#xff1a;Spring Boot图书信息网站开发 项目背景 随着数字化时代的到来&#xff0c;图书信息网站为用户提供了一个便捷的在线浏览和购买图书的平台。本实战项目旨在通过Spring Boot框架开发一个图书信息网站&#xff0c;实现图书展示、用户登录和管理等功能。 项…

MySQL数据库(三)

一.MySQL数据库学习(三) (一).数据表的约束 为防止错误的数据被插入到数据表&#xff0c;MySQL中定义了一些维护数据库完整性的规则&#xff1b;这些规则常称为表的约束。 约束条件说明PRIMARY KEY主键约束用于唯一标识对应的记录FOREIGN KEY外键约束NOT NULL非空约束UNIQUE…

牛客 第二十届西南科技大学ACM程序设计竞赛(同步赛):祖玛

题目描述 wzy 在玩一种很新的祖玛。 给定一个仅包含 小写字母 的字符串 sss , sss 由 mmm 个不同的小写字母组成&#xff0c;每个字母代表一种小球&#xff0c;在消去时会获得 相应 的分数&#xff1a; 两个及以上 相同的小球相碰就会消失&#xff08;在发射小球前因为无相碰&…

Excel中多条件判断公式怎么写?

在Excel里&#xff0c;这种情况下的公式怎么写呢&#xff1f; 本题有两个判断条件&#xff0c;按照题设&#xff0c;用IF函数就可以了&#xff0c;这样查看公式时逻辑比较直观&#xff1a; IF(A2>80%, 4, IF(A2>30%, 8*(A2-30%),0)) 用IF函数写公式&#xff0c;特别是当…

Blossom:支持私有部署的云端双链笔记软件分享

Blossom 是一款支持私有部署的云端双链笔记软件&#xff0c;能够帮助用户将笔记、图片和个人计划安排保存在自己的服务器中&#xff0c;并在任意设备之间实时同步。同时&#xff0c;它还可以作为一个动态博客使用。本文将详细介绍 Blossom 的特点和使用方法。 一、Blossom 的特…

使用百度的长文本转语音API时无法下载.MP3文件

今天是学生们交作业的时候&#xff0c;结果是我最忙碌的一天&#xff0c;各种改bug。 有个学生来问&#xff1a; 我在百度提供的API代码(长文本转语音)的基础上添加了下载生成的.MP3文件的代码&#xff0c;运行之后成功建成了.MP3文件&#xff0c;但是文件的内容确实以下的报错…

Git管理(Linux版本)

在Linux中我们如何把自己的代码上传到gitee中呢&#xff0c;本期将为大家讲解详细的步骤。 目录 查看Linux环境是否存在git工具 在gitee上创建代码仓库 复制仓库的HTTP路径到Linux中 代码上传 在仓库下创建文件或者将文件移动到仓库下 使用三板斧进行文件的上传 add …

部署大模型LLM

在autodl上部署大模型 windows运行太麻烦&#xff0c;环境是最大问题。 选择云上服务器【西北B区 / 514机】 cpp (c c plus plus) 纯 C/C 实现&#xff0c;无需外部依赖。针对使用 ARM NEON、Accelerate 和 Metal 框架的 Apple 芯片进行了优化。支持适用于 x86 架构的 AVX、…

FineReport简单介绍(2)

一、报表类型 模板设计是 FineReport 学习过程中的主要难题所在&#xff0c;FineReport 模板设计主要包括普通报表、聚合报表、决策报表三种设计类型。 报表类型简介- FineReport帮助文档 - 全面的报表使用教程和学习资料 二、聚合报表 2-1 介绍 聚合报表指一个报表中包含多个…