ATAC-seq analysis

Pre-processing

First things first. We noticed that Trimmomatic is not suitable for the new batch of ATAC-seq sequencing file. After trimming, short fragments(i.e. shorter than 142 bp) will be lost. It turns out that current mainstream trimming softwares are not compatible with Nova-seq and Next-seq. And we have run our sequencing on a Nova-seq. After arduous searching and debugging, we found that Fastp was designed for these two new Illumina sequencing platforms and can handle what we've encountered.

Common usage of fastp to process ATAC-seq fastq file:

fastp -i R1.fq.gz -I R2.fq.gz -o ./R1_clean.fq.gz -O ./R2_clean.fq.gz -L -g -l 15 -p  -w 16

Quality control

As for ATAC-seq quality control, I'm currently implementing ENCODE ATAC-seq standard. As it happens, threre is an R package encodeChIPqc for calculating all sorts of parameters mentioned in ENCODE for estimating seq quality. It should be noted that calculating PCR bottleneck coefficient should be done before removing duplicates which will remarkably increase the original PBC value.

Removing and calculating duplicates is done by picard:

picard MarkDuplicates I=in.bam O=out_dedup.bam M=out_dups.txt REMOVE_DUPLICATES=true

Then removing removing non-primary chromosome (GRCh38 only):

samtools idxstats in.dedup.bam | cut -f1 | grep -v KI | grep -v GL | grep -v MT | xargs samtools view --threads 10 -b sample.bam > out.clean.bam

If the aligner did not put chr suffix infront of chromosome number, it is also neccessary to add them manually:

samtools view -H test.bam | sed -e 's/SN:1/SN:chr1/' | sed -e 's/SN:2/SN:chr2/' | sed -e 's/SN:3/SN:chr3/' | sed -e 's/SN:4/SN:chr4/' | sed -e 's/SN:5/SN:chr5/' | sed -e 's/SN:6/SN:chr6/' | sed -e 's/SN:7/SN:chr7/' | sed -e 's/SN:8/SN:chr8/' | sed -e 's/SN:9/SN:chr9/' | sed -e 's/SN:10/SN:chr10/' | sed -e 's/SN:11/SN:chr11/' | sed -e 's/SN:12/SN:chr12/' | sed -e 's/SN:13/SN:chr13/' | sed -e 's/SN:14/SN:chr14/' | sed -e 's/SN:15/SN:chr15/' | sed -e 's/SN:16/SN:chr16/' | sed -e 's/SN:17/SN:chr17/' | sed -e 's/SN:18/SN:chr18/' | sed -e 's/SN:19/SN:chr19/' | sed -e 's/SN:20/SN:chr20/' | sed -e 's/SN:21/SN:chr21/' | sed -e 's/SN:22/SN:chr22/' | sed -e 's/SN:X/SN:chrX/' | sed -e 's/SN:Y/SN:chrY/' | sed -e 's/SN:MT/SN:chrM/' | samtools reheader - test.bam > test_chr.bam

shift coordinates, this is for ATAC-seq only:

alignmentSieve -b sample.bam -o sample.filtered.bam --minMappingQuality 10 --filterMetrics log.txt --ATACshift

Analysis

I mainly refer to Harvard ATAC-seq Guidelines.

Generating signal track

For generating signal track using Deeptools, there are a few things that should be taken into consideration:

  1. since we have previously removed duplicates, we should now use --ignoreDuplicates
  2. low complexity regions should be excluded by --blackListFileName
  3. using --maxFragmentLength 100 to plot NFR regions
    1. NFR: < 100
    2. Mono-: 180 ~ 240
    3. Di-: 315 ~ 437

Peak calling

MACS2 is an old package but it work just fine. The Harvard guideline recommends a newly developed package Genrich, but I couldn't make it call peak properly.

The MACS2 parameters are:

macs2 callpeak --treatment treat.dedup.bam --name treat -g hs --bdg -q 0.05 --keep-dup all -f BAMPE --nomodel --shift -100 --extsize 200

Software and pipeline list:

Leave a Reply

Your email address will not be published. Required fields are marked *