测序的PCR duplicates及用samtools的rmdup去除PCR重复reads
建庫(kù)中有一步是:
PCR擴(kuò)增加了接頭的DNA片段。
理想情況下,對(duì)打碎的基因組DNA,每個(gè)DNA片段測(cè)且僅測(cè)到一次。
但這一步擴(kuò)增了6個(gè)cycle,那么每個(gè)DNA片段有了64份拷貝。將擴(kuò)增后所有產(chǎn)物“灑”到flowcell,來(lái)自一個(gè)DNA片段的兩個(gè)拷貝,可能會(huì)錨定在兩個(gè)bead上,經(jīng)過測(cè)序得到的這兩條read,就是PCR duplication。
一般來(lái)說(shuō),如果PCR duplication rate過高,那么同樣總數(shù)目的reads,所提供的關(guān)于基因組的信息就大大減少了
samtools的rmdup如何去除PCR重復(fù)reads
隨機(jī)打斷測(cè)序需要去除PCR重復(fù)reads,特異性捕獲不需要
samtools rmdup 的官方說(shuō)明書見:http://www.htslib.org/doc/samtools.html
Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this commandONLYworks with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).
拿一個(gè)小的雙端測(cè)序數(shù)據(jù)來(lái)測(cè)試一下:
samtools rmdup tmp.sorted.bam tmp.rmdup.bam
[bam_rmdup_core] processing reference chr10...
[bam_rmdup_core] 2 / 12 = 0.1667 in library
雙端測(cè)序數(shù)據(jù)用samtools rmdup效果很差,很多人建議用picard工具的MarkDuplicates 功能
samtools 去除PCR冗余
ref:samtools 使用說(shuō)明
samtools markdup [-llength] [-r] [-s] [-T] [-S]in.algsort.bam out.bam
-lINTExpected maximum read length ofINTbases. [300]
-rRemove duplicate reads.
-sPrint some basic stats.
-TPREFIXWrite temporary files toPREFIX.samtools.nnnn.mmmm.tmp
-SMark supplementary reads of duplicates as duplicates.
需要四步:
samtools sort -n xxx.bam-o xxx.sort.bam
samtools fixmate -m xxx.sort.bam xxx.fixmate.bam #注意這里samtools 1.2 的fixmate沒有-m參數(shù)
samtools sort xxx.fixmate.bam-o xxx.positionsort.bam
samtools markdup -r xxx.positionsort.bam xxx.markdup.bam#注意這里samtools 1.2 去冗余參數(shù)為rmdup,且1.2版本會(huì)報(bào)錯(cuò),實(shí)際用1.3的rmdup參數(shù)
all:
samtools sort-n xxx.bam |samtools fixmate -m |samtools sort |samtools markdup -r >xxx.markdup.bam
在sam/bam水平:
picard
ref網(wǎng)站:Picard Tools - By Broad Institute
使用:
java -jar picard.jar MarkDuplicates
I=xxx.sorted.bam
O=xxx.sorted.markdup.bam
M=xxx.markdup.txt
直接刪除冗余:
java -jar picard.jar MarkDuplicates
REMOVE_DUPLICATES=true
I=xxx.sorted.bam
O=xxx.sorted.markdup.bam
M=xxx.markdup.txt
參考來(lái)源:
https://www.jianshu.com/p/73483070379b
http://www.bio-info-trainee.com/2003.html
https://www.jianshu.com/p/879c5e9ed56e
https://www.jianshu.com/p/879c5e9ed56e
總結(jié)
以上是生活随笔為你收集整理的测序的PCR duplicates及用samtools的rmdup去除PCR重复reads的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。