15. July 2020

关于测序数据质控

学习资料收集

1.3质控结果解读

质控软件fastqc官网FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are：

Import of data from BAM, SAM or FastQ files (any variant)
Providing a quick overview to tell you in which areas there may be problems
Summary graphs and tables to quickly assess your data
Export of results to an HTML based permanent report
Offline operation to allow automated generation of reports without running the interactive application

在fastqc软件官网，更应该值得关注的是其导航栏中的Training，该版块下包含很多生物信息学资料。如数据分析原理和编程软件使用等。 Training Courses:As part of its work with the Babraham Institute, the Bioinformatics group runs a regular series of training courses on many aspects of bioinformatics. These courses are run regularly on the Babraham site but we are also able to come out and present them on other sites. You can see the list of current Babraham dates which are available, and you can contact us to discuss options for running courses on your site. You can also sign up to our mailing list to get the latest training news delivered direct to your inbox every couple of months. Where possible we also aim to make the material from our courses publicly available so that anyone who wants to can download them for their own use. Below is a list of the courses we currently run. Where they are available there is a link to the training manual and course exercises.

去除duplicate reads—–https://www.360kuai.com/pc/99268fdce3e17207b?cota=4&tj_url=so_rec&sign=360_57c3bbd1&refer_scene=so_1
Mapping Quality —-http://blog.sina.com.cn/s/blog_670445240101iyjs.html
deplication过高的原因—–Duplication是指起始与终止位置完全一致的片段。引起Duplication的主要原因是在测序中有PCR过程，来源于同一个DNA片段PCR的产物被重复测序，就会产生duplication。次要原因是正巧两个插入片段的头和尾的位置完全一致，导致这一现象可能的原因有以下几种：a. 物种基因组小，本身的片段多样性低，测定的数据量多，重复的数据多；b. 建库过程中建库起始量少，片段多样性低，在相同的PCR条件下，会造成文库总量低，后期数据的dup率高；c.片段打断或加接头存在偏好性，文库的多样性较差。Dup率计算主要有以下2种方法：一种是数据质控时计算，利用 reads 序列来计算dup，要求 read 序列一样才算作duplication，duplicate reads数目除以总 reads数目计算比率；另一种是比对分析时计算，根据read比对上基因组的位置来判断，比对的位置一样就算作duplication，一般会有 2bp的容错。（http://www.360doc.com/content/19/0605/15/56135722_840581765.shtml）
IGV使用方法（https://www.jianshu.com/p/e5338858dd82）

The LatestT