Bioinformatics Lab Course – Draft Structure

University of Bologna

Genomics Course

Bioinformatics Lab

Teacher: Federico M. Giorgi

Duration: 60 hours (15 modules of ~4 hours + optional extras)

Exam: oral

The course aims at giving a practical overview of all the useful tools, approaches and techniques necessary for a competitive bioinformatician in 2019.

Module 1: Introduction to and testing of the working environment

  • Virtual Box
  • Linux Refreshment
  • Playing with a FASTA file: wc, grep, htop, regex, sed
  • Projects and Exercise structure

Module 2: Phylogenetic Sequence Analysis

  • Sequence databases: how to download sequences from NCBI
  • Building a phylogenetic multifasta (MYC family)
  • Multiple Sequence Alignment (Muscle, ClustalW, TCoffee)
  • Building a Phylogenetyc Tree (PHYLIP)
  • Phylogenetic GUI: MEGA

Module 3: Remote Homology Detection

  • BLAST introduction
  • Create, format and index a sequence database (BLAST formatdb)
  • BLASTN/BLASTP/TBLASTX with various options
  • Discover the organism of a mysterious sequence
  • PSI-BLAST

Module 4: Introduction to Next Generation Sequencing

  • FASTA vs FASTQ, PHRED score
  • FASTQ library, single ended and paired
  • FASTQC

Module 5: NGS Alignment

  • Aligners: Bowtie, BWA, HiSAT
  • BAM files
  • Samtools: process and visualize BAM files
  • Integrated Genome Viewer: visualize alignments

Module 6: Calling Mutations

  • Exercise: generate BAMs
  • Using Varscan
  • Visualizing mutations and indels with IGV
  • Larger mutations: CNVs and translocation
  • GATK
  • Kiss&Splice: calling mutations from RNA reads

Module 7: RNA-Seq

  • Spliced aligners (TopHat, STAR, HISAT)
  • Finding new transcripts (Cufflinks)
  • Converting bams to counts (GFF, HTSEQ-Counts)
  • Finding contaminants in human rnaseq vs. other genomes (unaligned vs H. Pylori)

Module 8: ChIP-Seq

  • Exercise: align reads again
  • Input reads
  • Call Peaks (MACS)
  • Find enriched motifs (HOMER)

Module 9 (Short): Assembly

  • Assembling a small bacterial genome with DNA reads
  • Assembling RNA-Seq transcripts with Trinity

*** end of R-free course ***

Module 10 (Long): (re)introduction to R

  • Basic commands up to sapply
  • RStudio
  • Scatterplots, Boxplots, Violin Plots, Heatmaps
  • RCircos
  • Bioconductor
  • Gene ID conversion
  • Genomic Ranges

Module 11: Differential Expression Analysis

  • Loading counts
  • Normalization: RPM vs RPKM vs TPM vs Size Factors vs voom
  • edgeR vs DESeq2
  • Comparing two datasets
  • Complex Designs
  • Confounding Variables (cancer vs. normal with age difference)

Module 12: Microarrays

  • Concept
  • Three steps: BG correction, normalization, summarization
  • RMA vs. MAS5
  • Differential Expression with LIMMA
  • Comparing microarrays with RNA-Seq

Module 13: Single Cell RNA-Seq

  • Dropout effects and biases
  • Clustering
  • Seurat pipeline
  • Cell Cycle bias removal
  • Differential Expression and comparison with bulk RNA

Module 14: Differential Binding Analysis

  • Estrogen treatment with DiffBind package
  • How to assign peaks to promoters to genes (Granges)
  • VULCAN package?

Module 15: Pathway Enrichment Analysis

  • Databases: Gene Ontology, MSIGDB, Reactome, Biocarta, KEGG, Mapman
  • Discrete enrichments: TopGO package
  • Continuous enrichments: GSEA
  • External resources: DAVID, Gorilla

*** Extra Modules ***

Module 16: Coexpression Analysis in R

  • Correlation: Pearson, Spearman, Kendall
  • Mutual Information
  • Partial Correlation (A,B,C)
  • Overlap with ENCODE and MSIGDB data
  • ARACNe

Module 17: Alternative Transcript Counters

  • Salmon
  • Kallisto

Module 18: Detect gene Fusions

  • RNA: Tophat fusions
  • DNA: big translocation finders?

Module 19: Simple Machine Learning

  • Predicting Mutations with Gene Expression
  • Glmnet, lasso, gradient boost modeling, caret package

Module 20: Survival Analsyis

  • Kaplan Meier Curves
  • Tests
  • Multiple groups
  • Comparing datasets

Module 21: Building an R plot with lattice

  • Canvas
  • Axes
  • Objects

Module 22: Clustering analysis in R

  • Hierarchical clustering (hclust and pvclust)
  • Treecut and dynamic treecut
  • Kmeans
  • Principal Component Analysis and TSNE
Advertisements

Combining P-values

TFisher

We have come a long way from the original simple p-value integration methods of Fisher and Stouffer. Hong Zhang, a talented grad student from the Worcester Polytechnic
Institute, and his colleagues have developed a novel method, called TFisher, for dealing with p-value integration in a wide range of test scenarios.

I quote from their abstract, available here: https://arxiv.org/abs/1801.04309

For testing a group of hypotheses, tremendous p-value combination methods have been developed and widely applied since 1930’s. Some methods (e.g., the minimal p-value) are optimal for sparse signals, and some others (e.g., Fisher’s combination) are optimal for dense signals. To address a wide spectrum of signal patterns, this paper proposes a unifying family of statistics, called TFisher, with general p-value truncation and weighting schemes. Analytical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. Optimal truncation and weighting parameters are studied based on Bahadur Efficiency (BE) and the proposed Asymptotic Power Efficiency (APE), which is superior to BE for studying the signal detection problem. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The TFisher tests were applied to analyzing a whole exome sequencing data of amyotrophic lateral sclerosis. Relevant tests and calculations have been implemented into an R package TFisher and published on the CRAN.

The methods are implemented in R and available on CRAN:

https://cran.r-project.org/web/packages/TFisher/index.html

RNASeq aligners

books aligned.jpgI would say the match has now four competitors:

Tophat2:

  • Pros: the classic, the first universally used, still widely adopted in pipelines all over the World, basically people keep using it so their new results are comparable to the old ones
  • Cons: slow (several CPU hours per alignment on a human genome with 10M reads), limited to 4Gbases genomes (so, no complex metatranscriptomics for him) and on their very website they say to use HISAT2

STAR:

  • Pros: super, wicked fast, the standard used by ENCODE and the big RNASeq projects
  • Cons: uses a LOT of RAM, like really a lot (64GB for a human index)

HISAT2:

  • Pros: fast and low RAM requirements. If you start from scratch, this is the aligner to pick
  • Cons: it’s still new and so many people don’t trust it yet

Salmon/Kallisto:

These are actually not strictly aligners, but rather transcript counters. I put them together for simplicity, but they are different softwares

  • Pros: high speed and low RAM requirements. Ideal for quick RNA-Seq gene expression measurements
  • Cons: they cannot do de novo transcript detection, sad. They don’t produce counts, which are the expected input for many downstream analysis tools. However, some tools are starting to accept Salmon/Kallisto outputs (in R you can use the transcript abundance import package tximport)

So… USE HISAT! 😀

Quantifying RNA-Seq Transcripts

About ten years ago, when RNA-Seq was young, we struggled to make sense of the huge quantity of data that came out of Next-Generation Sequencers. The RNA-Seq pipelines were founded on the simple scheme:

Reads -> Alignments -> Quantification

The most popular RNA-Seq alignment tool, Tophat (now Tophat2) was actually built on the Bowtie aligner to focus on transcribed genomic regions (the Transcriptome), with the optional feature of aligning reads in the whole Genome, for de-novo transcript discovery.

Continue reading “Quantifying RNA-Seq Transcripts”