corto – the Correlation Tool

We developed corto (Correlation Tool), a simple package to infer gene regulatory networks from gene expression data using DPI (Data Processing Inequality) and bootstrapping to recover edges.

Supplementary Material containing all gene networks generated during corto benchmarking:
https://www.dropbox.com/sh/qzl8vjeoa7mqxfp/AACEfLQpAzUz7rqqEHEjMrhQa?dl=0

CRAN stable package: https://cran.r-project.org/package=corto

Github developmental version: https://github.com/federicogiorgi/corto

Progress bars and parallelization in R

Since SNOW is being discontinued, today I worked a bit on finding new solutions to have a progress bar in R for jobs running in parallel. In this example, I run 10,000 times a simple function to calculate logarithms, using 2 threads and monitoring the progress of the 10,000 calculations.

Set up the parameters

The following are the three parameters needed for any parallel job: number of threads, number of replicates (jobs) and a function:

nthreads<-2
nreps<-10000
funrep<-function(i){
res<-c(log2(i),log10(i))
}

SNOW solution

This was my old solution in SNOW, but CRAN is flagging all packages using SNOW with a warning “superseded packages” so we have to change it:

library(doSNOW)
cl<-makeCluster(nthreads)
registerDoSNOW(cl)
pb<-txtProgressBar(0,nreps,style=3)
progress<-function(n){
setTxtProgressBar(pb,n)
}
opts<-list(progress=progress)
i<-0
output<-foreach(i=icount(nreps),.combine=c,.options.snow=opts) %dopar% {
s<-funrep(i)
return(s)
}
close(pb)
stopCluster(cl)

Parallel solution (not working)

Unfortunately, Parallel doesn’t have a .options in foreach, and running it like this won’t work, as the combine function is run only at the end:

library(doParallel)
cl<-makeCluster(nthreads)
registerDoParallel(cl)
pb<-txtProgressBar(0,nreps,style=3)
output<-foreach(i=icount(nreps),.combine=c) %dopar% {
funrep(i)
setTxtProgressBar(pb,i)
}
stopCluster(cl)

Another parallel solution

After many tears, I finally found a solution that could work. Essentially, instead of c() I am running a progcombine() that contains c() and also updates a progress bar. Luckily, it works on both Windows and Linux:

library(doParallel)
progcombine<-function(){
pb <- txtProgressBar(min=1, max=nreps-1,style=3)
count <- 0
function(…) {
count <<- count + length(list(…)) – 1
setTxtProgressBar(pb,count)
flush.console()
c(…)
}
}
cl <- makeCluster(nthreads)
registerDoParallel(cl)
output<-foreach(i = icount(nreps),.combine=progcombine()) %dopar% {
funrep(i)
}
stopCluster(cl)

The working solution: pblapply

library(pbapply)
cl<-parallel::makeCluster(nthreads)
invisible(parallel::clusterExport(cl=cl,varlist=c(“nreps”)))
invisible(parallel::clusterEvalQ(cl=cl,library(utils)))
result<-pblapply(cl=cl,
X=1:nreps,
FUN=funrep)
parallel::stopCluster(cl)

7-way nested Venn Diagrams

3:21am, fifth coffee. Late hour inspired a nested 7-way Venn Diagram, a blob of shared miRNAs targeting E2F genes. A thing of terrible beauty, inside every human cell. Code below, a simple list rendered with the venn package.

mirnas<-list(mirnas_e2f1,mirnas_e2f2,mirnas_e2f3,mirnas_e2f6,mirnas_e2f7,mirnas_e2f8,mirnas_mycn)
names(mirnas) <- c("E2F1","E2F2","E2F3","E2F6","E2F7","E2F8","MYCN")
png("plots/075_mirnas_7way.png",w=1000,h=1000,p=30)
venn(mirnas,ilab=TRUE,zcolor="style")
dev.off()

Bioinformatics Lab Course – Draft Structure

University of Bologna

Genomics Course

Bioinformatics Lab

Teacher: Prof. Federico M. Giorgi

Teaching Assistant: Dr. Chiara Cabrelle

Duration: 60 hours (15 modules of ~4 hours + optional extras)

Exam: Oral

The course aims at giving a practical overview of all the useful tools, approaches and techniques necessary for a competitive bioinformatician in 2019.

Module 1: Introduction to and testing of the working environment

  • Virtual Box
  • Linux Refreshment
  • Playing with a FASTA file: wc, grep, htop, regex, sed
  • EMBOSS suite
  • Remove/install programs using apt (htop)
  • Projects and Exercise structure

Module 2: Phylogenetic Sequence Analysis

  • Sequence databases: how to download sequences from NCBI
  • Building a phylogenetic multifasta (MYC family)
  • Multiple Sequence Alignment (Muscle, ClustalW, TCoffee)
  • Building a Phylogenetyc Tree (PHYLIP)
  • Phylogenetic GUI: MEGA

Module 3: Remote Homology Detection

  • BLAST introduction
  • Create, format and index a sequence database (BLAST formatdb)
  • BLASTN/BLASTP/TBLASTX with various options
  • Discover the organism of a mysterious sequence
  • PSI-BLAST

Module 4: Introduction to Next Generation Sequencing

  • FASTA vs FASTQ, PHRED score
  • FASTQ library, single ended and paired
  • FASTQC

Module 5: NGS Alignment

  • Aligners: Bowtie, BWA, HiSAT
  • BAM files
  • Samtools: process and visualize BAM files
  • Integrated Genome Viewer: visualize alignments

Module 6: Calling Mutations

  • Exercise: generate BAMs
  • Using Varscan
  • Visualizing mutations and indels with IGV
  • Larger mutations: CNVs and translocation
  • GATK
  • Kiss&Splice: calling mutations from RNA reads

Module 7: RNA-Seq

  • Spliced aligners (TopHat, STAR, HISAT)
  • Finding new transcripts (Cufflinks)
  • Converting bams to counts (GFF, HTSEQ-Counts)
  • Finding contaminants in human rnaseq vs. other genomes (unaligned vs H. Pylori)

Module 8: ChIP-Seq

  • Exercise: align reads again
  • Input reads
  • Call Peaks (MACS)
  • Find enriched motifs (HOMER)
  • Upload Custom ENCODE tracks on Genome Browser

Module 9 (Short): Assembly

  • Assembling a small bacterial genome with DNA reads with MIRA
  • Classic DNA Assembly with Abyss or VELVET
  • Assembling E2F3 gene with long DNA reads with Canu
  • Assembling RNA-Seq transcripts with Trinity

*** end of R-free course ***

Module 10 (Long): (re)introduction to R

  • Basic commands up to sapply
  • RStudio
  • Scatterplots, Boxplots, Violin Plots, Heatmaps
  • RCircos
  • Bioconductor
  • Gene ID conversion
  • Genomic Ranges

Module 11: Differential Expression Analysis

  • Loading counts
  • Normalization: RPM vs RPKM vs TPM vs Size Factors vs voom
  • edgeR vs DESeq2
  • Comparing two datasets
  • Complex Designs
  • Confounding Variables (cancer vs. normal with age difference)

Module 12: Microarrays

  • Concept
  • Three steps: BG correction, normalization, summarization
  • RMA vs. MAS5
  • Differential Expression with LIMMA
  • Comparing microarrays with RNA-Seq

Module 13: Single Cell RNA-Seq

  • Dropout effects and biases
  • Clustering
  • Seurat pipeline
  • Cell Cycle bias removal
  • Differential Expression and comparison with bulk RNA

Module 14: Differential Binding Analysis

  • Estrogen treatment with DiffBind package
  • How to assign peaks to promoters to genes (Granges)
  • VULCAN package?

Module 15: Pathway Enrichment Analysis

  • Databases: Gene Ontology, MSIGDB, Reactome, Biocarta, KEGG, Mapman
  • Discrete enrichments: TopGO package
  • Continuous enrichments: GSEA
  • External resources: DAVID, Gorilla

*** Extra Modules ***

Module 16: Coexpression Analysis in R

  • Correlation: Pearson, Spearman, Kendall
  • Mutual Information
  • Partial Correlation (A,B,C)
  • Overlap with ENCODE and MSIGDB data
  • ARACNe

Module 17: Alternative Transcript Counters

  • Salmon
  • Kallisto

Module 18: Detect gene Fusions

  • RNA: Tophat fusions
  • DNA: big translocation finders?

Module 19: Simple Machine Learning

  • Predicting Mutations with Gene Expression
  • Glmnet, lasso, gradient boost modeling, caret package

Module 20: Survival Analsyis

  • Kaplan Meier Curves
  • Tests
  • Multiple groups
  • Comparing datasets

Module 21: Building an R plot with lattice

  • Canvas
  • Axes
  • Objects

Module 22: Clustering analysis in R

  • Hierarchical clustering (hclust and pvclust)
  • Treecut and dynamic treecut
  • Kmeans
  • Principal Component Analysis and TSNE

Module 23: DNA shape prediction in R

  • The DNA shape properties (MGW, HelT, PropT, Roll, EP)
  • DNAShapeR package
  • Show the shape of similar promoters (H.pylori project)

Combining P-values

TFisher

We have come a long way from the original simple p-value integration methods of Fisher and Stouffer. Hong Zhang, a talented grad student from the Worcester Polytechnic
Institute, and his colleagues have developed a novel method, called TFisher, for dealing with p-value integration in a wide range of test scenarios.

I quote from their abstract, available here: https://arxiv.org/abs/1801.04309

For testing a group of hypotheses, tremendous p-value combination methods have been developed and widely applied since 1930’s. Some methods (e.g., the minimal p-value) are optimal for sparse signals, and some others (e.g., Fisher’s combination) are optimal for dense signals. To address a wide spectrum of signal patterns, this paper proposes a unifying family of statistics, called TFisher, with general p-value truncation and weighting schemes. Analytical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. Optimal truncation and weighting parameters are studied based on Bahadur Efficiency (BE) and the proposed Asymptotic Power Efficiency (APE), which is superior to BE for studying the signal detection problem. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The TFisher tests were applied to analyzing a whole exome sequencing data of amyotrophic lateral sclerosis. Relevant tests and calculations have been implemented into an R package TFisher and published on the CRAN.

The methods are implemented in R and available on CRAN:

https://cran.r-project.org/web/packages/TFisher/index.html