Author Archives: fedgiorgi

corto – the Correlation Tool

We developed corto (Correlation Tool), a simple package to infer gene regulatory networks from gene expression data using DPI (Data Processing Inequality) and bootstrapping to recover edges.

Supplementary Material containing all gene networks generated during corto benchmarking:

CRAN stable package:

Github developmental version:

Progress bars and parallelization in R

Since SNOW is being discontinued, today I worked a bit on finding new solutions to have a progress bar in R for jobs running in parallel. In this example, I run 10,000 times a simple function to calculate logarithms, using 2 threads and monitoring the progress of the 10,000 calculations.

Set up the parameters

The following are the three parameters needed for any parallel job: number of threads, number of replicates (jobs) and a function:


SNOW solution

This was my old solution in SNOW, but CRAN is flagging all packages using SNOW with a warning “superseded packages” so we have to change it:

output<-foreach(i=icount(nreps),.combine=c,.options.snow=opts) %dopar% {

Parallel solution (not working)

Unfortunately, Parallel doesn’t have a .options in foreach, and running it like this won’t work, as the combine function is run only at the end:

output<-foreach(i=icount(nreps),.combine=c) %dopar% {

Another parallel solution

After many tears, I finally found a solution that could work. Essentially, instead of c() I am running a progcombine() that contains c() and also updates a progress bar. Luckily, it works on both Windows and Linux:

pb <- txtProgressBar(min=1, max=nreps-1,style=3)
count <- 0
function(…) {
count <<- count + length(list(…)) – 1
cl <- makeCluster(nthreads)
output<-foreach(i = icount(nreps),.combine=progcombine()) %dopar% {

The working solution: pblapply


7-way nested Venn Diagrams

3:21am, fifth coffee. Late hour inspired a nested 7-way Venn Diagram, a blob of shared miRNAs targeting E2F genes. A thing of terrible beauty, inside every human cell. Code below, a simple list rendered with the venn package.

names(mirnas) <- c("E2F1","E2F2","E2F3","E2F6","E2F7","E2F8","MYCN")

Bioinformatics Lab Course – Draft Structure

University of Bologna

Genomics Course

Bioinformatics Lab

Teacher: Prof. Federico M. Giorgi

Teaching Assistant: Dr. Chiara Cabrelle

Duration: 60 hours (15 modules of ~4 hours + optional extras)

Exam: Oral

The course aims at giving a practical overview of all the useful tools, approaches and techniques necessary for a competitive bioinformatician in 2019.

Module 1: Introduction to and testing of the working environment

  • Virtual Box
  • Linux Refreshment
  • Playing with a FASTA file: wc, grep, htop, regex, sed
  • EMBOSS suite
  • Remove/install programs using apt (htop)
  • Projects and Exercise structure

Module 2: Phylogenetic Sequence Analysis

  • Sequence databases: how to download sequences from NCBI
  • Building a phylogenetic multifasta (MYC family)
  • Multiple Sequence Alignment (Muscle, ClustalW, TCoffee)
  • Building a Phylogenetyc Tree (PHYLIP)
  • Phylogenetic GUI: MEGA

Module 3: Remote Homology Detection

  • BLAST introduction
  • Create, format and index a sequence database (BLAST formatdb)
  • BLASTN/BLASTP/TBLASTX with various options
  • Discover the organism of a mysterious sequence

Module 4: Introduction to Next Generation Sequencing

  • FASTA vs FASTQ, PHRED score
  • FASTQ library, single ended and paired

Module 5: NGS Alignment

  • Aligners: Bowtie, BWA, HiSAT
  • BAM files
  • Samtools: process and visualize BAM files
  • Integrated Genome Viewer: visualize alignments

Module 6: Calling Mutations

  • Exercise: generate BAMs
  • Using Varscan
  • Visualizing mutations and indels with IGV
  • Larger mutations: CNVs and translocation
  • GATK
  • Kiss&Splice: calling mutations from RNA reads

Module 7: RNA-Seq

  • Spliced aligners (TopHat, STAR, HISAT)
  • Finding new transcripts (Cufflinks)
  • Converting bams to counts (GFF, HTSEQ-Counts)
  • Finding contaminants in human rnaseq vs. other genomes (unaligned vs H. Pylori)

Module 8: ChIP-Seq

  • Exercise: align reads again
  • Input reads
  • Call Peaks (MACS)
  • Find enriched motifs (HOMER)
  • Upload Custom ENCODE tracks on Genome Browser

Module 9 (Short): Assembly

  • Assembling a small bacterial genome with DNA reads with MIRA
  • Classic DNA Assembly with Abyss or VELVET
  • Assembling E2F3 gene with long DNA reads with Canu
  • Assembling RNA-Seq transcripts with Trinity

*** end of R-free course ***

Module 10 (Long): (re)introduction to R

  • Basic commands up to sapply
  • RStudio
  • Scatterplots, Boxplots, Violin Plots, Heatmaps
  • RCircos
  • Bioconductor
  • Gene ID conversion
  • Genomic Ranges

Module 11: Differential Expression Analysis

  • Loading counts
  • Normalization: RPM vs RPKM vs TPM vs Size Factors vs voom
  • edgeR vs DESeq2
  • Comparing two datasets
  • Complex Designs
  • Confounding Variables (cancer vs. normal with age difference)

Module 12: Microarrays

  • Concept
  • Three steps: BG correction, normalization, summarization
  • RMA vs. MAS5
  • Differential Expression with LIMMA
  • Comparing microarrays with RNA-Seq

Module 13: Single Cell RNA-Seq

  • Dropout effects and biases
  • Clustering
  • Seurat pipeline
  • Cell Cycle bias removal
  • Differential Expression and comparison with bulk RNA

Module 14: Differential Binding Analysis

  • Estrogen treatment with DiffBind package
  • How to assign peaks to promoters to genes (Granges)
  • VULCAN package?

Module 15: Pathway Enrichment Analysis

  • Databases: Gene Ontology, MSIGDB, Reactome, Biocarta, KEGG, Mapman
  • Discrete enrichments: TopGO package
  • Continuous enrichments: GSEA
  • External resources: DAVID, Gorilla

*** Extra Modules ***

Module 16: Coexpression Analysis in R

  • Correlation: Pearson, Spearman, Kendall
  • Mutual Information
  • Partial Correlation (A,B,C)
  • Overlap with ENCODE and MSIGDB data
  • ARACNe

Module 17: Alternative Transcript Counters

  • Salmon
  • Kallisto

Module 18: Detect gene Fusions

  • RNA: Tophat fusions
  • DNA: big translocation finders?

Module 19: Simple Machine Learning

  • Predicting Mutations with Gene Expression
  • Glmnet, lasso, gradient boost modeling, caret package

Module 20: Survival Analsyis

  • Kaplan Meier Curves
  • Tests
  • Multiple groups
  • Comparing datasets

Module 21: Building an R plot with lattice

  • Canvas
  • Axes
  • Objects

Module 22: Clustering analysis in R

  • Hierarchical clustering (hclust and pvclust)
  • Treecut and dynamic treecut
  • Kmeans
  • Principal Component Analysis and TSNE

Module 23: DNA shape prediction in R

  • The DNA shape properties (MGW, HelT, PropT, Roll, EP)
  • DNAShapeR package
  • Show the shape of similar promoters (H.pylori project)

Combining P-values


We have come a long way from the original simple p-value integration methods of Fisher and Stouffer. Hong Zhang, a talented grad student from the Worcester Polytechnic
Institute, and his colleagues have developed a novel method, called TFisher, for dealing with p-value integration in a wide range of test scenarios.

I quote from their abstract, available here:

For testing a group of hypotheses, tremendous p-value combination methods have been developed and widely applied since 1930’s. Some methods (e.g., the minimal p-value) are optimal for sparse signals, and some others (e.g., Fisher’s combination) are optimal for dense signals. To address a wide spectrum of signal patterns, this paper proposes a unifying family of statistics, called TFisher, with general p-value truncation and weighting schemes. Analytical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. Optimal truncation and weighting parameters are studied based on Bahadur Efficiency (BE) and the proposed Asymptotic Power Efficiency (APE), which is superior to BE for studying the signal detection problem. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The TFisher tests were applied to analyzing a whole exome sequencing data of amyotrophic lateral sclerosis. Relevant tests and calculations have been implemented into an R package TFisher and published on the CRAN.

The methods are implemented in R and available on CRAN:

RNASeq aligners

books aligned.jpgI would say the match has now four competitors:


  • Pros: the classic, the first universally used, still widely adopted in pipelines all over the World, basically people keep using it so their new results are comparable to the old ones
  • Cons: slow (several CPU hours per alignment on a human genome with 10M reads), limited to 4Gbases genomes (so, no complex metatranscriptomics for him) and on their very website they say to use HISAT2


  • Pros: super, wicked fast, the standard used by ENCODE and the big RNASeq projects
  • Cons: uses a LOT of RAM, like really a lot (64GB for a human index)


  • Pros: fast and low RAM requirements. If you start from scratch, this is the aligner to pick
  • Cons: it’s still new and so many people don’t trust it yet


These are actually not strictly aligners, but rather transcript counters. I put them together for simplicity, but they are different softwares

  • Pros: high speed and low RAM requirements. Ideal for quick RNA-Seq gene expression measurements
  • Cons: they cannot do de novo transcript detection, sad. They don’t produce counts, which are the expected input for many downstream analysis tools. However, some tools are starting to accept Salmon/Kallisto outputs (in R you can use the transcript abundance import package tximport)


Quantifying RNA-Seq Transcripts

About ten years ago, when RNA-Seq was young, we struggled to make sense of the huge quantity of data that came out of Next-Generation Sequencers. The RNA-Seq pipelines were founded on the simple scheme:

Reads -> Alignments -> Quantification

The most popular RNA-Seq alignment tool, Tophat (now Tophat2) was actually built on the Bowtie aligner to focus on transcribed genomic regions (the Transcriptome), with the optional feature of aligning reads in the whole Genome, for de-novo transcript discovery.

Continue reading