Online Tools to fight the COVID-19 pandemic (updated 07-Sep-2020)

ToolLinkMain InstitutionNationArchitectureTagsProsCons
JHU COVID-19 Dashboard Hopkins UniversityUSAPythonDashboard, Interactive Map, Trend Assessment, WorldwideFrequently Updated, Quick Assessment, Worldwide Analysis 
DSCovR UniversityUSAShiny/RDashboard, Interactive Map, Trend AssessmentComparative Region Analysis, Demographics IncludedSlow to load, Focused on USA
WHO Dashboardhttp://covid19.who.intWHOWorldwideJavaScriptDashboard, Interactive Map, WorldwideComparative Region Analysis, Easy to Use, Frequently Updated, Quick Assessment, Worldwide Analysis 
Worldometers, WorldwideEasy to Use, Frequently Updated, Quick Assessment, Worldwide Analysis 
COVID-19 Scenarioshttp://covid19-scenarios.orgUniversity of BaselSwitzerlandJavaScriptInteractive Simulator, WorldwideDemographics Included, High Number of ParametersNon-trivial to tailor the simulation for specific regions
Harvard COVID-19 Simulatorhttp://covid19sim.orgHarvard Medical SchoolUSARInteractive SimulatorFrequently UpdatedFocused on USA
CovidSIMhttp://covidsim.euExploSYS GmbHGermanyJavaScriptInteractive SimulatorHigh Number of ParametersNon-trivial to tailor the simulation for specific regions
COVID-19 Trajectory viewer of LeipzigGermanyShiny/RInteractive SimulatorComparative Region Analysis 
COVID-19 Exit Strategies versus Corona initiativeWorldwideShiny/RInteractive SimulatorComparison of Several Exit StrategiesTunable Parameters are Few
Greifswald COVID-19 Simulator of GreifswaldGermanyShiny/RInteractive SimulatorPredict Effect of Social Contact ReductionFocused on specific countries and German regions
COVID19-Tracker Biomedical Research InstituteSpainShiny/RCase Number Visualizer and PredictorFrequently UpdatedFocused on Spain
GISAIDhttp://gisaid.orgGISAIDWorldwideCMS TYPO3Data Repository, WorldwideDatabase Fully Downloadable, Frequently Updated, Precomputed Multiple Sequence Alignment 
Nextstrain of BaselSwitzerlandPythonDashboard, Nucleotide Mutation Analysis, Phylogenesis, WorldwideFrequently Updated, Simulation of Mutation Spread over Time, WorldwideDifficult to zoom into specific regions of the interactive phylogenetic tree
Covidex of LujánArgentinaShiny/RPhylogenetic CategorizationAllows User-provided Data, Intuitive TutorialWorks exclusively with User-provided Data
Coronapp of BolognaItalyShiny/RAmino Acid Mutation Analysis, Nucleotide Mutation Analysis, Frequency of Mutations over TimeAllows User-provided Data, Nucleotide and Protein Mutations, WorldwideSlow to load
COVID-19 Genotyping Toolhttp://covidgenotyper.appUniversity of TorontoCanadaShiny/RPhylogenetic Categorization via 2D clusteringAllows User-provided DataAnalysis is very slow, Maximum number of sequences is only 10
Pangolinhttp://pangolin.cog-uk.ioCentre for Genomic Pathogen SurveillanceUnited KingdomPythonPhylogenetic Categorization, Lineage AssignerAllows User-provided Data, Intuitive Assignment of LineageAnalysis is slow
SARS-CoV-2 Alignment Screen College LondonUnited KingdomShiny/RNucleotide Mutation AnalysisMutation Analysis can be Focused on specific Genomic Regions or GenesNot frequently updated
CoV-GLUE of GlasgowUnited KingdomJavaScriptAmino Acid Mutation Analysis, Nucleotide Mutation Analysis, SpreadsheetMutation Analysis can be Focused on specific Genomic Regions or Genes, Mutations Categorized as Replacements/Insertions/Deletions 
Coronavirus3Dhttp://coronavirus3d.orgUniversity of California RiversideUSAJavaScriptAmino Acid Mutation Analysis, 3D StructureAllows to project mutations on viral protein structures from PDB, Frequently Updated 
CoVex University of MunichGermanyJavaScriptInteractome VisualizerAllows to identify Known Drugs for selected Target Proteins 
VirHostNet 2.0http://virhostnet.prabi.frUniversity of LyonFranceCytoscape webInteractome VisualizerPrediction of novel interactions on user-provided protein sequencesAnalysis is slow
P-HIPSTerhttp://phipster.orgColumbia UniversityUSAJavaScriptInteraction ListPrediction of novel interactions using sequence- and structure-based machine learningNot focused on SARS-CoV-2
COVID-19 Gene/Drug Set Library School of Medicine Mount SinaiUSAJavaScriptCurated Lists of Genes and DrugsLists can be Searched, New Sets can be ProposedNo link with external databases
canSAR Cancer Therapeutics UnitUnited KingdomJavaScriptDatabase of Clinical Trials, Drugs and Druggable TargetsIntuitive Visualization of Druggable Interactome, Drug Prediction 
CORDITEhttp://cordite.mathematik.uni-marburg.deUniversity of MarburgGermanyJavaScriptDatabase of Clinical Trials, Drugs and Druggable TargetsQuick SearchNot Frequently Updated
COVID-19 Disease Map of LuxemburgLuxemburgJavaScriptDatabase of Drugs and PathwaysSearch for relevant interactions between viral proteins and human pathwaysInteractome Labels are hard to read, Not Frequently Updated, No Examples provided, Not focused on SARS-CoV-2
CoV-Hipathia for Progress and HealthSpainWeb ComponentsAnalysis of Druggable Pathways affected by Gene Expression ChangesAllows User-provided DataAnalysis is slow
Chemical Checker for Research in BiomedicineSpainJavaScriptDatabase of DrugsDrugs Ranked by Evidence Quality and Quantity, Frequently Updated 
Clinical Trials StatesJavaScriptDatabase of Clinical TrialsFrequently Updated, Fully ComprehensiveNot categorized by Drugs

corto – the Correlation Tool

We developed corto (Correlation Tool), a simple package to infer gene regulatory networks from gene expression data using DPI (Data Processing Inequality) and bootstrapping to recover edges.

Supplementary Material containing all gene networks generated during corto benchmarking:

CRAN stable package:

Github developmental version:

Progress bars and parallelization in R

Since SNOW is being discontinued, today I worked a bit on finding new solutions to have a progress bar in R for jobs running in parallel. In this example, I run 10,000 times a simple function to calculate logarithms, using 2 threads and monitoring the progress of the 10,000 calculations.

Set up the parameters

The following are the three parameters needed for any parallel job: number of threads, number of replicates (jobs) and a function:


SNOW solution

This was my old solution in SNOW, but CRAN is flagging all packages using SNOW with a warning “superseded packages” so we have to change it:

output<-foreach(i=icount(nreps),.combine=c,.options.snow=opts) %dopar% {

Parallel solution (not working)

Unfortunately, Parallel doesn’t have a .options in foreach, and running it like this won’t work, as the combine function is run only at the end:

output<-foreach(i=icount(nreps),.combine=c) %dopar% {

Another parallel solution

After many tears, I finally found a solution that could work. Essentially, instead of c() I am running a progcombine() that contains c() and also updates a progress bar. Luckily, it works on both Windows and Linux:

pb <- txtProgressBar(min=1, max=nreps-1,style=3)
count <- 0
function(…) {
count <<- count + length(list(…)) – 1
cl <- makeCluster(nthreads)
output<-foreach(i = icount(nreps),.combine=progcombine()) %dopar% {

The working solution: pblapply


7-way nested Venn Diagrams

3:21am, fifth coffee. Late hour inspired a nested 7-way Venn Diagram, a blob of shared miRNAs targeting E2F genes. A thing of terrible beauty, inside every human cell. Code below, a simple list rendered with the venn package.

names(mirnas) <- c("E2F1","E2F2","E2F3","E2F6","E2F7","E2F8","MYCN")

Bioinformatics Lab Course – Draft Structure

University of Bologna

Genomics Course

Bioinformatics Lab

Teacher: Prof. Federico M. Giorgi

Teaching Assistant: Dr. Chiara Cabrelle

Duration: 60 hours (15 modules of ~4 hours + optional extras)

Exam: Oral

The course aims at giving a practical overview of all the useful tools, approaches and techniques necessary for a competitive bioinformatician in 2019.

Module 1: Introduction to and testing of the working environment

  • Virtual Box
  • Linux Refreshment
  • Playing with a FASTA file: wc, grep, htop, regex, sed
  • EMBOSS suite
  • Remove/install programs using apt (htop)
  • Projects and Exercise structure

Module 2: Phylogenetic Sequence Analysis

  • Sequence databases: how to download sequences from NCBI
  • Building a phylogenetic multifasta (MYC family)
  • Multiple Sequence Alignment (Muscle, ClustalW, TCoffee)
  • Building a Phylogenetyc Tree (PHYLIP)
  • Phylogenetic GUI: MEGA

Module 3: Remote Homology Detection

  • BLAST introduction
  • Create, format and index a sequence database (BLAST formatdb)
  • BLASTN/BLASTP/TBLASTX with various options
  • Discover the organism of a mysterious sequence

Module 4: Introduction to Next Generation Sequencing

  • FASTA vs FASTQ, PHRED score
  • FASTQ library, single ended and paired

Module 5: NGS Alignment

  • Aligners: Bowtie, BWA, HiSAT
  • BAM files
  • Samtools: process and visualize BAM files
  • Integrated Genome Viewer: visualize alignments

Module 6: Calling Mutations

  • Exercise: generate BAMs
  • Using Varscan
  • Visualizing mutations and indels with IGV
  • Larger mutations: CNVs and translocation
  • GATK
  • Kiss&Splice: calling mutations from RNA reads

Module 7: RNA-Seq

  • Spliced aligners (TopHat, STAR, HISAT)
  • Finding new transcripts (Cufflinks)
  • Converting bams to counts (GFF, HTSEQ-Counts)
  • Finding contaminants in human rnaseq vs. other genomes (unaligned vs H. Pylori)

Module 8: ChIP-Seq

  • Exercise: align reads again
  • Input reads
  • Call Peaks (MACS)
  • Find enriched motifs (HOMER)
  • Upload Custom ENCODE tracks on Genome Browser

Module 9 (Short): Assembly

  • Assembling a small bacterial genome with DNA reads with MIRA
  • Classic DNA Assembly with Abyss or VELVET
  • Assembling E2F3 gene with long DNA reads with Canu
  • Assembling RNA-Seq transcripts with Trinity

*** end of R-free course ***

Module 10 (Long): (re)introduction to R

  • Basic commands up to sapply
  • RStudio
  • Scatterplots, Boxplots, Violin Plots, Heatmaps
  • RCircos
  • Bioconductor
  • Gene ID conversion
  • Genomic Ranges

Module 11: Differential Expression Analysis

  • Loading counts
  • Normalization: RPM vs RPKM vs TPM vs Size Factors vs voom
  • edgeR vs DESeq2
  • Comparing two datasets
  • Complex Designs
  • Confounding Variables (cancer vs. normal with age difference)

Module 12: Microarrays

  • Concept
  • Three steps: BG correction, normalization, summarization
  • RMA vs. MAS5
  • Differential Expression with LIMMA
  • Comparing microarrays with RNA-Seq

Module 13: Single Cell RNA-Seq

  • Dropout effects and biases
  • Clustering
  • Seurat pipeline
  • Cell Cycle bias removal
  • Differential Expression and comparison with bulk RNA

Module 14: Differential Binding Analysis

  • Estrogen treatment with DiffBind package
  • How to assign peaks to promoters to genes (Granges)
  • VULCAN package?

Module 15: Pathway Enrichment Analysis

  • Databases: Gene Ontology, MSIGDB, Reactome, Biocarta, KEGG, Mapman
  • Discrete enrichments: TopGO package
  • Continuous enrichments: GSEA
  • External resources: DAVID, Gorilla

*** Extra Modules ***

Module 16: Coexpression Analysis in R

  • Correlation: Pearson, Spearman, Kendall
  • Mutual Information
  • Partial Correlation (A,B,C)
  • Overlap with ENCODE and MSIGDB data
  • ARACNe

Module 17: Alternative Transcript Counters

  • Salmon
  • Kallisto

Module 18: Detect gene Fusions

  • RNA: Tophat fusions
  • DNA: big translocation finders?

Module 19: Simple Machine Learning

  • Predicting Mutations with Gene Expression
  • Glmnet, lasso, gradient boost modeling, caret package

Module 20: Survival Analsyis

  • Kaplan Meier Curves
  • Tests
  • Multiple groups
  • Comparing datasets

Module 21: Building an R plot with lattice

  • Canvas
  • Axes
  • Objects

Module 22: Clustering analysis in R

  • Hierarchical clustering (hclust and pvclust)
  • Treecut and dynamic treecut
  • Kmeans
  • Principal Component Analysis and TSNE

Module 23: DNA shape prediction in R

  • The DNA shape properties (MGW, HelT, PropT, Roll, EP)
  • DNAShapeR package
  • Show the shape of similar promoters (H.pylori project)

Combining P-values


We have come a long way from the original simple p-value integration methods of Fisher and Stouffer. Hong Zhang, a talented grad student from the Worcester Polytechnic
Institute, and his colleagues have developed a novel method, called TFisher, for dealing with p-value integration in a wide range of test scenarios.

I quote from their abstract, available here:

For testing a group of hypotheses, tremendous p-value combination methods have been developed and widely applied since 1930’s. Some methods (e.g., the minimal p-value) are optimal for sparse signals, and some others (e.g., Fisher’s combination) are optimal for dense signals. To address a wide spectrum of signal patterns, this paper proposes a unifying family of statistics, called TFisher, with general p-value truncation and weighting schemes. Analytical calculations for the p-value and the statistical power of TFisher under general hypotheses are given. Optimal truncation and weighting parameters are studied based on Bahadur Efficiency (BE) and the proposed Asymptotic Power Efficiency (APE), which is superior to BE for studying the signal detection problem. A soft-thresholding scheme is shown to be optimal for signal detection in a large space of signal patterns. When prior information of signal pattern is unavailable, an omnibus test, oTFisher, can adapt to the given data. Simulations evidenced the accuracy of calculations and validated the theoretical properties. The TFisher tests were applied to analyzing a whole exome sequencing data of amyotrophic lateral sclerosis. Relevant tests and calculations have been implemented into an R package TFisher and published on the CRAN.

The methods are implemented in R and available on CRAN:

RNASeq aligners

books aligned.jpgI would say the match has now four competitors:


  • Pros: the classic, the first universally used, still widely adopted in pipelines all over the World, basically people keep using it so their new results are comparable to the old ones
  • Cons: slow (several CPU hours per alignment on a human genome with 10M reads), limited to 4Gbases genomes (so, no complex metatranscriptomics for him) and on their very website they say to use HISAT2


  • Pros: super, wicked fast, the standard used by ENCODE and the big RNASeq projects
  • Cons: uses a LOT of RAM, like really a lot (64GB for a human index)


  • Pros: fast and low RAM requirements. If you start from scratch, this is the aligner to pick
  • Cons: it’s still new and so many people don’t trust it yet


These are actually not strictly aligners, but rather transcript counters. I put them together for simplicity, but they are different softwares

  • Pros: high speed and low RAM requirements. Ideal for quick RNA-Seq gene expression measurements
  • Cons: they cannot do de novo transcript detection, sad. They don’t produce counts, which are the expected input for many downstream analysis tools. However, some tools are starting to accept Salmon/Kallisto outputs (in R you can use the transcript abundance import package tximport)


Quantifying RNA-Seq Transcripts

About ten years ago, when RNA-Seq was young, we struggled to make sense of the huge quantity of data that came out of Next-Generation Sequencers. The RNA-Seq pipelines were founded on the simple scheme:

Reads -> Alignments -> Quantification

The most popular RNA-Seq alignment tool, Tophat (now Tophat2) was actually built on the Bowtie aligner to focus on transcribed genomic regions (the Transcriptome), with the optional feature of aligning reads in the whole Genome, for de-novo transcript discovery.

Continue reading