I started today a course in bioinformatic and to be frank I am not super enthusiast. As other courses I attended in the past, it starts by declaring all the wonderful things bioinformatic can do for you, but it fails its promises to make them CLEAR to you. The teachers through exercises at you without going to the essential basic of sequencing, how data are produced in the first place, to understand what the hell are we browsing!
I’ll suggest a crash course in sequencing techniques right at day 1, which would be excellent right before the practical session.
Regardless, I’ll keep track dayly of what I understood, from a very wet lab, forgive-me-I-am-a-biologist prospective.
Today it was all about Genome Browsing (GB). At least I understood what it is. There are different softwares online or ‘downloadable versions’, that help you getting hands on the enormous amount of data that labs in the world are generating around the same sequence.
It’s like the Cochrane Collaboration of system biology: let’s get all the evidence in the same place. I really like the concept, but am I able to access make sense of all those accumulated and nicely display data?
All genome browsers works on one genome at the time (this is a revolutionary concept for me). They have a graphic visualization part and a collection of data and hyperlinks that bridge different data sources the specific region of the genome you are looking at.
GB can be used also to look at your own primary data.
Wanna brows a genome? First: get the most updated version of it! There are new genome assembly coming out every now and then. The latest human is from 2009. Be sure u always work on the most updated version. Different version may have different sequence reference and your gene may have been moved!
Chose the right GB
We went through these:
- UCSC: cool and with a lot of data to look at, but very slow and not too user friendly. It run on an online server.
- Ensembl: the EU version of UCSC. It’s not clear to me what are the best features of this one vs the other. In Ensembl we looked at gene orthologues among species. I guess that’s what it is useful for?
- BioMat: part of Ensembl and not clear at all why a browser would have another browser within it. Here you can filter a genome to give all the genes present in only Ch1. In case you have always wondered, the first gene of the first chromosome is…. a micro RNA. Not very exiting, but that was the first thing I went to look at.
- IGV: custom data visualization. It run on Java (7 on Mac) and it looks like Chimera for protein visualization. It fetch on the web the sequence or other data and visualize them like UCSC, but it’s 100 times faster! The negative part is that it make all the math on your computer, so it makes the computer very slow.
Every browser works with tracks. Every track contain info derived from different database or annotations. There is a track for DNase sensitive sites, a track for H1 binding, ecc…
Every track can be divided into two parts: at the bottom there are the every single sequences that align to that region of the genome (from a ChIP analysis, for instance), and on top the summary of all these reads in histograms-like formats. higher is the histogram, more seq match to that region. This visualization is useful to capture region of hyper-methilation or DNase hypersensitivity.
The hirarky of the mess
As people keep throwing sequencing data in different databases, most align sequences have not been manually corrected and many raw data hang there. This imply that whatever you wanna call “gene” may have been annotated in so many different formats and can start and finish in so many differnt locations that… it’s a mess.
To help out, there is a hierarchy of this mess. On one side, there are manual annotation were people (very, very kindly) wrote down a version of the gene (where it starts-ends and all the exons) though a combo of only the more reliable sequence out there. This is the UCSC annotation system. A more stringent gene annotation is RefSeq (and I suspect that for most biologists is the best). Another good one is MGC, where mRNA comes from selected and standardized cell clones (though if these clones are not representative of the cells you work with or are so fucked up that have nothing more to do with normal human tissue… yes, you get it).
Going down with the accuracy but with broader reading capacity (catching different transcript variants) are Human mRNA and EST sequences. As I understood it, the letter is the the first technique used to sequence RNA and apparently covered only 200 bases. It’s a bit of a mess when your “real” mRNA is a big gene and these EST really don’t cover 5′ to 3′. The Human mRNA are all the mRNAs from GenBank that align at that particular region of the genome. As I figured, these mRNA are better and longer as the GenBank that has a higher quality control standard.
Somewhere between these gene labeling, there are also the UCSC known genes, an manual annotation that takes in account many sources, like ENCODE row data, UniProt ecc…
- RefSeq: alignments between RefSeq database (NBCI) to the selected genome
- Human mRNA: alignments between human GenBank seq and the selected genome.
- UCSD: gene annoation that encompass most data. The most reliable and sufficiently variable
Both Human mRNA and EST are useful to understand which exons are real exons and which aren’t. Sometimes, RNA seq variation are just artifacts, so you’ll find them in only few reads. These two are also good to look at splicing variants of genes.
Regardless which track you are looking at, the thinner line (no RNA detected –> introns) has arrows that indicated the direction of the DNA transcription (>>> or <<<). Then there are middle thick boxes and full thick boxes. The fist are non-coding sequences or RNA (UTR – untranslated regions) which MUST be at the 5′ or 3′. In some genes I found them in the middle of the region, these are likely automatic annotation mistakes.
A very cool track is the Conservation track. It shows in a histogram-like version of the conservation of a genomic region (using the reference genome) and other species. Usually, the picks are high in exons. The picks may also be below the 0 line, which biologically means that the region underwent a sort of positive selection in human, as they have REMARKABLY more variant then the “baseline noise”. Such noise (the zero line) is the level of variation that you encounter when comparing the two whole genomes.
Other things I learned
Orhologue genes can be annotated as (1:1) or (1:many). When looking for orthologues of your favorite gene, there is the possibility that the gene is equally found once in other organisms (1:1) or that it has been multiplied for evolutionary reasons (1:many). Cool!
If in a sequence like RefSeq there are vertical colored lines, those are insertions (SNPs). If these insertions are founds in many reads, they may ended up in the summary (histogram chart) as to indicate an official SNP. Look at publication track to check for published SNP too.
Mission accomplished for today.