Hi. In this video, we're going to be talking about bioinformatics. So bioinformatics is going to be the study of the information found within the genome. And so, what kind of information, if you took a guess, does the genome hold? Right. So it's going to hold information about genes, about RNAs, binding sites, non-coding RNAs, regulation sites, all this genetic information. And so, annotation is the process of marking these functional elements in the genome. Generally, this is done through an online software; there are big databases that scientists have developed that you can explore to look at a specific sequence of DNA and figure out if there are regulatory sites here, if this is a gene, or where protein binds, etc.
You can see that there are these contacts here, so we know these sequences. We can say that there's some kind of decay at this region, potentially a coding region, and all of these different colors represent something different about the gene. You don't need to know what these mean, but you used to go on you'd see something like this where there are different markers, different colors, all sort of representing what is in this region right here of the genome. Using that, bioinformatics is a great tool to figure out what parts of the genome are functional parts, what are being used for what.
Bioinformatics can be used to determine where protein-encoding genes are. That collection of where the protein coding genes are, and what the protein coding genes do, is called the proteome, which is an inventory of all proteins that are encoded by an organism's genome. How it does this is by trying to identify open reading frames. What are open reading frames? These are just sequences that have characteristics of genes. What are some characteristics of genes that you're probably familiar with already? Right. They have 5' ends, they have 3' sequences. Genes have introns, they have exons, and all of these characteristics, such as splice sites, can be put into a software program, and that software program can read a whole genome and identify the potential open reading frames that contain all these different characteristics.
Another thing bioinformatics can do is identify an organism's codon bias. So what is a codon bias? So far, we've told you that different combinations of codons can code for the same amino acid. But actually, it's not equally distributed, and some organisms prefer to use one codon over another. For example, fruit flies, when they code for the amino acid cysteine, have two choices here; they can use UGC or UGU. But generally, they prefer to use UGC because 73% of all cysteines are coded by UGC and not UGU, and that's called codon bias. So in regions of the genome that have codon bias, we start to say, okay, this is probably a protein coding region because these codons aren't distributed equally.
So when you have an open reading frame, you think it's an open reading frame, but you actually know until you do further study. How you can confirm whether or not you have an open reading frame is through cDNA sequences. What is cDNA? cDNA comes from mRNA. Remember, this is messenger RNA, which is going to be used to translate into a protein. This is a sequence that is only a coding region. All the introns have been removed. This is coding for a protein, which means that unless something happens between transcription or between the mRNA and translation into a protein, this is going to be expressed, meaning that it's really a gene. If you can isolate the mRNA, just remove everything else, remove the protein, remove all the DNA, and just get this sort of solution of just the mRNA expressed in a cell, you can actually reverse transcribe it into DNA. When you do that through reverse transcriptase, which is an enzyme you might be familiar with, it takes RNA and turns it into DNA. When you take mRNA, messenger RNA, that's going to be made into a protein and you reverse transcribe it into DNA, that is called cDNA. cDNA has unique characteristics compared to normal DNA because the introns are removed, and this is the exact coding sequence of the cDNA.
So if you can isolate a cDNA sequence, then you know that the ORF that you've found, the open reading frame, is actually a gene encoding a protein. There are these huge collections of sequences called expressed sequence tags. These are short cDNA sequences, and there are large data sets of them. Usually, you collect a ton at a time, every mRNA that's in the cell at a time, you can turn it into cDNA, and you get these expressed sequence tags that say these are the genes being expressed at this certain time. You can determine what genes are being expressed, where the boundaries are, and it's super important to confirm whether or not those open reading frames are in fact genes. Here's an example: There's not a lot about bioinformatics that's easy to visualize, but here's an example of an open reading frame. There's a start codon here, there's a transcription start site, eventually, there's going to be a stop codon way down here, and all of these different characteristics will tell the computer this is likely an open reading frame.
Bioinformatics can do other things too, like predicting DNA binding sites or protein-DNA binding sites. Again, through computer software, it'll search through a genome and says you know, it'll look for predicted sequences. So promoters sometimes have similar sequences. It'll look for splice sites, etc., and say, okay, well these sequences are consensus sequences, or they're conserved, so they're likely a promoter, or transcription start site, or an enhancer, or splice site, or whatever you're looking for.
Finally, bioinformatics can also be used to study evolution and DNA similarity. The search that's done is called a BLAST search, and this is actually on the NCBI website, and you can just Google NCBI BLAST and it'll come up. And if you have a sequence and you say, I have no idea what the sequence is, let me blast it. You can BLAST a nucleotide or protein sequence. It'll spit out all these different organisms with similar sequences and all these different proteins with similar sequences to give you an idea of which organism it comes from, and what the function of that specific gene sequence is. Here is an example: You have this human sequence, it's here, and this is protein. I know it's protein because these are the short codes for each amino acid. And you can see that it lives through mice, dove, falcon, worms, sea urchins, etc., and it shows how similar these are between organisms. To be honest, between a sea urchin and a human, this is probably a very conserved gene just because there's not a ton of changes, but there are changes there, and you can use these big searches to look through how these genes are similar between different organisms. So bioinformatics, looking at the information content of genes, where the genes are, what are protein coding, where things are binding, how everything is involved, all this sort of information content of genes. So with that, let's now move on.