Hi. In this video, we're going to be talking about the human genome and medicine. So, the Human Genome Project was the first major scientific undertaking to sequence the first human genome. And this is super important because without this sequence, we'd have really no idea how many genes we have, and therefore, we can't do anything to determine whether gene mutations are causing disease. So sequencing the human genome is really the very foundation for a lot of the gene therapies and the personalized medicine that you're hearing about today. And so the Human Genome Project's main function was to identify what was in the human genome. So what were the major components of the human genome? And what they found was actually extremely surprising.
They were expecting the majority of the genome to be made up of protein-coding regions, regions that had genes that produced proteins. But instead, what they found is that those regions actually only composed 2% of the genome, meaning 98%, so the overwhelming majority of the human genome does not encode genes, and that was really surprising to the scientists who were doing it. Now, that doesn't mean that there are a small number of genes. Instead, there are about 20,000 to 25,000, depending on the textbook you're using. We'll say either number, but it's about 20 to 25k genes. Now, each gene, though, can produce more than one protein. Right? So we have protein isoforms that are made through alternative splicing. And so, although it's 20 to 25 different genes, it can produce a lot more proteins than 20 to 25k. But still, the overarching concept I want you to grasp here is that the majority of our genome is not composed of genes but is instead composed of other things. And we'll talk a little bit about those things in a second.
Now, that means the next question was, "Well, are the genes equally distributed or are there clusters of genes found throughout the genome?" And they actually found that there are gene-rich regions, which are concentrated areas of genes, and then gene deserts, which are regions without any genes. And that was another interesting finding. And then finally, comparing individuals, there's about a 99% similarity between individuals. And we know our genomes aren't perfectly identical because if they were, we would all look exactly the same, and we don't. So there are 2 major concepts that allow for the genome to be different between individuals.
So the first one is copy number variations. These are variations in the number of gene copies. So either the gene has been deleted, or it's actually been inserted. Copy number variance is actually a big source of variation between identical twins because these copy number variations can occur very early in development. And so, even though identical twins have mostly identical genomes, they actually can have differences, and a major difference is copy number variation. And so if it differs between twins, then you can imagine how much it differs between me and you; it's a lot.
Then the second type is single nucleotide polymorphisms, short is SNP. These are single nucleotide variations between individuals, so between me and you. And there are actually thousands of these, if not even more than that, that exist between me and you. There's a major sort of variation in the human genome.
Now, if we look at the human genome composition, you can see that the protein-coding regions here are this dark green area, proteins, and you can see that it's very small and that the overwhelming majority of the entire human genome is made up of other things. And so these are things like transposons, which we've either seen a video about or will see a video about in the future, depending on the order of your textbook. These are kinds of jumping genes that jump around the genome. We have introns, which make up a huge portion. These are non-coding regions between the exons of the protein-coding genes. These are again transposons as well. We have duplication and heterochromatin here. So these would be gene deserts, right, because heterochromatin is not going to be expressed, these genes aren't going to be expressed. And then this whole 12% unique sequences, which is an interesting category that can include a lot of things, some of which is still unknown to this day. And so obviously, the human genome is this diverse selection of different molecules that do things other than just encode for proteins.
Now, the important part of the human genome that the Human Genome Project really discovered, and that was a shock at the time, was that these non-coding regions of the genome are just as important as the coding regions. The coding regions are only 2%. Right? So 98% are these non-coding regions, and therefore, they're extremely important. And so there have been a few different projects that have attempted to classify these since the Human Genome Project. One of these is called the ENCODE Project. It stands for the Encyclopedia of DNA Elements. And the ENCODE project is looking for enhancers, promoters, and pretty much anything that would be a regulatory region in the genome. And so this is a huge undertaking because if we can understand how the genes are regulated, then we may be able to understand what's going wrong in diseased cases. And then, another thing is that there are actually a big component of the genome is pseudogenes. So these are sequences that were genes. They were genes at some time; they resemble genes in a lot of different ways, but they are non-functional or inactive now. That could be due to some type of mutation or insertion or transposon. There are a lot of different ways that these genes may be inactive now or a viral genome insertion. Like I said, lots of different ways, but essentially, they were genes, and when you look at this genome, they're like, oh, that looks like a gene, but it's not quite because it's not functional anymore. So, with that, let's now turn the page.