Hi. In this video, we're going to be talking about sequencing the genome. So, before we can study the genome and what all the different functions of all these DNA pieces are, we have to be able to know the sequence of it. And so I'm going to go over a brief overview of just how sequencing works. Obviously, there are different techniques with different, you know, minute details, but this is just a general overview.
Sequencing genomes uses a few main steps that are common to many of the different ways that sequencing occurs. So the first thing is that you have genomic DNA or you have the majority of genomic DNA. You have a bunch of DNA. You want to sequence it. The first thing you have to do is you have to process it. And how you process it is actually that you chop it up into a bunch of pieces. And so these have to be random. They have to be overlapping, which means that not every piece is unique. Some pieces have the same sequence as other pieces, but they're overlapping. And that if you have this little piece of DNA, some of it will overlap here. Maybe another piece will overlap here. Something else will go like this and overlap here, here, and here. And so all these different pieces have to be overlapping and we'll figure out why in a minute. But first, how do you chop up the DNA? You can chop up the DNA using a special type of protein called a restriction enzyme, and these are proteins that chop the DNA. They chop usually, there's a bunch of them, and all of them have a specific sequence or 2 that they actually will, you know, chop the DNA up. And so you can use combinations of restriction enzymes to get these overlapping segments, and chop the entire DNA into these small fragments. So these fragments are given a special name. That name is called a read, and so these reads can vary depending on which how you chop it up and which restriction enzymes you use, can vary between a 105,000 base pairs long, just generally on average. And so, reads are super important. We'll talk about those in a second, but those are the overlapping fragments.
So, here we have DNA. It's blue and pink. You can see there's a sequence here. This is a restriction enzyme that comes in, and it chops here, and it chops here. So now, we have these fragments of DNA that exist. They do have these overlapping segments, which can be useful, but, mainly what you need to know is that then you generate fragments of DNA. And you do this for the entire genome and you generate millions, if not even more than that, fragments. So then you have all these fragments, you have to sequence them.
So there are many different ways that this happens. One that I'm going to talk about that's mentioned in your book is called pyrosequencing. So what happens in pyrosequencing is you take each read, you attach it to a bead, and you amplify it. It means that you make multiple copies of that read. So you have multiple copies of that sequence. Then when you have multiple copies of it, that means that you have enough to actually be able to check the signal from it because if you just have one copy, it's going to be, whatever signal you're using is going to be really faint. So if you have multiple copies, you can really amplify that signal. The signal that is used in pyrosequencing is actually light, and how this is done is you have a machine, and like you said, you've attached the sequence to a bead. It's attached to some kind of molecule sitting on a plate in a machine, and this machine will actually take each of the nucleotides, a, t, c, and g, and run them individually 1 at a time across this plate where all your sequences are. Now these are special nucleotides, and they contain a special molecule on them. So when that nucleotide binds, it will release that molecule, and that molecule is called a pyrophosphate. And when it releases, it interacts with other chemicals in there, and that releases, that converts it to a light signal. So let's say you have a sequence and it's all t's here. Right? I mean this is not really going to happen, but if you did, and the machine puts in an a, that a will bind, because it's complementary, and when it does, it releases a molecule that gives off a light signal. And so, because you're doing this in a machine, there's a camera, that camera detects the light signal, and because these nucleotides are passing by 1 at a time, it knows which nucleotide caused the light signal, and it will, say, okay, well, this is the complementary sequence. So here's an example of what this, print off of this might look like. Each of these peaks represents a light a light signal. So you can see there's, lots of g's. They've been running some nucleotides over and over, and I realize that, you know, it's more complicated than what I'm making it, which is why the x axis isn't just a, t, c, and g. But, generally, you can see that, you know, right here, this g resulted in a light signal, and so that g is going to be complementary to the actual sequence. So we know the sequence here is c. And you can do this over and over and over again, throughout the whole sequence, however long it is, a 100 base pairs, 5,000 base pairs, and eventually, the computer will spit out what the sequence is.
Now like I said, this is one way of doing this. There are a lot of different ways, shotgun sequencing, sort of more newer techniques that do this, but this is really the one that's highlighted in your book. So when you have the sequence, so you know the sequence of each of these reads, and remember, you probably have like millions of these reads, you use computer software to overlap the sequence. Remember, when we originally designed the sequencing step, we did overlapping reads. And so you use software to figure out where these sections are for every single read. And so what the computer does is it finds those overlapping segments, and it says, okay. These you know, this is the sequence here, and this is the sequence on either side of it. And so, that software continues to go and read each segment until it finally connects all the overlapping segments, and this is called sequence assembling. So, this is slowly taking each individual read, finding where it overlaps with all the rest of the reads, and forming it into one sequence, which is a consensus sequence. Now, we may have gone over consensus sequences before. And it's different from conserved sequence. So a conserved sequence is something that is exact, between species, but a consensus sequence doesn't have to be exact, which is important. Right? Because if we're sequencing, for instance, the human genome, and we take my genetic material to sequence, that's not necessarily going to be completely representative of the human genome. Not everyone is a clone of me. Not everyone has blonde hair, has my eye color, has my height. So there's individual differences between the where you get the genetic material, say for me, and what other members of the species might look like and their genetic material. So it has to be a consensus sequence, because this is close, but there might be, you know, single nucleotides that are different between me and you and other people. And so the individual differences, prevent a single sequence, say, my sequence from truly representing the entire human genome. So, another thing that this requires is generally multiple reads of each base pair. So an example of this is, say, if you read something that there's been tenfold coverage of the genome, that means that there are that each base pair is represented in at least 10 individual reads, so 10 individual fragments. So that even makes the number of reads even more because you have to have, so many reads, covering the whole genome. So this looks like this. It actually looks a lot more complicated than this because, you're dealing with millions and billions of reads. But this is exactly what it looks like. So you have, so let's see what this is. So we have, the red parts are things that we know, these are kind of overlapping regions, and the blue part is the unknown sequence. So you get these, red parts here, and you find out, okay, well where are these overlapping? We can say, well this represents this part of the genome, and now we have this whole sequence that we can compare and create a consensus sequence, so that when we finally get the full genome, which is this, we know that each base pair has been represented multiple times. We know that these overlapping segments are located in the proper locations, and that we can, construct the entire genome, based on the number of these reads. So that, like I said, that's an overview of sequencing. There are many different technologies that do this in slightly different ways, but that's generally, how they all do it, even though some of those minor details might be different.
So with that, let's now turn the page.