15. Genomes and Genomics

Sequencing the Genome

15. Genomes and Genomics

Sequencing the Genome - Online Tutor, Practice Problems & Exam Prep

Topic summary

Created using AI

Genome sequencing involves processing genomic DNA into overlapping fragments called reads, which are then sequenced using methods like pyrosequencing. This technique amplifies DNA fragments and detects nucleotide binding through light signals. Traditional whole genome sequencing uses plasmids in bacteria for amplification, while next-generation sequencing automates the process in smaller volumes. Challenges in genome assembly arise from repetitive sequences, addressed by paired-end reads that help align known sequences with unknown ones. Sanger sequencing, an early method, utilizes dideoxynucleotides to generate variable strand lengths for sequence determination.

concept

Sequencing Overview

Video duration:

Video transcript

Hi. In this video, we're going to be talking about sequencing the genome. So, before we can study the genome and what all the different functions of all these DNA pieces are, we have to be able to know the sequence of it. And so I'm going to go over a brief overview of just how sequencing works. Obviously, there are different techniques with different, you know, minute details, but this is just a general overview.

Sequencing genomes uses a few main steps that are common to many of the different ways that sequencing occurs. So the first thing is that you have genomic DNA or you have the majority of genomic DNA. You have a bunch of DNA. You want to sequence it. The first thing you have to do is you have to process it. And how you process it is actually that you chop it up into a bunch of pieces. And so these have to be random. They have to be overlapping, which means that not every piece is unique. Some pieces have the same sequence as other pieces, but they're overlapping. And that if you have this little piece of DNA, some of it will overlap here. Maybe another piece will overlap here. Something else will go like this and overlap here, here, and here. And so all these different pieces have to be overlapping and we'll figure out why in a minute. But first, how do you chop up the DNA? You can chop up the DNA using a special type of protein called a restriction enzyme, and these are proteins that chop the DNA. They chop usually, there's a bunch of them, and all of them have a specific sequence or 2 that they actually will, you know, chop the DNA up. And so you can use combinations of restriction enzymes to get these overlapping segments, and chop the entire DNA into these small fragments. So these fragments are given a special name. That name is called a read, and so these reads can vary depending on which how you chop it up and which restriction enzymes you use, can vary between a 105,000 base pairs long, just generally on average. And so, reads are super important. We'll talk about those in a second, but those are the overlapping fragments.

So, here we have DNA. It's blue and pink. You can see there's a sequence here. This is a restriction enzyme that comes in, and it chops here, and it chops here. So now, we have these fragments of DNA that exist. They do have these overlapping segments, which can be useful, but, mainly what you need to know is that then you generate fragments of DNA. And you do this for the entire genome and you generate millions, if not even more than that, fragments. So then you have all these fragments, you have to sequence them.

So there are many different ways that this happens. One that I'm going to talk about that's mentioned in your book is called pyrosequencing. So what happens in pyrosequencing is you take each read, you attach it to a bead, and you amplify it. It means that you make multiple copies of that read. So you have multiple copies of that sequence. Then when you have multiple copies of it, that means that you have enough to actually be able to check the signal from it because if you just have one copy, it's going to be, whatever signal you're using is going to be really faint. So if you have multiple copies, you can really amplify that signal. The signal that is used in pyrosequencing is actually light, and how this is done is you have a machine, and like you said, you've attached the sequence to a bead. It's attached to some kind of molecule sitting on a plate in a machine, and this machine will actually take each of the nucleotides, a, t, c, and g, and run them individually 1 at a time across this plate where all your sequences are. Now these are special nucleotides, and they contain a special molecule on them. So when that nucleotide binds, it will release that molecule, and that molecule is called a pyrophosphate. And when it releases, it interacts with other chemicals in there, and that releases, that converts it to a light signal. So let's say you have a sequence and it's all t's here. Right? I mean this is not really going to happen, but if you did, and the machine puts in an a, that a will bind, because it's complementary, and when it does, it releases a molecule that gives off a light signal. And so, because you're doing this in a machine, there's a camera, that camera detects the light signal, and because these nucleotides are passing by 1 at a time, it knows which nucleotide caused the light signal, and it will, say, okay, well, this is the complementary sequence. So here's an example of what this, print off of this might look like. Each of these peaks represents a light a light signal. So you can see there's, lots of g's. They've been running some nucleotides over and over, and I realize that, you know, it's more complicated than what I'm making it, which is why the x axis isn't just a, t, c, and g. But, generally, you can see that, you know, right here, this g resulted in a light signal, and so that g is going to be complementary to the actual sequence. So we know the sequence here is c. And you can do this over and over and over again, throughout the whole sequence, however long it is, a 100 base pairs, 5,000 base pairs, and eventually, the computer will spit out what the sequence is.

Now like I said, this is one way of doing this. There are a lot of different ways, shotgun sequencing, sort of more newer techniques that do this, but this is really the one that's highlighted in your book. So when you have the sequence, so you know the sequence of each of these reads, and remember, you probably have like millions of these reads, you use computer software to overlap the sequence. Remember, when we originally designed the sequencing step, we did overlapping reads. And so you use software to figure out where these sections are for every single read. And so what the computer does is it finds those overlapping segments, and it says, okay. These you know, this is the sequence here, and this is the sequence on either side of it. And so, that software continues to go and read each segment until it finally connects all the overlapping segments, and this is called sequence assembling. So, this is slowly taking each individual read, finding where it overlaps with all the rest of the reads, and forming it into one sequence, which is a consensus sequence. Now, we may have gone over consensus sequences before. And it's different from conserved sequence. So a conserved sequence is something that is exact, between species, but a consensus sequence doesn't have to be exact, which is important. Right? Because if we're sequencing, for instance, the human genome, and we take my genetic material to sequence, that's not necessarily going to be completely representative of the human genome. Not everyone is a clone of me. Not everyone has blonde hair, has my eye color, has my height. So there's individual differences between the where you get the genetic material, say for me, and what other members of the species might look like and their genetic material. So it has to be a consensus sequence, because this is close, but there might be, you know, single nucleotides that are different between me and you and other people. And so the individual differences, prevent a single sequence, say, my sequence from truly representing the entire human genome. So, another thing that this requires is generally multiple reads of each base pair. So an example of this is, say, if you read something that there's been tenfold coverage of the genome, that means that there are that each base pair is represented in at least 10 individual reads, so 10 individual fragments. So that even makes the number of reads even more because you have to have, so many reads, covering the whole genome. So this looks like this. It actually looks a lot more complicated than this because, you're dealing with millions and billions of reads. But this is exactly what it looks like. So you have, so let's see what this is. So we have, the red parts are things that we know, these are kind of overlapping regions, and the blue part is the unknown sequence. So you get these, red parts here, and you find out, okay, well where are these overlapping? We can say, well this represents this part of the genome, and now we have this whole sequence that we can compare and create a consensus sequence, so that when we finally get the full genome, which is this, we know that each base pair has been represented multiple times. We know that these overlapping segments are located in the proper locations, and that we can, construct the entire genome, based on the number of these reads. So that, like I said, that's an overview of sequencing. There are many different technologies that do this in slightly different ways, but that's generally, how they all do it, even though some of those minor details might be different.

So with that, let's now turn the page.

concept

Traditional vs. Next-Gen

Video duration:

Video transcript

Okay. So now let's talk about traditional versus the next generation of sequencing. Of course, from the time the sequencing of the genome became possible, there have been many different types and methods developed over the years that have improved upon this technology. So what your book calls traditional whole genome sequencing, or traditional WGS, requires the use of cells. And this is kind of the earlier method, right? It's traditional, so it's going to be the earlier way that the genome was sequenced. And so how this happens is you generate DNA fragments, like we said before. And how you actually sequence these is you put them into plasmids. Remember, plasmids are bacterial DNA. We give these plasmids a special name called vectors because we're putting the genetic information into them and then putting them into bacteria. And so they're vectors of this genetic material that we're putting in. So we generate the DNA fragments, we put them in vectors, these plasmids, and we actually put them into bacteria and grow up the bacteria, and that's how you get multiple copies of that small read, as the bacteria is replicating itself. It's replicating that DNA, and it's making multiple copies of the fragment that you put into it. So after you get enough bacteria, you have a ton of copies of this; you can actually take that DNA back out of the bacteria, sort of extract that DNA, and begin to read the sequence through the sequencing method, whatever sequencing method you want to use, shotgun sequencing, power of sequencing, whatever. And so you tally the reads, and then you use again computer software to overlap them, connect them, and in this case, we call them sequence contigs, and these are because they're contiguous sequences where the overlapped read is arranged into. So, that was exactly like the picture I showed in the previous video of all those different reads being overlapped.

The genome sequencing is very similar. Right? I mean, we went over the basic sequencing steps, but this one does not use cells. So you don't need cells to amplify that DNA. Instead, you use cell-free reactions, using various laboratory techniques, mainly PCR, if you're familiar with this. If you're not, don't worry about it, but if you are, PCR is a good way to amplify that DNA. And then you can use sequence software to sequence, and next-generation sequencing, whereas the traditional one you had to grow bacteria, and bacteria take up a lot of room. You have to grow a lot of it, and it's not very easy if you have 10,000,000,000 reads to grow 10,000,000,000 flasks of bacteria. But next-generation whole genome sequencing actually uses very small reaction volumes and it's generally automated through the use of a robot, and so you can actually do like billions of wells, potentially, through it.

So this is an example of traditional whole genome sequencing. You start with DNA, this is the genome, you extract it, you fragment it, you put it into these vectors, and remember vectors are circular, these are plasmids, they're circular bacterial DNA, and the green sequence here is the sequence you're interested in. You put them into bacteria, bacteria grow, they divide, they replicate, they create many different copies. You can isolate and extract it, then you sequence the vector itself, and then you have a bunch of fragments represented by these arrows, which you overlap and determine the actual sequence. So that is, sort of the two main types, traditional on the whole or the next gen. Traditional, it requires a lot more work, a lot more material, and growing in live cells, whereas next-gen is, mainly much more automated and can be done in a very small setting with small reaction volumes in a machine, without cells. So, with that, let's now move on.

concept

Sequencing Difficulties

Video duration:

Video transcript

Hello, everyone. In this lesson, we are going to be learning about the difficulties that come along with entire genome assembly or whole genome assembly. Okay. So the entire genome can be particularly difficult to sequence. And it's particularly difficult because we are going to possess some characteristics in the genome that are difficult to track. For example, the majority of our genome, the majority of our DNA is composed of these repetitive sequences that are just a t, a t, a t, a t for thousands of base pairs. They don't particularly code for anything, but if we're trying to sequence the entire genome, we're gonna need to know those sequences and where they belong and how they align with the other complementary sequences. So this can pose a problem because it's difficult to know where a repetitive sequence of DNA begins, where it ends, where it overlaps with other repetitive sequences of DNA. So these are gonna be some of the certain genome characteristics that make a genome very difficult to assemble. Because repetitive DNA sequences are generally much longer than the actual known sequences of DNA, or the reads, and they're generally much longer than the coding genome. Repetitive sequences are very common in our genome. And that can make it hard to determine where the overlaps begin, where they end, where this entire giant string of a's and t's, or g's and c's, actually came from. And the way that we're going to combat this issue is we're going to use paired-end reads. Paired-end reads are going to be a technique that we utilize to put these repetitive sequences in the correct location and in the correct alignment. Alignment is very important and I'll explain that in just a second. So, paired-end reads are pairs of sequences that are read from opposite ends of the genomic inserts. Basically, we have this giant repetitive sequence, and then we have these known sequences of DNA on either end of that giant repetitive sequence. And paired-end reads may span the gap and help determine the sequence between the two contigs.

So, if we look at this particular diagram here, what I want you guys to know is that the known sequence is going to be represented by the arrows. The unknown sequence, which is usually the repetitive one, is going to be represented by the line. And as you guys can see in our key here, it says roughly known length but not known sequence. So, we have a general understanding that maybe there's a thousand base pairs between these two known sequences, but we don't know the exact sequence because it's probably repetitive, and we don't particularly need to know that. So, each of these, wherever it has two arrows and an unknown piece in between, is gonna be a fragment. And, we're going to know the sequence that is represented by the arrow.

Now, let me show you how this can be useful to know this information. It's very useful to know this information especially when you're trying to align the DNA. So let's have a look and I'll give an example. So let's say that we have this sequence of DNA here, and it's got the two arrows, and it's got the unknown piece in the middle. So, if I put the complimentary strand, we know that these two match up perfectly. They align correctly. They have complimentary known strands or known reads represented by the arrows, and they have complimentary unknown sequences, which are probably repetitive. So that is going to be matching. This is normal. This is complimentary. But you can also see when things have been deleted, or inserted, or inverted, or duplicated. So, paired-end reads are very helpful for determining how the chromosome has changed over time, whether it's had a sequence insertion, deletion, inversion, rearrangement, duplication, anything like that.

So, if you guys see a deletion, this is what it's probably going to look like. So, if this is the deletion, you're going to have your known sequences. But then, you guys can see that the unknown sequence, some of it has been deleted because it's not as long anymore. So we can see that the top strand has a sequence of that unknown area that has been deleted. So, there has been a deletion in here when it used to be this particular size. Okay, guys?

Now, you can also see when there's been an inversion. Because if there's been an inversion, you're going to see the known sequences change their direction. So, if you know this particular read on one end reads in this particular direction, and then it completely flips that sequence, there has been an inversion. So, this is what an inversion is gonna look like. You're gonna have the normal sequence here, and then what you're going to have is you're gonna have the normal read on one end, unknown sequence, and then you're gonna have this known sequence going the incorrect direction. And an inversion has happened on this reed. So you guys can see that an inversion has happened in this chromosome in this particular area. We don't know where it might have started in here. We're not particularly sure, but we know that that has happened.

And since the unknown region is repetitive, we may never really know where in the DNA it was inverted. Now, you can also see things like duplication. So let me scroll down a little so we have some more room. You can also see things like duplication. So let's say that this is normal right here. And then, you have this. And it's much longer. It looks similar to the deletion. This could be a duplication or a deletion. You have to know what the normal length of this particular sequence is to know if part of it was deleted, or if part of it was duplicated. So we would say that a duplication of the unknown sequence happened in this particular strand of DNA because those two known sequences got farther and farther apart. For some reason, now there's more unknown sequence in between them. So some sort of duplication happened here.

Now you can also see things like repeat insertions. I'll just draw that for you guys really quick so you guys know what it might look like. So, this would be the normal one, then you have what looks like a normal one again, but then, what if you have this sequence added on? Let me get out of the way so you guys can see. What if you have another sequence? So, we have our unknown region with our reads, and then, wait, another unknown region, and another read. Then this could be a repeat insertion where that sequence of DNA was duplicated and then inserted into the DNA. So, basically, these paired-end reads, or the paired-end reads, are going to be utilized to better understand DNA alignment and what may have happened to the DNA at any point in time, the deletion, inversion, duplication, insertion, and it's utilized to understand regions of the genome that may have really repetitive unknown sequences of DNA, but they're gonna be flanked by known sequences of DNA. So we're simply going to sequence up to a particular point until we hit the repetitive DNA, and then we're probably not going to sequence anymore, but we know the general length of that particular sequence. And this is gonna help us combat some of the difficulties that come along with sequencing the entire genome, which is made up of a lot of repetitive sequences.

Okay, everyone. Let's go on to our next lesson.

concept

Sanger Sequencing

Video duration:

Video transcript

Okay. So now I want to talk about a method of sequencing, or it's not really a new method, it's actually a very old method called Sanger Sequencing. Sanger Sequencing was one of the first methods used to sequence DNA.

So, how Sanger Sequencing worked is it took advantage of these special nucleotides. These are the bases, essentially, A, T, C, and G, but they were specially made. And these specially made nucleotides were called dideoxynucleotides, or for short, ddNTPs. Now, these were made so that if they were incorporated into a new strand, as soon as one of these nucleotides is added if an A is needed and then an A ddNTP is added, then that means it would stop. The polymerase would fall off, replication would not continue.

And so, how you did this reaction is you have 4 separate reactions with a normal amount of everything you would need to replicate the DNA. But then in each reaction, you place a really small amount of each ddNTP for each of the nucleotides. So you had one reaction that had a little bit of A ddATP, you had one reaction with a ddTTP, you had another reaction with a small amount of ddCTP, and again, for G.

Four reactions, and they had small amounts of these. And the reason that you had a small amount of them, right, is because you want the majority of replication to take place normally, and then you want just every once in a while, just rarely, occasionally, the incorporation of one of these ddNTPs. And that way, if it's a rare incorporation, you get a variety of different strands. But if it was a common incorporation, it would never replicate at all because immediately it would stop. And you don't want that. You want these variety of different strands created.

And I'll show you an example of that in a second if that's not clear. So because ddNTP would, when the incorporation of that the addition of that would cause stop and elongation or the creation of this, this DNA. That means that you're going to generate a variety of different strand lengths, within each reaction, and each reaction will have a variety of different strand lengths that differ from the others.

And so, then, you take all of these reactions and now you have a bunch of different strand lengths from when that has been stopped, when replication was stopped, depending on the ddNTP that was used. Then, you get these variety of sequences and you can run them on different ways where you can separate those sequences out by size. Because you know that you separated it into 4 different reactions, one of those will always stop at A’s, one of them always stops at T’s, one of them always stops at C’s, and one of them will always stop at G’s.

And that is how you figure out, you know, which nucleotide has a G because that nucleotide caused the stop. So, an example of this, so you have these 4 reactions, right? Where the pink is T, the green is A, the blue is G, and the red is C. And you have this sequence here of DNA, and you want to know the sequence. Well, you put these into different reactions and you generate all these different lengths, and so what you get is you, can say, okay. Well, this stopped here, and that was in the T reaction. So that must be a T. This one stopped here, and that was in the A reaction. So this is going to be an A. This one stops here, and this was in the T reaction. It's going to be a T, and so on and so forth for G and C's, and all the way to the end, until you get this entire sequence here, because all these fragments have stopped because they've incorporated the appropriate ddNTP, which caused stop, creates these different fragments, and that allows you to figure out what this sequence is.

So that is how Sanger sequencing was done and how some of the very earliest forms of DNA sequencing were performed. So with that, let's now move on.

Problem

Restriction enzymes are proteins responsible for what?

Labeling DNA with molecular probes

Chopping the DNA at specific sequences

Amplifying a short DNA sequence

Compiling paired end reads

Problem

What is the name of a short sequenced DNA fragment?

Read

Contig

Consensus Sequence

Overlaps

Problem

The purpose of a sequence assembly is to what?

Use reads to build a conserved sequence

Use reads to build a consensus sequence

Use reads to form a vector

Use reads to form a labeled sequence

Problem

Which of the following sequence techniques requires the use of vectors?

Pyrosequencing

Traditional whole genome sequencing

Next generation whole genome sequencing

Sanger sequencing

Problem

Dideoxy nucleotides (ddNTPs) are used in Sanger sequencing because they have what function?

ddNTPs add fluorescence to the DNA sequence

ddNTPs speed up DNA amplification

ddNTPs stop elongation once they are incorporated into a growing sequencing reaction

ddNTPs prevent stalling of DNA sequencing reactions

Do you want more practice?

More sets

Here’s what students ask on this topic:

What are the main steps involved in genome sequencing?

Genome sequencing involves several key steps. First, the genomic DNA is fragmented into overlapping pieces called reads. These fragments are then sequenced using various methods, such as pyrosequencing, which involves attaching the DNA fragments to beads, amplifying them, and detecting nucleotide binding through light signals. The sequences of these reads are then determined. Finally, computer software is used to assemble the reads into a complete sequence by finding overlapping segments and creating a consensus sequence. This process ensures that the entire genome is accurately represented.

Created using AI

How does pyrosequencing work in genome sequencing?

Pyrosequencing is a method used in genome sequencing where DNA fragments, or reads, are attached to beads and amplified. The amplified reads are then placed in a machine that sequentially adds nucleotides (A, T, C, G) one at a time. Each nucleotide has a special molecule that releases pyrophosphate when it binds to the DNA, producing a light signal. A camera detects these light signals, and the machine determines which nucleotide caused the signal. By repeating this process, the sequence of the DNA fragment is determined. The sequences are then assembled into a complete genome using computer software.

Created using AI

What are the differences between traditional and next-generation genome sequencing?

Traditional genome sequencing involves inserting DNA fragments into plasmids, which are then introduced into bacteria. The bacteria replicate, amplifying the DNA fragments. The DNA is then extracted and sequenced. This method is labor-intensive and requires growing large amounts of bacteria. Next-generation sequencing, on the other hand, uses cell-free reactions, often involving PCR, to amplify DNA. It is highly automated, using small reaction volumes and robots, allowing for the sequencing of billions of reads simultaneously. This makes next-generation sequencing faster, more efficient, and less labor-intensive compared to traditional methods.

Created using AI

What challenges are associated with whole genome assembly?

Whole genome assembly faces several challenges, primarily due to repetitive sequences in the genome. These repetitive sequences, which can be thousands of base pairs long, make it difficult to determine where they begin and end, and how they align with other sequences. To address this, paired-end reads are used. These are sequences read from opposite ends of genomic inserts, helping to span gaps and align repetitive sequences correctly. This technique helps in accurately assembling the genome by providing information on the relative positions of known sequences flanking the repetitive regions.

Created using AI

How does Sanger sequencing work?

Sanger sequencing, one of the earliest DNA sequencing methods, uses dideoxynucleotides (ddNTPs) to terminate DNA synthesis. In this method, four separate reactions are set up, each containing a small amount of one type of ddNTP (A, T, C, or G). These ddNTPs cause the DNA polymerase to stop replication when incorporated. This results in a mixture of DNA fragments of varying lengths. The fragments are then separated by size using gel electrophoresis. By analyzing the pattern of terminated fragments, the DNA sequence can be determined. Each fragment's length corresponds to the position of the ddNTP, revealing the sequence of the original DNA strand.

Created using AI

Your Genetics tutor

Kylia Goodner

Genetics and Cell Biology lead instructor

Additional resources for Sequencing the Genome

PRACTICE PROBLEMS AND ACTIVITIES (21)

My Courses

Chemistry

Biology

Math

Physics

Business

Social Sciences

Programming

Product & Marketing

Sequencing the Genome - Online Tutor, Practice Problems & Exam Prep

Sequencing Overview

Video transcript

Traditional vs. Next-Gen

Video transcript

Sequencing Difficulties

Video transcript

Sanger Sequencing

Video transcript

Restriction enzymes are proteins responsible for what?

What is the name of a short sequenced DNA fragment?

The purpose of a sequence assembly is to what?

Which of the following sequence techniques requires the use of vectors?

Dideoxy nucleotides (ddNTPs) are used in Sanger sequencing because they have what function?

Do you want more practice?

Here’s what students ask on this topic:

What are the main steps involved in genome sequencing?

How does pyrosequencing work in genome sequencing?

What are the differences between traditional and next-generation genome sequencing?

What challenges are associated with whole genome assembly?

How does Sanger sequencing work?

Your Genetics tutor