Exon shuffling can lead to a common domain being found in a variety of proteins.

If a LINE has a weak poly(A) signal, then sometimes transcription will continue and include an adjacent 3′ gene (eventually terminating at that gene’s strong poly(A) signal). ORF2 then reverse-transcribes the RNA transcript of the LINE and gene, eventually inserting the gene at a new location along with the SINE in a phenomenon known as exon shuffling.

Exon shuffling can occur double crossover between interspersed repeats.


The most common type of repetitive DNA are interspersed repeats or moderately repeated sequences. These are present as a single copy at very many different loci and can move or jump to new locations.Interspersed repeats account for almost half of human DNA. These do not occur in tandem arrays. Individual copies of the same, or nearly the same sequence, ~100 bp to ~10 kb long, are found at tens of thousands to more than 1 million different positions dispersed all over the genome. This dispersion is the result of repeated insertions of transposons into new sites during the evolution. The interspersed elements are either transposons themselves or are derived from other genomic sequences acted on by transposon enzymes.
| Class | Length | Copy # | Genome | Overview | ||
| DNA Transposons | 2 – 3 kbp | ~300,000 | 3% | Transposons either move by: direct excision and reinsertion of one DNA element; or insertion of a reverse-transcribed RNA product. DNA transposons can jump during S phase from a daughter strand into unreplicated DNA, thus increasing its copy number. A DNA transposon integrates with a staggered cleavage of the target DNA followed by ligation of the target 5’ ends to the transposon and filling the gaps. | ||
![]() |
||||||
| LTR Retrotransposons | 6 – 11 kbp | ~440,000 | 8% |
Retrotransposons are transcribed normally, then reverse-transcribed to form a DNA copy that is inserted into a new site. LTR retrotransposons are retroviruses that have lost the ability to exit and reinfect a cell. The upstream LTR acts as a promoter and the downstream LTR contains a poly-A site to produce transcripts from the integrated element. These both encode the proteins needed for transposition and serve as a template for making the DNA copy. Between them is a coding region that encodes proteins for transposition and also acts as a copy template.
The retroviral genomic RNA is copied into DNA via priming, extension, jumping and repriming steps:
The double-stranded DNA of an LTR retrotransposon integrates by the same mechanism as a DNA transposon, with a staggered cleavage of the target DNA followed by ligation of the target 5’ ends to the transposon and filling the gaps. revise_new_w.png |
||
![]() |
||||||
| Non-LTR Retrotr’sons | ||||||
| LINEs | 6 – 8 kbp | ~860,000 | 21% | Non-LTR retrotransposons also encode a reverse transcriptase, but use a different mechanism for insertion. The two proteins encoded by LINEs are: ORF1, an RNA binding protein; and ORF2, a reverse transcriptase and endonuclease. ORF2 protein makes a nick in an A/T rich region of the target DNA to allow priming on the Poly-A tail of the LINE RNA. Reverse Transcriptase uses the 3’ end of the nicked target to extend into the LINE RNA, making a DNA copy. RT reaches the end of the LINE RNA and continues into the target at the staggered cleavage. Insertion is completed by cellular enzymes that copy the second strand, degrade the RNA and ligate the fragments together. Most LINE elements are truncated due to incomplete copying of element during insertion. This makes them inactive for transposition, but they can still be mutagenic upon insertion and can still induce aberrant recombination events. LINEs can result in exon shuffling. | ||
![]() |
||||||
| SINEs | 100 – 300 bp | ~1,600,000 | 13% |
Short Interspersed Elements (SINE’s) do not encode proteins but transpose by the same mechanism as LINE’s, presumably using the LINE proteins. SINEs carry within them a promoter for RNA Pol III allowing new RNA copies to be made. The most common SINE is the Alu Element named for an Alu restriction site it contains. Alu elements originally derived from the 7SL RNA, a cytoplasmic small RNA involved in protein secretion.Single Alu Elements can evolve into new exons. Alu elements can result in exon shuffling. |
||
| Processed Pseudogenes | Variable | 1 – ~100 | ~0.4% | |||
Also, there is unclassified spacer DNA that accounts for ~25% of the genome.
| Class | Length | Copy # | Genome | Overview | ||
| Solitary Genes | Variable | 1 | ~15% | |||
| Gene Families | Variable | 2 – ~1,000 | ~15% |
A significant percentage of human genes are members of gene families. In some cases, the multiple copies allow increased production of identical gene products – rRNA. In other cases, the different family members have different but related functions – beta Globin. About half of all human genes are solitary genes, like the SUR2 gene. This means that there is only one gene of similar sequence and function in the haploid genome. About half of all human protein coding genes are duplicated, or members of a gene family with >2 closely related genes. For example, the globin genes are members of a gene family. The β-globin genes on chromosome 11 have exons that are >90% identical. They are also >80% identical to the β-globin genes on another chromosome. The different beta Globins have evolved different oxygen affinities and transport properties and are adapted to use in different situations. For example, e globin is expressed in the developing fetus for absorbing oxygen from the maternal Hemoglobin in the placenta. How did the duplicated genes arise? Gene duplication by unequal crossing-over between homologous repeats during homologous recombination in meiosis. Duplicated genes do not necessarily remain linked at the same chromosomal locus. Later events can move them to other locations in the genome. |
||
Also, there is unclassified spacer DNA that accounts for ~25% of the genome.
| Class | Length | Copy # | Genome | Overview | ||
| Tandem Repeats | Variable | 20 – 300 | 0.3% | Encode rRNAs, tRNAs, snRNAs and histones. In some cases, the multiple copies allow increased production of identical gene products – rRNA and histones. In these cases, the genes usually exist as tandem arrays. Allows production of millions of copies of the gene product per cell division – needed for ribosome, snRNPs and histones. | ||
| Simple-Sequence DNA | 1 – 500 bp | Variable | 3% |
Commonly called “satellite DNA” — Sheared DNA has buoyant density dependent on base content. Total DNA gives rise to a main band of average base content. Certain overrepresented simple sequence repeats give rise to satellite bands due to skewed base content.
Microsatellite DNA is defined as having very short repeat units of 1-15 nt, such as CAGCAGCAG etc, repeated 50 or more times. Many human diseases are caused by triplet (especially CAG) repeat expansion mutations. These are thought to accumulate during rare mistakes in DNA synthesis when the nascent daughter strand slips backward along the template strand to insert additional bases into the daughter strand. |
||
![]() |
||||||
| Minisatellite DNA has a longer repeat unit length of 15-100 nt or so With tandem array lengths of 500 bp to 20 kb. Differences between individuals in the number of repeats of a minisatellite sequence arise through unequal crossing over between chromosomes during meiosis. Some minisatellite sequences are highly variable in repeat number between individuals. This is the basis of the DNA fingerprinting. | ||||||
![]() |
||||||
| Most simple sequence DNA is comprised of 14-500 bp units tandemly repeated in long stretches of 20-100 kb. Most of these very long simple sequence DNAs are either at the centromeres of chromosomes where they may affect chromosome segregation, or serve as telomeres at the chromosome ends. | ||||||
Also, there is unclassified spacer DNA that accounts for ~25% of the genome.
Introns have sequences that directs the splicing apparatus during RNA splicing, part of RNA processing:
| Introns begin and end with splice sites that conform to consensus sequences. |
| Introns always begin with a GU encompassed within a larger 5’ splice site consensus. |
| Introns always end with the branch point sequence, several pyrimidines and an AG. |
Transposable elements were first discovered by Barbara McClintock in kernels of corn, where certain mutations caused loss and reinstatement of purple pigment (due to gain and loss of an insertion element that activated pigment genes). The human genome contains ~300,000 DNA transposons, which have extensively accelerated evolution due to the modularity of exons and regulatory regions. Transposition is one avenue for exon shuffling to occur, whereby an exon and two flanking transposons are all excised and reinserted elsewhere as a single element (potentially adding a new exon to a gene). There are conservative (cut + paste) and replicative (copy + paste) mechanisms for transposition. However, transposition is potentially mutagenic and over-transposition is very deleterious; thus, it remains a rare event that occurs in about 1:105 or 1:107 cells per generation. Transposable elements occur in both eukaryotes (transposons; retrotransposons) and prokaryotes (insertion elements).
| Next Steps | Transposons and retrotransposons are discussed in the article about eukaryotic chromosomes. |
|---|
A gene is the nucleic acid sequence needed to synthesize a particular gene product. A gene includes more than just the coding region that encodes an RNA transcript; there are also control regions controlling synthesis, processing and translation of the RNA transcript. In prokaryotes, the entire coding region encodes a continuous polypeptide sequence. In eukaryotes, coding regions contain exons (50-250 nucleotides) that encode polypeptide sequences and introns (500-50,000 nucletides, removed during RNA processing) that do not. Higher eukaryotes not only have introns within genes, but large intergenic regions. For example, a ~80 kb region in Saccharomyces cerevisiae (baker’s yeast) contains 40 genes; the ~80 kb region encompassing the human β-globin cluster contains only 5 genes. This extra DNA comes from multiple repeats described here. Exons often encode modular units that are included or excluded via RNA processing. Exons are usually highly conserved while introns are barely conserved. For example, SUR2 exons are 90% identical between mice and humans while SUR2 introns are less than 10% identical between mice and humans. A lack of inter-species sequence conservation indicates a lack of function.

| Monocistronic | Most eukaryotic genes are monocistronic, meaning their mRNAs encode a single protein. Often, a eukaryotic primary transcript forms a single mRNA that encodes a single protein. Most eukaryotic mRNAs have a 5′ cap structure that directs ribosome binding, with translation beginning only at the closest AUG codon. |
| Polycistronic | Prokaryotic genes are mostly polycistronic, with one mRNA encoding multiple proteins involved in a biological process. Along the mRNA, there is a ribosome binding site near each coding region’s start site. Translation can initiate at any of these sites, allowing production of different proteins from one mRNA. |
A transcription unit is a region of DNA that is transcribed under the control of a particular promoter. While a gene and a transcription unit (like the LAC operon) are distinguishable in prokaryotes, the two terms are used interchangeable in eukaryotes. There are simple and complex eukaryotic transcription units. A simple transcription unit RNA transcript is processed to yield a single mRNA encoding a single protein. Complex transcription units, which are more common, encode an RNA transcript that is processed to form different monocistronic mRNAs each encoding a different protein. A single transcript can undergo different mRNA pathways via:
| Alternative Splicing | mRNAs have the same 5′ and 3′ exons but different internal exons. |
| Alternative Poly(A) Sites | mRNAs have the same 5′ exons but different 3′ exons. |
| Alternative Promoters | mRNAs have different 5′ exons but share 3′ exons. |
| Next Steps | Study about the eukaryotic chromosome. |
|---|
The total mass of histones associated with DNA in chromatin is about equal to that of the DNA. Interphase chromatin and metaphase chromsomes also contain small amounts of a complex set of other proteins. For instance, a growing list of DNA-binding transcription factors have been identified associated with interphase chromatin. The structure and function of hese critical nonhistone proteins, which help regulate transcritpion, are examined in Chapter 11. Other low-abundance nonhistone proteins associated with chromatin regulate DNA replication during the eukaryotic cell cycle.
A few other nonhistone DNA-binding proteins are present in much larger amounts than the transcription or replication factors. Some exhbit high mobility during electrophoretic separation and have thus been designated high-mobility group (HMG) proteins.W hen genes encoding the most abundant HMG proteins are deleted from yeast cells, normal transcription is disturbed in most other genes examined. Some HMG proteins have been found to bind DNA cooperatively with transcription facors binding to specific DNA sequences to stabilize multiproteins complexes regulating transcription of a neighboring gene.
A mammalian chromosome is massive, with tens to hundreds of megabases of DNA. The entire haploid human genome contains about 3 billion base pairs of DNA. Only about 5% of this encodes functional RNAs or Proteins or controls their production. Eukaryotic chromosomes are bound to structural proteins to form chromatin. During metaphase, chromatin is highly condensed into the recognizable structure seen at left. During interphase, chromatin is highly decondensed so that regulatory proteins can access the DNA.
| Cell Stage | Status of Chromosome |
| Interphase | Chromosomes are highly decondensed in most regions, allowing access of regulatory proteins for transcription and replication. Within the nucleus, individual chromosomes are found within diffuse but non-overlapping domains. |
|---|---|
| M Phase | Duplicated chromosomes condense into defined sister chromatids to allow their segregation at cytokinesis. After chromosome condensation, the nuclear envelope breaks down in a process controlled by the nuclear lamina so that the chromosomes can segregate to opposite ends. At metaphase, chromosomes are aligned along the metaphase plate and sister chromatids are split at the centromere to segregate to opposite poles of the dividing cell. |
During evolution large rearrangements can occur in the size and number of chromosomes. A syntenic region contains genes that are found in the same order in different species, although not always on the same chromosome. For example, the Indian Muntjac has three large chromosomes and a tiny X chromosome; the very similar Reeves Muntjac has just as much DNA — and often in the same sequence — but divided among 23 chromosomes. These chromosomal rearrangements are rare, but are extremely important for speciation because they make productive mating impossible. The number, sizes and shapes of metaphase chromosomes constitute the karyotype (distinctive for each species). During metaphase, chromosomes are distringuished by banding patterns and chromosome painting.
The region of the chromosome where the sister chromatids are held together is called the centromere. This assembles a structure called the kinetochore that is required for attachment to microtubules during alignment at the metaphase plate, splitting of the sister chromatids, and movement to the spindle poles. Because of the nature of DNA replication, a linear chromosome requires special sequences at the ends called Telomeres. DNA replication requires an RNA primer to initiate synthesis, which is degraded after priming. The loss of these primers on the lagging strand of the chromosome ends will result in a loss of information with each round of replication. Telomerase is a special enzyme that uses its own RNA template to add telomeric repetitive DNA to chromosome ends. Chromosomes require replication origins (ARS), centromeres, and telomeres for proper replication, mitotic segregation, and maintenance. These sequences were first identified in yeast. Adding telomeric DNA to a DNA containing an ARS and Centromere allows its maintenance as a linear chromosome. Yeast artificial chromosomes containing ARS, Cen, and Tel elements allowed the cloning of large fragments of human chromosomes.
| Progeny of Transfected Cell | ||||||
| Plasmid | Recipient | Growth | Mitotic Segregation | Observation | ||
|---|---|---|---|---|---|---|
| LEU+ Circular | LEU- Yeast | None | Transfection with a LEU+ plasmid does not alone restore LEU to a LEU- cell. | |||
| LEU+ ARS+ Circular | LEU- Yeast | Some | Poor | Replication occurs, but poor segregation means only ~10% of progeny carry the plasmid. | ||
| LEU+ ARS+ CEN+ Circular | LEU- Yeast | Yes | Good | A centromeric (CEN) genome fragment is needed for strong segregation. | ||
| LEU+ ARS+ CEN+ Linear | LEU- Yeast | None | Linearization (via restriction enzymes) of a TEL- circular plasmid makes it unstable. | |||
| LEU+ ARS+ CEN+ TEL+ Linear | LEU- Yeast | Yes | Good | Linear plasmids must carry the telomeric (TEL) gene fragment at each each end to remain stable in progeny cells. | ||
Some specialized eukaryotic cells increase cell volume via endomitosis, where DNA synthesis is repeated without cell division and a normal chromosome develops into a giant polytene chromosome results (first observed in Drosophila melanogaster larval salivary glands). Centromeres and telomeres endoreplicate poorly, leading to a bundle of duplicate chromatids (as many as 1,000) stuck at the chromocenter; thus, the ploidy of the cell remains constant.

Chromosome puffs are diffuse uncoiled regions of the polytene chromosomes where RNA transcription occurs; a large chromosome puff is a Balbiani ring. In addition to increased nucleic and cellular volume, polytene cells have metabolic advantages since multiple gene copies facilitate high levels of gene expression (which would be particularly useful in larval cells). Polytene chromosomes have very distinctive banding patterns that are useful for mapping the location of genes and observing their transcriptional activation.
Below are examples of different types of DNA-binding proteins. The most common and best studied DNA-binding proteins are the Zinc finger proteins, the Helix-turn-helix proteins, and the Leucine zipper proteins.
| TATA Box Binding Prtn | The TATA box binding protein is a subunit of the eukaryotic transcription factor, TFIID. This protein is somewhat unusual in that its TBP-binding domain binds to the minor groove of DNA. |
|---|---|
| Zinc finger domain | This domain is common in eukaryotic DNA-binding proteins. It was first noticed in the eukaryotic transcription factor, TFIIIA. TFIIIA contains 9 repeated modules, each of which contains two Cysteine and two Histidine residues. These four residues chelate one Zn++ ion. Each finger is bound in the major groove of B-DNA. |
| Helix-turn-helix domain | This motif was first noticed as a feature of the crystal structure of the bacteriophage l Cro protein. The structure of this small regulatory protein contained two a-helices separated by 34 Ã… – the pitch of a DNA double helix. Model building studies showed that these two a-helices would fit into two successive major grooves. As the structures of a number of other bacterial regulatory proteins (the CRP protein and the bacteriophage l cI repressor) were solved, the same structural motif – called a helix-turn-helix – was observed. It consists of two a-helices separated by a short turn (it is not a b turn). One helix binds to recognition elements within the major groove of DNA; the other helps to keep the binding helix properly positioned with respect to the rest of the molecule. This motif, common in bacterial DNA-binding proteins, also occurs in the eukaryotic homeobox proteins. |
| Leucine Zipper domains | This domain is an important feature of many eukaryotic regulatory proteins. Leucine is an hydrophobic amino acid. When it occurs at every seventh position of an a-helix, the aliphatic side-chains are all oriented on the same side of the helix and they can interact with another such helix to form a coiled coil type of structure. The GCN4 transcription activator in yeast is a dimer in which the leucine zipper region helps to position the two basic regions that bind to the DNA recognition sequence. |
| Helix-Loop-Helix binding motif | A variation of the leucine zipper, the basic DNA-binding helices are connected to the dimerization helices by a short loop. |
|
|