Biology, asked by gunlady, 1 year ago

3. Transposable elements make up more than 40% of the human genome and are inserted more-or-less randomly throughout the genome. These elements are rare at the four homeobox gene clusters (HoxA, B, C, and D) along with an equivalent region of chromosome 22, which lacks a Hox cluster. Each cluster is about 100 kb in length and contains 9-11 genes. Why would transposable elements be so rare then in these clusters


Answered by Anuj20Kr07Maurya

Eukaryotic genomes contain millions of copies of transposable elements (TE) and other repetitive sequences. Indeed, approximately half of the sequence content of typical mammalian genomes tends to be annotated as TEs and simple repeats by conventional annotation methods. By contrast, only about 5–10% of mammalian and vertebrate genome sequences comprise genes and known functional elements [1], [2], [3]. The remaining 40–45% of the genome is essentially of unknown function, and is sometimes referred to as the ‘dark matter’ of the human genome. The origins of this ‘dark matter’ fraction of the genome have presumably been obscured, in part, by extensive rearrangement and sequence divergence over deep evolutionary time. Understanding the content and origins of this huge uncharacterized component of the genome represents an important step towards completely deciphering the organization and function of the human genome sequence [4], [5], [6].

The dominant repeat annotation paradigm focuses on the identification of repeat element sequences via alignment to consensus TE sequences, as in the widely-used RepeatMasker (RM) approach [7]. Such approaches rely on well-curated libraries of known repeat family consensus sequences, which are usually provided by Repbase [8]. Thus, methods like RM can be described as not masking repeats, per se, but rather masking sequences with clear similarity to repeat consensus library sequences. Ultimately, alignment-based approaches are designed, and tuned, to conservatively mask regions that are clearly identifiable as TEs. Such approaches are therefore expected to be most effective for well-studied genomes with long histories of repeat library curation [9], [10], [11], . Even when TE databases are well-curated, however, there are plausible circumstances where such methods might be expected, a priori, to have poor sensitivity. Consensus sequences may not align well to old and highly diverged TE family members, for example, and alignment-based approaches may have trouble identifying short segments [17].

If half of the human genome can readily be identified as belonging to known TE families, it would seem reasonable to assume that much of the unannotated genomic dark matter may also be derived from TEs [18], even if the precise origins of such sequences are difficult to identify. TE elements have long been active in vertebrate genomes, and different families have diversified to varying degrees and at different times along the lineages leading to present-day species [19]. As a result, we expect that hundreds of millions of years of vertebrate evolution would have heavily altered substantial amounts of TE-derived sequence. Insertion, deletion, and sequence divergence would make many such elements quite difficult to identify. We postulated, however, that mutations in these ancient TEs will have produced a great deal of related but diverged sequences, and that the relations among these large clusters of sequences may make them detectable, even when individual sequences are not.

Motivated by these arguments, we developed a novel approach to identify and demarcate likely repetitive regions in large genomes [13], [20], [21]. This approach first identifies short oligos that are highly repeated (similar to some other de novo repeat-finding methods; [22]), but then groups closely-related oligos that occur, as a group, more often than predicted by chance (Figure 1). These ‘P-clouds’ are then used to demarcate regions of the genome that are of putatively repetitive origin. Identification of putative repeat-derived regions using P-clouds is far more rapid than consensus-based alignment identification of TEs, and analysis of the human genome can be accomplished on a modest desktop computer in well under a day [20].

Similar questions