Genome Architecture Mapping

Contents

1. Overview
2. Etymology
3. Cultural Impact

Genome Architecture Mapping

In the intricate realm of molecular biology , Genome Architecture Mapping (GAM) emerges as a sophisticated cryosectioning methodology. Its primary purpose is to meticulously map colocalized DNA regions within a cell, notably achieving this without the need for ligation steps. This technique represents a significant advancement, particularly when contrasted with established methods like Chromosome conformation capture (3C). The inherent reliance of 3C on digestion and subsequent ligation to identify interacting DNA segments presents certain limitations that GAM effectively circumvents. Indeed, GAM stands as the inaugural genome-wide approach capable of capturing three-dimensional proximities between any number of genomic loci without the encumbrance of ligation.

The fundamental output of this cryosectioning process yields what are termed “nuclear profiles.” These profiles offer invaluable insights into the spatial organization of the genome, specifically concerning their coverage across the entire genomic landscape. A comprehensive dataset of values derived from these nuclear profiles can effectively represent the strength of their presence within a genome. By analyzing the extent and distribution of this coverage, researchers can deduce critical information regarding chromatin interactions, the precise location of nuclear profiles within the nucleus during the cryosectioning procedure, and the overall levels of chromatin compaction.

To translate the raw data generated by GAM into visually interpretable formats, a suite of analytical methods can be employed. The initial data typically presents itself as a table indicating the presence (represented by a ‘1’) or absence (represented by a ‘0’) of nuclear profiles within defined genomic windows, organized by chromosome. From this binary representation, subsets of data can be extracted and visualized through various graphical means, including standard graphs, charts, and notably, heatmaps . These visualization techniques transform the abstract binary data into discernible patterns, revealing interactions that might otherwise remain obscured.

The interpretation of these visualizations can be multifaceted. For instance, bar graphs can effectively illustrate the distribution of nuclear profiles based on their radial position within the nucleus and their chromatin compaction levels. These graphs can be further categorized to provide a generalized overview of how frequently nuclear profiles are detected within specific genomic windows. Radar charts, a distinct type of circular graph, excel at representing the percentage distribution of a variable across multiple categories. In the context of genomic data, radar charts can depict the prevalence of genomic windows within specific “features” of the genome, thereby highlighting regional characteristics. Furthermore, radar charts are instrumental in comparing different groups of nuclear profiles, graphically illustrating variations in their occurrence patterns within these genomic features. Heatmaps offer yet another powerful visualization tool, where individual data points from a table are represented by colored cells, with the color intensity correlating to the value of the data point. This allows for the rapid identification of trends and patterns through the clustering of similar colors or the stark contrast of dissimilar ones.

The accompanying heatmap, for example, visually represents the similarity between nuclear profiles. The intensity of the color in each cell corresponds to a calculated Jaccard Index , a metric ranging from 0 to 1, signifying the degree of similarity between two nuclear profiles. This visualization aids in pinpointing regions within the genome where specific clusters of nuclear profiles tend to congregate. The diagonal line of white cells is a predictable outcome, as it represents the self-comparison of each nuclear profile, inherently resulting in maximum similarity (a value of 1). Beyond this diagonal, the presence of other lightly colored cells, particularly in the bottom right quadrant, indicates groups of nuclear profiles that exhibit a high degree of similarity. This suggests that these particular groups of profiles are detected across a greater number of genomic windows than others, implying a more extensive or concentrated presence.

Similarly, the bar graph presented serves to illustrate the distribution of radial positions among nuclear profiles within a specific cluster. The radial position is categorized on a scale from 1 (strongly apical, closer to the nuclear periphery) to 5 (strongly equatorial, closer to the nuclear center). The cluster itself is derived using a k-means clustering algorithm, a method that groups data points based on their similarity. The process begins with the random selection of initial “centers” for the clusters, followed by the assignment of all other nuclear profiles to the nearest center. The centers are then recalculated to better represent the data within each cluster, and this iterative process continues until the cluster centers stabilize, indicating that the algorithm has converged on optimal groupings. Within each identified cluster, nuclear profiles are assigned a radial position value, which is then aggregated and visualized in the bar graph, showcasing the proportional representation of each radial category.

The radar chart provides a comparative overview of three distinct clusters of nuclear profiles, detailing their percentage of occurrence within various features of the mouse genome. These clusters, identified through the aforementioned k-means clustering, are visualized to highlight their differential presence across specific genomic regions. A cluster’s representation within a feature is determined by assessing how frequently its constituent nuclear profiles are detected within genomic windows that are also annotated as part of that feature. The radar chart then graphically displays these percentages, allowing for direct comparison between clusters and their respective genomic associations.

Cryosection and Laser Microdissection

The meticulous process of generating cryosections, foundational to GAM, adheres to methods such as the Tokuyasu technique. This approach involves rigorous fixation protocols designed to preserve the delicate architecture of the nucleus and cellular structures. Following fixation, cryoprotection is applied using a sucrose-PBS solution, after which the sample is rapidly frozen, typically in liquid nitrogen. In the context of Genome Architecture Mapping, this cryosectioning is an indispensable preliminary step for the comprehensive exploration of the genome’s three-dimensional topology. Subsequently, laser microdissection is employed to precisely isolate each individual nuclear profile. This isolated material then undergoes DNA extraction and subsequent sequencing, yielding the data that forms the basis of GAM analysis.

Data Analysis - Bioinformatic Tools

GAMtools

GAMtools represents a comprehensive suite of software utilities specifically developed for the analysis of Genome Architecture Mapping data, curated by Robert Beagrie. To effectively utilize GAMtools, the Bowtie2 alignment tool is a prerequisite. The input data for GAMtools processing is typically provided in Fastq format . This software package is equipped with a diverse array of functionalities, and the precise commands required will vary depending on the specific analytical objective. However, a common initial step for most users involves generating a segregation table. This typically entails downloading or preparing the input data and then performing sequence mapping. The output of this mapping process is the crucial segregation table, which then serves as the input for a multitude of subsequent operations. For detailed guidance on specific commands and their applications, consulting the official GAMtools documentation is highly recommended.

Flowchart

The analytical pipeline for GAM data often begins with mapping the sequencing data. This is accomplished using the process_nps command within GAMtools, which maps the raw sequence reads originating from the nuclear profiles. GAMtools also incorporates a feature for performing quality control checks on these nuclear profiles. This quality control can be activated by including the -c or --do-qc flag in the process_nps command. When this flag is enabled, GAMtools actively attempts to identify and exclude nuclear profiles that exhibit poor quality, thereby ensuring the integrity of the subsequent analyses.

The command structure for this initial mapping and quality control step is as follows:

1
gamtools process_nps --do-qc -g <GENOME_FILE> <FASTQ_FILE> [<FASTQ_FILE> ...]

Windows Calling and Segregation Table

Following the mapping of sequencing data, the process_nps command in GAMtools proceeds to count the number of reads from each nuclear profile that overlap with predefined genomic windows. By default, these genomic windows are set at a size of 50 kilobases (kb). This crucial step culminates in the generation of a segregation table. This table serves as a fundamental data structure, meticulously detailing the presence or absence of each genomic window across all analyzed nuclear profiles.

Producing Proximity Matrices

The matrix command within GAMtools is employed to generate proximity matrices. The segregation table, previously computed during the windows calling phase, serves as the input for this command. GAMtools calculates these matrices by leveraging normalized linkage disequilibrium. This normalization process involves assessing the frequency with which pairs of genomic windows are detected within the same nuclear profile, and then adjusting these counts based on the overall detection frequency of each individual window across all nuclear profiles. The accompanying figure provides a visual representation of a proximity matrix heatmap, generated using GAMtools.

The GAMtools command for generating proximity matrices is:

1
gamtools matrix [OPTIONS] -s <SEGREGATION_FILE> -r <REGION> [<REGION> ...]

Calculating Chromatin Compaction

The compaction command in GAMtools is instrumental in estimating chromatin compaction levels. Compaction is quantified as a value assigned to a specific gene or genomic region, reflecting its physical size or volume. Notably, the level of chromatin compaction is inversely proportional to the locus volume; thus, genomic loci with a smaller volume are considered to have a higher degree of compaction, while those with a larger volume exhibit lower compaction. As depicted in the accompanying figure, genomic loci with a lower compaction level are statistically more likely to be intersected by the cryosection slices used in GAM. GAMtools utilizes this principle to assign a compaction value to each locus, derived from its detection frequency across numerous nuclear profiles. It is important to note that the compaction rate of these loci is not static; it is a dynamic property that can change throughout the life of a cell. Genomic loci are hypothesized to become de-compacted when they are actively being transcribed. This phenomenon allows researchers to infer which genes are currently active within a cell by analyzing the results obtained from GAMtools data. A locus characterized by low compaction is often correlated with high transcriptional activity . The computational time complexity of the compaction command is O(m × n), where ’m’ represents the number of genomic windows and ’n’ signifies the number of nuclear profiles.

The GAMtools command for this calculation is:

1
gamtools compaction [OPTIONS] -s <SEGREGATION_FILE> -o <OUTPUT_FILE>

Calculating Radial Position

GAMtools also facilitates the calculation of the radial position of nuclear profiles (NPs). The radial position is a metric that quantifies how close or far a given NP is from the equatorial plane or the center of the nucleus. NPs situated near the nuclear center are categorized as equatorial, while those closer to the nuclear periphery are designated as apical. The specific GAMtools command for calculating radial positioning is radial_pos. This command necessitates the prior generation of a segregation table. The estimation of radial position is derived from the average size of the NPs that encompass a particular chromatin region. Chromatin regions located closer to the nuclear periphery are typically intersected by smaller, more apical NPs, whereas centrally located chromatin is generally intersected by larger, more equatorial NPs.

To estimate the size of each NP, GAMtools analyzes the number of genomic windows each NP intersects. NPs that intersect a greater number of windows are inferred to possess a larger physical volume. This methodology bears a strong resemblance to the approach used for estimating chromatin compaction. The accompanying figure illustrates how GAMtools assesses the detection rate of each NP to estimate its volume, which in turn informs the calculation of either compaction or radial position. For instance, if the first NP intersects all three windows, it can be inferred as one of the largest NPs. The second NP, intersecting two out of three windows, would be estimated as smaller than the first. The third NP, intersecting only one window, would be considered the smallest. With these size estimations in hand, the radial position can then be inferred. Assuming that larger NPs are more equatorial, the first NP would be deemed the most equatorial, the second moderately so, and the third the most apical.

The GAMtools command for calculating radial position is:

1
gamtools radial_pos [OPTIONS] -s <SEGREGATION_FILE> -o <OUTPUT_FILE>

The provided pseudocode offers a conceptual illustration of how one might calculate the radial position for a list of NPs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Suppose we have a 2D matrix called data where the rows correspond to the NPs and the columns correspond to the windows, so if data[1][2] is 1, then that means NP 1 contains window 2
// Use this variable to keep track of the largest number of windows detected by a single NP
LET MAXWINDOW = 0
// Use this array to keep track of the number of windows detected by each NP, so we can later determine the radial position
LET RADIAL_POS = []

// Loop through all NPs
FOR NP FROM 1 TO NUM_NPS:
LET WINCOUNT = 0

// Count the number of windows each NP saw
FOR WIN FROM 1 to NUM_WINDOWS:
IF ( data[NP][WIN] == 1 )
WINCOUNT = WINCOUNT + 1

// See if the current NP has seen the most windows
IF WINCOUNT > MAXWINDOW:
MAXWINDOW = WINCOUNT

// Add the count for the current NP to the array
RADIAL_POS.APPEND( WINCOUNT )

// Divide the number of windows each NP saw by the largest number of windows any NP saw to get an estimate of the radial position
FOR NP FROM 1 TO NUM_NPS:
RADIAL_POS[NP] = RADIAL_POS[NP] / MAXWINDOW

This pseudocode generates a list of radial positions ranging from 0 to 1, providing an estimation where 1 represents the most equatorial position and 0 signifies the most apical. The time complexity of this pseudocode is O(n × m), where ’n’ is the number of NPs and ’m’ is the number of windows. The initial for loop iterates ’n’ times, and its nested inner loop iterates ’m’ times, resulting in a time complexity of O(n × m) for that segment. The subsequent for loop iterates ’n’ times, contributing a time complexity of O(n). Therefore, the overall time complexity of this code is O(n × m + n), which simplifies to O(n × m).

Data Analysis Methods

Overview

The provided flowchart outlines a general workflow for analyzing data generated by Genome Architecture Mapping. Processes are depicted as circles, while data elements are represented by squares.

The initial phase of GAM analysis involves the cryosectioning and subsequent examination of cells. This process yields a collection of nuclear slices, or nuclear profiles (NPs), each containing fragments of genomic DNA represented as genomic windows. These nuclear profiles are then meticulously analyzed to construct a segregation table. Segregation tables are the bedrock of GAM analysis, containing crucial information that delineates which genomic loci are present within each individual nuclear profile.

Beyond the specific methods detailed below, other analytical approaches are also applicable. For instance, clustering techniques, such as k-means clustering , can be employed to group nuclear profiles that exhibit similar genomic loci content. K-means clustering is particularly well-suited for this task as it can effectively group nuclear profiles based on a similarity measure. However, it does come with its own set of drawbacks. The time complexity of k-means clustering is O(tknd), where ’t’ represents the number of iterations, ‘k’ is the number of cluster centers (means), ’n’ is the number of data points (nuclear profiles), and ’d’ is the dimensionality of each data point. Such a complexity renders it an NP-hard problem, meaning it does not scale efficiently to very large datasets and is therefore more appropriate for analyzing subsets of data.

For more in-depth analysis, the GAMtools software suite can be utilized. GAMtools provides a collection of tools designed to extract valuable information from the segregation table. Some of the key analytical outcomes derived from GAMtools are discussed in subsequent sections.

Cosegregation, also referred to as linkage, can be ascertained by observing the frequency with which two genomic loci appear together within the same nuclear profile. This data is instrumental in identifying loci that are physically proximate in three-dimensional space and those that interact with regularity, thereby providing insights into the mechanisms of DNA transcription.

SLICE, a method for predicting specific interactions among genomic loci, leverages statistical data derived from cosegregation analysis. It operates by estimating the proportion of specific interactions for each pair of loci at a given time, employing a likelihood-based approach.

Finally, graph analysis can be applied to the segregation table to identify distinct “communities” within the data. Communities can be defined through various means, such as by identifying cliques . However, within the context of GAM analysis, community detection often focuses on centrality measures. In a social network analogy, centrality-based communities can be likened to celebrities and their followers. While the followers might not interact extensively with each other, they all interact with the central celebrity figure.

Several types of centrality exist, including degree centrality, eigenvector centrality , and betweenness centrality, each capable of defining communities in distinct ways. It is worth noting that in our social network analogy, eigenvector centrality might not always be the most accurate measure if, for example, one individual follows many celebrities but holds no influence over them. In such directed graph scenarios, eigenvector centrality might be less informative. However, in GAM analysis, the graph is generally assumed to be undirected, making eigenvector centrality a valid approach. It is important to recognize that both clique and centrality calculations can be computationally intensive and, similar to clustering, may not scale well for extremely large datasets.

SLICE

SLICE, an acronym for StatisticaL Inference of Co-sEgregation, plays a pivotal role in the analysis of GAM data. Developed within the laboratory of Mario Nicodemi, SLICE provides a mathematical framework for identifying the most specific interactions among genomic loci based on GAM cosegregation data. Its core function is to estimate the proportion of specific interactions for each pair of loci within a given timeframe, employing a likelihood-based methodology. The initial step in the SLICE process involves defining a function that describes the expected proportion of nuclear profiles containing specific loci. Subsequently, the optimal probability outcome is determined to accurately explain the experimentally derived data.

Flow chart of SLICE

The flowchart illustrates the sequential steps involved in the SLICE analysis.

SLICE Model

The SLICE model is predicated on the hypothesis that the probability of non-interacting loci appearing within the same nuclear profile is predictable and is influenced by the physical distance between these loci.

The SLICE model categorizes a pair of loci into two types: those that interact and those that do not. According to the underlying hypothesis, the proportions of nuclear profiles in different states can be mathematically predicted. By deriving a function representing the interaction probability, GAM data can be utilized to identify significant interactions and assess the sensitivity of the GAM methodology.

Calculate distribution in a single nuclear profile

The SLICE model considers a pair of loci, designated as A and B, which can exist in one of two states: either they do not interact with each other, or they do. The initial challenge is to determine the probability of finding a single locus within a nuclear profile.

The mathematical expression for this is:

Single locus probability:

v_{0},v_{1}

Where:

$v_{1}$

represents the probability that the locus is found within a nuclear profile.

$v_{0}$

= 1 -

v_{1}

represents the probability that the locus is not found within a nuclear profile.

$v_{1}$

≥

V_{NP}/V_{nucleus}

Estimation of average nuclear radius

As indicated by the preceding equation, the volume of the nucleus is a necessary parameter for the calculation. The radii of the nuclear profiles can be utilized to estimate the overall nuclear radius. The predictions made by SLICE regarding the nuclear radius align with findings from Monte Carlo simulations (further details on this step will be provided upon obtaining the necessary license for the figure from the original author’s publication). With the estimated nuclear radius, it becomes possible to estimate both the probability of two loci being in a non-interacting state and the probability of them being in an interacting state.

The mathematical expression for the non-interacting state is as follows:

u_{i}, i=0,1,2

represents the probability of finding 0, 1, or 2 loci from a pair of non-interacting loci.

For two loci in a non-interacting state:

<u_{0}>=<v_{0}^{2}>, <u_{1}>=<v_{1}v_{0}>, <u_{2}>=<v_{1}^{2}>

Here is the mathematical expression for the interacting state:

Estimation of two loci interaction state:

t_{i}, i=0,1,2

represents the probability.

<t_{2}>

<v_{1}>

<t_{1}>

~ 0

<t_{0}>

v_{0}

= 1 -

v_{1}

Calculate probability of pairs of loci in single nuclear profile

Utilizing the results from the preceding calculations, the probability of a pair of loci occurring within a single nuclear profile can be determined through statistical methods. A pair of loci can exist in one of three distinct states, each associated with a specific probability, denoted as $P_{i},i=0,1,2$ .

Occurrence probability of pairs of loci in single nuclear profiles:

P_{2},P_{1},P_{0}

Where:

$P_{2}$

: represents the probability that both loci in a pair are interacting.

$P_{1}$

: represents the probability that one locus interacts while the other does not.

$P_{0}$

: represents the probability that neither locus interacts.

SLICE Statistical Analysis

\frac {N_{0,0}}{N}=<t_{0}^{2}>P_{2}+<t_{0}u_{0}>P_{1}+<u_{0}^{2}>P_{0}

\frac {N_{2,0}}{N} = N_{0,2} = <t_{1}^{2}>P_{2}+<t_{1}u_{1}>P_{1}+<u_{1}^{2}>P_{0}

Where $N_{i,j}$ represents the state where ‘i’ refers to locus A and ‘j’ refers to locus B, with ‘i’ and ‘j’ being 0, 1, or 2.

Detection efficiency

In Genome Architecture Mapping (GAM), detection efficiency quantifies the likelihood that a specific genomic locus will be observed within a nuclear profile (NP). This probability is influenced by several factors, including the overall geometry of the nucleus and the degree of chromatin compaction. Genomic regions situated near the nuclear periphery or those that are highly condensed are less likely to be intersected by the randomly oriented slices employed in GAM. Conversely, loci located more centrally or existing in a decondensed state are more readily detected. Recognizing that not all loci present within a nuclear slice are reliably observed, the SLICE (Statistical Inference of Co-segregation) model incorporates the concept of detection efficiency. This accounts for inherent limitations such as incomplete slicing or potential DNA loss during sample preparation. By doing so, it becomes possible to differentiate between a genuine absence of a signal and a failure to detect an existing signal.

The flowchart visually explains how detection efficiency is integrated into the Genome Architecture Mapping (GAM) workflow.

To empirically assess detection efficiency, researchers have conducted studies, for instance, on mouse embryonic stem cells (mESCs). These studies involved generating genome-wide contact maps from over 400 high-quality nuclear profiles, examining detection at various resolutions, such as 30 kb. The findings indicated that approximately 400,000 uniquely mapped reads per NP were necessary to detect more than 80% of the positive windows. On average, each NP captured about 6 to 4% of the genome, which aligns with theoretical expectations based on nuclear volume. Further validation using FISH (fluorescence in situ hybridization ) confirmed that regions as small as 40 kb could be effectively detected. To enhance the accuracy of the data, statistical normalization techniques were applied. These methods aim to mitigate biases arising from factors such as GC content, sequence mappability, and variations in detection rates, resulting in GAM matrices that exhibit fewer artifacts compared to those generated by traditional Hi-C data.

To definitively identify which genomic windows truly represent biological signals, sequencing reads were aggregated across windows of varying sizes, ranging from 10 kb to 1 Mb. Researchers then modeled the read counts per NP using a combination of negative binomial and lognormal distributions. Based on this statistical modeling, a specific threshold was established for each NP. Windows were subsequently classified as “positive” if the number of mapped reads significantly exceeded the expected count attributable to random sequencing noise alone. This rigorous statistical approach, coupled with the correction for detection efficiency within the SLICE framework, ultimately leads to more accurate and biologically meaningful interpretations of GAM data.

Figure 3 in the original publication provides a detailed illustration of this modeling process, demonstrating the interplay between detection efficiency and co-segregation frequency among genomic windows.

Estimating interaction probabilities of pairs

Drawing upon the concept of detection efficiency and the previously defined probabilities $u_{0}$ , $u_{1}$ , and $u_{2}$ , SLICE proceeds to estimate the likelihood that a pair of genomic loci are indeed interacting. These probabilities represent the likelihood of detecting zero, one, or both loci within a nuclear profile, under the assumption that the loci are not interacting:

u_{0}=v_{0}^{2}

: probability that neither locus is detected.

u_{1}=2v_{1}v_{0}

: probability that only one locus is detected.

u_{2}=v_{1}^{2}

: probability that both loci are detected.

By juxtaposing these expected probabilities derived from the non-interacting model with the observed co-segregation data, SLICE is able to infer the interaction probability for each pair of loci. This statistical inference process rigorously accounts for detection efficiency, enabling researchers to reliably distinguish true chromatin contacts from coincidental co-detections that may arise due to the nature of nuclear slicing.

Co-segregation and normalized linkage

When undertaking genome mapping, one can analyze the co-segregation patterns across various genomic windows and Nuclear Profiles (NPs) within a genome. Nuclear profiles are derived from tissue slices and samples, representing segments of the genome as windows. Co-segregation in this context refers to the identification of linkages between specific windows in a genome, encompassing concepts like linkage disequilibrium and normalized linkage disequilibrium. A fundamental step in calculating co-segregation and linkage involves determining the detection frequency for each window. The detection frequency is calculated as the number of NPs present in a specified window divided by the total number of NPs. Each of these calculated values provides crucial statistical metrics for genome analysis. Normalized linkage disequilibrium represents the final calculation that quantifies the actual linkage between genomic windows. Once all values are computed, they are used to determine the normalized linkage equilibrium for each specified window. The normalized linkage value typically ranges between 1.0 and -1.0, where 1.0 indicates a strong linkage between two windows, and values below 1.0 signify progressively weaker linkages. Compiling the normalized linkage values for each window into a chart or matrix allows for comprehensive genome mapping and analysis, often visualized using a heatmap or other graphical representations. The co-segregation and normalized linkage values also serve as inputs for further analytical computations, such as centrality and community detection, as discussed in the subsequent section.

To ascertain the co-segregation and linkages between genomic windows, the following calculations are essential: Detection Frequency, Co-segregation, Linkage, and Normalized Linkage.

Calculating linkage and frequencies

Each calculation step outlined above is detailed and explained in the following table.

Formulas and Steps for Calculating Co-segregation and Linkage

| Calculations | Formulas [12] | Explanation