Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder with profound global impact. While genome-wide association studies (GWAS) have revealed genomic variants linked to AD, their translational impact has been limited due to challenges in interpreting the identified genetic associations. To address this challenge, we have devised a novel approach termed transcription factor-wide association studies (TF-WAS). By integrating the GWAS, expression quantitative trait loci, and transcriptome analyses, we selected 30 AD single nucleotide polymorphisms (SNPs) in noncoding regions that are likely to be functional. Using human transcription factor (TF) microarrays, we have identified 90 allele-specific TF interactions with 53 unique TFs. We then focused on several interactions involving SMAD4 and further validated them using electrophoretic mobility shift assay, luciferase, and chromatin immunoprecipitation on engineered genetic backgrounds (female cells). This approach holds promise for unraveling the intricacies of not just AD, but any complex disease with available GWAS data, providing insight into underlying molecular mechanisms and clues toward potential therapeutic targets.
Significance Statement
We introduce a powerful platform for better understanding the genetic contribution of Alzheimer's disease (AD) and other complex diseases. Through genome-wide association studies (GWAS), many statistically significant single nucleotide polymorphisms (SNPs) associated with AD have been identified, but their functionality remains unknown. By screening >85% of human proteome transcription factors and cofactors for allele-specific binding preferences with GWAS SNPs, we can comprehensively elucidate the functionality of these SNPs in disease etiology. Using this strategy, we have identified and validated several allele-specific interactions with AD-associated GWAS SNPs that have potential implications in processes relevant to AD. By leveraging available GWAS data, we can identify functional SNPs not just in AD but in essentially all other complex diseases.
Introduction
Alzheimer's disease (AD) is a devastating neurodegenerative condition responsible for 60–70% of dementia cases worldwide (Song et al., 2019). Despite its profound impact, there are currently no treatments available to halt or reverse its progression, primarily because our understanding of its intricate molecular mechanisms remains incomplete. Genome-wide association studies (GWAS) and familial linkage analysis have successfully identified genetic loci that confer risk for AD, including APP, APOE, PSEN1, and PSEN2, among many others (Strittmatter et al., 1993; Bekris et al., 2010). This information has provided critical insight into some of the biological processes involved in the disease, such as cholesterol and lipid metabolism, immune responses, and endosomal vesicle cycling (Van Cauwenberghe et al., 2016). Despite this progress, the functional basis for many of these associations remains elusive.
In the post-GWAS era, identifying causal variants among numerous significant variants remains challenging, particularly in linkage disequilibrium regions. Expression quantitative trait loci (eQTLs) have been useful in this pursuit by associating genetic variants with variation in gene expression levels. For example, based on whole-genome single nucleotide polymorphism (SNP) genotyping and whole-transcriptome expression profiling in cortical samples, many significant associations between inherited variants and transcripts expressed in the brain were identified (Myers et al., 2007; Webster et al., 2009). While certainly successful in identifying SNP-transcript connections, eQTL analysis is limited because it must be performed in physiologically relevant cells and tissues, and it neglects RNA-level processes beyond gene expression, such as RNA splicing, degradation, and transport.
Another challenge has been uncovering the mechanisms through which these variants are able to confer disease risk and influence pathology. This is underscored by the fact that many GWAS variants are SNPs located in noncoding genomic regions, further obfuscating their functionality. This phenomenon suggests functional SNPs are located within cis-regulatory elements, thus affecting phenotype on the level of transcriptional regulation. There is substantial evidence in the literature that SNPs within regulatory regions can alter canonical transcription factor binding and consequently impact gene expression (Maurano et al., 2012; Li et al., 2019).
To this end, ChIP-seq is a useful tool for connecting GWAS SNPs with transcription factor binding sites (Landt et al., 2012; Reddy et al., 2012), but this method requires a priori knowledge of relevant TFs and an extensive collection of ChIP-grade antibodies. Quantitative mass spectrometry has also been utilized to identify differential transcription factor binding to GWAS SNPs (Butter et al., 2012); however this approach is difficult to scale up and often suffers from false negatives due to the low binding affinity of typical TF–DNA interactions. Others have predicted allele-specific binding sites using genomic footprints obtained via mapping DNase I hypersensitivity sites or ATAC-seq (Neph et al., 2012; Buenrostro et al., 2013), but this approach can only be utilized for a limited number of TFs with known consensus motifs, and not all TFs cause genomic footprints due to weak or transient binding interactions. SNP-SELEX represents an alternative strategy for efficiently detecting protein–DNA interactions in a high-throughput manner. Nonetheless, it has limited coverage of SNPs within the human genome, and the method exhibits a subtle bias toward risk-associated loci (Yan et al., 2021).
To overcome these challenges, we introduce transcription factor-wide association studies (TF-WAS). In this method, we employ human transcription factor protein microarrays (TF arrays) containing ∼1,700 full-length purified transcription factors (TFs) and cofactors (1,265 unique factors, greater than 85% coverage of those in the human proteome) spotted in duplicate to screen GWAS SNPs for differential TF binding (Hu et al., 2009). In doing so, we can discern allele-specific binding interactions in an unbiased and high-throughput manner. This approach, coupled with bioinformatic analyses and orthogonal validation assays, can shed light on SNP functionality and TFs involved, thus providing clues to underlying mechanisms in AD. By harnessing available GWAS data, TF-WAS can enable valuable insights into any complex disease or trait, offering a comprehensive understanding of their genetic underpinnings and potential therapeutic targets.
Materials and Methods
Bioinformatic SNP selection
We obtained 2,750 AD-associated SNPs from GWAS (v1.0.2) database (Sollis et al., 2023). We have two strategies to enrich the SNPs that are likely to cause gene expression changes. We selected (1) the SNPs located in the enhancers and (2) the SNPs that were tested to affect gene expression through expression quantitative trait locus (eQTL) studies. Approximately 400,000 enhancers were obtained from EnhancerAtlas (Gao and Qian 2020) and SEA 3.0 (Chen et al., 2020) databases. Meanwhile, ∼290,000 brain-/nerve-related eQTLs were identified from the eQTL data portal GTEx (v8; Strober et al., 2020). By the overlapping analysis between AD-associated SNPs and these enhancers/eQTLs, we obtained a total of 418 AD-associated SNPs that are likely to affect gene expression.
Protein microarray fabrication
Protein microarrays were fabricated as described previously (Hu et al., 2009). From our previously generated collection of ∼21,000 full-length human ORFs expressible as N-terminal GST-His6 fusion proteins (Jeong et al., 2012), ∼1,700 transcription factor and cofactor proteins were selected as a subcollection for printing on the TF arrays. Each protein was expressed in 8 ml of yeast culture in a 96-well format, with protein expression induced for 6 h by addition of galactose in glucose-free media. Yeast cells were lysed mechanically in lysis buffer (50 mM Tris-Cl at pH 7.5, containing 100 mM NaCl, 1 mM EGTA, 1 mM PMSF, 10% glycerol, 0.1% Triton X-100, 0.1% beta-mercaptoethanol, and Roche protease inhibitor tablet). Protein was purified from the lysates though binding with glutathione sepharose beads (GE HealthCare, GE17-0756-04) overnight at 4°C. After incubation overnight, the beads were washed three times with wash buffer I (50 mM Tris-Cl at pH 7.5, containing 500 mM NaCl, 1 mM EGTA, 1 mM PMSF, 10% glycerol, 0.1% Triton X-100, and 0.1% beta-mercaptoethanol) and three times with wash buffer II (50 mM HEPES at pH 8.0, containing 500 mM NaCl, 1 mM EGTA, 1 mM PMSF, 10% glycerol, and 0.1% beta-mercaptoethanol) to remove any nonspecifically bound proteins from the beads and equilibrate to the elution buffer, respectively. Proteins were eluted from the beads using 80 µl of elution buffer (50 mM HEPES at pH 8.0, containing 100 mM NaCl, 40 mM reduced glutathione, pH 8.0, 30% glycerol, and 0.1% beta-mercaptoethanol). Purified proteins were rearrayed into a 384-well format. Proteins were printed in duplicate at 200 pL per spot on PATH Protein Microarray slides (Grace Bio-Labs, 805020). Quality was ensured by probing with Anti-GST antibody to verify adequate protein loading.
Fluorescent and biotinylated DNA probe generation
Each of the selected SNPs for this study were synthesized (Integrated DNA Technologies) with 15 bp flanking contextual sequences on either side, and an additional common modified T7 priming sequence (5′-ACCCTATAGTGAGTGCTATTA – 3′) at the 3′ end. Cy3, Cy5, and biotin primers complementary to the T7 priming site were also synthesized (Integrated DNA Technologies). Fluorescent/biotin primers were incubated at a 1:1 molar ratio in 1× NEB II Buffer (NEB). Mixtures were boiled at 95°C for 10 min and then cooled slowly to room temperature to allow for annealing to occur. Once the mixture was fully cooled, 3 U of Klenow Large Fragment 3′-5′ exo- (NEB M0210) and dNTPs (final concentration 1.5 mM) was added to each reaction. Reactions were incubated at 37°C for 20 min to generate double-stranded probes.
Dye-swap protein microarray screening
SNPs are probed to the TF arrays in pairs as a competition assay, with one allele labeled with Cy3 and the other labeled with Cy5. For each pair, the arrays are performed in duplicate, with the allele colors swapped. For example, on one array the Cy3 risk and Cy5 nonrisk are probed, while the Cy5 risk and Cy3 nonrisk are probed to another array. Prior to the competition assay, TF arrays are blocked with blocking buffer (25 mM HEPES at pH 7.5, containing 50 mM potassium glutamate, 8 mM magnesium acetate, 3 mM DTT, 10% glycerol, 0.1% Triton X-100, and 3% BSA) for 3 h at 4°C. After blocking, Cy5 and Cy3 alleles are mixed in 1× hybridization buffer (10 mM Tris-Cl at pH 8.0, containing 50 mM potassium chloride, 1 mM magnesium chloride, 1 mM DTT, 5% glycerol, 10 mM zinc chloride, and 3 mg/ml BSA) to a final concentration of 40 nM of each allele. Blocking buffer is removed from the arrays, and then the mixed hybridization reaction is added and incubated overnight at 4°C. After overnight incubation, arrays are washed once with 3 ml of ice-cold TBST for 5 min, briefly rinsed with water, and then dried via centrifugation. Arrays are scanned in both the Cy5 (635 nm) and Cy3 (532 nm) channels separately at 1,000 PMT using the GenePix 4000B scanner (Molecular Devices). GenePix Pro 7 software determines the foreground and local background intensities for each detected fluorescent spot at every location on the corresponding alignment grid (GAL file). For each image, a .TIFF file and a .GPR file are generated and saved for analysis.
Protein microarray analysis
For each spot on the alignment grid, the background and foreground intensities were determined by the GenePix Pro 7 software. Using these values, ratiometric binding analysis was performed as follows for each spot on the TF arrays using RStudio:
Electrophoretic mobility shift assay
Cy5 probes for the alleles of each SNP were generated for detection of interactions with purified SMAD4 protein. We mixed 10 nM Cy5 SNP allele, 1 µM cold unlabeled competitor allele, and purified SMAD4 protein in 1× hybridization buffer (10 mM Tris-Cl at pH 8.0, containing 50 mM KCl, 1 mM MgCl2, 1 mM DTT, 5% glycerol, 10 µM ZnCl2, and 3 mg/ml BSA). Reactions were incubated for 1 h at room temperature and then at 4°C overnight. After incubation, reactions mixtures were analyzed using gel electrophoresis with a 5% TBE PAGE gel in cold 1× TBE running buffer at 100 V for 1 h. Gels were visualized on Odyssey CLx (LI-COR Biosciences) using the Cy5 channel.
OCTET
Kinetic measurements were obtained using OCTET QK (Molecular Devices) with High Precision Streptavidin (SAX) biosensors. Biotinylated probes were generated for each of the SNP alleles to be tested. Prior to kinetic analysis, the biosensors were incubated in 1× hybridization buffer (10 mM Tris-Cl at pH 8.0, containing 50 mM KCl, 1 mM MgCl2, 1 mM DTT, 5% glycerol, 10 µM ZnCl2, 3 mg/ml BSA, and 0.1 mg/ml salmon sperm DNA) for 10 min to hydrate the sensors and equilibrate to the buffer. A baseline measurement was taken in 1× hybridization buffer for 120 s, and then biosensors were moved to a well containing 500 nM biotinylated DNA in 1× hybridization buffer for 600 s to allow for DNA loading. After DNA was loaded onto the biosensors, another baseline measurement was taken for 120 s in 1× hybridization buffer. Next, the biosensors were immersed in solutions containing a range of concentrations of purified SMAD4 protein (195, 172, 144, 115, 86, and 0 nM) and allowed to incubate until a binding equilibrium was reached. Finally, the biosensors were immersed back into hybridization buffer to allow for the dissociation of SMAD4 from the sensors. Binding curves and kinetic values were generated by ForteBio Data Analysis software.
Luciferase assay
Luciferase constructs were generated for each of the selected SNP allele pairs. Each allele was synthesized (Integrated DNA Technologies) with four repeats of the SNP (with 7 bp flanking contextual sequence on either side) and NheI and HinDIII restriction sites on the ends. These sequences were cloned upstream of the luc2P luciferase reporter gene in pGL4.32 (Promega), replacing the NF-kB response element which served to drive luciferase reporter gene expression. A clone of the transcription factor SMAD4 (IOH3638) within a pDONR221 entry vector was obtained from the ChemCORE at Johns Hopkins University and gateway cloned into the CMV driven pcDNA DEST40 to act as an overexpression vector.
For the luciferase assay, HEK293t cells (female) were plated in 24-well dishes and allowed to grow until ∼70–90% confluent. Cells were then transfected with 62.5 ng of SMAD4 overexpression vector, 62.5 ng of SNP allele luciferase vector, and 6.25 ng of Renilla control vector using Lipofectamine 3000 (Thermo Fisher Scientific) according to manufacturer's instructions. Luciferase assays were performed as described by the Dual Luciferase Reporter Assay system from Promega (E1910). Following 2 d of incubation, cells were passively lysed using Passive Lysis Buffer at room temperature for 15 min. Lysates were transferred to a 96-well plate to facilitate luminescence signal reading. LARII reagent was added to each of the wells, and then luminescence measurements were recorded for the luciferase signals for each well. Following this measurement, Renilla substrate was added to each well, and luminescence readings were recorded again to determine the Renilla signal for each well. Raw luciferase signals for each sample were normalized to Renilla control signals to account for well-to-well variability in cell count. Sample signals were further normalized to no TF control wells to account for any background interaction with the SNP luciferase vectors. Error is reported as the standard error across at least two experiments, each experiment consisting of three replicates.
CRISPR/Cas9 generation of PDE1A and CACNA2D3 point mutations in HEK293 cells
CRISPR/Cas9 system was used to generate point mutations at the loci of PDE1A and CACNA2D3 in HEK 293 cell lines (female) as previously described (Ran et al., 2013). The donor templates for PDE1A or CACNA2D3 point mutations were designed to harbor point mutation in defined loci and synthesized as single-strand oligo donors (ssODNs) at IDT (Integrated DNA Technologies) without cloning. To achieve high HDR (homologous DNA repair) efficiencies, ssODNs contain flanking sequences of 40 bp on each side that are homologous to the target region. For the gRNA design, we utilized an online CRISPR Design Tool (https://benchling.com) and selected the 20 nt guide sequence within or near by the point mutation sites (Table 1). Then the designed gRNAs were cloned into the gRNA Cloning Vector (Addgene, plasmid #41824). The functionality and efficacy of designed gRNAs were assessed by SURVEYOR nuclease assay.
HEK 293 cells were plated in 100 mm dishes 1 d before transfection and transfected with Cas9 expression vector, pSpCas9(BB)-2A-Puro (px459; Addgene plasmid # 48139), cloned gRNA expression vector, and ssODN using Lipofectamine 2000 according to manufacturer's instruction (Thermo Fisher Scientific, catalog #11668019). Transfected cells were incubated for 48 h after transfection then treated with puromycin (3 μg/ml) for 7 d. Surviving colonies were picked from 96-well plates and expanded until confluent. Then 10% of cells were cultured for further use and 90% of cells were lysed with DirectPCR Lysis Reagent (cell) (Viagen Biotech, catalog #301-C) for 6 h at 56°C for PCR screening with designed screening primer sets (Table 1). For primary screening, PCR products of PDE1A and CACNA2D3 were cut with AccI and HinfI, respectively, and then selected clones were performed with Sanger sequencing or MiSeq for final confirmation.
Results
Bioinformatic selection of Alzheimer’s disease SNPs for probing to the TF arrays
In this study, we first procured a dataset of 1,046 SNPs associated with AD from the GWAS database (v1.0.2; Sollis et al., 2023). To refine our selection of GWAS-identified SNPs to those that have the potential to influence gene expression, particularly in AD, we employed overlapping analysis with two additional datasets, as visually depicted in Figure 1A,B. The first dataset contains a selection of ∼290,000 expression quantitative trait loci (eQTLs) relevant to the brain and nervous system, sourced from the GTEx Portal (v8; Strober et al., 2020). These loci have demonstrated the ability to modulate gene expression in relevant cell types, and thus have increased potential to play a role in the etiology of AD. The second dataset consists of ∼400,000 enhancer elements obtained from EnhancerAtlas (Gao and Qian 2020) and SEA 3.0 (Chen et al., 2020) databases. As mentioned previously, functional SNPs have been hypothesized to reside within cis-regulatory elements, where they can exert regulatory control over nearby genes by influencing transcription or other essential cellular processes. Using multimodality analysis with these three datasets, we were able to achieve a list of 418 prioritized SNPs associated with AD with increased potential for functionality (Table 2). In this set of 418 SNPs, 200 were annotated to be within enhancers, 155 were associated with changes on transcription of a target gene expressed in the brain or nervous system via eQTL, and 63 had a combination of the two annotations (Fig. 1A). The identified SNPs span various positions relative to the coding regions, including intronic, intergenic, downstream, upstream, and 3′-UTR regions, among others (Fig. 1A).
Figure 1-1
SNP locations relative to target gene Top – Illustration of different genomic locations relative to the gene. Middle – Distribution of SNPs across all 418 (left) and the 30 selected SNPs for probing to the arrays (right) according to their location relative to their mapped target gene. Bottom – Distribution of SNPs across all 418 (left) and the 30 selected SNPs for probing to the arrays (right) according to their chromosomal location. Download Figure 1-1, TIF file.
Identification of allele-specific SNP–TF interactions using human TF arrays
From the list of 418 SNPs, a set of 30 SNPs were selected in this study for screens on the TF arrays (Table 2, bold entries). The selected SNPs were either located in an enhancer region, implicated in transcription of a target gene through eQTL, or had a combination of these annotations. SNPs were strategically chosen to cover diverse regions in relation to the target genes, while also covering a spectrum of chromosomal positions (Extended Data Fig. 1-1). To facilitate observation of differential binding interactions with AD-associated SNPs and TF proteins and to avoid potential bias caused by the fluorophores used to end-label each allele probe, we employed a “dye-swap” approach to simultaneously survey a pair of risk and nonrisk alleles on the TF arrays as depicted in Figure 2A. If the TF protein array-based assay is sensitive enough to distinguish single base-pair changes, we would expect to observe three possible modes of allele-specific TF interactions, as described in Figure 2B: (1) loss-of-function, when the introduction of a risk allele weakens or prevents binding with a canonical TF; (2) enhanced function, in which the presence of the risk allele increases the binding affinity of the canonical TF and preserves its function; and (3) gain-of-function, meaning the risk allele introduces an alternative canonical binding site for a new TF.
Figure 2-1
Expression of differential binding proteins in brain regions Heat map illustrating the expression of identified differential binding proteins across different regions of the brain, with darker shades of green indicating higher expression based on RNA-seq data from the GTex Portal(Strober et al. 2020). Download Figure 2-1, TIF file.
To identify allele-specific SNP–TF interactions, each allele of a given SNP, accompanied by 15 nt flanking contextual sequences on both sides, was synthesized as 31 nt DNA oligo attached with a common reverse T7 primer sequence at the 3′-end (Table 3). This primer sequence enables addition of either Cy3 or Cy5 fluorophore to the end of the T7 sequences used to convert them into double-stranded allele probes and allowed for detection of binding events with transcription factors on the arrays. After the probes are generated, each pair of the risk and nonrisk alleles with different colors were mixed at equal molar ratios and probed to the TF arrays in pairs as a competition assay. For example, on one TF array an equimolar mixture of the Cy3-labeled risk and Cy5-labeled nonrisk sequences was probed, while the opposite pairing was probed to a separate array. This approach serves the purpose of eliminating potential inherent bias from the dyes, as well as acting as an experimental replicate. It also enables a more accurate and sensitive quantitative measurement of binding ratios between a nonrisk and risk allele. A preferred allele is expected to show a stronger signal on a protein spot in both labels than its counterpart. For example, SMAD4 showed a strong preference for the nonrisk allele of rs17366218, as a strong Cy3 signal was observed on the first array and an equally strong Cy5 signal was observed on the second array, with little to no signal for the corresponding risk allele (Fig. 2C, left). Nonpreferential binding manifests as detectable signals in both colors on both arrays (Fig. 2C, right). As we expected, many of the TFs (1,407) did not show any detectable binding signals, presumably due to the small number of SNPs tested. Of those that did produce binding signals (205), 154 did not show differential binding activity between the risk and nonrisk alleles of a SNP. However, 51 TF proteins did show a very notable distinction. To analyze the binding patterns for each protein spot on the arrays, ratiometric binding analysis is conducted using the formula shown in Equation 1 (i.e., the R* value; see Materials and Methods; Dudoit et al., 2002). Proteins with highly positive or strongly negative binding ratios demonstrate a clear preference for either the nonrisk or risk alleles, respectively.⇓⇓⇓⇓
Our results clearly showed that the TF protein array-based allele-binding assay was sensitive enough to distinguish single base-pair changes, as exemplified in Figure 2C. Furthermore, we observed all three modes of action as predicted (Fig. 2B). For example, the interaction of the TF SMAD4 with rs17366218 illustrates the loss-of-function binding modality, evidenced by a robust interaction with the nonrisk allele that is effectively abolished upon the introduction of the risk allele (Fig. 2C). This G-to-A mutation disrupts the canonical binding motif for SMAD4, providing plausible explanation for this loss of binding (Fig. 2C). In the case of the SNP rs7431992, SMAD4 exhibits an enhanced binding mode of interaction, binding with both alleles but displaying increased affinity for the risk allele. The T-to-A mutation in this sequence context improves the alignment with the consensus binding motif of SMAD4, likely contributing to the increased affinity for the risk allele (Fig. 2C). A gain-of-function interaction can be observed with the SNP rs429358. The TF TBX2 showed preference for the nonrisk allele of SNP rs429358, with binding reduced with the introduction of the risk allele (Fig. 2C). This matches with the consensus binding motif of TBX2, as T is slightly preferred to C at the SNP position. At the same time, this T-to-C mutation in the risk allele introduces the binding motif for another TF NRF1, resulting in minimal to no binding with the nonrisk allele and strong binding with the risk allele (Fig. 2C). Finally, the interaction between the TF TFAP2E and SNP rs116530595 serves as an example of nondifferential binding, as there is similarly strong signal in both channels for both alleles, resulting in an R* value close to zero (0.08). This nonpreferential binding event is explained by the fact that the known binding consensus sequence for TFAP2E lies outside of the SNP site. This kind of sensitivity agrees with our previous observation that methylation-dependent DNA-TF interactions could also be readily detected on the TF arrays (Hu et al., 2013). We visualized the results for each SNP using an MA plot, which displays the log2 ratio of intensities (M) versus the average log intensity (A), allowing for the identification of differential interactions between the SNP alleles and TFs (Dudoit et al., 2002). In this study, differential binding hits were characterized as R* values above or below the threshold of ±1, representing two-fold difference in binding (Fig. 2D). Identified differential binding events were filtered to consist of interactions with proteins expressed in the brain and nervous system (GTEx Portal; Extended Data Fig. 2-1).
In vitro validation of identified allele-specific SNP–TF interactions
Through the described analysis of array assays conducted on the selected 30 AD-associated SNPs, a total of 90 differential interactions were identified across 51 unique TF proteins (Fig. 3A, Table 4). In addition to these differential interactions, we also observed a total of 794 nondifferential binding interactions across each of the SNPs tested, encompassing 154 distinct TF proteins (Extended Data Fig. 3-1). Such nondifferential interactions by the 154 TFs could be mostly explained by finding their corresponding consensus sequences in the probe sequences, suggesting that they might interact with the TFs showing preferential binding activity. Given the length of the DNA probes that we utilized in these experiments (31 bp), it is possible to observe sequence-specific binding of 2–3 TF proteins with the same sequence, as the average TF consensus motif is between 5 and 20 bases long (Pachkov et al., 2007; Pratt et al., 2022). Identification of heterodimers between differentially bound and nondifferentially bound proteins can provide further evidence of pathways involved in the molecular mechanisms underlying these variants. Therefore, we decided to determine whether there are any known interaction networks formed out of the proteins shown to bind each of the different SNPs, including differential and nondifferential binders. Using the STRING analysis, we mapped the known protein–protein interactions among the identified TFs and found several SNPs with significantly interconnected protein–protein interactions networks (Szklarczyk et al., 2023). We pinpointed three SNPs, rs17366218 (PDE1A), rs7431992 (CACNA2D3), and rs769449 (APOE) exhibiting a range of binding preferences with the TF SMAD4, and notably, SMAD4 emerges as a central node within the interaction networks of the proteins associated with each of these SNPs. Associated GO terminology with these networks also shows relevancy to processes in the brain, suggesting that a role in AD is possible (Extended Data Fig. 4; Thomas et al., 2022). SMAD4 is a signal transduction protein that plays a role in the TGFβ signaling pathway that is critical toward proper neural development and function (Meyers and Kessler 2017). Modulation of the activities of SMAD family proteins in the TGFβ pathway have been shown to impact neurogenesis (Hiew et al., 2021). Additionally, SMAD4 plays a significant role in the BMP signaling pathway, which is vital for neurogenesis in the adult hippocampus—a region notably affected in AD (Zhou et al., 2022). Considering SMAD4's apparent centrality within potential interaction networks for these SNPs, its pivotal role in neural processes, and its potential relevance to AD pathology, we were motivated to pursue further investigation into these interactions.
Figure 3-1
Heat map for all binding interactions observed on the TF arrays A heat map showing all differential and non-differential interactions observed through the array experiments. The x-axis shows each TF, and the y-axis shows each SNP. Interactions showing preference towards the non-risk allele are shown in shades of blue, preferences towards the Risk allele are shown in shades of red, and non-differential interactions are shown in yellow. Grey boxes indicate no binding between the TF and SNP. Across all the arrays, a total of 90 differential and 794 non-differential interactions were observed, encompassing 154 unique TF proteins. Download Figure 3-1, TIF file.
Figure 3-2
STRING and GO analysis for rs17366218, rs7431992, and rs769449 Top – STRING analysis for all binding proteins (differential and non-differential) for rs17366218, rs7431992, and rs76944927. Proteins that demonstrated a Risk preference on the arrays are shown in red, non-risk preference is shown in blue, and no preference is shown in grey. Disconnected nodes were hidden from the network. The thickness of the lines connecting nodes indicates the strength of data support for that interaction. SMAD4 is circled in red in each interaction network. Each network showed significant enrichment, with p-values of 8.99e-15, > 4.31e-10, and 1.29e-07, respectively. Bottom – Gene ontology for the interaction networks of rs17366218 (PDE1A), rs7431992 (CACNA2D3), and rs769449 (APOE) as determined by the STRING database28. Terms shown were selected based upon relevancy to processes in the brain. Analysis Type: PANTHER Overrepresentation Test (Released 20240226); Annotation Version and Release Date: GO Ontology database DOI: 10.5281/zenodo.10536401 Released 2024-01-17; Reference List: Homo sapiens (all genes in database); Annotation Data Set: GO biological process complete; Test Type: Fisher’s Exact; Correction: Calculate False Discovery Rate. Download Figure 3-2, TIF file.
To begin to validate the findings we observed on the arrays, we focused on the interaction of these three SNPs with SMAD4 (Fig. 3A, highlighted in yellow). Our first goal was to replicate the binding interactions we observed on the arrays using an orthogonal assay. To achieve this, we employed an electrophoretic mobility shift assay (EMSA). For each of the selected SNPs, we generated Cy5 fluorescent probes for each allele and subsequently incubated each of these probes separately with purified SMAD4 protein. The resulting reactions were then analyzed using gel electrophoresis to assess any mobility shifts of the DNA probes in comparison with the protein-free control reactions. Ultimately, the EMSA results precisely reproduced each of the binding preferences observed on the TF arrays. Specifically, SMAD4 exhibited a preference for the risk allele of both rs7431992 and rs769449, while favoring the nonrisk allele of rs17366218 (Fig. 3B).
Next, we determined the affinity values for each of these interactions, by utilizing the OCTET system, a real-time, label-free kinetics instrument. This instrument utilizes biolayer interferometry to determine the kon and koff values so that the KD values can be deduced for a given interaction (Barrows and Van Dyke 2022; Pluta et al., 2022). For this experiment, each of the alleles for all three SNPs were end-labeled with biotin and loaded onto OCTET biosensors coated with streptavidin. After a blocking step, each of the biosensors was immersed into purified SMAD4 protein at various concentrations, allowing us to observe binding with the immobilized probe sequences in real time. From this experiment, the fitted on- and off-curves were generated based on the raw data (Fig. 3C). The binding activities were consistent with our previous results, with SMAD4 exclusively binding the expected alleles and KD values ranging from 63 to 146 nM, aligning closely with values previously reported in the literature for TF–DNA interactions.
Examining the impacts of identified allele-specific SNP–TF interactions
It has been well established that many TF proteins can act both as transcription activators and repressors, depending on its binding partners and/or the surrounding chromatin context (Bylino et al., 2020; Weidemüller et al., 2021). It is also possible that a TF-binding event does not translate to changes in downstream gene transcription when it sits on a poised enhancer (Spivakov 2014; Banks et al., 2016). Therefore, we next sought to determine if these binding interactions could have a direct impact on transcriptional regulation with a cell-based luciferase assay. First, SNP allele sequences were cloned into luciferase reporter vector pGL4.32 to replace the NF-kB response element present to drive the luciferase reporter gene (Table 5). Additionally, an overexpression vector for SMAD4 was generated, driven by a CMV promoter. Cells were cotransfected with the luciferase reporter carrying either the risk or nonrisk allele, the SMAD4 overexpression vector, and a Renilla control vector. Following 2 d of incubation, cell lysates were processed through a dual luciferase reporter assay system, measuring the luminescence generated after successive treatment with Firefly luciferase and Renilla substrates. To account for well-to-well variability in cell count, Firefly luciferase signals were normalized to Renilla control signals. Signals were further normalized to control wells containing SNP luciferase vector with no overexpression vector added to account for background interaction with endogenous proteins. After quantification of this assay, we observed significantly increased induction of luciferase expression with each of the alleles with which SMAD4 had previously shown preference, suggesting that SMAD4 is likely to act as a transcription activator via preferential binding activity to the identified alleles (Fig. 3D).
Cell-based validation of SMAD4 differential interactions
Finally, we wanted to see if the differential binding interactions observed on the arrays could persist on the expected genetic backgrounds in cells. To achieve this, the homozygous and heterozygous HEK293t cell lines carrying the risk and nonrisk alleles of CACNA2D3 and PDE1A were generated using a CRISPR-based genetic engineering method as illustrated in Figure 4A–C and as described previously (Ran et al., 2013). Due to challenges with cell line generation, we were unable to produce cells with the corresponding nonrisk and risk alleles at the APOE locus. SMAD4 expression construct was transiently transfected to these cell lines under the control of the CMV promoter (Fig. 4C). After 48 h of incubation, cells were harvested and anti-SMAD4 antibodies were used to perform chromatin immunoprecipitation (ChIP). The SMAD4 occupancy at different loci was determined via qPCR using primer pairs specific for each SNP containing region (Table 6). For the PDE1A and CACNA2D3 experiments, only these loci were altered, with all others tested being wild type. mRNA expression of PDE1A and CACNA2D3 in the same cell lines after SMAD4 overexpression were also determined in parallel (Table 7).
As a result of these experiments, SMAD4 showed a significantly higher occupancy to the homozygous nonrisk allele than the risk allele at the PDE1A locus, with attenuated binding observed in the heterozygous line compared with the nonrisk (Fig. 4D, left panel). No significant changes in occupancy were observed in the other wild-type loci tested. Following SMAD4 overexpression in the same cell lines, the expression level of PDE1A was observed to be elevated in the nonrisk line compared with the risk and heterozygous lines (Fig. 4E, left panel). This supports the notion that the SNP located at rs17366218 interferes with SMAD4's engagement with the DNA at this locus, potentially resulting in reduced PDE1A expression in the context of AD. In the same experiment using the CACNA2D3 cell lines, SMAD4 was found to occupy the homozygous risk allele to a higher degree than the nonrisk or heterozygous lines, with the heterozygous line showing intermediate occupation (Fig. 4D, right panel). Again, no significant impact in allele occupation could be observed in the other wild-type loci tested. However, no allele-dependent effects can be observed in the expression of CACNA2D3 after SMAD4 overexpression (Fig. 4E, right panel). This contrasts with the result we observed in the luciferase assay, in which the binding interaction between SMAD4 and the SNP in CACNA2D3 impacted the transcription of the luciferase gene. One potential explanation for this discrepancy is that the luciferase expression vector utilized in the assay contained three copies of the SNP, potentially amplifying the effect of TF binding compared with the single copy present in the native genomic context. Additionally, in a real cellular context, compensation by other endogenous factors might mitigate the impact of decreased SMAD4 binding at that locus. The absence of expression changes could also reflect regulatory redundancy, where multiple regulatory elements can compensate for each other to maintain proper gene expression levels when individual elements are disrupted. Another important consideration to note is the choice of cells for this experiment. We used HEK293t cells, which differ from brain cells where CACNA2D3 is primarily active. Therefore, the results we obtained might not accurately represent what happens in the brain. CACNA2D3 is known to have its highest expression in brain tissue, and it is possible that a different cellular environment could lead to changes in the gene's expression. This suggests the importance of considering the specific cellular context when interpreting these results.
Discussion
Despite its widespread prevalence, the precise molecular mechanisms underlying AD remain poorly understood. GWAS have begun to uncover variations in the human genome that are associated with AD (Han et al., 2010; Shen et al., 2010; Schott et al., 2016; Jun et al., 2017); however, the causal nature of many of these variants has yet to be determined in a systemic way. Most variations identified via GWAS are SNPs; however, the challenge is that many of the SNPs are located within linkage disequilibrium and/or noncoding regions of the genome, making it challenging to establish potential functionality. In this study, we aimed to generate a high-throughput TF-WAS pipeline to facilitate the identification of functional GWAS SNPs and associated TFs that may play a role in the etiology of AD. Using multimodality bioinformatics analysis, we were able to narrow down a list of AD SNPs that (1) are associated with AD through GWAS, (2) have demonstrated an impact on gene expression in the brain and nervous system via eQTL, and (3) are located within noncoding regions of the genome (as annotated by EnhancerAtlas and SEA 3.0). This allowed us to focus on SNPs that are likely to be functional in cell types relevant to disease pathology. By screening SNPs from this list on the human TF protein arrays, we were able to detect differential binding interactions with transcription factors, potentially providing molecular insights into how these genetic variants may convey dysregulation of transcription and therefore, influencing AD susceptibility. Using several orthogonal in vitro methods, we could validate all the allele-specific SNP–TF interactions observed on the TF array and thus paving the way to identify functional SNPs in other complex diseases.
Of the notable findings from our study were the allele-specific binding preferences of SMAD4 with several AD-associated SNPs (rs17366218, rs7431992, and rs769449), with SMAD4 showing a range of binding preferences with these SNPs. SMAD4 is highly conserved from yeast to humans and is also widely expressed in various human tissues and organs, including the human brain (Human Protein Atlas; De Caestecker et al., 1997; Uhlén et al., 2015). This protein acts as a signal transducer in the TGFβ and BMP signaling pathways and plays numerous roles in various human diseases, particularly in cancer, and has also been implicated in AD (Das and Golde 2006; Wan et al., 2021; Wang et al., 2021; Kapoor and Chinnathambi 2023). The role of SMAD4 in the BMP signaling pathway is especially interesting, as this pathway is critical to neurogenesis in the adult hippocampus, a region of the brain that is notably affected in AD (Zhou et al., 2022).
One target of interest for SMAD4 that we identified using the TF arrays is rs17366218, an intron variant mapped to the PDE1A gene. PDE1A is a part of a family of cyclic nucleotide phosphodiesterases (PDEs) and functions in the degradation of cyclic nucleotide second messengers, showing preference toward the degradation of cyclic guanidine monophosphate (cGMP; Azevedo et al., 2014). PDE1A is specifically expressed at high levels in the brain, kidney, and thyroid (Michibata et al., 2001; Fidock et al., 2002; Lefièvre et al., 2002). In one study, an association was identified between decreased cGMP levels in the cerebral spinal fluid of patients, but not cAMP, and the severity of dementia in AD patients (Hesse et al., 2017). PDE1A has also been shown to have a close connection to aging in rodent models (Kelly et al., 2014).
Another interesting candidate target we discovered is rs7431992, an intron in the gene CACNA2D3. CACNA2D3 codes for a subunit of the voltage-gated calcium channel (VGCC) complex and is expressed most abundantly in the brain (Human Protein Atlas; Uhlén et al., 2015; Ablinger et al., 2020). Elevated levels of calcium have been shown to be associated with and exacerbate the symptoms of neurological disorders such as AD (Guan et al., 2021; Ge et al., 2022). More specifically, elevated levels of cellular calcium have been shown to accelerate beta-amyloid peptide aggregation, contributing to the onset of amyloid-associated pathology (Isaacs et al., 2006).
The third target we identified as a potential target for SMAD4 is rs769449, a SNP that falls between exons 2 and 3 of APOE. Variations in the APOE gene have been determined to be significant genetic risk factors for late-onset AD, and APOE gene is one of the most widely studied genes associated with the disease (Yamazaki et al., 2019). APOE is abundantly expressed in the central nervous system, serving the primary function of mediating lipid transport in the brain (Raulin et al., 2022). A variety of studies have identified an association of rs769449 with cognitive decline, as well as levels of tau and Aβ42 (Cruchaga et al., 2013; Zhang and Pierce 2014).
Using EMSA and real-time kinetics measurements, we validated these interactions with SMAD4 and determined their binding affinity values. These experiments confirmed the results that we observed on the arrays. Furthermore, our cell-based luciferase assay provided additional evidence of the functional impact of these allele-specific interactions, suggesting that they may have a direct influence on transcription. While these traditional luciferase assays provided valuable insights, future studies could benefit from implementing massively parallel reporter assays (MPRAs), which offer increased statistical power through simultaneous testing of thousands of sequences and could provide more comprehensive characterization of allele-specific effects (McAfee et al., 2022).
In the final phase of our investigation, we extended our analysis to cellular contexts by generating homozygous and heterozygous cell lines for specific SNPs. We observed that SMAD4's binding preference in these cells was consistent with our in vitro findings. Moreover, the expression of PDE1A exhibited allele-dependent changes in expression, providing further evidence of the potential functional consequences of this SNP–TF interaction in a cellular context. Nonetheless, our study also revealed the importance of considering the specific cellular context when interpreting these results. The choice of HEK293t cells, which differ from brain cells in which certain genes are primarily active, underscores the need for further research in physiologically relevant cell types to better understand the consequences of these allele-specific interactions. Future directions for this study will include validation in more physiologically relevant cell types, such as neurons or microglia. Variant screening strategies can also be strengthened by incorporating knockdown studies, particularly through advanced methods like highly multiplexed CRISPRi screening, to evaluate the functional impact of candidate variants prior to cell line generation (Gasperini et al., 2019). Additionally, chromosome conformation capture techniques can reveal the three-dimensional interactions between regulatory elements, providing crucial spatial context. These complementary approaches can help differentiate between direct eQTL effects and those stemming from genetic linkage, improving our ability to identify causal variants.
Our study highlights the power of using TF-WAS to explore the functional implications of GWAS-identified SNPs in AD. By combining bioinformatics, in vitro validation, and cellular experiments, we were able to shed light on some of the intricate molecular mechanisms that may underlie AD susceptibility. Further research in this direction, including screening and validation of a more comprehensive list of SNPs, will undoubtedly contribute to the ongoing efforts to combat this devastating neurodegenerative condition. TF-WAS can further be expanded to proteome-wide association studies (PWAS) using human proteome microarrays (Jeong et al., 2012), allowing for systematic identification of protein–DNA interactions and potential regulatory networks beyond just transcription factors, thus providing a more comprehensive understanding of genetic regulatory mechanisms at the proteome level. This methodology holds great promise for advancing our understanding of not just AD, but many other complex diseases and traits, potentially leading to the discovery of new therapeutic targets and strategies for the treatment of AD and beyond. A key strength of this methodology is its broad applicability to essentially any complex disease or trait with available GWAS datasets. Our future pursuits will include the profiling of diverse complex diseases, with the aim of generating data that can provide guidance toward clinical interventions across a spectrum of health conditions.
Footnotes
This work was supported by National Institutes of Health grants R01AG061852 (H.Z., H.S., and J.Q.), R01MH122451 (H.Z.), R01EY030475 (H.Z. and J.Q.), P50AA027054 (H.Z.), R35NS097370 (G.M.), R35NS116843, and RF1AG079557 (H.S.) and by the Dr. Miriam and Sheldon G. Adelson Medical Research Foundation (G.M.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
↵*J.D., C.M., and N.-S.K. contributed equally to this work.
The authors declare no competing financial interests.
- Correspondence should be addressed to Heng Zhu at hzhu4{at}jhmi.edu, Hongjun Song at shongjun{at}pennmedicine.upenn.edu, Yijing Su at yijingsu{at}upenn.edu, or Jiang Qian at qianjiang007{at}gmail.com.