A working refresher on chromatin, methylation, gene regulatory networks — and the new generation of sequence-to-function and single-cell foundation models that are quietly redrawing how we read the regulatory genome.
Two things changed in the last 24 months. First, "non-mutational epigenetic reprogramming" and "unlocking phenotypic plasticity" were formally added to the hallmarks of cancer — pulling chromatin out of supporting-actor status. Second, the sequence-to-function modelling field crossed a quiet threshold: AlphaGenome, Borzoi, and a wave of single-cell foundation models now make in-silico epigenome prediction tractable enough to plan experiments around.
If your work sits at the intersection of cancer genetics and pathology — and especially if you are interested in tissue-specific oncogene behaviour — the practical question is no longer whether epigenetics matters. It's how to read it computationally with enough resolution to be useful at the bedside or the bench.
This guide is built around three intertwined goals: (i) refresh first principles — methylation, histone marks, GRN architecture — with enough rigour that the language of recent papers feels native again; (ii) connect those principles to the concept of cell identity and plasticity, because that is where current cancer epigenetics is concentrated; and (iii) survey the predictive-model landscape so you can pick the right tool for the right question.
It is meant to be read, not just consulted. The order matters. Each section assumes the previous one.
Every nucleated cell in your body carries the same genome. The reason a hepatocyte and a memory T-cell behave differently is encoded one layer up — in chemical modifications to DNA and to the histone proteins it wraps around, and in the three-dimensional folding of the resulting fibre.
It is worth getting the vocabulary precise, because the field is sloppy with it. Modern reviews tend to organise the epigenome along four axes:
These four are not independent; they form a single system. A change at any layer feeds back into the others. Most of the methodological progress in the last five years has been in measuring them jointly at single-cell resolution, and most of the modelling progress has been in learning how to translate sequence into all of them simultaneously.
A non-coding stretch of DNA — promoter, enhancer, silencer, insulator — that controls the transcription of nearby genes through transcription factor binding and chromatin context. Roughly 98% of the human genome is non-coding, and most disease-associated GWAS hits sit in CREs. Predicting how a variant alters CRE function is the core task of every model in section VII.
Quick mental map of the assays you'll see referenced everywhere:
Methylation gets taught as a binary on/off mark on CpG islands. That model is correct enough for an undergraduate exam and dangerously wrong for research. The current view is far richer.
5-methylcytosine (5mC) is written by DNMT3A/B de novo and maintained by DNMT1 at replication forks. It is actively erased through stepwise oxidation by TET1/2/3 — 5mC → 5hmC → 5fC → 5caC — followed by base excision repair. 5hmC is not just an intermediate; it accumulates at active enhancers and gene bodies of transcribed genes, and is functionally distinct from 5mC.
Genomic context dictates effect:
MLH1, CDKN2A, BRCA1) is a textbook oncogenic mechanism.For a pathologist this is the headline: methylation patterns are cell-type-specific and stable, which is why DNA-methylation-based tumour classifiers (Capper et al. on CNS tumours; Sarcoma classifier; the various pan-cancer iterations) work as well as they do. The same property underlies methylation clocks (Horvath, GrimAge) and recent single-cell methylation atlases. When you read a paper that uses methylation to call cell-of-origin or tumour subtype, what's actually being exploited is the developmental memory written into the methylome.
Standard bisulfite sequencing cannot distinguish them; both read as C. Use oxidative bisulfite (oxBS-seq), TAB-seq, or ACE-seq if 5hmC matters — and in active tissues like brain it really does.
There are too many histone modifications to memorise. There are about a dozen you genuinely need to know, because they recur in nearly every chromatin paper and they are what foundation models are trained to predict.
The mental model: histone tails carry a combinatorial alphabet of marks, written and read by enzyme complexes, that cooperatively define functional states — active promoter, poised promoter, active enhancer, primed enhancer, transcribed gene body, facultative heterochromatin, constitutive heterochromatin. The ChromHMM and Segway frameworks formalise this into discrete states; modern models predict the underlying tracks directly.
The KMT2 (MLL) family — KMT2A, KMT2C, KMT2D in particular — writes H3K4 methylation at enhancers and promoters. Their loss-of-function mutations are recurrent across solid tumours, and (as you've seen in your own SRC carcinoma cohort) they often define molecular subgroups. The functional consequence is enhancer dysregulation: when you can't deposit H3K4me1/me3 properly, lineage-defining enhancer landscapes erode, and cells drift toward less differentiated, more plastic states.
Stem cells and many tumour cells carry bivalent domains — H3K4me3 and H3K27me3 at the same promoter, simultaneously activating and repressing. These mark genes ready for rapid resolution in either direction depending on differentiation cues, and they are mechanistically central to plasticity.
"Non-mutational epigenetic reprogramming" and "unlocking phenotypic plasticity" are now formally listed alongside the classical hallmarks of cancer.
A cell's identity is not a list of expressed genes. It is a self-reinforcing circuit of transcription factors that mutually activate each other and lock down alternative fates. The epigenome is the substrate that stores and stabilises that circuit between cell divisions.
The dominant framework, articulated most cleanly by Young, Bradner and others, is the core regulatory circuit (CRC): a small set of lineage-defining transcription factors that bind their own enhancers and each other's, forming a positive-feedback module. Knock one out and the circuit collapses; force-express one in another lineage and it can drive transdifferentiation. Examples worth knowing — OCT4/SOX2/NANOG in pluripotency; PU.1/CEBPA in myeloid; MITF in melanocyte/melanoma; HNF4A in hepatocyte; CDX2 in intestinal epithelium; ATOH1 in secretory lineage.
This is also where the bridge to your KRAS work lives: the mucosecretory transcriptional program you have characterised — TFF1/2/3, MUC5AC, AGR2, with ATOH1 upstream — is essentially a CRC. The reason it shows up across pancreas, colon, lung and stomach KRAS-mutant tumours is that it is a permissive epigenetic state, not a coincidence of mutation.
Three classes of methods, in rough order of historical appearance:
For your kind of work — characterising why a particular oncogenic driver behaves the way it does in a particular tissue — SCENIC+ on paired scRNA + scATAC is currently the highest-yield off-the-shelf approach, but the sequence-to-function generation will overtake it within a year or two for variant-level questions.
A differentiated cell is supposed to stay differentiated. Cancer's most dangerous trick is breaking that constraint — not by mutating it away, but by letting the chromatin state slide back toward an embryonic, multipotent or invasive configuration.
Three overlapping but distinguishable phenomena get called "plasticity":
All three share a chromatin signature: increased accessibility at developmental enhancers, loss of lineage-specific H3K4me1/me3 patterns, redistribution of H3K27me3, and often global hypomethylation with focal hypermethylation of differentiation regulators.
The cancer-stem-cell field spent twenty years arguing about whether CSCs are a fixed subpopulation or a transient state. The current synthesis: they are a dynamic equilibrium, maintained by epigenetic plasticity, in which any cell with the right chromatin permissiveness can occupy the stem-like state under the right microenvironmental cues. This is why surface-marker-defined CSCs are unstable in culture, and why epigenetic therapies have re-emerged as a way to "lock in" differentiated states.
The morphological features pathologists already use — differentiation grade, mucin production, signet-ring morphology, sarcomatoid change, neuroendocrine differentiation — are all visible manifestations of underlying chromatin states. Connecting morphology to chromatin computationally (CellViT-style segmentation paired with spatial epigenomics) is one of the highest-value research directions opening up right now.
For roughly a decade, sequence-to-function modelling progressed in modest increments — DeepSEA, then Basset, Basenji, Basenji2. In the last 18 months it has accelerated sharply. Three families dominate the current landscape, with characteristic strengths and trade-offs.
These take genomic sequence as input and predict thousands of functional tracks — histone marks, accessibility, TF binding, expression — in parallel.
The model that proved transformers could integrate distal regulatory information up to ~100 kb to predict cell-type-specific expression and chromatin signal. Still the reference baseline; many newer models report performance relative to it.
Extends Enformer's lineage to ~524 kb context and, crucially, predicts RNA-seq coverage directly — meaning splicing and polyadenylation effects fall out of the same model. Currently the strongest open-weights option for variant-effect prediction across multiple regulatory layers. Flashzoi, a 2025 reimplementation with rotary embeddings and FlashAttention-2, is ~3× faster with comparable or better performance.
The 2025–2026 inflection point. Takes 1 Mb of input and predicts thousands of tracks across modalities — expression, transcription initiation, accessibility, histone modifications, TF binding, contact maps, splice site usage, splice junction strength — at base-pair resolution in a single unified model. Reported to match or exceed prior best-in-class on 25 of 26 variant-effect benchmarks. Available via an API for non-commercial research; not yet validated for clinical use. Reads like Enformer's logical conclusion plus a splicing model bolted on.
A complementary line of work focuses specifically on predicting chromatin tracks from existing chromatin tracks, rather than from sequence:
Attention-based imputation of missing epigenomic tracks, conditioned on observed ones. Useful when you have partial coverage of a sample (a few histone marks + accessibility) and want to reconstruct the rest. Particularly valuable for individual-specific epigenomics where exhaustive profiling is not feasible.
EPInformer (2025) is worth flagging because it takes a different architectural bet: rather than scaling raw sequence context, it integrates pre-computed Hi-C contact information with cCRE annotations, and reportedly outperforms Enformer and Borzoi on enhancer-promoter prediction in K562 and GM12878 with two orders of magnitude fewer parameters. The lesson is that for some tasks, prior structural information beats raw scale.
A rough decision rubric for your kind of work:
The first wave of single-cell foundation models — Geneformer, scGPT, scBERT, scFoundation — were exclusively transcriptomic. The second wave is bringing in epigenetic modalities, because gene expression alone is a lagging indicator of cell state.
RNA tells you what a cell is doing right now. Chromatin accessibility and methylation tell you what it can do — which programs are licensed, which are silenced, which are poised. For questions about plasticity, lineage commitment, or transdifferentiation, the epigenetic state is the more informative variable. A cell can be transcriptionally quiescent and chromatin-primed for switching; that's invisible to scRNA-seq.
The first foundation model purpose-built for scATAC-seq, with a peak-to-gene cross-modality pre-training objective that addresses the inherent sparsity of chromatin accessibility data. Used for cell-type annotation, batch correction, and gene-expression prediction directly from accessibility. State-of-the-art on multiple benchmarks at release.
Models pseudobulk scATAC signals with explicit transcription-factor information to predict cell-type-specific regulation. Trades single-cell resolution for stronger regulatory inference; useful when your question is about programs rather than individual cells.
Integrates bulk epigenetic data with paired scRNA + scATAC for downstream tasks: master-regulator identification, enhancer prediction, functional-variant scoring. Worth knowing as a paired-modality reference point.
For a researcher coming from molecular pathology with scRNA / scATAC / methylation data, the realistic 2026 workflow looks something like this: use a transcriptomic foundation model (scGPT / Geneformer) for cell-type annotation, embedding, and zero-shot transfer; pair it with an scATAC-specific model (EpiFoundation, GET) for accessibility-derived cell-state and regulatory inference; and run sequence-to-function models (Borzoi, AlphaGenome) on specific loci of interest for variant interpretation and in-silico perturbation. None of these replace each other.
Compressed into evening-and-weekend pace. The week numbers assume one substantive paper plus one practical exercise per week. Adjust to your reality.
Allis & Jenuwein Nat Rev Genet review on chromatin states. Build a one-page mental map of the marks in §IV. Exercise: pull a public ChIP-seq track for H3K27ac in a tissue you know well; eyeball it against gene expression.
Schübeler 2015 Nature review (still the cleanest synthesis); then Capper et al. 2018 on methylation-based CNS tumour classification. Exercise: download an Illumina EPIC dataset from TCGA and run a quick UMAP on β-values across two tumour types.
Read the SCENIC+ paper (Bravo González-Blas et al. Nat Methods 2023). Exercise: run SCENIC+ on a published paired scRNA + scATAC dataset; identify the core regulatory circuit of a lineage you care about.
Hanahan 2022 hallmarks update; one of the 2024–2025 plasticity reviews from §X. Exercise: take one of your own SRC subgroups (S1/S2/S3) and ask which chromatin program would distinguish them — sketch the experiment.
Read Enformer (2021), then Borzoi (Nat Genet 2025), then AlphaGenome (Nature 2026). Exercise: pick one non-coding variant from your data; score it through Borzoi (open weights) and AlphaGenome (API); compare predictions across modalities.
Read EpiFoundation; skim the recent Nature Methods review of scFMs. Exercise: load one of your own scRNA datasets into Geneformer or scGPT for zero-shot annotation, and compare to your existing labels.
Curated, not exhaustive. If you read only what is on this list and do the exercises in §IX, you will be fluent in the current literature within two months.