A Reading & Reference Guide For V. Ljungström 2026 / Vol. I

Cell Identity, Plasticity, and the Predictive Epigenome

A working refresher on chromatin, methylation, gene regulatory networks — and the new generation of sequence-to-function and single-cell foundation models that are quietly redrawing how we read the regulatory genome.

IPrologue

Why a refresher, now?

Two things changed in the last 24 months. First, "non-mutational epigenetic reprogramming" and "unlocking phenotypic plasticity" were formally added to the hallmarks of cancer — pulling chromatin out of supporting-actor status. Second, the sequence-to-function modelling field crossed a quiet threshold: AlphaGenome, Borzoi, and a wave of single-cell foundation models now make in-silico epigenome prediction tractable enough to plan experiments around.

If your work sits at the intersection of cancer genetics and pathology — and especially if you are interested in tissue-specific oncogene behaviour — the practical question is no longer whether epigenetics matters. It's how to read it computationally with enough resolution to be useful at the bedside or the bench.

This guide is built around three intertwined goals: (i) refresh first principles — methylation, histone marks, GRN architecture — with enough rigour that the language of recent papers feels native again; (ii) connect those principles to the concept of cell identity and plasticity, because that is where current cancer epigenetics is concentrated; and (iii) survey the predictive-model landscape so you can pick the right tool for the right question.

It is meant to be read, not just consulted. The order matters. Each section assumes the previous one.

◆ ◆ ◆
IIFoundations

The epigenome as regulatory layer

Every nucleated cell in your body carries the same genome. The reason a hepatocyte and a memory T-cell behave differently is encoded one layer up — in chemical modifications to DNA and to the histone proteins it wraps around, and in the three-dimensional folding of the resulting fibre.

The four substrates

It is worth getting the vocabulary precise, because the field is sloppy with it. Modern reviews tend to organise the epigenome along four axes:

  1. DNA methylation — covalent addition of a methyl group, predominantly to cytosine in CpG dinucleotides, written by DNMTs and erased (oxidatively) by TET enzymes.
  2. Histone post-translational modifications — methylation, acetylation, phosphorylation, ubiquitination and dozens more on the tails of H2A, H2B, H3, H4. Written, read, and erased by dedicated enzyme families.
  3. Chromatin accessibility & nucleosome positioning — whether transcription factors can physically reach their motifs. Measured by ATAC-seq, DNase-seq, MNase-seq.
  4. 3D genome architecture — topologically associating domains (TADs), loops, compartments. Measured by Hi-C, Micro-C and their single-cell variants.

These four are not independent; they form a single system. A change at any layer feeds back into the others. Most of the methodological progress in the last five years has been in measuring them jointly at single-cell resolution, and most of the modelling progress has been in learning how to translate sequence into all of them simultaneously.

Anchor concept
cis-regulatory element (CRE)

A non-coding stretch of DNA — promoter, enhancer, silencer, insulator — that controls the transcription of nearby genes through transcription factor binding and chromatin context. Roughly 98% of the human genome is non-coding, and most disease-associated GWAS hits sit in CREs. Predicting how a variant alters CRE function is the core task of every model in section VII.

What ChIP, ATAC and the rest actually measure

Quick mental map of the assays you'll see referenced everywhere:

IIIDNA methylation

Methylation, in earnest

Methylation gets taught as a binary on/off mark on CpG islands. That model is correct enough for an undergraduate exam and dangerously wrong for research. The current view is far richer.

The basic chemistry

5-methylcytosine (5mC) is written by DNMT3A/B de novo and maintained by DNMT1 at replication forks. It is actively erased through stepwise oxidation by TET1/2/3 — 5mC → 5hmC → 5fC → 5caC — followed by base excision repair. 5hmC is not just an intermediate; it accumulates at active enhancers and gene bodies of transcribed genes, and is functionally distinct from 5mC.

Where it lives, what it does

Genomic context dictates effect:

Methylation as cell identity barcode

For a pathologist this is the headline: methylation patterns are cell-type-specific and stable, which is why DNA-methylation-based tumour classifiers (Capper et al. on CNS tumours; Sarcoma classifier; the various pan-cancer iterations) work as well as they do. The same property underlies methylation clocks (Horvath, GrimAge) and recent single-cell methylation atlases. When you read a paper that uses methylation to call cell-of-origin or tumour subtype, what's actually being exploited is the developmental memory written into the methylome.

Don't confuse
5mC vs 5hmC

Standard bisulfite sequencing cannot distinguish them; both read as C. Use oxidative bisulfite (oxBS-seq), TAB-seq, or ACE-seq if 5hmC matters — and in active tissues like brain it really does.

◆ ◆ ◆
IVHistone marks

Histones & the chromatin code

There are too many histone modifications to memorise. There are about a dozen you genuinely need to know, because they recur in nearly every chromatin paper and they are what foundation models are trained to predict.

The mental model: histone tails carry a combinatorial alphabet of marks, written and read by enzyme complexes, that cooperatively define functional states — active promoter, poised promoter, active enhancer, primed enhancer, transcribed gene body, facultative heterochromatin, constitutive heterochromatin. The ChromHMM and Segway frameworks formalise this into discrete states; modern models predict the underlying tracks directly.

The marks worth knowing cold

H3K4me3
Active promoters
Sharp peaks at TSS of transcribed genes. Written by MLL/KMT2 family.
H3K4me1
Enhancers (primed or active)
Broad, distal. Combine with H3K27ac to call active.
H3K27ac
Active enhancers & promoters
Written by p300/CBP. The standard "is this enhancer on?" mark.
H3K27me3
Polycomb-repressed
Written by PRC2 (EZH2). Facultative, developmentally important silencing.
H3K9me3
Constitutive heterochromatin
Written by SUV39H, SETDB1. Pericentromeres, repeats, lamina-associated.
H3K36me3
Transcribed gene bodies
Written co-transcriptionally by SETD2. Marks elongation.
H3K9ac
Active transcription
Promoter-proximal; correlates with H3K4me3.
H2AK119ub
Polycomb (PRC1)
Cooperates with H3K27me3 to maintain repression.

The KMT2 family — and why it shows up in your data

The KMT2 (MLL) familyKMT2A, KMT2C, KMT2D in particular — writes H3K4 methylation at enhancers and promoters. Their loss-of-function mutations are recurrent across solid tumours, and (as you've seen in your own SRC carcinoma cohort) they often define molecular subgroups. The functional consequence is enhancer dysregulation: when you can't deposit H3K4me1/me3 properly, lineage-defining enhancer landscapes erode, and cells drift toward less differentiated, more plastic states.

Bivalent chromatin and the "poised" state

Stem cells and many tumour cells carry bivalent domains — H3K4me3 and H3K27me3 at the same promoter, simultaneously activating and repressing. These mark genes ready for rapid resolution in either direction depending on differentiation cues, and they are mechanistically central to plasticity.

"Non-mutational epigenetic reprogramming" and "unlocking phenotypic plasticity" are now formally listed alongside the classical hallmarks of cancer.
VGRNs

Gene regulatory networks & the logic of identity

A cell's identity is not a list of expressed genes. It is a self-reinforcing circuit of transcription factors that mutually activate each other and lock down alternative fates. The epigenome is the substrate that stores and stabilises that circuit between cell divisions.

Master regulators & core regulatory circuits

The dominant framework, articulated most cleanly by Young, Bradner and others, is the core regulatory circuit (CRC): a small set of lineage-defining transcription factors that bind their own enhancers and each other's, forming a positive-feedback module. Knock one out and the circuit collapses; force-express one in another lineage and it can drive transdifferentiation. Examples worth knowing — OCT4/SOX2/NANOG in pluripotency; PU.1/CEBPA in myeloid; MITF in melanocyte/melanoma; HNF4A in hepatocyte; CDX2 in intestinal epithelium; ATOH1 in secretory lineage.

This is also where the bridge to your KRAS work lives: the mucosecretory transcriptional program you have characterised — TFF1/2/3, MUC5AC, AGR2, with ATOH1 upstream — is essentially a CRC. The reason it shows up across pancreas, colon, lung and stomach KRAS-mutant tumours is that it is a permissive epigenetic state, not a coincidence of mutation.

Inferring GRNs from data

Three classes of methods, in rough order of historical appearance:

  1. Co-expression based (WGCNA, GENIE3, GRNBoost2) — build networks from RNA correlation. Cheap, ubiquitous, but cannot distinguish causal from consequential edges.
  2. Motif + accessibility based (SCENIC, SCENIC+, Pando) — combine TF motif scanning in accessible regions (ATAC) with target gene expression. Currently the standard for single-cell GRN inference and the most useful for cell-identity work.
  3. Sequence-to-function based (GET, CREformer, AlphaGenome-derived approaches) — learn the regulatory grammar directly from sequence, then in-silico perturb. The frontier.

For your kind of work — characterising why a particular oncogenic driver behaves the way it does in a particular tissue — SCENIC+ on paired scRNA + scATAC is currently the highest-yield off-the-shelf approach, but the sequence-to-function generation will overtake it within a year or two for variant-level questions.

◆ ◆ ◆
VIPlasticity

Plasticity, dedifferentiation, and the cancer phenotype

A differentiated cell is supposed to stay differentiated. Cancer's most dangerous trick is breaking that constraint — not by mutating it away, but by letting the chromatin state slide back toward an embryonic, multipotent or invasive configuration.

What plasticity actually means, mechanistically

Three overlapping but distinguishable phenomena get called "plasticity":

All three share a chromatin signature: increased accessibility at developmental enhancers, loss of lineage-specific H3K4me1/me3 patterns, redistribution of H3K27me3, and often global hypomethylation with focal hypermethylation of differentiation regulators.

Cancer stem cells, finally clarified

The cancer-stem-cell field spent twenty years arguing about whether CSCs are a fixed subpopulation or a transient state. The current synthesis: they are a dynamic equilibrium, maintained by epigenetic plasticity, in which any cell with the right chromatin permissiveness can occupy the stem-like state under the right microenvironmental cues. This is why surface-marker-defined CSCs are unstable in culture, and why epigenetic therapies have re-emerged as a way to "lock in" differentiated states.

Why this matters for pathology
Histology as plasticity readout

The morphological features pathologists already use — differentiation grade, mucin production, signet-ring morphology, sarcomatoid change, neuroendocrine differentiation — are all visible manifestations of underlying chromatin states. Connecting morphology to chromatin computationally (CellViT-style segmentation paired with spatial epigenomics) is one of the highest-value research directions opening up right now.

VIIPredictive models

The new generation of predictive epigenome models

For roughly a decade, sequence-to-function modelling progressed in modest increments — DeepSEA, then Basset, Basenji, Basenji2. In the last 18 months it has accelerated sharply. Three families dominate the current landscape, with characteristic strengths and trade-offs.

Family 1 — Sequence-to-function (the long-context predictors)

These take genomic sequence as input and predict thousands of functional tracks — histone marks, accessibility, TF binding, expression — in parallel.

Enformer
Avsec et al. 2021 / DeepMind & Calico

The model that proved transformers could integrate distal regulatory information up to ~100 kb to predict cell-type-specific expression and chromatin signal. Still the reference baseline; many newer models report performance relative to it.

Context~196 kb Resolution128 bp Use it forBaseline; well-supported tooling
Borzoi
Linder et al. Nat Genet 2025 / Calico

Extends Enformer's lineage to ~524 kb context and, crucially, predicts RNA-seq coverage directly — meaning splicing and polyadenylation effects fall out of the same model. Currently the strongest open-weights option for variant-effect prediction across multiple regulatory layers. Flashzoi, a 2025 reimplementation with rotary embeddings and FlashAttention-2, is ~3× faster with comparable or better performance.

Context~524 kb Resolution32 bp Use it forCis-regulatory variant scoring; splicing
AlphaGenome
Avsec et al. Nature 2026 / Google DeepMind

The 2025–2026 inflection point. Takes 1 Mb of input and predicts thousands of tracks across modalities — expression, transcription initiation, accessibility, histone modifications, TF binding, contact maps, splice site usage, splice junction strength — at base-pair resolution in a single unified model. Reported to match or exceed prior best-in-class on 25 of 26 variant-effect benchmarks. Available via an API for non-commercial research; not yet validated for clinical use. Reads like Enformer's logical conclusion plus a splicing model bolted on.

Context1 Mb Resolution1 bp Use it forFrontier variant interpretation; multi-modal in-silico perturbation

Family 2 — Chromatin / histone-mark predictors

A complementary line of work focuses specifically on predicting chromatin tracks from existing chromatin tracks, rather than from sequence:

eDICE & descendants
Epigenome imputation

Attention-based imputation of missing epigenomic tracks, conditioned on observed ones. Useful when you have partial coverage of a sample (a few histone marks + accessibility) and want to reconstruct the rest. Particularly valuable for individual-specific epigenomics where exhaustive profiling is not feasible.

Family 3 — Structure-aware regulatory models

EPInformer (2025) is worth flagging because it takes a different architectural bet: rather than scaling raw sequence context, it integrates pre-computed Hi-C contact information with cCRE annotations, and reportedly outperforms Enformer and Borzoi on enhancer-promoter prediction in K562 and GM12878 with two orders of magnitude fewer parameters. The lesson is that for some tasks, prior structural information beats raw scale.

How to choose

A rough decision rubric for your kind of work:

◆ ◆ ◆
VIIISingle-cell foundation models

Single-cell foundation models meet the epigenome

The first wave of single-cell foundation models — Geneformer, scGPT, scBERT, scFoundation — were exclusively transcriptomic. The second wave is bringing in epigenetic modalities, because gene expression alone is a lagging indicator of cell state.

Why expression isn't enough

RNA tells you what a cell is doing right now. Chromatin accessibility and methylation tell you what it can do — which programs are licensed, which are silenced, which are poised. For questions about plasticity, lineage commitment, or transdifferentiation, the epigenetic state is the more informative variable. A cell can be transcriptionally quiescent and chromatin-primed for switching; that's invisible to scRNA-seq.

The current models worth knowing

EpiFoundation
Wu et al. bioRxiv 2025 — scATAC-seq

The first foundation model purpose-built for scATAC-seq, with a peak-to-gene cross-modality pre-training objective that addresses the inherent sparsity of chromatin accessibility data. Used for cell-type annotation, batch correction, and gene-expression prediction directly from accessibility. State-of-the-art on multiple benchmarks at release.

GET (General Expression Transformer)
Fu et al. 2025

Models pseudobulk scATAC signals with explicit transcription-factor information to predict cell-type-specific regulation. Trades single-cell resolution for stronger regulatory inference; useful when your question is about programs rather than individual cells.

CREformer
Yang et al. 2024

Integrates bulk epigenetic data with paired scRNA + scATAC for downstream tasks: master-regulator identification, enhancer prediction, functional-variant scoring. Worth knowing as a paired-modality reference point.

The practical question for you

For a researcher coming from molecular pathology with scRNA / scATAC / methylation data, the realistic 2026 workflow looks something like this: use a transcriptomic foundation model (scGPT / Geneformer) for cell-type annotation, embedding, and zero-shot transfer; pair it with an scATAC-specific model (EpiFoundation, GET) for accessibility-derived cell-state and regulatory inference; and run sequence-to-function models (Borzoi, AlphaGenome) on specific loci of interest for variant interpretation and in-silico perturbation. None of these replace each other.

IXStudy plan

A six-week study plan

Compressed into evening-and-weekend pace. The week numbers assume one substantive paper plus one practical exercise per week. Adjust to your reality.

Week 01

Foundations refresh

Allis & Jenuwein Nat Rev Genet review on chromatin states. Build a one-page mental map of the marks in §IV. Exercise: pull a public ChIP-seq track for H3K27ac in a tissue you know well; eyeball it against gene expression.

Week 02

Methylation deep dive

Schübeler 2015 Nature review (still the cleanest synthesis); then Capper et al. 2018 on methylation-based CNS tumour classification. Exercise: download an Illumina EPIC dataset from TCGA and run a quick UMAP on β-values across two tumour types.

Week 03

GRN inference, hands-on

Read the SCENIC+ paper (Bravo González-Blas et al. Nat Methods 2023). Exercise: run SCENIC+ on a published paired scRNA + scATAC dataset; identify the core regulatory circuit of a lineage you care about.

Week 04

Plasticity & cancer

Hanahan 2022 hallmarks update; one of the 2024–2025 plasticity reviews from §X. Exercise: take one of your own SRC subgroups (S1/S2/S3) and ask which chromatin program would distinguish them — sketch the experiment.

Week 05

Sequence-to-function modelling

Read Enformer (2021), then Borzoi (Nat Genet 2025), then AlphaGenome (Nature 2026). Exercise: pick one non-coding variant from your data; score it through Borzoi (open weights) and AlphaGenome (API); compare predictions across modalities.

Week 06

Single-cell foundation models

Read EpiFoundation; skim the recent Nature Methods review of scFMs. Exercise: load one of your own scRNA datasets into Geneformer or scGPT for zero-shot annotation, and compare to your existing labels.

XReading list

The essential reading list

Curated, not exhaustive. If you read only what is on this list and do the exercises in §IX, you will be fluent in the current literature within two months.

— Foundations & reviews —

  1. Hanahan, D. Hallmarks of Cancer: New Dimensions. Cancer Discovery, 2022. — The hallmarks update that formalised epigenetic reprogramming and plasticity.
  2. Yamada et al. In Vivo Reprogramming Highlights Epigenetic Regulation That Shapes Cancer Hallmarks. Cancer Science, 2025.
  3. Allis & Jenuwein. The molecular hallmarks of epigenetic control. Nat Rev Genet, 2016. — Still the clearest single-paper introduction to the chromatin code.
  4. Schübeler, D. Function and information content of DNA methylation. Nature, 2015. — Standard reference for methylation biology.

— Plasticity & cell identity —

  1. Bhat et al. Cancer cell plasticity: from cellular, molecular, and genetic mechanisms to tumor heterogeneity and drug resistance. Cancer Metastasis Rev, 2024.
  2. Tumor cell plasticity reviews in Adv Sci and Discover Oncology (2025) covering EMT, CSCs, and chromatin reprogramming as a unified framework.
  3. Capper et al. DNA methylation-based classification of central nervous system tumours. Nature, 2018. — The proof that methylation is identity.

— Predictive models (sequence-to-function) —

  1. Avsec et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods, 2021. — Enformer.
  2. Linder et al. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat Genet, 2025. — Borzoi.
  3. Hingerl et al. Flashzoi: an enhanced Borzoi for accelerated genomic analysis. Bioinformatics, 2025.
  4. Avsec et al. Advancing regulatory variant effect prediction with AlphaGenome. Nature, 2026 (preprint bioRxiv 2025).
  5. Karollus et al. Predicting gene expression from histone marks using chromatin deep learning models. NAR, 2024.

— Single-cell foundation models —

  1. Wu et al. EpiFoundation: A Foundation Model for Single-Cell ATAC-seq via Peak-to-Gene Alignment. bioRxiv, 2025.
  2. The Nat Methods 2025 review of single-cell foundation models — for the broad map of where the field is heading, including epigenomic extensions.
  3. Fu et al. General Expression Transformer (GET). 2025.
  4. Bravo González-Blas et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat Methods, 2023.

— Ancillary, but worth having on hand —

  1. Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature, 2015.
  2. ENCODE Project Consortium. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 2020.
  3. Teschendorff et al. Interpretable deep learning of single-cell and epigenetic data reveals novel molecular insights in aging. Sci Reports, 2025. — Useful XAI framework worth borrowing for tumour epigenomes.