Thematic Course:
Practical Machine Learning for Transcriptomics in Cancer Research

Date
6-10 July 2026

Organizer
Zhan Yinxiu (IEO)

Location
IEO Campus, Via Adamello 16 - Milan
About the Course
The course focuses on the process of building a trustworthy predictive model from biological data. The goal is to train computational biologists who can build, evaluate, interpret and critically review predictive models derived from transcriptomic data.
The entire course is built around a single running case study, so that every concept is learned on one real, evolving problem rather than in isolation.
Total number of credits: 5 credits
Lessons Location: Ivory+Silver Room, bld. 13
From Biological Question to Machine Learning Problem
9:00 – 11:00
Theoretical session
Turning a vague clinical question into a well-posed prediction task. Transcriptomic data structure (features and labels); high-dimensional biology (p≫n); batch effects; class imbalance; data leakage; train/validation/test splitting.
11:00-13:00
Practical session
Explore a real transcriptomic dataset, audit clinical metadata, define a clean recurrence label, find a batch effect with PCA, and build a leakage-aware, patient-level, stratified split.
Lessons Location: Ivory+Silver Room, bld. 13
Building the First Predictive Model
13:00 – 15:00
Theoretical session
Transcriptomic preprocessing; logistic regression and regularisation (LASSO, Elastic Net); cross-validation; classification metrics under class imbalance (precision/recall, ROC-AUC vs PR-AUC); overfitting. Emphasis on model behaviour over algorithmic complexity.
15:00 – 17:00
Practical session
Train a regularised logistic-regression baseline inside a leakage-safe pipeline; evaluate honestly with cross-validation; watch overfitting happen on a regularisation sweep.
Lessons Location: Gold+Platinum Room, bld. 13
Feature Engineering and Biological Representation
13:00 – 15:00
Theoretical session
Representation often matters more than the algorithm. Gene-level features (variance filtering, differential expression); biological signatures (proliferation, ER signalling, immune, stromal); pathway-level features (gene sets, GSVA/ssGSEA); feature selection; and feature-selection leakage, the single most common silent error in omics ML.
15:00 – 17:00
Practical session
Hold the model fixed and compare representations (all genes → filtered → signatures → pathways); demonstrate how leaky selection inflates performance — even on permuted labels.
Lessons Location: Ivory+Silver Room, bld. 13
Improving and Validating Models
13:00 – 15:00
Theoretical session
Alternative model families (decision trees, random forests, gradient boosting); hyperparameter optimisation and nested cross-validation; batch correction (ComBat-style, and its pitfalls); model interpretation (feature importance, stability); reproducibility (pipelines, seeds, experiment tracking).
15:00 – 17:00
Practical session
Compare model families fairly, tune with nested CV, run a batch-correction experiment on a simulated cross-platform cohort, and interpret the best model.
Lessons Location: Ivory+Silver Room, bld. 13
Research-Grade Machine Learning (Capstone)
13:00 – 15:00
Theoretical session
What separates a publishable study from an exploratory analysis: external validation and transportability; distribution shift; reproducibility; statistical significance vs clinical utility; the move from binary labels to time-to-event (survival) modelling; and the common reviewer criticisms.
15:00 – 17:00
Practical session (review-diagnose-repair)
You are handed a deliberately flawed “submitted analysis”, reproduce its headline, identify and demonstrate its flaws (permuted-label leakage check), and rebuild it as a corrected, reproducible pipeline with genuine external validation on the real GSE6532 cohort — showing the flawed raw-gene model collapse across platforms while a robust engineered-feature model holds, then weighing its incremental value over standard clinical variables.