K data model

An opinionated structuring framework designed to bridge the gap between bioinformatics outputs and high-performance analytical queries.

Overview

The K Data Model optimizes information retrieval across multiple biological scales by transforming raw files (e.g. h5ad, csv, zarr) into a unified, SQL-queryable parquet structure. It enables seamless cross-modality analysis—from clinical longitudinal trends down to single-cell gene expression and spatial spots—ensuring every data point is mapped to a common coordinate system and ontology.

Pipeline overview

Processing & standardization

These steps transform heterogeneous data into a unified, high-performance schema.

Step
What happens
Why

Multi-Level structuring

Data is organized into Patient, Sample, Slide, and Cell/Spot levels.

Optimizes query performance for different scales of analysis.

Schema harmonization

Conversion of raw formats to Parquet files.

Enables universal SQL querying and high-speed data loading.

Ontology mapping

Gene names and disease labels are matched to a single reference.

Ensures consistency and enables cross-dataset comparisons.

Normalization

Counts for bulk, single-cell, and spatial data are normalized.

Removes technical variation to allow biological interpretation.

Feature enrichment

Computation of UMAPs (bulk & single-cell) and pseudobulk.

Provides instant, high-quality visualizations without on-the-fly lag.

Data filtering & Quality control

The K Data Model applies specific rules to ensure data integrity and visualization speed.

Filter / QC Step
Criteria
Impact

Relational mapping

Uses common keys to link patients, samples, and cells.

Prevents "orphaned" data; ensures clinical context is always available.

Pre-computed metadata

Ranges, completeness, and cohort counts are calculated upfront.

Accelerates UI responsiveness for filtering and exploration.

Visual optimization

Data is partitioned/saved specifically for heatmaps and dotplots.

Delivers near-instant rendering of complex large-scale matrices.

Longitudinal alignment

Chronological ordering of treatments and events.

Enables accurate survival and treatment-response analysis.

Technologies used

Technology
Purpose

AnnData / Scanpy

Handling single-cell and spatial transcriptomics objects.

NumPy / Pandas

Core matrix manipulation and numerical processing.

Parquet

Columnar storage format for efficient, large-scale data access.

AWS SDK (or other vendor)

Cloud-native data management and storage interface.

SQL

Universal query language for all standardized data levels.

What this means for your analyses:

  • Granular access: Query clinical data (e.g., all patients on immunotherapy) and drill down to single-cell gene expression or immune cell density.

  • Ready-to-view: Pre-computed visualizations like UMAPs and heatmaps reduce wait times.

  • Ontological truth: Synonymous gene names or mismatched disease labels are aligned automatically.

Last updated

Was this helpful?