# K data model

### Overview

The K Data Model optimizes information retrieval across multiple biological scales by transforming raw files (e.g. h5ad, csv, zarr) into a unified, SQL-queryable parquet structure. It enables seamless cross-modality analysis—from clinical longitudinal trends down to single-cell gene expression and spatial spots—ensuring every data point is mapped to a common coordinate system and ontology.

### Pipeline overview

```mermaid
flowchart LR
    A("Raw Bioinformatics Files<br/>(h5ad, CSV, Zarr)") --> B(K Data Model Transformation)
    B --> C(Standardized Parquet Store)
    C --> D(SQL-Ready K Pro Platform)
```

### Processing & standardization

These steps transform heterogeneous data into a unified, high-performance schema.

| Step                        | What happens                                                         | Why                                                                   |
| --------------------------- | -------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Multi-Level structuring** | Data is organized into Patient, Sample, Slide, and Cell/Spot levels. | Optimizes query performance for different scales of analysis.         |
| **Schema harmonization**    | Conversion of raw formats to **Parquet** files.                      | Enables universal SQL querying and high-speed data loading.           |
| **Ontology mapping**        | Gene names and disease labels are matched to a single reference.     | Ensures consistency and enables cross-dataset comparisons.            |
| **Normalization**           | Counts for bulk, single-cell, and spatial data are normalized.       | Removes technical variation to allow biological interpretation.       |
| **Feature enrichment**      | Computation of UMAPs (bulk & single-cell) and pseudobulk.            | Provides instant, high-quality visualizations without on-the-fly lag. |

#### Data filtering & Quality control

The K Data Model applies specific rules to ensure data integrity and visualization speed.

| Filter / QC Step           | Criteria                                                          | Impact                                                                  |
| -------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------- |
| **Relational mapping**     | Uses common keys to link patients, samples, and cells.            | Prevents "orphaned" data; ensures clinical context is always available. |
| **Pre-computed metadata**  | Ranges, completeness, and cohort counts are calculated upfront.   | Accelerates UI responsiveness for filtering and exploration.            |
| **Visual optimization**    | Data is partitioned/saved specifically for heatmaps and dotplots. | Delivers near-instant rendering of complex large-scale matrices.        |
| **Longitudinal alignment** | Chronological ordering of treatments and events.                  | Enables accurate survival and treatment-response analysis.              |

#### Technologies used

| Technology                    | Purpose                                                         |
| ----------------------------- | --------------------------------------------------------------- |
| **AnnData / Scanpy**          | Handling single-cell and spatial transcriptomics objects.       |
| **NumPy / Pandas**            | Core matrix manipulation and numerical processing.              |
| **Parquet**                   | Columnar storage format for efficient, large-scale data access. |
| **AWS SDK (or other vendor)** | Cloud-native data management and storage interface.             |
| **SQL**                       | Universal query language for all standardized data levels.      |

What this means for your analyses:

* **Granular access:** Query clinical data (e.g., all patients on immunotherapy) and drill down to single-cell gene expression or immune cell density.
* **Ready-to-view:** Pre-computed visualizations like UMAPs and heatmaps reduce wait times.
* **Ontological truth:** Synonymous gene names or mismatched disease labels are aligned automatically.
