> For the complete documentation index, see [llms.txt](https://docs.owkin.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.owkin.com/explore-and-analyse-data/k-pro-data-model-and-technical-references/last-mile-transformation-pipeline.md).

# About Owkin-managed data

### Overview

The K Data Model optimizes information retrieval across multiple biological scales by transforming raw files (e.g. h5ad, csv, zarr) into a unified, SQL-queryable parquet structure. It enables seamless cross-modality analysis—from clinical longitudinal trends down to single-cell gene expression and spatial spots—ensuring every data point is mapped to a common coordinate system and ontology.

### Pipeline overview

```mermaid
flowchart LR
    A("Raw Bioinformatics Files<br/>(h5ad, CSV, Zarr)") --> B(K Data Model Transformation)
    B --> C(Standardized Parquet Store)
    C --> D(SQL-Ready K Pro Platform)
```

### Processing & standardization

These steps transform heterogeneous data into a unified, high-performance schema.

| Step                        | What happens                                                         | Why                                                                   |
| --------------------------- | -------------------------------------------------------------------- | --------------------------------------------------------------------- |
| **Multi-Level structuring** | Data is organized into Patient, Sample, Slide, and Cell/Spot levels. | Optimizes query performance for different scales of analysis.         |
| **Schema harmonization**    | Conversion of raw formats to **Parquet** files.                      | Enables universal SQL querying and high-speed data loading.           |
| **Ontology mapping**        | Gene names and disease labels are matched to a single reference.     | Ensures consistency and enables cross-dataset comparisons.            |
| **Normalization**           | Counts for bulk, single-cell, and spatial data are normalized.       | Removes technical variation to allow biological interpretation.       |
| **Feature enrichment**      | Computation of UMAPs (bulk & single-cell) and pseudobulk.            | Provides instant, high-quality visualizations without on-the-fly lag. |

#### Data filtering & Quality control

The K Data Model applies specific rules to ensure data integrity and visualization speed.

| Filter / QC Step           | Criteria                                                          | Impact                                                                  |
| -------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------- |
| **Relational mapping**     | Uses common keys to link patients, samples, and cells.            | Prevents "orphaned" data; ensures clinical context is always available. |
| **Pre-computed metadata**  | Ranges, completeness, and cohort counts are calculated upfront.   | Accelerates UI responsiveness for filtering and exploration.            |
| **Visual optimization**    | Data is partitioned/saved specifically for heatmaps and dotplots. | Delivers near-instant rendering of complex large-scale matrices.        |
| **Longitudinal alignment** | Chronological ordering of treatments and events.                  | Enables accurate survival and treatment-response analysis.              |

#### Technologies used

| Technology                    | Purpose                                                         |
| ----------------------------- | --------------------------------------------------------------- |
| **AnnData / Scanpy**          | Handling single-cell and spatial transcriptomics objects.       |
| **NumPy / Pandas**            | Core matrix manipulation and numerical processing.              |
| **Parquet**                   | Columnar storage format for efficient, large-scale data access. |
| **AWS SDK (or other vendor)** | Cloud-native data management and storage interface.             |
| **SQL**                       | Universal query language for all standardized data levels.      |

What this means for your analyses:

* **Granular access:** Query clinical data (e.g., all patients on immunotherapy) and drill down to single-cell gene expression or immune cell density.
* **Ready-to-view:** Pre-computed visualizations like UMAPs and heatmaps reduce wait times.
* **Ontological truth:** Synonymous gene names or mismatched disease labels are aligned automatically.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.owkin.com/explore-and-analyse-data/k-pro-data-model-and-technical-references/last-mile-transformation-pipeline.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
