For the complete documentation index, see llms.txt. This page is also available as Markdown.

Understand your data (QC, methods)

For your data to be fully exploitable by K Pro's AI agents, it needs to meet a defined set of structural and quality requirements. This section explains how Owkin thinks about AI data readiness, which biological modalities and file formats K Pro currently supports, and the naming conventions and ontologies data must conform to. Whether you are preparing your own dataset for integration or evaluating a third-party dataset, this section is the technical reference you need.


K Pro uses a multi-layer quality and provenance framework for public data.

Data quality. Public datasets are transformed into a unified K Data Model with schema harmonization, ontology mapping, normalization, and multi-level structuring. Quality controls include relational mapping to avoid orphaned records, pre-computed metadata, and other filtering steps for integrity and fast exploration.

Provenance tracking. Every analysis is anchored in authoritative sources and validated biological knowledge bases. The platform keeps complete provenance for outputs and logs data sources, model decisions, and reasoning steps. For literature-backed content, K Pro uses RAG to verify cited PubMed articles.

Version control / traceability. The AI-Readiness Maturity Model defines version history at Level 2 and full data lineage at Level 4. Level 5 adds reproducibility with code + environment and detailed audits. Public dataset pages also expose dataset versions where available — for example, the TCGA entry lists a version in the dataset catalog.

Proprietary data (MOSAIC and similar). Owkin computational biology teams have developed data processing pipelines following gold standards and have, where necessary, optimized the pipelines for the specific dataset. Biomedical experts have been included in the development loop for conducting confirmatory analyses with the data, annotations of single-cell clusters, etc., additionally ensuring a high quality data for discovery and other typical uses.

The AI-maturity modelSupported modalitiesPreferred ontologies and nomenclatures

Last updated

Was this helpful?