# Prompting guide & prompt library

Learn how to get the best results from K-Pro. This guide covers the principles behind effective prompts and provides a ready-to-use library organized by use case.

### Start here

New to K-Pro? Try this prompt right now:

> **Provide clinical characteristics of the TCGA-BRCA cohort. Include the distribution of tumor stage and age at diagnosis.**

You'll get summary tables and distributions showing the clinical breakdown of the cohort. From there, try a follow-up:

> **Compute PAM50-like subtype scores using key marker genes from bulk RNA-seq and stratify survival accordingly**

This is the core loop in K Pro: ask a question, get a result, refine with a follow-up. Every prompt in this guide works this way.

### Part 1 — How to prompt K Pro effectively

#### The A-S-R-T-C Framework

Every high-performing prompt follows a five-part structure. This framework was built from analysis of 500+ real K-Pro interactions — prompts that follow it succeed consistently, prompts that skip components hit predictable failure modes.

**\[Action] + \[Subject] + \[Resolution] + \[Tool] + \[Context]**

| Component          | What it is                          | Why it matters                                                                    | Keywords & examples                                             |
| ------------------ | ----------------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------- |
| **A — Action**     | The analytical intent               | Prevents the AI from just "describing" — forces it to calculate                   | `Analyze`, `Compare`, `Correlate`, `Perform DEG`, `Generate KM` |
| **S — Subject**    | Gene symbols + aliases (HGNC)       | Avoids "Column Not Found" errors                                                  | `TNFRSF9 (CD137)`, `CD274 (PD-L1)`, `VSIR (VISTA)`              |
| **R — Resolution** | Data granularity                    | Directs the AI to the correct single-cell or bulk layer                           | `Level 3 (refined types)`, `Level 4 (subsets)`, `Bulk RNA`      |
| **T — Tool**       | Visualization type                  | Bypasses tool-specific failure modes (e.g., heatmap crashes on large gene sets)   | `dotplot`, `Violin Plot`, `Kaplan-Meier`, `Volcano plot`        |
| **C — Context**    | Biological or cohort stratification | Filters out irrelevant data (e.g., GBM cells appearing in a bladder cancer query) | `Epithelioid vs. Biphasic`, `TP53 Mut vs. WT`, `TCGA-BRCA`      |

> **Not every prompt needs all five components.** Literature searches typically need only A + S + C. Data exploration needs at least A + S + C. Visualization and spatial queries benefit from the full A-S-R-T-C.

#### From weak to effective — see the difference

Example: Gene expression in mesothelioma

**Weak prompt:**

> "Show me CD137 in mesothelioma."

*What goes wrong:* The AI might pick the wrong data resolution, use a basic bar chart, or fail to find the gene name "CD137" in the database columns.

**Better:**

> "Analyze TNFRSF9 at Level 2 in mesothelioma."

*Improved:* Uses the official gene symbol and specifies resolution. But still lacks the visualization tool and comparison context.

**A-S-R-T-C prompt:**

> "**Analyze** \[A] **TNFRSF9 (CD137)** \[S] at **Single-cell Level 3** \[R] using a **dotplot** \[T] in **Epithelioid vs. Biphasic Mesothelioma** \[C]."

*Why it works:* The agent knows exactly what to calculate, where to find the gene, what resolution to use, which chart to produce, and how to stratify the data.

#### Common failure modes and how to fix them

These are the four most frequent errors observed in real K-Pro sessions. Each one maps to a missing A-S-R-T-C component.

**1. The "Complexity Crash" — too many variables**

|                     |                                                                                                                                                    |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Failed prompt**   | "Plot expression for AREG, EREG, HBEGF, BTC, TGFA, EGFR, KRAS, BRAF, MET, and HER3 in a heatmap for all bladder cancer types."                     |
| **What went wrong** | Overloaded the heatmap tool with too many genes across too many categories → timeout error                                                         |
| **Fix**             | "**Analyze** the EGFR-ligand and bypass pathway genes at **Bulk RNA level** using a **dotplot** stratified by **Bladder Histological Subgroups**." |
| **Why it works**    | Switching the **Tool** (T) to a dotplot handles high-dimensional data more efficiently than a heatmap                                              |

**2. The "Schema Error" — gene alias not found**

|                     |                                                                                                                                                        |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Failed prompt**   | "Show me VISTA expression in mesothelioma."                                                                                                            |
| **What went wrong** | The agent couldn't find "VISTA" in column headers → "Column Not Found" error                                                                           |
| **Fix**             | "**Compare** **VSIR (VISTA)** expression at **Single-cell Level 3** using a **Violin Plot** across **Epithelioid and Biphasic mesothelioma subtypes"** |
| **Why it works**    | Using the official HGNC symbol as **Subject** (S) with the alias in parentheses ensures correct mapping                                                |

**3. The "Ambiguity Error" — wrong indication pulled in**

|                     |                                                                                                                    |
| ------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **Failed prompt**   | "What are the top cell types expressing CD137?"                                                                    |
| **What went wrong** | The AI pulled Glioblastoma data (Astrocytes/Microglia) because no cancer type was specified                        |
| **Fix**             | "Analyze TNFRSF9 (CD137) distribution at Single-cell Level 3 specifically within the Bladder Cancer MOSAIC cohort" |
| **Why it works**    | Providing the **Context** (C) filters out non-relevant indications                                                 |

**4. The "Spatial Resolution" failure — too vague for spatial data**

|                     |                                                                                                                           |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| **Failed prompt**   | "Where is FAP expressed in the slides?"                                                                                   |
| **What went wrong** | Too vague — produced a generic slide view with no quantitative insight                                                    |
| **Fix**             | "Calculate the spatial abundance of FAP within the Invasive Front vs. Tumor Core across ovarian, bladder cancer patients" |
| **Why it works**    | Forces a calculation (A) on specific spatial niches (C) rather than a simple visual search                                |

#### The "Chain of Thought" sequence

Your most productive sessions will follow a logical drill-down path. Don't start with a complex differential expression query — start simple and build up.

| Step            | What to ask                                         | Example                                                                                                            |
| --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| **1. Survey**   | Broad overview of a gene across indications         | "Show expression of NECTIN4 across all mosaic cancer types"                                                        |
| **2. Focus**    | Zoom into one indication at higher resolution       | "Zoom into MOSAIC-BLCA at Level 3 resolution."                                                                     |
| **3. Spatial**  | Check co-localization in the tumor microenvironment | "Does NECTIN4+ tumor cells co-localize with CD8+ T cells in the invasive front in MOSAIC bladder cancer patients?" |
| **4. Clinical** | Link findings to patient outcomes                   | "Does NECTIN4 overexpression correlate with overall survival in TCGA-BLCA?"                                        |

Each step builds on the context from the previous one — K-Pro maintains conversation history within a session.

#### Quick-reference prompt templates

Copy these templates and fill in the bracketed values.

**Survival Analysis**

```
Generate Kaplan-Meier curves for [OS/PFS/rwPFS] in bladder cancer
patients from TCGA, stratified by FGFR3 expression
(high vs. low using median cutoff).
```

**Differential Expression**

```
Perform a DEG analysis on across immunotherapy responders (complete+partial responders) 
and nonresponders (progressive and stable disease) in MOSAIC NSCLC patients 
excluding unknown/not evaluable patients adjust for [age ,sex]. 
Show top 50 DEGs as a volcano plot with |log2FC| > 1 and padj < 0.05 highlighted.
```

**Spatial Analysis**

```
Visualize FGFR3 and Nectin 4 spatial co-expression in
bladder cancer from MOSAIC, showing correlation within
tumor regions vs. invasive front.
```

**Cell Type Profiling**

```
Compare expression across Level 4 immune cell types in 
PDL1 scRNA-seq data from bladder. 
Display as a dot plot with expression level and percentage of cells expressing.
```

**Target Discovery**

```
Identify differentially expressed genes when 
the primary target TP53 is inhibited in Bladder cancer
```

***

### Part 2 — Prompt Library

Ready-to-use prompts organized by use case. Each prompt has been verified against K-Pro documentation.

#### Literature Review & Biomedical Knowledge

**Summarize publications on a specific gene or pathway**

**Prompt:**

> Summarize the latest PubMed publications on the role of TP53 mutations in non-small cell lung cancer. Focus on findings from the last 3 years and highlight any consensus on prognostic significance.

**Expected result:** A structured summary of key publications with citations, organized by main findings and areas of consensus/controversy.

**Follow-up:**

> Are there any conflicting findings across these studies regarding TP53's role as a predictive biomarker for immunotherapy response in NSCLC?

**Tip:** Be specific about the gene, indication, and timeframe. Vague prompts like "tell me about TP53" return overly broad results.

**Search for publications about a drug target**

**Prompt:**

> Find publications investigating TROP2 as a therapeutic target in triple-negative breast cancer. Include any data on TROP2 expression levels and their correlation with clinical outcomes.

**Expected result:** A curated list of relevant publications with key findings, expression data, and clinical correlations.

**Follow-up:**

> Based on these publications, what is the evidence for using TROP2 expression as a patient selection biomarker for ADC therapies?

**Tip:** Combine the target name with a specific indication and the type of evidence you need (expression, outcomes, mechanisms).

#### Data Exploration & Cohort Analysis

**Explore TCGA cohort demographics**

**Prompt:**

> Provide clinical characteristics of the TCGA-BRCA cohort. Include the distribution of tumor stage and age at diagnosis.

**Expected result:** Summary tables and/or distributions showing the clinical breakdown of the TCGA-BRCA cohort.

**Tip:** Always specify the exact TCGA cohort code (e.g., TCGA-BRCA, TCGA-LUAD) rather than the disease name alone.

**A-S-R-T-C breakdown:** Show \[A] clinical characteristics \[S: cohort variables] at bulk level \[R] as summary tables \[T] in TCGA-BRCA \[C].

**Explore MOSAIC Window multiomics data**

**Prompt:**

> Provide the number of patients that have access to all the available (RNA-seq, Single cell, spatial, H\&E, clinical) modalities for MOSAIC bladder cancer.

**Expected result:** A summary of MOSAIC Window data availability for the requested indication, broken down by modality.

**Follow-up:**

> For ovarian cancer in mosaic window, show the distribution of HRD status." There is no CRC samples in mosaic window.

**Tip:** MOSAIC Window is Owkin's proprietary multimodal dataset. Specify the indication to scope the data landscape before diving into analysis.

**Characterize a patient cohort**

**Prompt:**

> Characterize the TCGA-LUAD cohort: show me the distribution of KRAS mutation status, smoking history, tumor stage, and median overall survival for KRAS wt and mutated patients.

**Expected result:** A multi-variable cohort characterization with summary statistics per subgroup.

**Follow-up:**

> Which of these subgroups has the worst overall survival, and what are their distinguishing molecular features?

**Tip:** List the specific variables you want characterized upfront — K-Pro works best when it knows exactly what you're looking for.

**Assess gene expression across tissues (target prioritization)**

**Prompt:**

> Compare the expression of NECTIN4 across all TCGA cancer types. Show me a pan-cancer overview ranked by median expression level.

**Expected result:** A ranked table or visualization of NECTIN4 expression across TCGA indications.

**Follow-up:**

> For the top 3 indications with highest NECTIN4 overexpression, show me the correlation between NECTIN4 expression and overall survival.

**A-S-R-T-C breakdown:** Compare \[A] NECTIN4 \[S] at Bulk RNA level \[R] as a ranked table \[T] across all TCGA cancer types vs. matched normal tissue \[C].

**Multiomics integration**

**Prompt:**

> In the TCGA-BRCA cohort, integrate RNA-seq gene expression and mutation data. Identify genes that are both differentially expressed and frequently mutated in basal-like versus luminal A patients.

**Expected result:** A multi-omics integration result showing genes that appear significant across both data modalities.

**Follow-up:**

> Perform an individual gene expression comparison of MSIGDB DNA repair, oxidative stress, and EMT pathway signatures at Bulk RNA level using a violin plot with p-values comparing Basal-like and Luminal TCGA-BRCA patients.

**Tip:** Multiomics queries are computationally intensive. Start with a specific comparison (two subgroups) rather than "analyze everything."

#### Visualization

**Agent:** Data Explorer (with visualization capabilities)

**Generate a Kaplan-Meier survival curve**

**Prompt:**

> Create a Kaplan-Meier survival curve for TCGA-BRCA patients stratified by TP53 mutation status (mutated vs. wild-type). Include the p-value from a log-rank test and the number of patients at risk.

**Expected result:** A Kaplan-Meier plot with two curves, log-rank p-value, and at-risk table.

**Tip:** K-Pro supports real-time plot iteration. Ask for specific formatting changes (colors, labels, font size) in follow-up prompts rather than trying to specify everything at once.

**A-S-R-T-C breakdown:** Create \[A] TP53 mutation survival analysis \[S] at bulk level \[R] as a KM curve with log-rank test \[T] in TCGA-BRCA, mutated vs. wild-type \[C].

**Create comparative visualizations**

**Prompt:**

> Create a box plot comparing the expression of CD274 (PD-L1) across the five molecular subtypes in TCGA-BRCA. Add individual data points and mark statistically significant pairwise comparisons.

**Expected result:** A box plot with overlaid data points and significance brackets between groups.

**Follow-up:**

> Now create a heatmap of the top 50 differentially expressed genes between these molecular subtypes.

**Tip:** You can request multiple plot types in sequence. Each builds on the data context from the previous query.

#### Statistical Analysis

**Compute survival statistics**

**Prompt:**

> In the TCGA-STAD cohort, test whether there is a statistically significant difference in overall survival between microsatellite-instable (MSI-H) and microsatellite-stable (MSS) patients. Report the hazard ratio, 95% confidence interval, and log-rank p-value.

**Expected result:** Statistical test results with HR, CI, and p-value, plus a supporting KM curve.

**Follow-up:**

> Run a multivariate Cox regression adjusting for age, stage, and MSI status to confirm whether MSI status is an independent prognostic factor.

**Tip:** Always specify the statistical test you want (log-rank, Cox, t-test) for precise results. K-Pro backs every analysis with p-values and population-level data.

#### Cross-Dataset Discovery

**Link datasets to find novel biological mechanisms**

**Prompt:**

> In the TCGA-KIRC cohort, identify genes whose expression is significantly associated with both VHL mutation status and response to immune checkpoint inhibitors. Cross-reference findings with published literature on VHL-related immune evasion mechanisms.

**Expected result:** A list of candidate genes with statistical associations, linked to supporting literature evidence.

**Follow-up:**

> For the top 3 candidate genes, show me their spatial expression patterns in the tumor microenvironment using MOSAIC data if available.

**Tip:** This is a multi-step, multi-agent use case. Be explicit about both the data analysis you want AND the literature cross-reference.

### Do's and don'ts

#### Do

* **Use official gene symbols** with common aliases in parentheses: `VSIR (VISTA)`, not just `VISTA`
* **Specify the dataset** using the cohort name and the indication.
* **Name the statistical test** you want: log-rank, Cox regression, t-test
* **Iterate in follow-ups** — refine plots, add filters, drill deeper
* **Start simple, then drill down** — follow the Survey → Focus → Spatial → Clinical sequence
* **Specify the visualization type** — dotplot, violin, KM curve — to avoid tool-selection errors

#### Don't

* **Don't write "search engine" prompts** — "Tell me about TP53" is too vague
* **Don't overload a single prompt** — stacking multiple plots or analyses into one request is more likely to fail than running a complex one. Break the work into sequential steps, the way you would during a scientific deep-dive.&#x20;
* **Don't skip the indication/cohort** — without context, the AI may pull irrelevant data from other cancer types
* **Don't try to specify everything in one prompt** — ask for the analysis first, then adjust formatting in follow-ups


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.owkin.com/getting-started/prompting-guide-and-prompt-library.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
