Multi-Search Workflow
Run several searches with different phrasings to maximise recall, then merge and deduplicate before analysis.
Why Multiple Searches?
Semantic search matches conceptually similar content, but different phrasings cast different nets. A single search for “budget concerns” might miss documents about “cost overruns” or “financial pressure”. Running multiple searches with varied phrasing catches more relevant results.
Step-by-Step
1. Create an Evidence Directory
mkdir -p evidence2. Run Multiple Semantic Searches
Use different phrasings for the same concept:
uv run python query.py \
--semantic "project delays timeline concerns" \
--top-k 200 --export-json evidence/delays.json
uv run python query.py \
--semantic "budget overrun cost escalation" \
--top-k 200 --export-json evidence/budget.json
uv run python query.py \
--semantic "stakeholder feedback complaints" \
--top-k 200 --export-json evidence/feedback.json3. Combine Metadata Filters
Add sender, date, or source-type filters to narrow results:
uv run python query.py \
--semantic "project risk" \
--sender "[email protected]" \
--date-range 2025-01-01 2025-06-30 \
--top-k 100 --export-json evidence/pm-risks.json4. Merge and Deduplicate
The merge tool removes duplicate documents found by multiple searches:
uv run python merge.py evidence/*.json --output evidence/merged.json5. Triage Separately
Run triage as a standalone step so results are saved to disk. This lets you re-run deep analysis at different thresholds without re-triaging:
# Recommended: gemini-flash for triage (cheapest, fastest)
uv run python analyze.py evidence/merged.json \
--triage \
--model gemini-flash \
--truncate 500 \
--concurrency 5 \
--context "Key project risks and stakeholder concerns" \
--output evidence/triaged.json \
--dry-run
# Free, private, slow (local Mistral 7B):
uv run python analyze.py evidence/merged.json \
--triage \
--local \
--context "Key project risks and stakeholder concerns" \
--output evidence/triaged.jsonUse gemini-flash for triage (cheapest, fastest). Use --truncate 500 and --concurrency 5 for speed.
Checkpoints save every wave — if interrupted, re-run the same command to resume.
6. Deep Analysis (on triaged results)
Use --deep-only to skip triage and run deep analysis on already-triaged data:
uv run python analyze.py evidence/triaged.json \
--deep-only \
--min-relevance 5 \
--context "Key project risks and stakeholder concerns" \
--model deepseek \
--dry-runReview the cost estimate, then run without --dry-run. If the output is
too noisy, re-run at --min-relevance 7 (no re-triage needed).
Best Practices
- Start broad, then narrow — cast a wide net first, then filter
- Use 3-5 different phrasings per topic for good coverage
--top-k 200is a good default — adjust based on corpus size- Always triage separately with
--triage --outputso results are saved - Start with
--min-relevance 5to avoid missing borderline evidence - Always
--dry-runbefore paid analysis to check costs - Use
gemini-flashfor triage (cheapest, fastest),deepseekfor deep analysis (best reasoning) - Use
--truncate 500and--concurrency 5for fast triage - Use
--retry-failedto re-triage failed batches without re-running everything - Triage checkpoints save progress — re-run to resume if interrupted