Internal Investigations

Cross-reference communications between multiple parties to establish who knew what and when.

The Problem

Internal investigations require tracing information flow across multiple people and communication channels. You need to build a timeline showing when key decisions were made, who was informed, and what was discussed.

Workflow

1. Ingest All Relevant Sources

uv run python dedup.py
uv run python ingest.py --reset

2. Map the Communication Landscape

Use explore.py to understand who communicated and when:

# Full corpus overview
uv run python explore.py
 
# Focus on specific year
uv run python explore.py --year 2025
 
# Check specific individuals
uv run python explore.py --sender [email protected]
uv run python explore.py --sender [email protected]

3. Track Specific Communications

Search for communications between key individuals:

mkdir -p evidence
 
uv run python query.py \
  --sender "[email protected],[email protected]" \
  --date-range 2025-01-01 2025-06-30 \
  --export-json evidence/alice-bob.json

4. Search for Key Topics

uv run python query.py \
  --semantic "approval authorisation sign-off decision" \
  --date-range 2025-01-01 2025-06-30 \
  --top-k 200 --export-json evidence/approvals.json
 
uv run python query.py \
  --semantic "risk warning concern escalation" \
  --date-range 2025-01-01 2025-06-30 \
  --top-k 200 --export-json evidence/warnings.json
 
uv run python query.py \
  --semantic "meeting minutes discussion agreed actions" \
  --source-type meeting_note \
  --top-k 200 --export-json evidence/meetings.json

5. Merge Results

uv run python merge.py evidence/*.json --output evidence/merged.json

6. Triage Separately (recommended)

Run triage as a standalone step so results are saved to disk:

# Recommended: gemini-flash for triage (cheapest, fastest)
uv run python analyze.py evidence/merged.json \
  --triage \
  --model gemini-flash \
  --truncate 500 \
  --concurrency 5 \
  --context "Key decisions, who was involved, what information was available" \
  --output evidence/triaged.json \
  --dry-run
 
# Free, private, slow (local Mistral 7B):
uv run python analyze.py evidence/merged.json \
  --triage \
  --local \
  --context "Key decisions, who was involved, what information was available" \
  --output evidence/triaged.json

Use gemini-flash for triage (cheapest, fastest). Use --truncate 500 and --concurrency 5 for speed. Checkpoints save every wave — if interrupted, re-run to resume.

7. Deep Analysis (on triaged results)

Use --deep-only to skip triage and run deep analysis on already-triaged data. Start with --min-relevance 5 to avoid missing borderline evidence:

uv run python analyze.py evidence/triaged.json \
  --deep-only \
  --min-relevance 5 \
  --context "Build a chronological timeline of key decisions, who was involved, and what information was available at each decision point" \
  --model deepseek \
  --dry-run

Review the cost estimate, then run without --dry-run. Tighten to --min-relevance 7 if output is noisy (no re-triage needed).

8. Export Timeline

uv run python export.py analysis_output.md --output investigation-timeline.md

Tips

Always triage separately with --triage --output so results are saved
Start with --min-relevance 5 to avoid missing borderline evidence
Use --date-range to narrow the investigation window
Combine --sender filters with --semantic to find specific topics discussed by specific people
Meeting notes often contain the clearest record of decisions — filter with --source-type meeting_note
For maximum privacy during investigation, use --local to keep all analysis on-machine
Use gemini-flash for triage (cheapest, fastest), deepseek for deep analysis (best reasoning)
Use --truncate 500 and --concurrency 5 for fast triage
Use --retry-failed to re-triage failed batches without re-running everything
Use --read N to inspect individual results in full before exporting
Triage checkpoints save progress — re-run to resume if interrupted

Subject Access Requests Contract Review