Retrieval-augmented system for mapping unstructured input to a controlled vocabulary (demo: image tags).
See 'How Prompt Squirrel Works'.

Typical runtime: up to ~20 seconds.

Legend: Rewrite phrase General selection Probe query Structural query Implied User-toggled Unselected

Prompt Squirrel RAG: System Overview

This document explains what Prompt Squirrel does, why it is structured this way, and how data moves through the system.

Purpose

Prompt Squirrel converts a rough natural-language prompt into a structured, editable tag list from a fixed image-tag vocabulary, then lets the user refine that list interactively.

Design goals:

  • Keep generation grounded in a closed tag vocabulary.
  • Balance recall (find good candidates) with precision (avoid bad tags).
  • Keep UI editable so users remain in control.
  • Run reliably in a Hugging Face Space with constrained resources.

Architecture At A Glance

Architecture diagram

What Each Step Does

  • Rewrite: Turns the user prompt into short, tag-like pseudo-phrases that are easier to match in vector retrieval. These phrases are optimized as search queries for candidate lookup.
  • Structural Inference: Runs an LLM call over a fixed set of high-level structure tags (for example character count, body type, gender, clothing state, gaze/text). It outputs only the structural tags it believes are supported.
  • Probe Inference: Runs a separate LLM call over a small, curated set of informative tags. This is a targeted check for tags that are often useful for reranking and final selection.
  • Retrieval Candidates: Uses the rewrite phrases (plus structural/probe context) to fetch candidate tags from the fixed vocabulary, prioritizing recall.
  • Closed-Set Selection: Runs an LLM call that can only choose from the retrieved candidate list. It cannot invent new tags.
  • Implication Expansion: Adds parent/related tags implied by selected tags according to the implication graph.
  • Ranked Rows: Groups and orders suggested tags into row categories for editing.
  • Toggle UI and Suggested Prompt: Lets the user turn tags on/off and see the resulting prompt text update immediately.

Design Rationale

  • Rewrite and retrieval are separate so search phrase generation stays flexible while candidate generation stays deterministic.
  • Retrieval and closed-set selection are separate to keep high recall first, then apply higher-precision filtering.
  • Structural and probe inference run in parallel with rewrite so they can add context without adding much latency.
  • Users control the final prompt by toggling suggested tags on/off; the prompt text is generated from those toggle states.

Generalization to Other Domains

Prompt Squirrel RAG is implemented here for image-tag prompt construction, but the system design applies more broadly to domains where unstructured input must be mapped to a fixed vocabulary, taxonomy, or controlled label set.

The reusable pattern is:

  1. maintain a closed vocabulary with aliases, metadata, and optional definitions
  2. retrieve candidate terms using lexical and embedding-based search
  3. use structured metadata and retrieved context to rank, group, or filter candidates
  4. use constrained LLM selection so the model can choose only from valid retrieved terms
  5. keep the final output editable by the user

Examples of analogous domains include:

  • developer tools: mapping programming questions to Stack Overflow-style tags
  • healthcare administration: mapping notes or claim descriptions to candidate diagnosis or billing codes
  • e-commerce: mapping product descriptions or search queries to catalog categories and attributes
  • enterprise search: mapping user requests to internal taxonomy labels or document categories

The current Space does not claim to solve these other domains directly. Porting the system would require replacing the tag vocabulary, aliases, metadata, co-occurrence statistics, implication rules, evaluation data, and UI labels. The transferable part is the retrieval-plus-constrained-selection architecture.

Data Inputs (Broad)

  • Tag vocabulary and alias mappings
  • Tag counts (frequency)
  • Tag implications graph
  • Group/category mappings for row display
  • Optional wiki definitions (used for hover help)

Technologies Used

  • FastText embeddings for semantic tag retrieval.
  • HNSW approximate nearest-neighbor indexes for efficient retrieval at runtime.
  • Reduced TF-IDF vectors for context-aware ranking and row scoring.
  • OpenRouter-served instruction LLMs for rewrite, structural inference, probe inference, and closed-set selection. Default model: mistralai/mistral-small-24b-instruct-2501, chosen empirically from internal caption-evident test-set comparisons (with model choice remaining configurable).
  • Gradio for the interactive web UI (tag toggles, ranked rows, and suggested prompt text).
  • Python pipeline orchestration with CSV/JSON data sources and implication-graph expansion.

Evaluation (Broad)

Current evaluation style compares selected tags against ground-truth tags on caption-evident samples.

Primary metrics:

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1: harmonic mean of precision/recall

The evaluation focus is practical:

  • Is the returned tag set useful and mostly correct?
  • Does it miss important prompt-evident tags?
  • Does UI ranking surface likely-correct tags early?

Evaluation Dataset Snapshot

  • File: data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident_n30.jsonl
  • Construction: manually curated caption-evident subset, where ground-truth tags are intended to be directly supported by the caption text.
  • Size: 30 images
  • Total ground-truth tag assignments: 440
  • Unique tags represented: 205
  • Average tags per image: 14.67