Describe your image under "Enter Prompt" and click "Run". Prompt Squirrel will translate it into image board tags.

Typical runtime: up to ~20 seconds.

Legend: Rewrite phrase General selection Probe query Structural query Implied User-toggled Unselected

Prompt Squirrel RAG: System Overview

This document explains what Prompt Squirrel does, why it is structured this way, and how data moves through the system.

Purpose

Prompt Squirrel converts a rough natural-language prompt into a structured, editable tag list from a fixed image-tag vocabulary, then lets the user refine that list interactively.

Design goals:

  • Keep generation grounded in a closed tag vocabulary.
  • Balance recall (find good candidates) with precision (avoid bad tags).
  • Keep UI editable so users remain in control.
  • Run reliably in a Hugging Face Space with constrained resources.

Architecture At A Glance

Architecture diagram

What Each Step Does

  • Rewrite: Turns the user prompt into short, tag-like pseudo-phrases that are easier to match in vector retrieval. These phrases are optimized as search queries for candidate lookup.
  • Structural Inference: Runs an LLM call over a fixed set of high-level structure tags (for example character count, body type, gender, clothing state, gaze/text). It outputs only the structural tags it believes are supported.
  • Probe Inference: Runs a separate LLM call over a small, curated set of informative tags. This is a targeted check for tags that are often useful for reranking and final selection.
  • Retrieval Candidates: Uses the rewrite phrases (plus structural/probe context) to fetch candidate tags from the fixed vocabulary, prioritizing recall.
  • Closed-Set Selection: Runs an LLM call that can only choose from the retrieved candidate list. It cannot invent new tags.
  • Implication Expansion: Adds parent/related tags implied by selected tags according to the implication graph.
  • Ranked Rows: Groups and orders suggested tags into row categories for editing.
  • Toggle UI and Suggested Prompt: Lets the user turn tags on/off and see the resulting prompt text update immediately.

Design Rationale

  • Rewrite and retrieval are separate so search phrase generation stays flexible while candidate generation stays deterministic.
  • Retrieval and closed-set selection are separate to keep high recall first, then apply higher-precision filtering.
  • Structural and probe inference run in parallel with rewrite so they can add context without adding much latency.
  • Users control the final prompt by toggling suggested tags on/off; the prompt text is generated from those toggle states.

Data Inputs (Broad)

  • Tag vocabulary and alias mappings
  • Tag counts (frequency)
  • Tag implications graph
  • Group/category mappings for row display
  • Optional wiki definitions (used for hover help)

Technologies Used

  • FastText embeddings for semantic tag retrieval.
  • HNSW approximate nearest-neighbor indexes for efficient retrieval at runtime.
  • Reduced TF-IDF vectors for context-aware ranking and row scoring.
  • OpenRouter-served instruction LLMs for rewrite, structural inference, probe inference, and closed-set selection. Default model: mistralai/mistral-small-24b-instruct-2501, chosen empirically from internal caption-evident test-set comparisons (with model choice remaining configurable).
  • Gradio for the interactive web UI (tag toggles, ranked rows, and suggested prompt text).
  • Python pipeline orchestration with CSV/JSON data sources and implication-graph expansion.

Evaluation (Broad)

Current evaluation style compares selected tags against ground-truth tags on caption-evident samples.

Primary metrics:

  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1: harmonic mean of precision/recall

The evaluation focus is practical:

  • Is the returned tag set useful and mostly correct?
  • Does it miss important prompt-evident tags?
  • Does UI ranking surface likely-correct tags early?

Evaluation Dataset Snapshot

  • File: data/eval_samples/e621_sfw_sample_1000_seed123_buffer10000_caption_evident_n30.jsonl
  • Construction: manually curated caption-evident subset, where ground-truth tags are intended to be directly supported by the caption text.
  • Size: 30 images
  • Total ground-truth tag assignments: 440
  • Unique tags represented: 205
  • Average tags per image: 14.67