Stirling-PDF

Mirrors/Stirling-PDF

Fork 0

mirror of https://github.com/Frooodle/Stirling-PDF.git synced 2026-05-01 23:16:31 +02:00

Commit Graph

Author	SHA1	Message	Date
James Brunton	5541dd666c	Flesh out RAG system (#6197 ) # Description of Changes Flesh out the RAG system and connect it to the PDF Question Agent so it can respond to questions about PDFs of an extremely large size. I'd expect lots more work will need to be done to finish off the RAG system to really be what we need, but this should be a reasonable start which will let us connect it to tools and have the ingestion mostly handled automatically. I'm leaving file deletion and proper file ID management to be done in a future PR. We also need to consider whether all tools should retrieve content exclusively via RAG, or whether it's beneficial to have tools sometimes fetch the direct content and other times fetch it from RAG. A diagram of the expected interaction is as follows: ```mermaid sequenceDiagram autonumber actor U as User participant FE as Frontend<br/>(ChatPanel) participant J as Java<br/>(AiWorkflowService) participant O as Engine:<br/>OrchestratorAgent participant QA as Engine:<br/>PdfQuestionAgent participant RAG as Engine:<br/>RagService + SqliteVecStore participant V as VoyageAI<br/>(embeddings) participant L as LLM<br/>(Claude / etc.) U->>FE: types "Summarise this PDF"<br/>(PDF already uploaded) FE->>J: POST /api/v1/ai/orchestrate/stream<br/>multipart: fileInputs[], userMessage Note over J: ByteHashFileIdStrategy<br/>id = sha256(bytes)[:16] J->>O: POST /api/v1/orchestrator<br/>{ files:[{id,name}], userMessage } O->>L: route via fast model L-->>O: delegate_pdf_question O->>QA: PdfQuestionRequest loop for each file QA->>RAG: has_collection(file.id) RAG-->>QA: false end QA-->>O: NeedIngestResponse(files_to_ingest) O-->>J: { outcome:"need_ingest", filesToIngest:[...] } Note over J: onNeedIngest loop per file J->>J: PDFBox: extract page text J->>O: POST /api/v1/rag/documents<br/>(long-running timeout) O->>RAG: chunk + stage documents O->>V: embed_documents (batches of 256) V-->>O: embeddings O->>RAG: add_documents O-->>J: { chunks_indexed: N } end Note over J: retry with resumeWith=pdf_question J->>O: POST /api/v1/orchestrator Note over O: fast-path to PdfQuestionAgent O->>QA: PdfQuestionRequest Note over QA: build RagCapability<br/>pinned to file IDs QA->>L: run(prompt) with search_knowledge tool loop up to max_searches L->>QA: search_knowledge(query) QA->>V: embed_query V-->>QA: query vector QA->>RAG: search(vector, collections=[file.id]) RAG-->>QA: top-k chunks QA-->>L: formatted chunks end Note over QA: once budget spent,<br/>prepare() hides the tool L-->>QA: PdfQuestionAnswerResponse QA-->>O: answer O-->>J: { outcome:"answer", answer, evidence } J-->>FE: SSE "result" FE->>U: assistant bubble ```	2026-05-01 14:11:54 +01:00
Anthony Stirling	f779085d75	setup RAG (#6146 )	2026-04-21 12:42:33 +01:00

Author

SHA1

Message

Date

James Brunton

5541dd666c

Flesh out RAG system (#6197 )

# Description of Changes
Flesh out the RAG system and connect it to the PDF Question Agent so it
can respond to questions about PDFs of an extremely large size.

I'd expect lots more work will need to be done to finish off the RAG
system to really be what we need, but this should be a reasonable start
which will let us connect it to tools and have the ingestion mostly
handled automatically. I'm leaving file deletion and proper file ID
management to be done in a future PR. We also need to consider whether
all tools should retrieve content exclusively via RAG, or whether it's
beneficial to have tools sometimes fetch the direct content and other
times fetch it from RAG.

A diagram of the expected interaction is as follows:

```mermaid
sequenceDiagram
    autonumber
    actor U as User
    participant FE as Frontend<br/>(ChatPanel)
    participant J as Java<br/>(AiWorkflowService)
    participant O as Engine:<br/>OrchestratorAgent
    participant QA as Engine:<br/>PdfQuestionAgent
    participant RAG as Engine:<br/>RagService + SqliteVecStore
    participant V as VoyageAI<br/>(embeddings)
    participant L as LLM<br/>(Claude / etc.)

    U->>FE: types "Summarise this PDF"<br/>(PDF already uploaded)
    FE->>J: POST /api/v1/ai/orchestrate/stream<br/>multipart: fileInputs[], userMessage
    Note over J: ByteHashFileIdStrategy<br/>id = sha256(bytes)[:16]
    J->>O: POST /api/v1/orchestrator<br/>{ files:[{id,name}], userMessage }

    O->>L: route via fast model
    L-->>O: delegate_pdf_question
    O->>QA: PdfQuestionRequest

    loop for each file
        QA->>RAG: has_collection(file.id)
        RAG-->>QA: false
    end
    QA-->>O: NeedIngestResponse(files_to_ingest)
    O-->>J: { outcome:"need_ingest", filesToIngest:[...] }

    Note over J: onNeedIngest
    loop per file
        J->>J: PDFBox: extract page text
        J->>O: POST /api/v1/rag/documents<br/>(long-running timeout)
        O->>RAG: chunk + stage documents
        O->>V: embed_documents (batches of 256)
        V-->>O: embeddings
        O->>RAG: add_documents
        O-->>J: { chunks_indexed: N }
    end

    Note over J: retry with resumeWith=pdf_question
    J->>O: POST /api/v1/orchestrator
    Note over O: fast-path to PdfQuestionAgent

    O->>QA: PdfQuestionRequest
    Note over QA: build RagCapability<br/>pinned to file IDs
    QA->>L: run(prompt) with search_knowledge tool

    loop up to max_searches
        L->>QA: search_knowledge(query)
        QA->>V: embed_query
        V-->>QA: query vector
        QA->>RAG: search(vector, collections=[file.id])
        RAG-->>QA: top-k chunks
        QA-->>L: formatted chunks
    end

    Note over QA: once budget spent,<br/>prepare() hides the tool
    L-->>QA: PdfQuestionAnswerResponse
    QA-->>O: answer
    O-->>J: { outcome:"answer", answer, evidence }
    J-->>FE: SSE "result"
    FE->>U: assistant bubble
```

2026-05-01 14:11:54 +01:00

Anthony Stirling

f779085d75

setup RAG (#6146 )

2026-04-21 12:42:33 +01:00

2 Commits