End-to-end data flow from IBM FileNet documents to banker chat responses
ingested_at
| From | To | Protocol / API | What is exchanged | When |
|---|---|---|---|---|
| Python Ingestor | 🔵 AWS S3 | HTTPSboto3 GetObject, HeadObject | PDF binary bytes + S3 object metadata tags | Ingestion batch / incremental |
| Python Ingestor | 🔵 AWS Textract | HTTPSboto3 detect_document_text() | PDF pages → extracted text blocks (scanned docs only) | Ingestion — scanned PDFs only |
| Python Ingestor | 🔵 Bedrock Titan | HTTPSbedrock-runtime.invoke_model() | Text chunk (≤8K tokens) → 256-dim float array | Ingestion — once per chunk |
| Python Ingestor | 🟣 ChromaDB | Local gRPCcollection.upsert() | chunk text + 256-dim embedding + metadata dict (USD amounts) | Ingestion — dual-write step 5a, once per chunk |
| Python Ingestor | 🟢 ClickHouse Cloud | HTTPS / Nativeclickhouse_connect.insert() | Full metadata row: 32 columns, USD Float64 amounts, typed Dates — same ingest.py run as ChromaDB | Ingestion — dual-write step 5b, once per doc · ReplacingMergeTree deduplicates |
| Streamlit UI | 🟣 LangChain RAG Chain | In-processPython function call | Banker query string → result dict (answer + sources) | Every question |
| LangChain RAG Chain | 🔵 Bedrock Nova Lite (NL→SQL) | HTTPSbedrock-runtime.invoke_model() | Banker question + schema → valid ClickHouse SQL query | Aggregation questions only (counts, trends, breakdowns) |
| LangChain RAG Chain | 🟢 ClickHouse Cloud | HTTPS / Nativeclickhouse_connect.query() | Generated SQL → result rows (full dataset, no row limit) | Aggregation questions only |
| LangChain RAG Chain | 🔵 Bedrock Nova Lite (format) | HTTPSbedrock-runtime.invoke_model() | Raw ClickHouse result rows → formatted markdown answer + insights | Aggregation questions only |
| LangChain RAG Chain | 🔵 Bedrock Titan | HTTPSbedrock-runtime.invoke_model() | Query text → 256-dim embedding vector | Content questions only |
| LangChain RAG Chain | 🟣 ChromaDB | Local gRPCcollection.query() | Query vector + metadata filters → top-20 chunks + scores | Content questions only |
| LangChain RAG Chain | 🔵 Bedrock Nova Lite (RAG) | HTTPSbedrock-runtime.invoke_model() | System prompt + retrieved chunks + question → answer text with doc citations | Content questions only |
| Streamlit UI | 👤 Banker Browser | HTTP / WebSocketStreamlit server | Rendered chat UI, answer text, source cards, S3 PDF links | Continuous session |
| Banker Browser | 🔵 AWS S3 | HTTPSPre-signed URL | PDF binary → rendered in browser tab | When banker clicks "View PDF" |
| IBM FileNet (Prod) | 🐍 Python Ingestor | REST / CMISFileNet Content Engine REST API | PDF binary + FileNet object properties | Production — replaces S3 boto3 calls |