RAG SystemsKnowledge ManagementEnterprise AIVector Databases

Knowledge Management Architecture: Building Enterprise RAG Systems for AI-First Organizations

Ash Ganda • March 24, 2018 • 15 min read

The Knowledge Bottleneck

Every AI-first organisation eventually confronts the same challenge: their AI systems are only as good as the knowledge those systems can access.

Large language models arrive with impressive general capabilities, but they know nothing about your organisation’s policies, processes, products, or people. They hallucinate confidently when asked about internal matters. They can’t answer the questions that actually matter to employees and customers—the specific, contextual questions that require organisational knowledge.

Retrieval-Augmented Generation (RAG) addresses this by grounding AI responses in your organisation’s actual documents. The concept is straightforward: retrieve relevant context from your knowledge base, inject it into the AI prompt, generate responses based on that context.

Implementation is considerably harder.

Enterprise RAG systems fail not because the technology doesn’t work, but because knowledge management at enterprise scale involves challenges that simple prototypes don’t reveal. Document sprawl across dozens of systems. Content that contradicts itself across versions. Access controls that vary by document, section, and user. Quality that ranges from carefully edited policies to hastily written email threads.

This post examines the architectural patterns that separate production RAG systems from prototypes that never leave the lab.

Architecture Overview

A production RAG system comprises five major subsystems:

[Content Sources]     [Ingestion Pipeline]     [Vector Store]
       |                     |                      |
       v                     v                      v
  Confluence           Processing              Embeddings
  SharePoint           Chunking                Metadata
  Documents            Enrichment              Indices

  Databases                                        |
       |                     |                     |
       +---------------------+---------------------+
                             |
                             v
                    [Query Processing]
                             |
                             v
                    [Response Generation]

Each subsystem presents distinct challenges at enterprise scale.

Content Ingestion Architecture

The Source Integration Problem

Enterprise knowledge lives everywhere:

Document management (SharePoint, Google Drive, Box)
Wikis and collaboration (Confluence, Notion, Teams)
Ticketing systems (ServiceNow, Jira, Zendesk)
Databases and data warehouses
Email archives
Chat history
Code repositories

Each source has different:

APIs and authentication mechanisms
Content formats and structures
Update patterns and change detection
Access control models

Connector Architecture

Build a modular connector framework:

[Connector Interface]
        |
        +---> SharePointConnector
        +---> ConfluenceConnector
        +---> DatabaseConnector
        +---> FileSystemConnector

Each connector implements:

class ContentConnector(Protocol):
    def list_items(self, since: datetime) -> Iterator[ContentItem]:
        """List items modified since timestamp"""
        ...

![Content Ingestion Architecture Infographic](/images/knowledge-management-architecture-enterprise-rag-systems-ai-first-content-ingestion-architecture.webp)

    def get_content(self, item_id: str) -> ContentDocument:
        """Retrieve full content for item"""
        ...

    def get_permissions(self, item_id: str) -> PermissionSet:
        """Retrieve access permissions"""
        ...

    def get_metadata(self, item_id: str) -> dict:
        """Retrieve item metadata"""
        ...

Incremental Processing

Full reprocessing of all content is impractical at scale. Implement change detection:

Timestamp-Based: Track last successful sync per source; query for changes since that timestamp.

Content Hashing: Hash document content; reprocess only when hash changes.

Event-Driven: Subscribe to source system webhooks for real-time updates.

Hybrid Approach: Combine event-driven for high-priority sources with scheduled incremental scans.

Content Extraction

Raw documents require preprocessing:

Text Extraction:

PDF: Extract with layout awareness (tables, headers, columns)
Office Documents: Handle embedded objects and formatting
HTML: Clean markup while preserving structure
Images: OCR where text is embedded in images

Metadata Extraction:

Document title, author, creation date
Classification and tags from source system
Source URL and identifiers
Relationship to other documents

Quality Signals:

Document age and freshness
Edit frequency
Author authority
Explicit ratings or endorsements

Chunking Strategy

How you split documents into chunks determines retrieval quality more than any other factor.

The Chunking Trade-off

Small chunks (256-512 tokens):

More precise retrieval
Better embedding quality
May lose context spanning chunk boundaries
More chunks to manage

Large chunks (1024-2048 tokens):

Preserve more context
Fewer chunks to manage
Diluted relevance (irrelevant text in chunk reduces match quality)
May exceed context windows when multiple chunks retrieved

Chunking Strategies

Fixed-Size Chunking Split at fixed token counts with overlap:

def fixed_chunk(text: str, size: int = 512, overlap: int = 50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), size - overlap):
        chunks.append(tokens[i:i + size])
    return chunks

Simple but breaks semantic units (sentences, paragraphs).

Semantic Chunking Respect document structure:

def semantic_chunk(text: str, max_size: int = 512):
    sections = extract_sections(text)  # Use headers, paragraphs
    chunks = []
    current = []
    current_size = 0

    for section in sections:
        section_size = count_tokens(section)
        if current_size + section_size > max_size:
            chunks.append(merge(current))
            current = [section]
            current_size = section_size
        else:
            current.append(section)
            current_size += section_size

    if current:
        chunks.append(merge(current))
    return chunks

Preserves meaning but creates variable-size chunks.

Hierarchical Chunking Create chunks at multiple levels:

Document
    |
    +---> Section chunks (large context)
    |         |
    |         +---> Paragraph chunks (medium context)
    |                   |
    |                   +---> Sentence chunks (fine retrieval)

Enables multi-level retrieval with context expansion.

Chunk Enrichment

Add context that improves retrieval:

Contextual Prefixes: Prepend document title and section headers:

"[From: Employee Handbook > Leave Policies > Annual Leave]
Employees are entitled to 20 days of annual leave..."

Summary Generation: Generate summaries for each chunk using LLM:

{
  "content": "...",
  "summary": "Annual leave entitlement and approval process",
  "keywords": ["annual leave", "PTO", "vacation approval"]
}

Hypothetical Questions: Generate questions this chunk might answer:

{
  "content": "...",
  "questions": [
    "How many days of annual leave do employees get?",
    "Who approves leave requests?",
    "Can I carry over unused leave?"
  ]
}

Vector Store Architecture

Embedding Model Selection

Embedding models convert text to vectors for similarity search.

Considerations:

Dimension size (higher = more expressive, more storage)
Multilingual support
Domain adaptation capabilities
Inference cost and latency
Open vs. proprietary

Current Options:

OpenAI text-embedding-3 (proprietary, excellent quality)
Cohere embed (proprietary, multilingual)
BGE (open, competitive quality)
E5 (open, good multilingual)

Recommendation: Start with high-quality proprietary embeddings; evaluate open alternatives for cost optimisation once baseline is established.

Vector Database Selection

Vector databases store embeddings and enable similarity search.

Purpose-Built Options:

Pinecone: Managed, excellent performance, limited filtering
Weaviate: Open source, hybrid search, good metadata filtering
Qdrant: Open source, excellent filtering, efficient storage
Milvus: Open source, high scale, Kubernetes-native

Embedded Options:

Chroma: Simple, Python-native, good for prototypes
pgvector: PostgreSQL extension, familiar tooling, moderate scale

Selection Criteria:

Scale requirements (millions vs. billions of vectors)
Filtering complexity (simple vs. multi-dimensional)
Operational preferences (managed vs. self-hosted)
Cost structure (per-vector vs. infrastructure-based)

Index Architecture

Structure indices for your query patterns:

Single Index (Simple): All content in one index. Works for small-scale deployments. Becomes slow with complex filtering at large scale.

Multi-Index (Segmented): Separate indices by content type or access level:

[Public Content Index]  <-- Queries from unauthenticated users
[Internal Index]        <-- All authenticated users
[Confidential Index]    <-- Restricted users only

Hierarchical Index: Parent-child relationships for document structure:

Document Index (summaries)
      |
      +---> Section Index (details)
                  |
                  +---> Paragraph Index (fine-grained)

Query starts broad; drill down for specificity.

Query Processing

Query Understanding

Raw user queries often need preprocessing:

Query Expansion: Add synonyms and related terms:

User: "vacation policy"
Expanded: "vacation policy annual leave PTO time off holiday entitlement"

Query Decomposition: Split complex queries into sub-queries:

User: "What's our vacation policy and how do I request time off?"
Sub-queries:
  1. "vacation policy entitlement"
  2. "request time off process approval"

Intent Classification: Route queries based on intent:

User: "Who should I contact about benefits?"
Intent: contact_lookup (route to HR directory)

User: "What are our health insurance options?"
Intent: policy_retrieval (route to knowledge base)

Retrieval Strategy

Dense Retrieval: Vector similarity search on embeddings. Captures semantic similarity but may miss exact keyword matches.

Sparse Retrieval: BM25 or TF-IDF on text. Captures keyword matches but misses semantic equivalents.

Hybrid Retrieval: Combine dense and sparse:

def hybrid_search(query: str, k: int = 10):
    dense_results = vector_search(query, k=k)
    sparse_results = bm25_search(query, k=k)

    # Reciprocal rank fusion
    combined = reciprocal_rank_fusion(dense_results, sparse_results)
    return combined[:k]

Hybrid typically outperforms either approach alone.

Re-Ranking

Initial retrieval returns candidates; re-ranking improves precision:

Cross-Encoder Re-Ranking: Score (query, chunk) pairs with a cross-encoder model:

def rerank(query: str, chunks: List[str], top_k: int = 5):
    pairs = [(query, chunk) for chunk in chunks]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: -x[1])
    return ranked[:top_k]

Cross-encoders are slow but accurate. Use for top-N candidates only.

LLM-Based Re-Ranking: Ask LLM to rank candidates by relevance:

Given the query: "{query}"
Rank these documents from most to least relevant:
1. {doc1}
2. {doc2}
...

Expensive but effective for complex relevance judgments.

Response Generation

Context Construction

How you present retrieved context matters:

Simple Concatenation:

Context:
{chunk1}
{chunk2}
{chunk3}

Question: {query}
Answer based on the context above.

Structured Context:

# Retrieved Information

## From: Employee Handbook (Updated: 2024-01-15)
{chunk1}

## From: HR FAQ (Updated: 2023-08-20)
{chunk2}

# Your Question
{query}

# Instructions
Answer the question using only the information above. Cite your sources.

Relevance-Weighted: Order context by relevance score; place most relevant first (or last, depending on model).

Grounding and Attribution

Production systems must ground responses in retrieved content:

Citation Requirements:

Answer the question and cite the source documents using [1], [2], etc.
At the end, list your sources.

Verification Prompting:

For each claim in your answer, quote the specific text that supports it.
If you cannot find support for a claim, do not make it.

Uncertainty Expression:

If the retrieved context doesn't contain enough information to answer
confidently, say "I don't have enough information about X" rather
than guessing.

Hallucination Mitigation

LLMs hallucinate. RAG reduces but doesn’t eliminate this.

Detection Strategies:

Claim extraction: Parse response into claims; verify each against sources
Entailment checking: NLI model to check if claims are entailed by context
Citation verification: Check that citations actually exist and support claims

Prevention Strategies:

Conservative prompting: Instruct model to decline rather than guess
Temperature reduction: Lower temperature reduces creative fabrication
Context filtering: Remove low-relevance chunks that might confuse

Access Control Architecture

Enterprise RAG must respect document permissions.

Permission Models

Document-Level: Each document has a permission set; users see results only from accessible documents.

Section-Level: Within documents, sections may have different permissions (redacted clauses, confidential appendices).

User-Contextual: Permissions depend on user context (role, department, project membership).

Implementation Patterns

Filter at Query Time: Include user permissions in query:

def secure_search(query: str, user: User):
    # Build permission filter
    access_filter = {
        "OR": [
            {"visibility": "public"},
            {"department": user.department},
            {"allowed_users": {"contains": user.id}},
        ]
    }

    return vector_db.search(
        query=query,
        filter=access_filter
    )

Permission-Indexed Documents: Include permission metadata in document embeddings; filter during retrieval.

Separate Indices: Maintain separate indices for different permission levels; query appropriate indices per user.

Audit and Compliance

Log all queries and retrievals:

{
  "timestamp": "2024-02-13T10:30:00Z",
  "user_id": "user123",
  "query": "compensation guidelines",
  "retrieved_documents": ["doc_a", "doc_b"],
  "response_generated": true,
  "sources_cited": ["doc_a"]
}

Enable:

Access pattern analysis
Compliance reporting
Anomaly detection
Quality improvement

Evaluation and Improvement

Evaluation Metrics

Retrieval Metrics:

Recall@K: Fraction of relevant documents in top K
MRR: Mean reciprocal rank of first relevant result
NDCG: Normalized discounted cumulative gain

Response Metrics:

Faithfulness: Does response align with retrieved context?
Answer correctness: Is the answer factually correct?
Completeness: Does answer address all aspects of the query?

Evaluation Infrastructure

Build evaluation datasets:

{
  "query": "What is our parental leave policy?",
  "relevant_docs": ["hr_handbook_v3", "benefits_faq"],
  "expected_answer": "16 weeks paid leave for primary caregivers...",
  "expected_citations": ["hr_handbook_v3"]
}

Run evaluation regularly:

After model changes
After significant content updates
On scheduled basis for drift detection

Continuous Improvement

User Feedback:

Thumbs up/down on responses
“This didn’t answer my question”
Correction submissions

Retrieval Analysis:

Queries with no relevant results
Queries with low-confidence responses
Frequently asked questions without content coverage

Content Gap Identification:

Questions that consistently fail to find answers
Topics with outdated content
Areas where users frequently escalate to humans

Scaling Considerations

Enterprise RAG systems grow with usage. Plan for scale:

Content Scale:

Millions of documents across sources
Billions of chunks in vector store
Continuous ingestion from hundreds of sources

Query Scale:

Thousands of queries per minute
Sub-second latency requirements
Concurrent users across time zones

Model Scale:

Multiple embedding models for different content types
Large context models for complex queries
Ensemble approaches for quality

Architectural Responses

Caching:

Query result caching for repeated questions
Embedding caching for frequently accessed documents
LLM response caching for common queries

Tiered Processing:

Simple queries: fast, cheap models
Complex queries: powerful, expensive models
Route based on query classification

Distributed Infrastructure:

Sharded vector stores
Replicated query services
Regional deployments for latency

The Knowledge-First Organisation

RAG systems are means to an end. The end is enabling every employee and every AI system to access organisational knowledge effectively.

This requires more than technology:

Content governance ensuring knowledge is current and accurate
Contribution incentives making knowledge sharing rewarding
Quality processes maintaining standards across sources
Feedback loops connecting user needs to content creation

The organisations that win with AI will be those with the best knowledge infrastructure. RAG is the interface between that knowledge and AI capabilities.

Build it to last.

Ash Ganda advises enterprise technology leaders on AI systems, knowledge management, and digital transformation strategy. Connect on LinkedIn for ongoing insights.