Confluence Document Extraction Pipeline

Scrapy-based extraction pipeline for enterprise Confluence documentation — produces Markdown with embedded context headers and dual-format metadata sidecars for AWS Bedrock Knowledge Base and platform-agnostic embedding workflows.

This pipeline extracts enterprise Confluence documentation on a monthly cadence, converting HTML pages to clean Markdown with embedded context headers and producing dual-format metadata sidecars: an AWS Bedrock-native format and a portable flat-JSON format compatible with LangChain, LlamaIndex, Haystack, and Chroma. All metadata fields are derived deterministically — no ML inference, no heuristic scoring.

This page covers the design decisions, operational constraints, and scope boundaries of the extraction pipeline — not the implementation details.

Six-stage pipeline: REST API extraction → metadata validation → Markdown conversion → JSON sidecar export → local staging and optional S3 upload → cleanup.

Background

Enterprise documentation in Confluence required structured extraction for downstream RAG and embedding workflows. Ad-hoc extraction lacked consistent metadata and source attribution, making chunk-level retrieval unreliable. A Scrapy-based pipeline with six ordered stages (HTML extraction → validation → Markdown conversion → metadata JSON export → S3/local staging → cleanup) addressed these requirements while keeping operational complexity low for monthly refresh cycles. The pipeline targets AWS Bedrock Knowledge Base as the primary consumer while producing portable sidecars for alternative platforms.

Design Decisions

Scrapy was chosen as the ETL orchestrator for its built-in parallel HTTP request handling, configurable concurrency, and explicit pipeline stage model. Each of the six stages processes items sequentially with clear input/output contracts — no shared state, no implicit handoffs between stages. This makes the data flow auditable and each stage independently testable.

The pipeline produces dual-format sidecar output per page: a platform-agnostic portable sidecar (flat JSON with 12 fields) and an AWS Bedrock-native sidecar (metadataAttributes wrapper format). Both land in a flat staging directory alongside the Markdown file. This decouples the pipeline from any single downstream platform — Bedrock, LangChain, LlamaIndex, and Chroma can all consume from the same run without reprocessing.

Each Markdown file begins with an embedded context header — title as H1 followed by document type, Confluence space, and parent page in bold. This ensures any downstream chunk retains source attribution regardless of how Bedrock or a local chunker slices the document. This is particularly relevant for hierarchical chunking where child chunks may not include the document start.

Stop-on-error behavior was explicitly chosen for pipeline stages 300–450 (extraction through metadata export). These stages raise exceptions on failure, halting the pipeline immediately rather than allowing partial runs with inconsistent metadata to propagate silently to downstream consumers. The S3 upload stage logs failures and continues, which is consistent with monthly-batch semantics where a missed upload is preferable to a corrupted knowledge base.

Operational Considerations

Monthly batch cadence with an explicit page list eliminates the need for change detection, incremental state management, or distributed scheduling. The sources file defines all target pages and descendant trees. Each run is idempotent — files are overwritten. Deleted pages require manual removal from S3 followed by a Bedrock sync; this constraint was accepted deliberately to keep the system a single-process Python batch job with no infrastructure dependencies beyond Confluence and optional AWS credentials.

Content quality signals via content_status (ok / minimal / empty) let downstream operators filter before S3 upload without rerunning the pipeline. The ok threshold is 200 characters of Markdown body. The run manifest summarizes page counts by status and document type, providing sufficient operational visibility for monthly review without dedicated monitoring tooling.

Dry-run mode mirrors the full S3 upload path — including sidecar placement — to a local directory with an identical key structure. This validated the Bedrock integration requirement (sidecar must share the same S3 prefix as its Markdown file) in an environment without AWS credentials before the first production run.

The pipeline is scoped to controlled monthly extraction against a known Confluence instance with a defined page list. It does not handle versioned document history, real-time sync, or dynamic page discovery. PDF export via Scroll PDF API is available as an optional audit artifact but is not part of the primary embedding workflow. These scope constraints were accepted to maintain operational simplicity over architectural generality.