Skip to main content

Document Processing

MachinaOs provides a complete RAG (Retrieval-Augmented Generation) pipeline for processing documents, generating embeddings, and storing vectors for semantic search.

Pipeline Overview

[HTTP Scraper] --> [File Downloader] --> [Document Parser] --> [Text Chunker] --> [Embedding Generator] --> [Vector Store]
StageNodePurpose
1. CollectHTTP ScraperScrape URLs from web pages
2. DownloadFile DownloaderDownload files in parallel
3. ParseDocument ParserExtract text from documents
4. ChunkText ChunkerSplit text into overlapping chunks
5. EmbedEmbedding GeneratorGenerate vector embeddings
6. StoreVector StoreStore and query vectors

HTTP Scraper

Scrapes links from web pages with support for pagination and date ranges.

Modes

ModeDescriptionUse Case
SingleOne request to URLSimple page scraping
Date RangeIterate through datesNews archives, dated content
PaginationFollow page numbersMulti-page listings

Parameters

url
string
required
URL to scrape. Use {{date}} or {{page}} placeholders for iteration.
mode
select
default:"single"
Scraping mode: single, date_range, or pagination
CSS selector for links to extract
startDate
string
Start date for date_range mode (YYYY-MM-DD)
endDate
string
End date for date_range mode (YYYY-MM-DD)
startPage
number
default:"1"
Start page for pagination mode
endPage
number
default:"10"
End page for pagination mode

Output

{
  "items": [
    {"url": "https://example.com/article1.pdf"},
    {"url": "https://example.com/article2.pdf"}
  ],
  "count": 2
}

Example: Scrape News Archive

URL: https://news.site/archive/{{date}}
Mode: date_range
Start Date: 2025-01-01
End Date: 2025-01-31
Link Selector: a.article-link

File Downloader

Downloads files from URLs in parallel using semaphore-based concurrency.

Parameters

outputDirectory
string
required
Directory to save downloaded files
maxWorkers
number
default:"5"
Maximum parallel downloads (1-32)
skipExisting
boolean
default:"true"
Skip files that already exist
timeout
number
default:"30000"
Download timeout in milliseconds

Input

Expects array of items with url field (from HTTP Scraper):
{
  "items": [
    {"url": "https://example.com/file1.pdf"},
    {"url": "https://example.com/file2.pdf"}
  ]
}

Output

{
  "downloaded": [
    {"url": "...", "path": "/output/file1.pdf", "size": 1024000},
    {"url": "...", "path": "/output/file2.pdf", "size": 2048000}
  ],
  "failed": [],
  "skipped": 0,
  "total": 2
}

Document Parser

Parses documents to extract text content using configurable parsers.

Parsers

ParserDescriptionSupported Formats
PyPDFFast PDF parsingPDF
MarkerGPU-accelerated OCRPDF (scanned documents)
UnstructuredMulti-format parsingPDF, DOCX, HTML, TXT, MD
BeautifulSoupHTML parsingHTML

Parameters

parser
select
default:"pypdf"
Parser to use: pypdf, marker, unstructured, beautifulsoup
inputDirectory
string
required
Directory containing files to parse (or from File Downloader)

Input

Accepts paths from File Downloader or a directory:
{
  "downloaded": [
    {"path": "/output/file1.pdf"},
    {"path": "/output/file2.pdf"}
  ]
}

Output

{
  "documents": [
    {
      "path": "/output/file1.pdf",
      "text": "Extracted text content...",
      "pages": 10,
      "parser": "pypdf"
    }
  ],
  "total": 2,
  "failed": 0
}

Parser Comparison

FeaturePyPDFMarkerUnstructuredBeautifulSoup
SpeedFastSlowMediumFast
AccuracyGoodExcellentGoodGood
OCRNoYes (GPU)OptionalNo
FormatsPDFPDFManyHTML
GPU RequiredNoYesNoNo

Text Chunker

Splits text into overlapping chunks for embedding generation.

Strategies

StrategyDescriptionBest For
RecursiveSplit by paragraphs, sentences, wordsMost documents
MarkdownPreserve markdown structureMarkdown files
TokenSplit by token countConsistent chunk sizes

Parameters

strategy
select
default:"recursive"
Chunking strategy: recursive, markdown, or token
chunkSize
number
default:"1000"
Target chunk size in characters (100-8000)
chunkOverlap
number
default:"200"
Overlap between chunks (0-1000)

Input

{
  "documents": [
    {"path": "...", "text": "Long document text..."}
  ]
}

Output

{
  "chunks": [
    {
      "text": "Chunk 1 text...",
      "metadata": {
        "source": "/output/file1.pdf",
        "chunk_index": 0,
        "start_char": 0,
        "end_char": 1000
      }
    }
  ],
  "total_chunks": 50
}

Choosing Chunk Size

Document TypeRecommended SizeOverlap
Technical docs500-1000100-200
Articles1000-1500200-300
Books1500-2000300-400
Code500-800100-150

Embedding Generator

Generates vector embeddings from text chunks using various providers.

Providers

ProviderModelDimensionsCost
HuggingFaceBAAI/bge-small-en-v1.5384Free (local)
HuggingFaceBAAI/bge-large-en-v1.51024Free (local)
OpenAItext-embedding-3-small1536Paid
OpenAItext-embedding-3-large3072Paid
Ollamanomic-embed-text768Free (local)

Parameters

provider
select
default:"huggingface"
Embedding provider: huggingface, openai, or ollama
model
string
default:"BAAI/bge-small-en-v1.5"
Model name/path for embeddings
batchSize
number
default:"32"
Batch size for embedding generation

Input

{
  "chunks": [
    {"text": "Chunk text...", "metadata": {...}}
  ]
}

Output

{
  "embeddings": [
    {
      "text": "Chunk text...",
      "embedding": [0.123, -0.456, ...],
      "metadata": {...}
    }
  ],
  "model": "BAAI/bge-small-en-v1.5",
  "dimensions": 384,
  "total": 50
}

Vector Store

Stores and queries vector embeddings using various backends.

Backends

BackendDescriptionBest For
ChromaDBLocal SQLite-basedDevelopment, small datasets
QdrantHigh-performanceProduction, large datasets
PineconeCloud-hostedServerless, managed service

Operations

OperationDescription
storeAdd embeddings to the store
querySearch for similar documents
deleteRemove documents by ID or filter

Parameters

backend
select
default:"chroma"
Vector store backend: chroma, qdrant, or pinecone
operation
select
default:"store"
Operation: store, query, or delete
collectionName
string
required
Collection/index name
topK
number
default:"5"
Number of results for query operation

Store Operation

Input (from Embedding Generator):
{
  "embeddings": [
    {"text": "...", "embedding": [...], "metadata": {...}}
  ]
}
Output:
{
  "stored": 50,
  "collection": "my-docs",
  "backend": "chroma"
}

Query Operation

Input:
{
  "query": "What is the main topic?",
  "top_k": 5
}
Output:
{
  "results": [
    {
      "text": "Relevant chunk text...",
      "score": 0.89,
      "metadata": {"source": "file1.pdf", "chunk_index": 5}
    }
  ],
  "query": "What is the main topic?"
}

Complete RAG Pipeline Example

Workflow

[HTTP Scraper] --> [File Downloader] --> [Document Parser] --> [Text Chunker] --> [Embedding Generator] --> [Vector Store]

Configuration

  1. HTTP Scraper
    • URL: https://docs.example.com/api/{{page}}
    • Mode: pagination
    • Link Selector: a.pdf-link
  2. File Downloader
    • Output: ./downloads
    • Max Workers: 10
  3. Document Parser
    • Parser: pypdf
  4. Text Chunker
    • Strategy: recursive
    • Chunk Size: 1000
    • Overlap: 200
  5. Embedding Generator
    • Provider: huggingface
    • Model: BAAI/bge-small-en-v1.5
  6. Vector Store
    • Backend: chroma
    • Operation: store
    • Collection: api-docs

Querying the Pipeline

Create a separate workflow for querying:
[Webhook Trigger] --> [Embedding Generator] --> [Vector Store (query)] --> [AI Agent] --> [Webhook Response]
The AI Agent receives relevant context from the vector store to answer questions.

Tips

Start with ChromaDB for development - no setup required.
Use HuggingFace embeddings for free local processing without API keys.
Adjust chunk size based on your embedding model’s context window.
Include meaningful metadata (source, page, section) for better retrieval context.
GPU is required for the Marker parser (OCR). Use PyPDF for non-scanned documents.