Document Processing
MachinaOs provides a complete RAG (Retrieval-Augmented Generation) pipeline for processing documents, generating embeddings, and storing vectors for semantic search.Pipeline Overview
| Stage | Node | Purpose |
|---|---|---|
| 1. Collect | HTTP Scraper | Scrape URLs from web pages |
| 2. Download | File Downloader | Download files in parallel |
| 3. Parse | Document Parser | Extract text from documents |
| 4. Chunk | Text Chunker | Split text into overlapping chunks |
| 5. Embed | Embedding Generator | Generate vector embeddings |
| 6. Store | Vector Store | Store and query vectors |
HTTP Scraper
Scrapes links from web pages with support for pagination and date ranges.Modes
| Mode | Description | Use Case |
|---|---|---|
| Single | One request to URL | Simple page scraping |
| Date Range | Iterate through dates | News archives, dated content |
| Pagination | Follow page numbers | Multi-page listings |
Parameters
URL to scrape. Use
{{date}} or {{page}} placeholders for iteration.Scraping mode: single, date_range, or pagination
CSS selector for links to extract
Start date for date_range mode (YYYY-MM-DD)
End date for date_range mode (YYYY-MM-DD)
Start page for pagination mode
End page for pagination mode
Output
Example: Scrape News Archive
File Downloader
Downloads files from URLs in parallel using semaphore-based concurrency.Parameters
Directory to save downloaded files
Maximum parallel downloads (1-32)
Skip files that already exist
Download timeout in milliseconds
Input
Expects array of items withurl field (from HTTP Scraper):
Output
Document Parser
Parses documents to extract text content using configurable parsers.Parsers
| Parser | Description | Supported Formats |
|---|---|---|
| PyPDF | Fast PDF parsing | |
| Marker | GPU-accelerated OCR | PDF (scanned documents) |
| Unstructured | Multi-format parsing | PDF, DOCX, HTML, TXT, MD |
| BeautifulSoup | HTML parsing | HTML |
Parameters
Parser to use: pypdf, marker, unstructured, beautifulsoup
Directory containing files to parse (or from File Downloader)
Input
Accepts paths from File Downloader or a directory:Output
Parser Comparison
| Feature | PyPDF | Marker | Unstructured | BeautifulSoup |
|---|---|---|---|---|
| Speed | Fast | Slow | Medium | Fast |
| Accuracy | Good | Excellent | Good | Good |
| OCR | No | Yes (GPU) | Optional | No |
| Formats | Many | HTML | ||
| GPU Required | No | Yes | No | No |
Text Chunker
Splits text into overlapping chunks for embedding generation.Strategies
| Strategy | Description | Best For |
|---|---|---|
| Recursive | Split by paragraphs, sentences, words | Most documents |
| Markdown | Preserve markdown structure | Markdown files |
| Token | Split by token count | Consistent chunk sizes |
Parameters
Chunking strategy: recursive, markdown, or token
Target chunk size in characters (100-8000)
Overlap between chunks (0-1000)
Input
Output
Choosing Chunk Size
| Document Type | Recommended Size | Overlap |
|---|---|---|
| Technical docs | 500-1000 | 100-200 |
| Articles | 1000-1500 | 200-300 |
| Books | 1500-2000 | 300-400 |
| Code | 500-800 | 100-150 |
Embedding Generator
Generates vector embeddings from text chunks using various providers.Providers
| Provider | Model | Dimensions | Cost |
|---|---|---|---|
| HuggingFace | BAAI/bge-small-en-v1.5 | 384 | Free (local) |
| HuggingFace | BAAI/bge-large-en-v1.5 | 1024 | Free (local) |
| OpenAI | text-embedding-3-small | 1536 | Paid |
| OpenAI | text-embedding-3-large | 3072 | Paid |
| Ollama | nomic-embed-text | 768 | Free (local) |
Parameters
Embedding provider: huggingface, openai, or ollama
Model name/path for embeddings
Batch size for embedding generation
Input
Output
Vector Store
Stores and queries vector embeddings using various backends.Backends
| Backend | Description | Best For |
|---|---|---|
| ChromaDB | Local SQLite-based | Development, small datasets |
| Qdrant | High-performance | Production, large datasets |
| Pinecone | Cloud-hosted | Serverless, managed service |
Operations
| Operation | Description |
|---|---|
| store | Add embeddings to the store |
| query | Search for similar documents |
| delete | Remove documents by ID or filter |
Parameters
Vector store backend: chroma, qdrant, or pinecone
Operation: store, query, or delete
Collection/index name
Number of results for query operation
Store Operation
Input (from Embedding Generator):Query Operation
Input:Complete RAG Pipeline Example
Workflow
Configuration
-
HTTP Scraper
- URL:
https://docs.example.com/api/{{page}} - Mode: pagination
- Link Selector:
a.pdf-link
- URL:
-
File Downloader
- Output:
./downloads - Max Workers: 10
- Output:
-
Document Parser
- Parser: pypdf
-
Text Chunker
- Strategy: recursive
- Chunk Size: 1000
- Overlap: 200
-
Embedding Generator
- Provider: huggingface
- Model: BAAI/bge-small-en-v1.5
-
Vector Store
- Backend: chroma
- Operation: store
- Collection: api-docs