# TEI Pipeline

OCR text extraction and TEI P5 XML markup from historical document images, running locally on macOS with GPU acceleration.

## Features

- **VLM-powered OCR** using Qwen2.5-VL (7B or 3B) for structural text extraction
- **Automatic metadata inference** — title, author, date, publisher, place, language detected from the document
- **Editable metadata** — review and correct inferred values; XML regenerates instantly
- **Custom tags** — define your own structural elements (e.g., PERSON, PLACE) before processing
- **Language auto-detection** — identifies the primary language of the main text
- **TEI P5 compliant** output with genre-specific markup (prose, poetry, drama, manuscript)
- **Apple Silicon optimized** — MPS GPU acceleration with aggressive memory management
- **Self-contained .app bundle** — everything inside one folder, no external dependencies

## Setup

```bash
cd tei-pipeline
python3 setup_app.py
```

This creates `/Applications/TEI Pipeline.app` with an embedded Python environment and all dependencies.

## Usage

1. Launch **TEI Pipeline** from Applications
2. Upload a PDF or image
3. Optionally set genre, language, model size, and custom tags
4. Click **Process Document**
5. Review inferred metadata and edit as needed
6. Click **Apply & Regenerate XML** to update the TEI header
7. Save the XML file

## Memory Management (Apple Silicon)

The 7B model requires ~14 GB in float16. On Macs with 32 GB unified memory, large multi-page documents can cause MPS out-of-memory errors. Mitigations built into this version:

- **Image downscaling**: MPS images capped at 1280px per side (vs. 2048 on CUDA)
- **Aggressive cache clearing**: `torch.mps.empty_cache()` + `gc.collect()` between every page
- **Watermark ratio disabled**: `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` allows use of all available memory
- **Tensor cleanup**: Input/output tensors deleted immediately after each inference
- **Smaller model option**: Select the 3B model for tighter memory environments

If you still hit OOM on very large documents, try the 3B model.

## Custom Tags

Under "Custom Tags (advanced)" you can define new elements. For example:

| Tag Name | Description |
|----------|-------------|
| PERSON | Personal names of individuals |
| PLACE | Geographic place names |
| WORK_TITLE | Titles of referenced works |

These are included in the VLM prompt as `[TAG_NAME]...[/TAG_NAME]` markers and appear in the TEI output as `<seg type="tagname">` elements.

## File Structure

```
tei-pipeline/
├── setup_app.py          # Creates /Applications/TEI Pipeline.app
├── server.py             # Flask server (UI + API)
├── __main__.py           # python -m server
├── icon.png              # App icon source
├── icon.iconset/         # macOS icon sizes
├── templates/
│   └── index.html        # Web UI
└── core/
    ├── device.py          # GPU detection
    ├── ocr_engine.py      # VLM model + inference
    ├── tei_generator.py   # Structural markers → TEI
    ├── tei_schema.py      # TEI P5 skeleton/validation
    ├── image_loader.py    # PDF/image loading
    └── pipeline.py        # Workflow orchestration
```

## Uninstall

Delete `/Applications/TEI Pipeline.app`. That's it — nothing else is installed system-wide.

Model weights are cached at `~/.cache/huggingface/hub/` and can be deleted separately if desired.
