NLP for Document Classification: Practical Implementation

The problem everyone has and few automate

Every organization classifies documents. Invoices go to accounting. Contracts go to legal. Delivery notes go to logistics. Complaints go to customer service. Court notifications go to the right lawyer based on subject matter.

At most companies, this process is manual: someone opens the document, reads it (or at least scans it), decides what type it is, and routes it accordingly. That works at 10 documents a day. It breaks at 200. And it becomes a bottleneck at 1,000.

Automatic document classification with NLP (Natural Language Processing) solves this problem. It’s not science fiction and doesn’t require a team of PhDs. With current tools, a competent engineering team can have a classifier in production within 4-8 weeks.

But the gap between a prototype working in a notebook and a system classifying real documents in production at 95% accuracy is considerable. This article covers the full path.

Choosing the approach: three options in 2025

Option 1: LLM via API (fast, expensive at scale)

The quickest path: send the document to GPT-4, Claude, or Gemini with a prompt saying “classify this document into one of these categories: [list]” and parse the response.

It works surprisingly well for prototypes and low volumes (< 100 documents/day). Out-of-the-box accuracy typically falls between 85-92% for well-defined categories. You need no training data, no ML infrastructure, and implementation takes a day.

The problem: cost and latency. Classifying a 3-page document with GPT-4 costs approximately $0.02-0.05 (depending on length). For 1,000 daily documents, that’s $600-1,500 per month for classification alone. Latency runs 2-5 seconds per document, which may be unacceptable for real-time flows. And you depend on an external service for a critical function.

When it makes sense: rapid prototype, low volume, available budget, or when categories change frequently and you don’t want to retrain models.

Option 2: Pre-trained model with fine-tuning (balanced)

Take a pre-trained language model (BERT, RoBERTa, DeBERTa, or their multilingual variants) and fine-tune it with your labeled data to classify your specific categories.

The process: collect 200-500 labeled documents per category (more if categories are ambiguous), fine-tune the model for 3-5 epochs, evaluate on a test set, iterate. The result is a model running on your infrastructure, classifying a document in 50-200ms, at practically zero per-inference cost.

Typical accuracy with 300+ examples per category: 93-97% for well-defined categories. Superior to the LLM API option for your domain-specific categories because the model has seen your actual documents.

For multilingual documents, XLM-RoBERTa is the recommended starting point. For English-only, DeBERTa-v3 offers the best classification benchmark performance.

Option 3: Classic ML model (lightweight, sufficient for many cases)

If your documents are reasonably distinct across categories (an invoice looks nothing like a contract), a classic model with TF-IDF + linear classifier (Logistic Regression, SVM) can reach 90-94% accuracy with far less complexity and cost than a transformer.

This approach makes sense when: labeled document volume is very high (> 10,000), inference speed is critical (1-5ms per document), or the team lacks deep learning experience and needs something maintainable.

Don’t dismiss it as “old.” We’ve seen projects where an SVM with well-selected features outperforms a poorly configured fine-tuned BERT. The right tool is the one your team can maintain in production.

The labeling workflow: where you win or lose

The model is only as good as its training data. And training data for document classification needs human labels. The labeling workflow is, in practice, the most critical and most underestimated phase of the project.

Defining categories

Seems obvious. It’s not. Categories must be:

Mutually exclusive. A document belongs to one and only one category. If “vendor-invoice” and “invoice” are both categories, the model will confuse them because they’re ambiguous.
Exhaustive. Every document entering the system must fit a category. If an unexpected document type arrives, you need an “other” category or an outlier detection mechanism.
Balanced (ideally). If you have 2,000 invoices and 50 court judgments, the model will bias toward invoices. Solutions: oversampling minority categories, undersampling majority, or adjusting class weights during training.

In our deployments for law firms, typical categories are: complaint, answer, judgment, order, ruling, notification, subpoena, and other. For logistics companies: invoice, delivery note, packing list, bill of lading, certificate of origin, customs declaration, and other.

Labeling tools

Don’t label in a spreadsheet. Use an annotation tool:

Label Studio (open source) is our choice. It supports text, images, and documents. It handles multiple annotators, inter-annotator agreement calculation, and direct export to training formats. Deploys in 10 minutes with Docker.

Prodigy (commercial, from the spaCy creators) is more efficient for individual annotators with its active annotation interface. More expensive, but reduces labeling time 30-50% thanks to its active learning algorithm that prioritizes the most informative documents.

How many documents to label

Rule of thumb we’ve validated:

Categories	Documents/category	Minimum total
3-5	200-300	600-1,500
6-10	300-500	1,800-5,000
11-20	500+	5,500+

These numbers are for transformer fine-tuning. For classic models (TF-IDF + SVM), you need 2-3x more. For LLMs via API, you need 0 (zero-shot) or 5-10 per category (few-shot).

Quality matters more than quantity. 200 perfectly labeled documents beat 1,000 with noisy labels. Invest in clear category definitions and label review.

Fine-tuning step by step

The technical flow for transformer fine-tuning (the most common case in our projects):

1. Preprocessing. Extract text from documents (PDF to text with PyMuPDF or pdfplumber; images with OCR via Tesseract or a service like AWS Textract). Clean the text: remove repeated headers/footers, normalize whitespace, truncate to the model’s maximum length (512 tokens for BERT, 1024 for newer models).

2. Train/val/test split. 70% train, 15% validation, 15% test. Stratified by category to maintain proportions. The test set is never touched during training; it’s your truth metric.

3. Training. We use Hugging Face Transformers + PyTorch. Initial hyperparameters: learning rate 2e-5, batch size 16, 3-5 epochs, 10% warmup. Early stopping based on F1-score on the validation set. For BERT-base, fine-tuning on an A10 GPU (available on Railway or Lambda Labs at $0.73/hour) takes 15-30 minutes for a 2,000-document dataset.

4. Evaluation. Weighted F1-score on the test set. Confusion matrix to identify where the model errs. If two categories are systematically confused, the problem usually lies in category definitions, not the model.

5. Confidence calibration. The model produces probabilities, not certainties. We calibrate a confidence threshold: if the predicted category’s probability is below 85%, the document is flagged for human review. This creates a human-in-the-loop flow where the model automatically classifies 80-90% of documents and escalates ambiguous ones to an operator.

Production deployment

A model in a Jupyter notebook is not a product. Production deployment requires:

Inference service. The model packaged as a REST API. We use FastAPI + uvicorn with the model loaded in memory at startup. For transformers, ONNX Runtime reduces inference latency 2-3x versus native PyTorch. For high volumes, TorchServe or Triton Inference Server offer batching and multiple workers.

Document pipeline. A flow that receives the document (via API, email, or shared folder), extracts text, calls the inference service, and executes the corresponding action (move to folder, create task in CRM, route to the right person). Airflow or Prefect for orchestration.

Monitoring. Model accuracy degrades over time (data drift). We monitor the model’s confidence distribution weekly. If the percentage of low-confidence documents (< 85%) rises significantly, something has changed: new document types, format changes, or vocabulary shifts. That triggers a relabeling and retraining cycle.

Retraining. Documents the model escalates to human review become new training data. The operator reviewing the document confirms or corrects the category, and that data joins the training dataset. We retrain monthly or when accuracy drops below the acceptable threshold (95% in our case).

Real case: court notification classification

A law firm with 15 attorneys and 4 practice areas (civil, commercial, labor, criminal) received 80-120 court notifications daily via electronic notification systems. Each notification needed classification by type (complaint, judgment, order, ruling, subpoena) and assignment to the responsible attorney based on subject matter and case number.

An administrative assistant spent 3 hours daily on this triage. Human error (assigning a notification to the wrong attorney) occurred 2-3 times per week, with potentially serious consequences (missed deadlines).

We implemented a classifier based on a fine-tuned multilingual BERT with 1,800 labeled notifications (6 categories). The model achieves 96.2% F1-score on the test set. Documents with confidence below 90% (approximately 8% of total) are escalated to human review.

Result: automatic triage reduced classification time from 3 hours to 20 minutes daily (reviewing escalated items only). Assignment errors dropped from 2-3 weekly to 0-1. ROI was reached in 5 weeks.

Real case: shipping document classification

A logistics company processed 500+ daily documents associated with international shipments: commercial invoices, packing lists, bills of lading, certificates of origin, customs declarations, phytosanitary certificates. Each document needed linking to the correct shipment file and classification for customs processing.

The classifier combines OCR + NLP + business rules. OCR extracts text from PDFs (many are poor-quality scans; we use AWS Textract which handles this reasonably well). The NLP model (XLM-RoBERTa fine-tuned with 3,200 documents) classifies the document type. Business rules extract the shipment reference number and link the document to the file.

Classification accuracy: 94.8%. Unclassified low-confidence documents are escalated to an operator who reviews them in a web interface and corrects the classification. Each correction feeds the retraining cycle.

Mistakes to avoid

Don’t measure only accuracy. A classifier that predicts “invoice” for everything has 70% accuracy if 70% of your documents are invoices. Use weighted F1-score, precision and recall by category. The confusion matrix is your best friend.

Don’t ignore text extraction. If OCR produces garbage, the NLP classifier can’t work magic. Invest in quality OCR (Textract, Google Document AI) and image preprocessing (binarization, rotation correction) before worrying about the classification model.

Don’t launch without human-in-the-loop. No model has 100% accuracy. You need a flow for documents the model can’t classify with confidence. If you deploy without this flow, you’ll have misclassified documents that nobody catches until it’s too late.

Don’t train on imbalanced data without compensation. If your dataset has 2,000 invoices and 50 judgments, the model will learn to ignore judgments. Class weights, oversampling (SMOTE for tabular features, duplication + augmentation for text), or focal loss solve the problem.

Document classification with NLP is not the most spectacular AI project, but it is arguably the one with the best effort-to-value ratio for most organizations. Document volume grows every year. The capacity to process them manually does not. For a deeper look at how legal document management evolves from OCR to semantic understanding, see our article on legal document management with AI. And to get these models to production with the right metrics, our whitepaper on MLOps covers the full process.