How to Automatically Split PDFs by HAWB Number Using Python — A Complete Step-by-Step Guide
🧭 Introduction
In logistics, shipping, and freight forwarding, large PDF documents often contain multiple House Air Waybill (HAWB) sets — each consisting of two or more pages. Manually splitting these into individual files can be tedious and error-prone.
This guide will show you how to automate PDF splitting by HAWB Number using a simple Python script. You’ll learn how to extract text, detect HAWB numbers like HAWB Number: 9221038752, and save each set as a separate, correctly named PDF — all in seconds.
By the end of this tutorial, you’ll have a fully automated workflow for handling multi-page logistics documents efficiently.
🚀 Why Automate PDF Splitting by HAWB?
If you work in air cargo, freight forwarding, or documentation processing, you’ve probably faced this issue:
- A single PDF contains multiple shipment sets.
- Each set is two pages — invoice + airwaybill.
- Every set has a unique HAWB number.
- Manually splitting and renaming 20 or more sets is slow and repetitive.
With Python, you can split, detect, and rename all sets in one go — accurately and automatically.
🧰 Tools and Requirements
To complete this tutorial, you’ll need:
| Tool | Description |
|---|---|
| 🐍 Python 3.x | Programming language used for automation |
| 📄 PyPDF2 | Python library for reading and writing PDF files |
| 🧮 Regular Expressions (re) | Used to find HAWB numbers in text |
| 💻 A Searchable PDF | PDF with selectable text (not scanned) |
⚙️ Step-by-Step Guide
Step 1: Install Python and PyPDF2
If you don’t already have Python, download it from python.org/downloads and install it. Then open your terminal or Command Prompt and install the required library:
pip install PyPDF2
Step 2: Prepare Your PDF File
Place your PDF in a known folder. Example path:
C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf
Each HAWB set contains 2 pages, and each set includes a line such as:
HAWB Number: 9221038752
Step 3: Create the Python Script
Open Notepad, VS Code, or any code editor and paste this script. Save it as split_hawb_sets.py:
import os
import re
from PyPDF2 import PdfReader, PdfWriter
# --- CONFIG ---
input_pdf = r"C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\20pagespdf.pdf"
pages_per_set = 2 # each HAWB set = 2 pages
output_folder = os.path.join(os.path.dirname(input_pdf), "output_sets")
# --- Create output folder ---
os.makedirs(output_folder, exist_ok=True)
# --- Read the PDF ---
reader = PdfReader(input_pdf)
total_pages = len(reader.pages)
print(f"Total pages: {total_pages}")
# --- Process each 2-page set ---
for i in range(0, total_pages, pages_per_set):
writer = PdfWriter()
set_pages = reader.pages[i:i + pages_per_set]
text = ""
# Extract text from both pages
for page in set_pages:
text += page.extract_text() or ""
# Find HAWB number using regex
match = re.search(r"HAWB\s*Number:\s*(\d+)", text)
if match:
hawb_number = match.group(1)
filename = f"HAWB_{hawb_number}.pdf"
else:
set_num = (i // pages_per_set) + 1
filename = f"Set_{set_num:02d}.pdf"
# Write the set
for page in set_pages:
writer.add_page(page)
output_path = os.path.join(output_folder, filename)
with open(output_path, "wb") as f:
writer.write(f)
print(f"✅ Saved: {output_path}")
print("\nAll sets split successfully!")
print(f"Output folder: {output_folder}")
Step 4: Run the Script
Open your Command Prompt and run:
python "C:\Users\Admin\Desktop\Data Entry\Trick\Temp work\New folder\pdf\split_hawb_sets.py"
When the script finishes, you’ll find a new folder named:
output_sets
containing your separated and renamed files, such as:
HAWB_9221038752.pdfHAWB_9221038753.pdfHAWB_9_221038754.pdf- ...
🧩 How It Works — Step-by-Step Explanation
- Read and Count PDF Pages: The script uses
PdfReaderto open your PDF and determine how many pages it contains. - Divide into 2-Page Sets: Using a simple loop, it processes every two pages as one shipment set.
- Extract Text: It extracts text from those two pages using
extract_text(). - Find HAWB Number: The line
HAWB Number: 9221038752is detected using a regular expression (re.search()), which looks for a number pattern following the phrase “HAWB Number”. - Rename and Save: Each set is then saved as a new PDF file named after the detected HAWB number. If no number is found, it assigns a generic name like
Set_01.pdf.
🎯 Benefits of This Automation
- ✅ Saves Time: Split and rename 20 or 200 pages in seconds.
- ✅ Error-Free: No more manual typing or page selection mistakes.
- ✅ Consistent Naming: All PDFs are named in a clean, uniform format.
- ✅ Reusable: You can modify it for other patterns (like Invoice No, Shipment ID, etc.).
🧠 Pro Tips
- If your PDF pages per set vary, you can change
pages_per_set = 2to another number.- If you want a CSV report of extracted HAWB numbers and filenames, you can easily add one line using Python’s
csvmodule.- Works best on text-based PDFs (not scanned images). For scanned PDFs, add OCR using
pytesseract.
🏁 Conclusion
Automating document splitting can save hours of manual work in logistics and data processing. Using a simple Python script and the PyPDF2 library, you can extract, split, and rename multi-page PDFs by HAWB Number in one automated workflow.
Whether you handle 10 or 1,000 sets, this method ensures speed, accuracy, and professional output every time.
Keywords: HAWB PDF Split Python, Split PDF by HAWB Number, Python PDF automation, Air Waybill PDF Processing, PyPDF2 tutorial, Logistics document automation, Extract text from PDF Python, Rename PDFs automatically Python

Hi Please, Do Not Spam in Comments