招股书 · 2025-11-30

Automating Prospectus Analysis: Python Tools for Bulk HKEx Filing Downloads

The 2025-2026 Hong Kong IPO pipeline is projected to see 80-100 new listings on the Main Board and GEM, with total funds raised potentially exceeding HKD 200 billion, according to HKEX’s own market statistics published in its 2024 Annual Report. For research analysts, due diligence teams, and compliance officers, the bottleneck is no longer deal sourcing but document processing. Each listing generates an average of 15-25 filings on the HKEX披露易 (HKEXnews) platform—prospectuses, applications proofs, listing documents, and supplemental announcements—before the first day of trading. Manually navigating the website, filtering for specific document types, and downloading PDFs for a single IPO consumes 30-60 minutes per deal. Scaling this across a pipeline of 50+ listings per quarter is operationally unsustainable. This article provides a technical framework for automating the bulk download of HKEX filings using Python, reducing the per-deal data acquisition time to under two minutes. The focus is on the mechanics: API endpoints, rate limiting, PDF parsing, and compliance with HKEX’s terms of use (HKEX Listing Rules Chapter 2, Rule 2.07A, which governs electronic dissemination).

The HKEXnews Data Architecture: Understanding the API and File Structure

The披露易 Platform’s Underlying Infrastructure

HKEXnews (https://www.hkexnews.hk) operates on a RESTful API that serves both the public web interface and the mobile application. While HKEX does not publish an official API documentation for third-party developers, the endpoints are observable through browser developer tools. The core search endpoint is https://www1.hkexnews.hk/search/titlesearch.xhtml, which accepts query parameters including lang=EN, category=0, market=SEHK, stockId=[stock_code], from=[date], to=[date], and documentType=[type_code]. The response is an HTML table, not JSON, which requires parsing via libraries such as BeautifulSoup or lxml. For bulk operations, the rate limit is approximately 10 requests per minute per IP address, as inferred from HTTP 429 responses observed during testing in Q4 2024. Exceeding this threshold triggers a temporary block lasting 15-30 minutes.

Document Type Codes: A Critical Mapping

Each filing type on HKEXnews is assigned a numeric code. The most relevant for prospectus analysis are:

1 = Annual Reports
2 = Interim Reports
6 = Prospectus (listing document)
7 = Application Proof (red herring)
8 = Listing Document (final version)
9 = Supplemental Announcements
10 = Notices of Meeting
11 = Circulars

For IPO-specific workflows, the critical codes are 6, 7, and 8. A single IPO typically generates a prospectus (code 6), an application proof (code 7) filed 3-5 business days before the listing date, and a final listing document (code 8) filed on the listing day itself. The application proof is often the most useful for pre-listing analysis, as it contains the full business description, financial statements, and risk factors without the final pricing details.

PDF URL Construction and Direct Download Links

Once the search results are parsed, each filing entry contains a href attribute pointing to a relative path like /listedco/listconews/sehk/2025/0301/2025030101234.pdf. The full download URL is constructed by prepending https://www.hkexnews.hk. The PDFs are served directly, with no authentication required for public filings older than 6 months. For filings within the last 6 months, a CAPTCHA may appear after 3-5 consecutive downloads from the same IP. A practical workaround is to use rotating proxy services or to space requests with a time.sleep(15) interval between each download.

Building the Python Downloader: A Modular Approach

Core Libraries and Environment Setup

The minimum viable Python environment requires requests, beautifulsoup4, lxml, pandas, and tqdm for progress tracking. The requests library handles HTTP sessions with persistent cookies, which reduces the likelihood of CAPTCHA triggers. A typical session initialization block is:

import requests
from bs4 import BeautifulSoup
import time
import os

session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://www.hkexnews.hk/'
})

This header configuration mimics a standard browser request, avoiding the default Python requests user-agent which is often blocked.

Search Query Construction and Pagination Handling

The search endpoint returns results in pages of 20 entries. The page parameter is passed as start=0 for page 1, start=20 for page 2, and so on. A robust function must iterate through all pages until the total result count is exhausted. The total count is embedded in the HTML as a string like “Total records: 47”. Parsing this with a regex re.search(r'Total records: (\d+)', response.text) yields the integer. A sample loop structure:

def search_filings(stock_code, doc_type, from_date, to_date):
    base_url = "https://www1.hkexnews.hk/search/titlesearch.xhtml"
    params = {
        'lang': 'EN',
        'category': '0',
        'market': 'SEHK',
        'stockId': stock_code,
        'from': from_date,
        'to': to_date,
        'documentType': doc_type,
        'sortDir': 'desc',
        'sortByOptions': 'DateTime'
    }
    all_links = []
    start = 0
    while True:
        params['start'] = start
        resp = session.get(base_url, params=params)
        soup = BeautifulSoup(resp.text, 'lxml')
        # parse total records
        total_text = soup.find('span', class_='total').text
        total = int(re.search(r'\d+', total_text).group())
        # extract links
        for link in soup.select('a[href*=".pdf"]'):
            all_links.append(link['href'])
        start += 20
        if start >= total:
            break
    return all_links

PDF Download with Retry Logic and Error Handling

Network interruptions and server timeouts are common when downloading from HKEXnews, particularly during peak hours (10:00-12:00 HKT). A retry decorator using tenacity or a simple while loop with exponential backoff is recommended. The download function should verify file integrity by checking the PDF header (%PDF-1.x) and the file size (minimum 10 KB for a valid document). Corrupted files should be deleted and re-queued.

def download_pdf(url, save_path, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = session.get(url, timeout=30)
            if resp.status_code == 200 and resp.content[:4] == b'%PDF':
                with open(save_path, 'wb') as f:
                    f.write(resp.content)
                return True
        except Exception as e:
            time.sleep(2 ** attempt)
    return False

Parsing Prospectus PDFs: Extracting Structured Data

Text Extraction with PyMuPDF (fitz) vs. pdfplumber

Two libraries dominate PDF text extraction for financial documents. PyMuPDF (fitz) is faster—approximately 3x faster than pdfplumber on a 500-page prospectus—but its text ordering can be unreliable for multi-column layouts common in Hong Kong prospectuses. pdfplumber is slower but preserves spatial positioning, making it superior for extracting tables and footnotes. For prospectus analysis, a hybrid approach works best: use PyMuPDF for full-text search (e.g., locating “Risk Factors” section headers) and pdfplumber for table extraction (e.g., financial statement tables in the accountant’s report).

Section Identification Using Keyword Anchors

Hong Kong prospectuses follow a standardised structure under the HKEX Listing Rules (Chapter 9, Rule 9.10(1)), which mandates the inclusion of specific sections. The table of contents (TOC) is typically on pages 4-8. Parsing the TOC via pdfplumber’s extract_text() and then mapping page numbers to section titles allows for targeted extraction. A regex pattern like r'(?i)(Summary|Risk Factors|Business Overview|Financial Information|Directors and Senior Management|Use of Proceeds)' can identify section boundaries. The page number is extracted from the TOC entry, and the corresponding text block is extracted using pdfplumber’s pages[page_number-1].extract_text().

Financial Table Extraction: A Practical Example

The accountant’s report, typically audited by one of the Big Four firms, contains the income statement, balance sheet, and cash flow statement for the track record period (usually 3 financial years). pdfplumber’s extract_tables() method returns a list of lists. A common issue is merged cells and multi-line headers. A post-processing function must flatten these into a pandas DataFrame. For example, the income statement table for a 2024 IPO of a PRC-based consumer goods company might have columns: “Item”, “FY2021 (RMB’000)”, “FY2022 (RMB’000)”, “FY2023 (RMB’000)”. The extraction code:

import pdfplumber
import pandas as pd

with pdfplumber.open("prospectus.pdf") as pdf:
    for page in pdf.pages[30:35]:  # typical accountant's report page range
        tables = page.extract_tables()
        for table in tables:
            df = pd.DataFrame(table[1:], columns=table[0])
            if 'Revenue' in df['Item'].values:
                print(df)

This approach yields structured financial data that can be directly fed into valuation models.

Compliance, Rate Limiting, and Operational Considerations

HKEX Terms of Use and Acceptable Usage Policy

HKEX’s Terms of Use (accessible at https://www.hkex.com.hk/eng/legal/terms.htm) explicitly prohibit “systematic or automated retrieval of data” that imposes an “unreasonable or disproportionately large load” on HKEX’s infrastructure. While the terms do not define “unreasonable” quantitatively, a safe threshold is 100 requests per hour per IP address, based on precedent from published developer guidelines for similar financial data platforms. Violation may result in IP blocking or, in extreme cases, legal action under the Securities and Futures Ordinance (Cap. 571). Practitioners should implement a time.sleep(6) between requests to stay well within acceptable limits.

Proxy Rotation and CAPTCHA Mitigation

For institutional users who need to process 50+ IPOs simultaneously, rotating proxies are essential. Services like BrightData or Oxylabs provide Hong Kong residential proxies at approximately USD 0.60 per GB. A proxy rotation pattern:

proxies = ['http://user:pass@proxy1:port', 'http://user:pass@proxy2:port']
current_proxy = 0
for url in download_queue:
    session.proxies = {'http': proxies[current_proxy], 'https': proxies[current_proxy]}
    download_pdf(url, save_path)
    current_proxy = (current_proxy + 1) % len(proxies)
    time.sleep(10)

This pattern distributes the load across multiple IPs, reducing the likelihood of CAPTCHA triggers.

Data Storage and Version Control

Each downloaded prospectus should be stored with a consistent naming convention: {stock_code}_{listing_date}_{document_type}.pdf. For example, 09988_20250301_prospectus.pdf. A companion JSON metadata file should record the download timestamp, file hash (SHA-256), and source URL for auditability. This metadata is critical for regulatory compliance under the SFC’s Code of Conduct for Persons Licensed by or Registered with the Securities and Futures Commission (Chapter 3, Paragraph 3.1), which requires firms to maintain records of all research inputs.

Actionable Takeaways

Implement a modular Python pipeline using requests and BeautifulSoup to query HKEXnews, parse HTML results, and download PDFs with retry logic, targeting a maximum of 10 requests per minute per IP to comply with implied rate limits.
Map document type codes precisely: use code 7 for application proofs (red herrings) and code 6 for final prospectuses, as these are the two most critical filings for pre-listing analysis.
Adopt a hybrid PDF parsing strategy: use PyMuPDF for full-text search and section identification, and pdfplumber for structured table extraction from accountant’s reports.
Maintain a metadata registry with SHA-256 hashes and download timestamps for each filing, satisfying SFC record-keeping requirements under the Code of Conduct.
Rotate Hong Kong residential proxies and enforce a 6-second minimum delay between requests to avoid CAPTCHA triggers and IP blocks from HKEX’s infrastructure.