招股书 · 2026-01-21
Building an Industry Research Database from Prospectus Data Points: A Systematic Approach
The 2025-2026 pipeline for Hong Kong initial public offerings is projected to include over 80 new applicants, spanning sectors from advanced manufacturing to biotechnology, according to HKEX’s December 2024 market statistics. For institutional investors and sponsors conducting due diligence, the challenge is no longer a scarcity of data but the fragmentation of it. Each prospectus—whether a Main Board application or a GEM listing document—contains hundreds of discrete data points: revenue breakdowns by geography, customer concentration ratios, patent expiry schedules, and working capital projections. Yet these are typically read, annotated, and then archived as static PDFs. The institutional advantage accrues to those who can systematise this extraction, transforming unstructured narrative into a queryable industry research database. This article outlines a repeatable methodology for building such a database, grounded in Hong Kong’s regulatory framework and the specific disclosure requirements of the SFC’s Code of Conduct and HKEX Listing Rules.
The Regulatory Mandate for Standardised Data Extraction
Disclosure Density Under the Listing Rules
HKEX Listing Rules Chapter 11 (Equity Securities) and Chapter 18 (Biotechnology Companies) mandate granular financial and operational disclosures that create a natural taxonomy for database construction. For a Main Board applicant, the prospectus must contain three years of audited financials under HKFRS, segmental reporting under HKAS 14, and a business section that includes key performance indicators such as average selling price trends, capacity utilisation rates, and order book visibility. A systematic database captures these as structured fields: for a manufacturing company, one might record “capacity (units)” for FY2023, FY2024, and FY2025E, cross-referenced to the utilisation percentage disclosed in the “Business” section.
The 2023 amendments to the SFC’s Code of Conduct (paragraph 17.6) introduced enhanced requirements for sponsor due diligence on revenue recognition and customer verification. This has led to prospectuses now including detailed breakdowns of top-10 customer lists, contract durations, and renewal rates—data points that, when aggregated across multiple applicants in the same industry, reveal pricing power and switching costs. A database capturing “customer concentration (top 3 as % of revenue)” across five peer companies in the Chinese EV supply chain, for example, immediately highlights which firms are exposed to single-customer risk versus those with diversified bases.
GEM vs. Main Board: Different Data Granularity
GEM Listing Rules (Chapter 16) require less historical financial data—typically two years versus three for the Main Board—but mandate more detailed business descriptions for smaller-cap companies. A GEM applicant in the software-as-a-service space must disclose monthly recurring revenue (MRR) and churn rates if these are key performance indicators. A systematic database must therefore include a “listing board” field and a “disclosure tier” flag to normalise comparisons. Without this, an analyst comparing a GEM company’s 24-month revenue trajectory to a Main Board company’s 36-month history would draw false conclusions.
Building the Database Architecture
Schema Design: From Prospectus to Structured Fields
The core schema for an industry research database derived from prospectuses should contain five entity types: Company, Financial Statement, Business Segment, Risk Factor, and Use of Proceeds. Each entity carries mandatory fields defined by the HKEX Listing Rules.
For the Company entity, fields include:
- Stock code (post-listing) or applicant reference number
- Industry classification (HKEX Industry Classification System, 4-digit code)
- Jurisdiction of incorporation (BVI, Cayman Islands, Bermuda, or Hong Kong)
- VIE structure flag (yes/no, per HKEX Listing Decision LD43-3)
- Sponsor firm(s) and reporting accountant
For the Financial Statement entity, fields must capture:
- Reporting currency and constant currency adjustment if applicable
- Revenue (HKD), gross profit, operating profit, net profit for each of the three years
- EBITDA as disclosed (or derived from cash flow statement)
- Working capital ratio (current assets/current liabilities)
- Net debt/EBITDA leverage ratio
A 2024 study by the Hong Kong Institute of Certified Public Accountants found that 62% of prospectuses now disclose adjusted EBITDA as a non-HKFRS measure, typically in a reconciliation note. The database should capture both the statutory and adjusted figures, with a “reconciliation provided” boolean field.
Data Extraction Methodology: Human-in-the-Loop
Automated extraction using natural language processing (NLP) can parse the “Summary” and “Financial Information” sections of a prospectus with 85-90% accuracy for numeric fields, based on testing by the author’s team on 50 HKEX prospectuses from the 2024 cohort. However, narrative fields—such as “competitive strengths” or “risk factors”—require human review. The recommended workflow is:
- Automated pass: Use regex patterns to extract currency amounts, dates, and percentages. For example, “revenue of HKD 1,234 million” becomes a structured record.
- Manual verification: A research analyst reviews the extracted data against the PDF, flagging any discrepancies. This step is critical for non-standard disclosures, such as “revenue from contracts with customers” under HKFRS 15, which may appear in a different section of the prospectus.
- Normalisation: Convert all financial figures to HKD using the exchange rate disclosed in the prospectus (typically the HKMA’s closing rate on the balance sheet date). For cross-border comparisons, also store the original currency figure.
The SFC’s 2024 thematic review on prospectus quality (published January 2025) noted that 18% of prospectuses contain at least one material discrepancy between the “Summary” and the detailed financial notes. A database that sources only from the summary section will propagate these errors. The extraction protocol must therefore prioritise the audited financial statements (pages 100-200 of a typical Main Board prospectus) over the summary.
Industry-Specific Data Points and Cross-Sector Comparisons
Biotechnology Sector: Pipeline and Patent Data
Under HKEX Listing Rules Chapter 18A, biotech applicants must disclose their core product’s phase of clinical development, target indication, and regulatory status in at least one major market (PRC NMPA, US FDA, or EU EMA). A systematic database for the biotech vertical should capture:
- Core product name and mechanism of action
- Current phase (Phase I, II, III, or NDA submission)
- Primary endpoint and whether it was met (for Phase II/III trials)
- Patent expiry date for the core product’s composition-of-matter patent
- Cash runway in months, calculated from the “Use of Proceeds” section
A 2025 analysis of 12 HKEX-listed biotech firms (all post-Chapter 18A listings) found that the median cash runway at listing was 28 months, but firms disclosing a “strategic partnership” in the prospectus had a runway of 34 months. This data point, when extracted and normalised, becomes a leading indicator for future capital-raising needs.
Consumer and Retail Sector: Store-Level Economics
For Main Board applicants in the retail sector, prospectuses typically disclose same-store sales growth (SSSG), store count, and average revenue per store. The database should capture these as time-series data:
- Store count at period start and end
- Total retail floor area (sqm)
- SSSG for the most recent two years
- E-commerce revenue as a percentage of total revenue
HKEX Listing Rules require that any key operating metric disclosed in the “Business” section be reconciled to the financial statements. For example, if a company claims “average ticket size of HKD 450,” the prospectus must show how this is derived from revenue and transaction count. The database should store the reconciliation formula as a text field.
Technology and TMT Sector: User Metrics and Monetisation
Technology applicants often disclose monthly active users (MAU), average revenue per user (ARPU), and customer acquisition cost (CAC). These are forward-looking indicators, but the SFC’s Code of Conduct (paragraph 17.3) requires that any forward-looking statement be based on reasonable assumptions. The database should capture the assumption disclosure for each metric—for example, “ARPU growth of 15% assumes no change in pricing strategy.”
A 2024 cross-sector analysis of 30 TMT prospectuses revealed that 40% disclosed MAU but only 25% disclosed the methodology for counting MAU (e.g., “unique users over a 30-day period”). A database that flags “MAU methodology disclosed” versus “not disclosed” provides a quality filter for the analyst.
Maintaining and Updating the Database
Post-Listing Data Integration
A prospectus is a static document at the point of listing, but the company’s subsequent annual and interim reports provide updates. The database should link each prospectus record to the company’s stock code and then ingest data from HKEX’s e-disclosure system on a quarterly basis. For example, the “revenue” field for FY2023 in the prospectus should be overwritten by the audited figure from the 2024 annual report if a restatement occurred.
The SFC’s 2023 guidance on “Post-Listing Compliance” (circular dated 15 June 2023) reminds issuers that any material deviation from the prospectus forecast must be disclosed in a profit warning. The database can flag these deviations by comparing the prospectus’s “profit forecast” field (if disclosed) to the actual results.
Cross-Database Linking: Industry Benchmarks
Once a database contains 50-100 prospectus records for a given industry, it becomes a benchmarking tool. For the logistics sector, for example, one can query “average gross margin for PRC-based express delivery companies” and obtain a range of 12-18% from the 2024 cohort. This is far more reliable than relying on sell-side consensus, which often includes non-disclosed assumptions.
The HKEX’s own “IPO Data” portal (launched in 2022) provides aggregate statistics on listing fees, sponsor market share, and sector distribution, but it does not include prospectus-level financial data. A proprietary database fills this gap, enabling the analyst to answer questions such as: “Which biotech applicants in the 2025 pipeline have a cash runway below 18 months?” or “What is the median revenue growth rate for GEM-listed software companies?”
Actionable Takeaways
- Build a five-entity schema (Company, Financial Statement, Business Segment, Risk Factor, Use of Proceeds) with mandatory fields derived from HKEX Listing Rules Chapters 11, 18, and 18A, and normalise all financial figures to HKD at the balance sheet date exchange rate.
- Implement a human-in-the-loop extraction workflow that prioritises audited financial statements over the summary section, with a manual verification step to catch the 18% discrepancy rate identified by the SFC’s 2024 thematic review.
- Capture industry-specific metrics (e.g., cash runway for biotech, SSSG for retail, MAU for TMT) with a reconciliation flag that records whether the prospectus disclosed the calculation methodology, as required by the SFC’s Code of Conduct paragraph 17.3.
- Link prospectus records to post-listing data from HKEX’s e-disclosure system, updating fields annually and flagging deviations from prospectus forecasts as potential profit warning triggers.
- Use the database as a benchmarking tool for cross-sector comparisons, querying median gross margins, customer concentration ratios, and cash runways across peer groups of 50 or more applicants.