Chen Xiang
Processing Hong Kong Legislation data
⚖️

Processing Hong Kong Legislation data

Introduction

Hong Kong’s legal system is rooted in common law, supplemented by statutory legislation. This means that the law is shaped both by past court rulings and by written legislations (read more here). While judgment documents are typically unstructured and often found in MS Word format, legislation documents are available in a structured XML format with unified, semantic tags. This blog focuses on how to process legislation documents for AI applications, such as building a semantic database.
 
Before we start here are some places where we find legal documents:
Documents Type
Website
Note
Judgments / Legislation
Scraping for Case Law in prohibited. Documents in well formatted HTML pages.
Legislation / Instruments
Documents in PDF / RTF Format.
Legislation / Instruments
e-Legislation data but in XML format and can be downloaded in bulk.
Judgments / Case Summary
This subpage includes typical cases.
 
This blog will be processing XML legislation data downloaded from data.gov.hk, the detailed description of each tag can be found here.
 

Document Structure

There are two types of legislation in Hong Kong: (1) Ordinances, which are primary laws passed by the Legislative Council, and (2) Regulations, which include rules, orders, resolutions, and other subsidiary legislation empowered by an ordinance.
Ordinances are identified by the prefix "Cap. " followed by a number, such as "Cap. 1". In contrast, regulations are identified by "Cap. " followed by a number and a letter, for example, "Cap. 1A" , indicating that the regulation supplements the corresponding ordinance "Cap. 1". Despite the difference in their definition they are structure in similar manner so we can process them together.
 

General Pattern

A simplified legislation document structure are as the following ( means there could be multiple of the above group of tags):
<meta> <docName>Cap. x</docName> <docType>cap</docType> <docNumber>x</docNumber> <docStatus>In effect/suspended/omitted/repealed/expired</docStatus> <dc:identifier>/hk/capx!en</dc:identifier> <!--Unique identifier used when referencing--> <dc:date>2024-08-18</dc:date> <dc:subject>legislation</dc:subject> <dc:language>en</dc:language> </meta> <main> <part> <num></num> <!--The Part Number--> <heading></heading> <!--Part Heading--> <division> <!--Not Commonly Found--> <num></num> <!--The Division Number--> <heading></heading> <!--Division Heading--> <subdivision></subdivision> <!--Not Commonly Found--> <!--The following tag may also be in the subdivision tag--> <section> <subsections></subsections> <!--The following tag may also be in the subsection tag--> <num></num> <!--The Section Number--> <heading></heading> <!--Section Heading--> <leadIn></leadIn> <!--Introductory Text--> <content></content> <text></text> <!--When there are mulitple paragraphs--> <paragraph> <subparagraph></subpagraph> <!--The following tag may also be in the subparagraph tag--> <num></num> <content></content> </paragraph> ... </section> ... </division> ... </part> ... <schedule> <num></num> <content></content> <!--May be further split into schedule sections--> <section> <!--Same as section tag above--> </section> ... </schedule> ... </main>
Although this is the expected pattern, in practice many hierarchical tag layers—such as parts or divisions—may be omitted, causing their child tags to appear directly at higher levels in the document structure. Hence it reasonable to directly query for section and schedule tags as they are the primary hierarchical level and present in almost all legislations.
 
Furthermore, sections and schedules usually convey distinct semantic meanings, so it can be practical to chunk a document by its sections or schedules, provided that each chunk is not overly lengthy.
 
A rendered example document structure may look like this:
 

References

A special tag that can appear in many places is the ref tag indicated a reference to a another document or another section of the same section.
  • ref tags will usually have a href attribute which is the identifier for the reference document or section.
  • References to another section in the same document are usually wrapped in a note tag with the role attribute of role="crossReferences”
    • If no href attribute it is also likely that it is referencing the another section in the same document
  • References to other document usually do not have section number in the reference
  • There may also be references for older version of a legislation, these are usually wrapped in a sourceNote tag to indicated when was the content added.
Here are some examples
<!--Reference within the same document--> <note class="align_right" role="crossReferences" type="inline"> [ <ref href="/hk/cap1/s62">s. 62</ref> ] </note> <!--Reference within the same document with no href--> <content> <ref>section 34</ref> shall not apply to such subsidiary legislation; </content> <!--Reference to other document--> <ref href="/hk/cap17">Cap. 17</ref> <!--Reference to non-current version documents--> <sourceNote> (Added <ref href="/hk/1998/26">26 of 1998 s. 36</ref> ) </sourceNote>
 

Titles

Ordinances have a longTitle tag that surrounds a long description that sets out the purposes of the Bill or Ordinance. Then the first section of the ordinance is almost always contain the shortTitle of the legislation
Regulations have a docTitle tag which surrounds the actual title of the regulation (basically the same as short title).
<!--example ordinance first section--> <section id="ID_1438402532765_001" name="s1" reason="inEffect" startPeriod="1997-06-30" status="operational" temporalId="s1"> <num value="1">1.</num> <heading>Short title</heading> <content>This Ordinance may be cited as the <shortTitle> Hongkong and Kowloon Wharf and Godown Company Limited (By-laws) Ordinance </shortTitle>. </content> </section>
 

Interpretations (Term definition)

For a lot of the legal documents the first section is dedicated to “Interpretation” which defines how some terms should be interpreted for this document.
<!--example of a term interpretation--> <section id="ID_1438403273502_001" name="s1" reason="inEffect" startPeriod="2023-08-03" status="operational" temporalId="s1"> <num value="1">1.</num> <heading>Interpretation</heading> <leadIn>In this Regulation, unless the context otherwise requires—</leadIn> <def name="Bus"> <term>bus</term> ( <term xml:lang="zh-Hant-HK">巴士</term> ) has the same meaning as in the Road Traffic Ordinance ( <ref href="/hk/cap374">Cap. 374</ref> ); <inline class="width_3"> </inline> <sourceNote> ( <ref href="/hk/2009/ln104">L.N. 104 of 2009</ref> ) </sourceNote> </def> ... </section>