Flinders University Library
Search Smart

Open Access

Open Access is the process of making your research outputs freely and publicly available online.

Basic Search Help

Use the right-hand side menu to browse by:

  • Communities & Collections
  • By Issue Date
  • Authors
  • Titles
  • Subjects

Acts of the Parliament of South Australia 1837 - 2002 - Search Help

Preferred search is by Act Year and Number 

 Searching by Short Title

  • Use Advanced Search
  • Select Title as search type
  • Please note: short title information is only included for acts from 1936 onwards. For acts prior to 1935, the subject information from Public General Acts of the Parliament of South Australia 1837 – 1936 Vol 9 Tables and Index (1940) is listed as the title.

Searching by Long Title

  • Use Advanced Search
  • Select Abstract
  • Please note: searching by long title is only available for acts from 1837-1935. For acts 1936 onwards, you may try a full-text search but please be aware of the limitations described below.

Full Text Searching

  • Available from the main search box on each page
  • Or use the Advanced Search and select Keyword as the search type
  • Please note: there are limitations in the full-text search functionality of the Legal History Archive. As the acts have been scanned from original documents with divergent paper and print quality, Optical Character Recognition (OCR) does not perform reliably. For further information, please click here .
 
 

OCR and Full Text Indexing

An Acrobat document that has been scanned rather than exported from the software that created it, such as a word processor, it's just an image of the page.

To a computer, a picture of the letter "A" is not the same as the text character "A," so when you try to text-search a scanned document, you get no hits because there's no text to search.

A unique thing about Acrobat files is that it can contain a document image plus the text in the one document. The Acrobat software "reads" the page image and tries to figure out what the text is by using OCR (Optical Character Recognition). While you still see the "image" displayed on screen, the software can also read the underlying text. This allows searching within the Acrobat document. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do).

The scanned items in the legal history database are from a variety of sources in a variety of conditions. Many of the older volumes have been;

  • rebound several times resulting in narrow margins (gutter shadow and curvature problems in the scanned image),
  • in rebinding, some signatures have been skewed leading to possible truncation of some end of line characters,
  • some pages include manual annotations and text underlining (often in beautiful copperplate penmanship)

Furthermore,

  • some papers have significant ink bleeding in smaller point sizes (The voids of letters like "c" or "t" fill in and the letter can be mistaken for an "e" or "o"),
  • variable letter spacing (words are unintentionally split into smaller components by the OCR process).

All of these issues and many others impact upon the accuracy of automated OCR.

The OCR text layer of Acrobat reader documents is added to the full text index within Dspace and becomes searchable.

Let's apply this information to searching in the South Australian Legal History collection.
Imagine that we wish to search he South Australian Legal History Archive for the very legal sounding phrase "something or other".

First let's make some assumptions;

  1. OCR character recognition accuracy is around 99%
  2. The phrase actually occurs 100 times in the database (in 100 documents, once per document).
  3. The phrase does not appear in any of the metadata records

We therefore have a target of 100 documents to find based upon a phrase search of the OCR full text index.

With a 99% accuracy we could assume that we would find 99 out of the 100 documents. That is, there is an OCR error in one of the one hundred occurrences of the phrase. Yes? Well... no. We should perhaps, expect to find far fewer occurrences.

The phrase "something or other" (string) contains 16 characters, 16 letters and 2 spaces. For simplicity let’s ignore the spaces.

The total character count of the 100 instances of the string across all the documents is 1600 characters.

With a 99% character accuracy we could expect up to 16 characters to be recognised incorrectly in our sample of 1600 characters.

If we were extremely lucky, and all the planets aligned, all of the expected errors would occur in a single occurrence of the string and we would find 99 occurrences, one shy of the desired outcome. This is perhaps, unlikely.

We should expect to find 99 documents at best (all expected errors in the same document, or there was just one OCR error) and 84 at worst (if 16 errors are spread across the phrase in 16 of the documents, one OCR error in each)

So let’s call it a 60% chance, that somewhere between 6 to 9 documents would be missed in our search scenario (documents that contain the phrase but there is some error in the OCR matching).

However the real situation could be better or worse than that, it really depends on the accuracy of the OCR. An Accuracy of 99% is optimistic, given the issues discussed above.

Additionally, the documents contain margin text, as in this example here from the "Blyth and Gladstone Railway 54 and 55 Vic., 1891, No. 522”.

The OCR process has difficulty interpreting such an image. The OCR text index for this document contains the following text;

2, The South Australian Railway Commissioners," hereinafter Power to masr 
called 'L the said Commissioners," may make and maintain a line of dw?y. 
822 railway 

 

Here we have around 15 OCR errors and 2 formatting errors in a passage of only 136 characters. If we called that an error rate of 10%, that gives us OCR outcome that is 10 times worse than the 99% accuracy we assumed above.
When Acrobat looks at the page image it reads right to left across the column. It does not recognise a two column document in this case, but an unjustified block of text.

So, to our original assumptions about phrase searching, we need to be aware that a "phrase" only appears as one in the full text index if;

  1. there are no OCR errors and,
  2. if the phrase wraps over more than one line, it is not adjacent to any margin text.

In a nutshell, the full text index has some use but it cannot be relied upon to retrieve all documents that actually contain the text that is searched for.

We can be fairly confident of retrieving all documents that contain a 2 word phrase, but as the length of the phrase increases, so does the probability that the phrase includes an OCR error. With often only 8 or so words to a line it is quite likely that phrases over 3 or 4 words in length will span multiple lines, increasing the likelihood of incorrect matching due to adjacent margin text. As a general rule of thumb, the more recent the Act, the higher quality the original, the more accurate the OCR.


Examples of problem text for OCR.

Narrow margin shadow, curvature and truncation (caused by rebinding).

Hand annotation.

Acrobat cannot tell the difference between the printed document and the hand annotation.
 

Hand underlining.

Underlining makes it difficult fo OCR to isolate the lines of text.
 

Skewed signature binding.

Text at an angle is difficult for OCR to recognise.
 

Ink bleeding.

Does that say "Present Governors." or as Acrobat OCR interprets it "~zesenGt wemorr."