addressparse(3)

NAME

Lingua::EN::AddressParse - manipulate geographical
addresses

SYNOPSIS

use Lingua::EN::AddressParse;
my %args =
(
   country     => 'Australia',
   auto_clean  => 1,
   force_case  => 1,
   abbreviate_subcountry => 0,
       abbreviated_subcountry_only => 1
);
my $address = new Lingua::EN::AddressParse(%args);
$error = $address->parse("14A MAIN RD.  ST  JOHNS  WOOD
NEW SOUTH WALES 2000");
%my_address = $address->components;
$suburb = $my_address{suburb};
$correct_casing = $address->case_all;

REQUIRES

Perl, version 5.004 or higher, Lingua::EN::NameParse,
Locale::SubCountry, Parse::RecDescent

DESCRIPTION

This module takes as input an address or post box in free
format text such as,
12/3-5 AUBREY ST MOUNT VICTORIA WA 6133
"OLD REGRET" WENTWORTH FALLS NSW 2782 AUSTRALIA
2A OLD SOUTH LOW ST. KEW NEW SOUTH WALES 2123
GPO Box K318, HAYMARKET, NSW 2000
and attempts to parse it. If successful, the address is
broken down into components and useful functions can be
performed such as :

converting upper or lower case values to name case (2
Low St. Kew NSW 2123 )
extracting the addresses individual components
(2,Low St.,KEW,NSW,2123 )
determining the type of format the address is in
('suburban')
If the address cannot be parsed you have the option of
cleaning the address of bad characters, or extracting any
portion that was parsed and the portion that failed.
This module can be used for analysing and improving the
quality of lists of addresses.

DEFINITIONS

The following terms are used by AddressParse to define the
components that can make up an address or post box.
Post Box - GP0 Box K123, LPO 2345, RMS 23 ...
Property Identifier
Sub property description - Level, Unit, Apartment,
Lot ...
Property number - 12/66A, 24-34, 2A,
23B/12C, 12/42-44
Property name - "Old Regret"
Street name - O'Hare, New South Head, The Causeway
Street type - Road, Rd., St, Lane, Highway, Crescent,
Circuit ...
Suburb - Dee Why, St. John's Wood ...
Sub country - NSW, New South Wales, ACT, NY, AZ ...
Post code - 2062, 34532, SG12A 9ET
Country - Australia, UK, US or Canada
Refer to the component grammar defined in the AddressGram
mar module for a list of combinations.
The following address formats are currently supported :

'suburban' - property_identifier(?) street street_type
suburb subcountry post_code country(?)
'post_box' - post_box suburb subcountry post_code coun
try(?)
'rural' - property_name suburb subcountry post_code
country(?)

METHODS

new

The "new" method creates an instance of an address object
and sets up the grammar used to parse addresses. This must
be called before any of the following methods are invoked.
Note that the object only needs to be created once, and
can be reused with new input data.

Various setup options may be defined in a hash that is
passed as an optional argument to the "new" method.
my %args =
(
country => 'Australia',
auto_clean => 1,
force_case => 1,
abbreviate_subcountry => 1,
abbreviated_subcountry_only => 1
);
my $address = new Lingua::EN::AddressParse(%args);
country
The country argument must be specified. It determines the
possible list of valid sub countries (states, counties
etc, defined in the Locale::SubCountry module) and post
code formats. Either the full name or abbreviation may be
specified. The currently suppoted country names and codes
are:

AU or Australia
CA or Canada
GB or United Kingdom
US or United States
All forms of upper/lower case are acceptable in the coun
try's spelling. If a country name is supplied that the
module doesn't recognise, it will die.
force_case (optional)
This option will force the "case_all" method to address
case the entire input string, including any unmatched sec
tions that failed parsing. This option is useful when
you know you data has invalid addresses, but you cannot
filter out or reject them.
auto_clean (optional)
When this option is set to a positive value, any call to
the "parse" method that fails will attempt to 'clean' the
address and then reparse it. See the "clean" method in
Lingua::EN::Nameparse for details. This is useful for
dirty data with embedded unprintable or non alphabetic
characters.
abbreviate_subcountry (optional)
When this option is set to a positive value, the sub coun
try is forced to it's abbreviated form, so "New South
Wales" becomes "NSW". If the sub country is already abbre
viated then it's value is not altered.
abbreviated_subcountry_only (optional)
When this option is set to a positive value, only the
abbreviated form of sub country is allowed, such as "NSW"
and not "New South Wales". This will make parsing quicker
and ensure that addresses comply with postal standards
that normally specify abbrviated sub countries only.
parse

$error = $address->parse("12/3-5 AUBREY ST VERMONT VIC
3133");
The "parse" method takes a single parameter of a text
string containing a address. It attempts to parse the
address and break it down into the components described
above. If the address was parsed successfully, a 0 is
returned, otherwise a 1. This step is a prerequisite for
the following functions.
case_all

$correct_casing = $address->case_all;
The "case_all" method converts the first letter of each
component to capitals and the remainder to lower case,
with the following exceptions

Proper names capitalisation such as MacNay and O'Brien
are observed
The method returns the entire cased address as text.
case_components

%my_address = $address->components;
$cased_suburb = $my_address{suburb};
The "case_components" method does the same thing as the
"case_all" method, but returns the addresses cased compo
nents in a hash. The following keys are used for each com
ponent

post_box
property_identifier
property_name
street
street_type
suburb
subcountry
post_code
country
If a key has no matching data for a given address, it's
values will be set to the empty string.
components

%address = $address->components;
$surburb = $address{suburb};
The "components" method does the same thing as the
"case_components" method, but each component is returned
as it appears in the input string, with no case conver
sion.
properties
The "properties" method return several properties of the
address as a hash.
type
The type of format a name is in, as one of the following
strings:

suburban
rural
post_box
unknown
non_matching
Returns any unmatched section that was found.

LIMITATIONS

The huge number of character combinations that can form a
valid address makes it is impossible to correctly identify
them all.

Valid addresses must contain a suburb, subcountry (state)
and post code, in that order. This format is widely
accepted in Australia and the US. UK addresses will often
include suburb, town, city and county, formats that are
very difficult to parse.

Property names must be enclosed in quotes like "Old
Regret"

Because of the large combination of possible addresses
defined in the grammar, the program is not very fast.

REFERENCES

"The Wordsworth Dictionary of Abbreviations & Acronyms"
(1997)

Australian Standard AS4212-1994 "Geographic Information
Systems - Data Dictionary for transfer of street address
ing information"

ISO 3166-2:1998, Codes for the representation of names of
countries and their subdivisions. Also released as AS/NZS
2632.2:1999

FUTURE DIRECTIONS

Define grammar for other languages. Hopefully, all that
would be needed is to specify a new module with its own
grammar, and inherit all the existing methods. I don't
have the knowledge of the naming conventions for nonenglish languages.

SEE ALSO

Lingua::EN::NameParse, Parse::RecDescent, Locale::SubCoun
try

TO DO BUGS

Streets such as "The Esplanade" will return a street of
"The Espalande" and a street type of null string.

COPYRIGHT

Copyright (c) 1999-2002 Kim Ryan. All rights reserved.
This program is free software; you can redistribute it
and/or modify it under the terms of the Perl Artistic
License (see http://www.perl.com/perl/misc/Artistic.html).

AUTHOR

AddressParse was written by Kim Ryan <kimaryan@oze
mail.com.au>. <http://www.data-distillers.com>
Copyright © 2010-2025 Platon Technologies, s.r.o.           Index | Man stránky | tLDP | Dokumenty | Utilitky | O projekte
Design by styleshout