nameparse(3)
NAME
Lingua::EN::NameParse - routines for manipulating a per
son's name
SYNOPSIS
use Lingua::EN::NameParse qw(clean case_surname);
# optional configuration arguments
my %args =
(
salutation => 'Dear',
sal_default => 'Friend',
auto_clean => 1,
force_case => 1,
lc_prefix => 1,
initials => 3,
allow_reversed => 1,
joint_names => 0,
extended_titles => 0
);
my $name = new Lingua::EN::NameParse(%args);
$error = $name->parse("MR AC DE SILVA");
%name_comps = $name->components;
$surname = $name_comps{surname_1}; # DE SILVA
$correct_casing = $name->case_all; # Mr AC de Silva
$correct_casing = $name->case_all_reversed ; # de Silva, AC
$good_name = &clean("Bad Na9me "); # "Bad Name"
$name->salutation; # Dear Mr de Silva
%my_properties = $name->properties;
$number_surnames = $my_properties{number}; # 1
$bad_input = $my_properties{non_matching};
$lc_prefix = 0;
$correct_case = &case_surname("DE SILVA-MACNAY",$lc_prefix); # De Silva-MacNay
REQUIRES
Perl, version 5.001 or higher and Parse::RecDescent
DESCRIPTION
- This module takes as input a person or persons name in
free format text such as, - Mr AB & M/s CD MacNay-Smith
MR J.L. D'ANGELO
Estate Of The Late Lieutenant Colonel AB Van Der Hei - den
- and attempts to parse it. If successful, the name is bro
ken down into components and useful functions can be per
formed such as :
converting upper or lower case values to name case (Mr- AB MacNay )
creating a personalised greeting or salutation - (Dear Mr MacNay )
extracting the names individual components - (Mr,AB,MacNay )
determining the type of format the name is in - (Mr_A_Smith )
- If the name cannot be parsed you have the option of clean
ing the name of bad characters, or extracting any portion
that was parsed and the portion that failed. - This module can be used for analysing and improving the
quality of lists of names.
DEFINITIONS
- The following terms are used by NameParse to define the
components that can make up a name. - Precursor - Estate of (The Late), Right Honourable
- ...
Title - Mr, Mrs, Ms., Sir, Dr, Major, Reverend - ...
Conjunction - word to separate names or initials, such - as "And"
Initials - 1-3 letters, each with an optional space - and/or dot
Surname - De Silva, Van Der Heiden, MacNay-Smith, - O'Reilly ...
Suffix - Snr., Jnr, III, V ... - Refer to the component grammar defined within the code for
a complete list of combinations. - 'Name casing' refers to the correct use of upper and lower
case letters in peoples names, such as Mr AB McNay. - To describe the formats supported by NameParse, a short
hand representation of the name is used. The following
formats are currently supported :
Mr_A_Smith_&_Ms_B_Jones
Mr_&_Ms_A_&_B_Smith
Mr_A_&_Ms_B_Smith
Mr_&_Ms_A_Smith
Mr_A_&_B_Smith
Mr_John_A_Smith
Mr_John_Smith
Mr_A_Smith
John_Adam_Smith
John_A_Smith
J_Adam_Smith
John_Smith
A_Smith- Precursors and suffixes are only applied to the following
formats: - Mr_John_A_Smith Mr_John_Smith Mr_John_Smith Mr_A_Smith
John_Adam_Smith John_A_Smith J_Adam_Smith John_Smith
A_Smith
METHODS
new
The "new" method creates an instance of a name object and
sets up the grammar used to parse names. This must be
called before any of the following methods are invoked.
Note that the object only needs to be created ONCE, and
should be reused with new input data. Calling "new"
repeatedly will significantly slow your program down.
- Various setup options may be defined in a hash that is
passed as an optional argument to the "new" method. Note
that all the arguments are optional. You need to define
the combination of arguments that are appropriate for your
usage. - my %args =
(salutation => 'Dear',
sal_default => 'Friend',
auto_clean => 1,
force_case => 1,
lc_prefix => 1,
initials => 3,
allow_reversed => 1 - );
- my $name = new Lingua::EN::NameParse(%args);
- salutation
- The option defines the salutation word, such as "Dear"
or "Greetings". It must be defined if you are planning
to use the "salutation" method. - sal_default
- This option defines the defaulting word to substitute
for the title and surname(s), when parsing fails to
identify them. It is also used when a precursor
occurs. Examples are "Friend" or "Member". It must be
defined if you are planning to use the "salutation"
method. If an '&' or 'and' occurs in the unmatched
section then it is assumed that we are dealing with
more than one person, and an 's' is appended to the
defaulting word. - force_case
- This option will force the "case_all" method to name
case the entire input string, including any unmatched
sections that failed parsing. For example, in "MR A
JONES & ASSOCIATES", "& ASSOCIATES" will also be name
cased. The casing rules for unmatched sections are the
same as for surnames. This is usually the best option,
although any initials in the unmatched section will
not be correctly cased. This option is useful when you
know you data has invalid names, but you cannot filter
out or reject them. - auto_clean
- When this option is set to a positive value, any call
to the "parse" method that fails will attempt to
'clean' the name and then reparse it. See the "clean"
method for details. This is useful for dirty data with
embedded unprintable or non alphabetic characters. - lc_prefix
- When this option is set to a positive value, it will
force the "case_all" and "case_component" methods to
lower case the first letter of each word that occurs
in the prefix portion of a surname. For example, Mr AB
de Silva, or Ms AS von der Heiden. - initials
- Allows the user to control the number of letters that
can occur in the initials. Valid settings are 1,2 or
3. If no value is supplied a default of 2 is used. - allow_reversed
When this option is set to a positive value, names in
reverse order will be processed. The only valid format
is the surname followed by a comma and the rest of the
name, which can be in any of the combinations allowed
by non reversed names. Some examples are: - Smith, Mr AB Jones, Jim De Silva, Professor A.B.
- The program change the order of the name back to the
non reversed format, and then performs the normal
parsing. Note that if the name can be parsed, the fact
that it's order was originally reversed, is not
recorded as a property of the name object. - joint_names
When this option is set to a positive value, joint
names are accounted for: - Mr_A_Smith_&_Ms_B_Jones Mr_&_Ms_A_&_B_Smith
Mr_A_&_Ms_B_Smith Mr_&_Ms_A_Smith Mr_A_&_B_Smith - Note that if this option is not specified, than by
default joint names are ignored. Disabling joint names
speeds up the processing a lot. - extended_titles
When this option is set to a positive value, all com
binations of titles, such as Colonel, Mother Superior
are used. If this value is not set, only the following
titles are accounted for:
Mr
Ms
M/s
Mrs
Miss
Dr
Sir
Dame
Reverend
Reverand
Father
Captain
Capt
Colonel
Col
General
Gen
Major
Maj- Note that if this option is not specified, than by
default extended titles are ignored. Disabling
extended titles speeds up the processing. - parse
$error = $name->parse("MR AC DE SILVA");- The "parse" method takes a single parameter of a text
string containing a name. It attempts to parse the name
and break it down into the components described above. If
the name was parsed successfully, a 0 is returned, other
wise a 1. This step is a prerequisite for the following
functions. - case_all
$correct_casing = $name->case_all;- The "case_all" method converts the first letter of each
component to capitals and the remainder to lower case,
with the following exceptions
initials remain capitalised
surname spelling such as MacNay-Smith, O'Brien and Van- Der Heiden are preserved
- see C<surname_prefs.txt> for user defined exceptions - A complete definition of the capitalising rules can be
found by studying the component grammar defined within the
code. - The method returns the entire cased name as text.
- case_all_reversed
$correct_casing = $name->case_all_reversed;- The "case_all_reversed" method applies the same type of
casing as "case_all". However, the name is returned as
surname followed by a comma and the rest of the name,
which can be any of the combinations allowed for a name,
except the title. Some examples are: "Smith, John", "De
Silva, A.B." This is useful for sorting names alphabeti
cally by surname. - The method returns the entire reverse order cased name as
text. - case_components
%my_name = $name->components;
$cased_surname = $my_name{surname_1};- The "case_components" method does the same thing as the
"case_all" method, but returns the name cased components
in a hash. The following keys are used for each component
precursor
title_1
title_2
given_name_1
initials_1
initials_2
middle_name
conjunction_1
conjunction_2
surname_1
surname_2
suffix- If a key has no matching data for a given name, it's val
ues will be set to the empty string. - components
%name = $name->components;
$surname = $my_name{surname_1};- The "components" method does the same thing as the
"case_components" method, but each component is returned
as it appears in the input string, with no case conver
sion. - case_surname
$correct_casing = &case_surname("DE SILVA-MACNAY"- [,$lc_prefix]);
- "case_surname" is a stand alone function that does not
require a name object. The input is a text string. An
optional input argument controls the casing rules for pre
fix portions of a surname, as described above in the
"lc_prefix" section. - The output is a string converted to the correct casing for
surnames. See "surname_prefs.txt" for user defined excep
tions - This function is useful when you know you are only dealing
with names that do not have initials like "Mr John Jones".
It is much faster than the case_all method, but does not
understand context, and cannot detect errors on strings
that are not personal names. - surname_prefs.txt
- Some surnames can have more than one form of valid capi
talisation, such as MacQuarie or Macquarie. Where the user
wants to specify one form as the default, a text file
called surname_prefs.txt should be created and placed in
the same location as the NameParse module. The text file
should contain one surname per line, in the capitalised
form you want, such as
Macquarie
MacHado- NameParse will still operate if the file does not exist
- salutation
- The "salutation" method converts a name into a personal
greeting, such as "Dear Mr & Mrs O'Brien". - If an error is detected during parsing, such as with the
name "AB Smith & Associates", the title (if it occurs) and
the surname(s) are replaced with a default word like
"Friend" or "Member". If the input string contains a con
junction, an 's' is added to the default. - If the name contains a precursor, a default salutation is
also produced. - clean
$good_name = &clean("Bad Na9me");- "clean" is a stand alone function that does not require a
name object. The input is a text string and the output is
the string with:
all repeating spaces removed
all characters not in the set (A-Z a-z - ' , . &) re- moved
- properties
- The "properties" method returns all the properties of the
name, non_matching, number and type, as a hash. - type
The type of format a name is in, as one of the follow
ing strings:
Mr_A_Smith_&_Ms_B_Jones
Mr_&_Ms_A_&_B_Smith
Mr_A_&_Ms_B_Smith
Mr_&_Ms_A_Smith
Mr_A_&_B_Smith
Mr_John_A_Smith
Mr_John_Smith
Mr_A_Smith
John_Adam_Smith
John_A_Smith
J_Adam_Smith
John_Smith
A_Smith
unknown- non_matching
Returns any unmatched section that was found.
LIMITATIONS
- The huge number of character combinations that can form a
valid names makes it is impossible to correctly identify
them all. Firstly, there are many ambiguities, which have
no right answer. - Macbeth or MacBeth, are both valid spellings
Is ED WOOD E.D. Wood or Edward Wood
Is 'Mr Rapid Print' a name or a company - One approach is to have large lookup files of names and
words, statistical rules and fuzzy logic to attempt to
derive context. This approach gives high levels of accu
racy but uses a lot of your computers time and resources. - NameParse takes the approach of using a limited set of
rules, based on the formats that are commonly used by
business to represent peoples names. This gives us fairly
high accuracy, with acceptable speed and program size. - NameParse will accept names from many countries, like Van
Der Heiden, De La Mare and Le Fontain. Having said that,
it is still biased toward English, because the precursors,
titles and conjunctions are based on English usage. - Names with two or more words, but no separating hyphen are
not recognized. This is a real quandary as Indian, Chi
nese and other names can have several components. If these
are allowed for, any component after the surname will also
be picked up. For example in "Mr AB Jones Trading As Jones
Pty Ltd" will return a surname of "Jones Trading". - Because of the large combination of possible names defined
in the grammar, the program is not very fast, except for
the more limited "case_surname" subroutine. See the
"Future Directions" section for possible speed ups. - As the parser has a very limited understanding of context,
the "John_Adam_Smith" name type is most likely to cause
problems, as it contains no known tokens like a title. A
string such as "National Australia Bank" would be accepted
as a valid name, first name National etc. Supplying a
list of common pronouns as exceptions could solve this
problem.
REFERENCES
"The Wordsworth Dictionary of Abbreviations & Acronyms"
(1997)
Australian Standard AS4212-1994 "Geographic Information
Systems - Data Dictionary for transfer of street address
ing information"
FUTURE DIRECTIONS
- Add filtering of very long names
Add diagnostic messages explaining why parsing failed
Add transforming methods to do things like remove dots - from initials
Try to derive gender (Mr... is male, Ms, Mrs... is fe - male)
- Let the user select what level of complexity of grammar
they need for their data. For example, if you know most of
your names are in a "John Smith" format, you can avoid the
ambiguity between two letter given names and initials.
Using a limited grammar subset will also be much faster. - Define grammar for other languages. Hopefully, all that
would be needed is to specify a new module with its own
grammar, and inherit all the existing methods. I don't
have the knowledge of the naming conventions for nonenglish languages.
SEE ALSO
Lingua::EN::AddressParse, Lingua::EN::MatchNames, Lin
gua::EN::NickNames, Lingua::EN::NameCase, Parse::RecDes
cent
TO DO BUGS
The dot in a suffix of Jnr. or Snr. will be consumed as
unmatched text, and not be retained with the suffix.
COPYRIGHT
Copyright (c) 1999-2002 Kim Ryan. All rights reserved.
This program is free software; you can redistribute it
and/or modify it under the terms of the Perl Artistic
License (see http://www.perl.com/perl/misc/Artistic.html).
AUTHOR
NameParse was written by Kim Ryan <kimryan@cpan.org>
<http://www.data-distillers.com>
- Thanks to all the people who provided ideas and sugges
tions, including - QM Industries <http://www.qmi.com.au>
Damian Conway <damian@cs.monash.edu.au> author of - Parse::RecDescent
<mark.summerfield@chest.ac.uk>, author of Lin - gua::EN::NameCase,
Ron Savage <rpsavage@ozemail.com.au>
<alastair@calliope.demon.co.uk>, Adam Huffman, Douglas - Wilson