lingua::en::sentence(3)

NAME

Lingua::EN::Sentence - Module for splitting text into sen
tences.

SYNOPSIS

use   Lingua::EN::Sentence    qw(    get_sentences
add_acronyms );
add_acronyms('lt','gen');                ## adding
support for 'Lt. Gen.'
my $sentences=get_sentences($text);     ## Get the
sentences.
foreach my $sentence (@$sentences) {
        ## do something with $sentence
}

DESCRIPTION

The "Lingua::EN::Sentence" module contains the function
get_sentences, which splits text into its constituent sen
tences, based on a regular expression and a list of abbre
viations (built in and given).

Certain well know exceptions, such as abreviations, may
cause incorrect segmentations. But some of them are
already integrated into this code and are being taken care
of. Still, if you see that there are words causing the
get_sentences() to fail, you can add those to the module, so it notices them.

ALGORITHM

Basically, I use a 'brute' regular expression to split the
text into sentences. (Well, nothing is yet split - I just
mark the end-of-sentence). Then I look into a set of
rules which decide when an end-of-sentence is justified
and when it's a mistake. In case of a mistake, the end-ofsentence mark is removed.

What are such mistakes? Cases of abbreviations, for exam
ple. I have a list of such abbreviations (Please see
`Acronym/Abbreviations list' section), and more general
rules (for example, the abbreviations 'i.e.' and '.e.g.'
need not to be in the list as a special rule takes care of
all single letter abbreviations).

FUNCTIONS

All functions used should be requested in the 'use'
clause. None is exported by default.
get_sentences( $text ) The get sentences function takes a scalar
containing ascii text as an argument and returns a reference to
an array of sentences that the text has been split into. Re
turned sentences will be trimmed (beginning and end of sentence)
of white-spaces. Strings with no alpha-numeric characters in
them, won't be returned as sentences.
add_acronyms( @acronyms ) This function is used for adding
acronyms not supported by this code. Please see `Acronym/Abbre
viations list' section for the abbreviations already supported by
this module.
get_acronyms( ) This function will return the defined list of
acronyms.
set_acronyms( @my_acronyms ) This function replaces the prede
fined acroym list with the given list.
get_EOS( ) This function returns the value of the string used to
mark the end of sentence. You might want to see what it is, and
to make sure your text doesn't contain it. You can use set_EOS()
to alter the end-of-sentence string to whatever you desire.
set_EOS( $new_EOS_string ) This function alters the end-of-sen
tence string used to mark the end of sentences.
set_locale( $new_locale ) Revceives language locale in the form
language.country.character-set for example: "fr_CA.ISO8859-1" for
Canadian French using character set ISO8859-1. Returns a refer
ence to a hash containing the current locale for matting values.
Returns undef if got undef.
The following will set the LC_COLLATE behaviour to Argentinian
Spanish. NOTE: The naming and avail ability of locales depends on
your operating sys tem. Please consult the perllocale manpage for
how to find out which locales are available in your system.
$loc = set_locale( "es_AR.ISO8859-1" );
This actually does this:
$loc = setlocale( LC_ALL, "es_AR.ISO8859-1" );

Acronym/Abbreviations list

You can use the get_acronyms() function to get acronyms. It has become too long to specify in the documentation.

If I come across a good general-purpose list - I'll incor
porate it into this module. Feel free to suggest such
lists.

FUTURE WORK [1] Object Oriented like usage [2] Supporting more

than just English/French [3] Code optimization. Currently everything is RE based and not so optimized RE [4] Possi bly use more semantic heuristics for detecting a beginning of a sentence SEE ALSO
Text::Sentence

AUTHOR

Shlomo Yona <Shlomo.Yona@Siftology.com>

COPYRIGHT

Copyright (c) 2001 Siftology Inc.. All rights reserved.

This library is free software. You can redistribute it
and/or modify it under the same terms as Perl itself.
Copyright © 2010-2025 Platon Technologies, s.r.o.           Index | Man stránky | tLDP | Dokumenty | Utilitky | O projekte
Design by styleshout