squeeze(3)
NAME
Squeeze.pm - Shorten text to minimum syllables by using
hash table and vowel deletion
REVISION
$Id: Squeeze.pm,v 1.25 1998/12/04 10:00:08 jaalto Exp $
SYNOPSIS
use Squeeze.pm; # imnport only function
use Squeeze qw( :ALL ); # import all functions and
variables
use English;
while (<>)
{
print SqueezeText $ARG;
}
DESCRIPTION
- Squeeze English text to most compact format possibly so
that it is barely readable. You should convert all text to
lowercase for maximum compression, because optimizations
have been designed mostly fr uncapitalised letters. - "Warning: Each line is processed multiple times, so
prepare for slow conversion time" - You can use this module e.g. to preprocess text before it
is sent to electronic media that has some maximum text
size limit. For example pagers have an arbitrary text size
limit, typically 200 characters, which you want to fill as
much as possible. Alternatively you may have GSM cellular
phone which is capable of receiving Short Messages (SMS),
whose message size limit is 160 characters. For demonstra
tion of this module's SqueezeText() function , the description text of this paragraph has been converted
below. See yourself if it's readable (Yes, it takes some
time to get used to). The compress ratio is typically
30-40%
u _n use thi mod e.g. to prprce txt bfre i_s snt to
elrnic mda has som max txt siz lim. f_xmple pag
hv abitry txt siz lim, tpcly 200 chr, W/ u wnt
to fll as mch as psbleAlternatvly u may hv GSM cllar- P8
w_s cpble of rcivng Short msg (SMS), WS/ msg siz
lim is 160 chr. 4 demonstrton of thi mods SquezText
fnc , dsc txt of thi prgra has ben cnvd_ blow
See uself if i_s redble (Yes, it tak som T to get usd - to
compr rat is tpcly 30-40 - And if $SQZ_OPTIMIZE_LEVEL is set to non-zero
u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo
elrnicMdaHasSomMaxTxtSizLim.F_xmplePag
hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt
toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8
w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz
limIs160Chr.4DemonstrtonOfThiModsSquezText
fnc,DscTxtOfThiPrgraHasBenCnvd_Blow
SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto
comprRatIsTpcly30-40- The comparision of these two show
Original text : 627 characters
Level 0 : 433 characters reduction 31 %
Level 1 : 345 characters reduction 45 %- (+14 improvement)
- There are few grammar rules which are used to shorten some
English tokens very much:
Word that has _ is usually a verb- Word that has / is usually a substantive, noun,
pronomine or other non-verb
- For example, these tokens must be understood before text
can be read. This is not yet like Geek code, because you
don't need external parser to understand this, but just
some common sense and time to adapt yourself to this text.
For a complete up to date list, you have to peek the source code
automatically => 'acly_'- for => 4
for him => 4h
for her => 4h
for them => 4t
for those => 4t - can => _n
does => _s - it is => i_s
that is => t_s
which is => w_s
that are => t_r
which are => w_r - less => -/
more => +/
most => ++ - however => h/ver
think => thk_ - useful => usful
- you => u
your => u/
you'd => u/d
you'll => u/l
they => t/
their => t/r - will => /w
would => /d
with => w/
without => w/o
which => W/
whose => WS/ - Time is expressed with big letters
time => T
minute => MIN
second => SEC
hour => HH
day => DD
month => MM
year => YY- Other Big letter acronyms
phone => P8
EXAMPLES
- To add new words e.g. to word conversion hash table, you'd
define your custom set and merge them to existing ones. Do
similarly to %SQZ_WXLATE_MULTI_HASH and $SQZ_ZAP_REGEXP
and then start using the conversion function. - use English;
use Squeeze qw( :ALL ); - my %myExtraWordHash =
(new-word1 => 'conversion1' - , new-word2 => 'conversion2'
, new-word3 => 'conversion3'
, new-word4 => 'conversion4' - );
- # First take the existing tables and merge them with
- my
# translation table - my %mySustomWordHash =
( - %SQZ_WXLATE_HASH
- , %SQZ_WXLATE_EXTRA_HASH
, %myExtraWordHash - );
- my $myXlat = 0; # state
- flag
- while (<>)
{ - if ( $condition )
{SqueezeHashSet mySustomWordHash; # Use MYconversions
$myXlat = 1; - }
- if ( $myXlat and $condition )
{SqueezeHashSet "reset"; # Back todefault table
$myXlat = 0; - }
- print SqueezeText $ARG;
- }
- Similarly you can redefine the multi word translate table
by supplying another hash reference in call to Squeeze_
HashSet(). To kill more text immediately in addtion to default, just concatenate the regexps to $SQZ_ZAP_REGEXP
KNOWN BUGS
- There may be lot of false conversions and if you think
that some word squeezing went too far, please 1) turn on
the debug 2) send you example text 3) debug log log to the
maintainer. To see how the conversion goes e.g. for word
Messages: - use English;
use Lingua::EN:Squeeze; - # activate debug when case-insensitive worj "Mes
- sages" is found from the
# line. - SqueezeDebug( 1, '(?i)Messages' );
- $ARG = "This line has some Messages in it";
print SqueezeText $ARG;
EXPORTABLE VARIABLES
The defaults may not conquer all possible text, so you may
wish to extend the hash tables and $SQZ_ZAP_REGEXP to cope
with your typical text.
- $SQZ_ZAP_REGEXP
- Text to kill immediately, like "Hm, Hi, Hello..." You
can only set this once, because this regexp is com
piled immediately when "SqueezeText()" is caller for
the first time. - $SQZ_OPTIMIZE_LEVEL
- This controls how optimized the text will be. Curretly
there is only levels 0 (default) and level 1, which
squeezes out all spaces. This improves compression by
average of 10%, but the text is more harder to read.
If space is tight, use this extended compression opti
mization. - %SQZ_WXLATE_MULTI_HASH
- Multi Word conversion hash table: "for you" => "4u" ...
- %SQZ_WXLATE_HASH
- Single Word conversion hash table: word => conversion.
This table is applied after %SQZ_WXLATE_MULTI_HASH has
been used. - %SQZ_WXLATE_EXTRA_HASH
- Aggressive Single Word conversions like: without => w/o. Applied last.
INTERFACE FUNCTIONS
SqueezeText($)
- Description
- Squeeze text by using vowel substitutions and dele
tions and hash tables that guide text substitutions.
The line is parsed multiple times and this will take
some time. - arg1: $text
- String. Line of Text.
- Return values
- String, squeezed text.
- new()
- Description
- Return class object.
- Return values
- object.
- SqueezeHashSet($;$)
- Description
- Set hash tables to use for converting text. The multi
ple word conversion is done first and after that the
single words conversions. - arg1: wordHashRef
- Pointer to be used to convert single words. If
"reset", use default hash table. - arg2: multiHashRef [optional]
- pointer to be used to convert multiple words. If
"reset", use default hash table. - Return values
- None.
- SqueezeControl(;$)
- Description
- Select level of text squeezing: noconv, enable,
medium, maximum. - arg1: $state
- String. If nothing, set maximum squeeze level (kinda:
restore defualts).
noconv Turn off squeeze
conv Turn on squeeze
med Set squeezing level to medium
max Set squeezing level to maximum - Return values
- None.
- SqueezeDebug(;$$)
- Description
- Activate or deactivate debug.
- arg1: $state [optional]
- If not given, turn debug off. If non-zero, turn debug
on. You must also supply "regexp" if you turn on
debug, unless you have given it previously. - arg1: $regexp [optional]
- If given, use regexp to trigger debug output when
debug is on. - Return values
- None.
AVAILABILITY
Author can be reached at jari.aalto@poboxes.com HomePage
via forwarding service is at http://www.netfor
ward.com/poboxes/?jari.aalto or alternatively absolute url
is at ftp://cs.uta.fi/pub/ssjaaa/ but this may move with
out notice. Prefer keeping the forwarding service link in
your bookmark.
Latest version of this module can be found at $CPAN/mod
ules/by-module/Lingua/
AUTHOR
- Copyright (C) 1998-1999 Jari Aalto. All rights reserved.
This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself or in
terms of Gnu General Public licence v2 or later.