1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
|
# -*- perl -*-
# (c) Copyright 1998-2003 by Mark Mielke
#
# Freedom to use these sources for whatever you want, as long as credit
# is given where credit is due, is hereby granted. You may make modifications
# where you see fit but leave this copyright somewhere visible. As well, try
# to initial any changes you make so that if I like the changes I can
# incorporate them into later versions.
#
# - Mark Mielke <mark@mielke.cc>
#
package Text::Soundex;
require 5.006;
use Exporter ();
use XSLoader ();
use strict;
our $VERSION = '3.02';
our @EXPORT_OK = qw(soundex soundex_unicode soundex_nara soundex_nara_unicode
$soundex_nocode);
our @EXPORT = qw(soundex $soundex_nocode);
our @ISA = qw(Exporter);
our $nocode;
# Previous releases of Text::Soundex made $nocode available as $soundex_nocode.
# For now, this part of the interface is exported and maintained.
# In the feature, $soundex_nocode will be deprecated.
*Text::Soundex::soundex_nocode = \$nocode;
sub soundex_noxs
{
# Strict implementation of Knuth's soundex algorithm.
my @results = map {
my $code = $_;
$code =~ tr/AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr//cd;
if (length($code)) {
my $firstchar = substr($code, 0, 1);
$code =~ tr[AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr]
[0000000000000000111111112222222222222222333344555566]s;
($code = substr($code, 1)) =~ tr/0//d;
substr($firstchar . $code . '000', 0, 4);
} else {
$nocode;
}
} @_;
wantarray ? @results : $results[0];
}
sub soundex_nara
{
# Implementation of NARA's soundex algorithm. If two sounds are
# identical, and separated by only an H or a W... they should be
# treated as one. This requires an additional "s///", as well as
# the "9" character code to represent H and W. ("9" works like "0"
# except it combines indentical sounds around it into one)
my @results = map {
my $code = uc($_);
$code =~ tr/AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr//cd;
if (length($code)) {
my $firstchar = substr($code, 0, 1);
$code =~ tr[AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr]
[0000990000009900111111112222222222222222333344555566]s;
$code =~ s/(.)9\1/$1/g;
($code = substr($code, 1)) =~ tr/09//d;
substr($firstchar . $code . '000', 0, 4);
} else {
$nocode
}
} @_;
wantarray ? @results : $results[0];
}
sub soundex_unicode
{
require Text::Unidecode unless defined &Text::Unidecode::unidecode;
soundex(Text::Unidecode::unidecode(@_));
}
sub soundex_nara_unicode
{
require Text::Unidecode unless defined &Text::Unidecode::unidecode;
soundex_nara(Text::Unidecode::unidecode(@_));
}
eval { XSLoader::load(__PACKAGE__, $VERSION) };
if (defined(&soundex_xs)) {
*soundex = \&soundex_xs;
} else {
*soundex = \&soundex_noxs;
*soundex_xs = sub {
require Carp;
Carp::croak("XS implementation of Text::Soundex::soundex_xs() ".
"could not be loaded");
};
}
1;
__END__
# Implementation of soundex algorithm as described by Knuth in volume
# 3 of The Art of Computer Programming.
#
# Some of this documention was written by Mike Stok.
#
# Knuth's test cases are:
#
# Euler, Ellery -> E460
# Gauss, Ghosh -> G200
# Hilbert, Heilbronn -> H416
# Knuth, Kant -> K530
# Lloyd, Ladd -> L300
# Lukasiewicz, Lissajous -> L222
#
=head1 NAME
Text::Soundex - Implementation of the Soundex Algorithm as Described by Knuth
=head1 SYNOPSIS
use Text::Soundex 'soundex';
$code = soundex($name); # Get the soundex code for a name.
@codes = soundex(@names); # Get the list of codes for a list of names.
# Redefine the value that soundex() will return if the input string
# contains no identifiable sounds within it.
$Text::Soundex::nocode = 'Z000';
=head1 DESCRIPTION
This module implements the soundex algorithm as described by Donald Knuth
in Volume 3 of B<The Art of Computer Programming>. The algorithm is
intended to hash words (in particular surnames) into a small space
using a simple model which approximates the sound of the word when
spoken by an English speaker. Each word is reduced to a four
character string, the first character being an upper case letter and
the remaining three being digits.
The value returned for strings which have no soundex encoding is
defined using C<$Text::Soundex::nocode>. The default value is C<undef>,
however values such as C<'Z000'> are commonly used alternatives.
For backward compatibility with older versions of this module the
C<$Text::Soundex::nocode> is exported into the caller's namespace as
C<$soundex_nocode>.
In scalar context, C<soundex()> returns the soundex code of its first
argument. In list context, a list is returned in which each element is the
soundex code for the corresponding argument passed to C<soundex()>. For
example, the following code assigns @codes the value C<('M200', 'S320')>:
@codes = soundex qw(Mike Stok);
To use C<Text::Soundex> to generate codes that can be used to search one
of the publically available US Censuses, a variant of the soundex()
subroutine must be used:
use Text::Soundex 'soundex_nara';
$code = soundex_nara($name);
The algorithm used by the US Censuses is slightly different than that
defined by Knuth and others. The descrepancy shows up in names such as
"Ashcraft":
use Text::Soundex qw(soundex soundex_nara);
print soundex("Ashcraft"), "\n"; # prints: A226
print soundex_nara("Ashcraft"), "\n"; # prints: A261
=head1 EXAMPLES
Knuth's examples of various names and the soundex codes they map to
are listed below:
Euler, Ellery -> E460
Gauss, Ghosh -> G200
Hilbert, Heilbronn -> H416
Knuth, Kant -> K530
Lloyd, Ladd -> L300
Lukasiewicz, Lissajous -> L222
so:
$code = soundex 'Knuth'; # $code contains 'K530'
@list = soundex qw(Lloyd Gauss); # @list contains 'L300', 'G200'
=head1 LIMITATIONS
As the soundex algorithm was originally used a B<long> time ago in the US
it considers only the English alphabet and pronunciation. In particular,
non-ASCII characters will be ignored. The recommended method of dealing
with characters that have accents, or other unicode characters, is to use
the Text::Unidecode module available from CPAN. Either use the module
explicitly:
use Text::Soundex;
use Text::Unidecode;
print soundex(unidecode("Fran\xE7ais")), "\n"; # Prints "F652\n"
Or use the convenient wrapper routine:
use Text::Soundex 'soundex_unicode';
print soundex_unicode("Fran\xE7ais"), "\n"; # Prints "F652\n"
Since the soundex algorithm maps a large space (strings of arbitrary
length) onto a small space (single letter plus 3 digits) no inference
can be made about the similarity of two strings which end up with the
same soundex code. For example, both C<Hilbert> and C<Heilbronn> end
up with a soundex code of C<H416>.
=head1 MAINTAINER
This module is currently maintain by Mark Mielke (C<mark@mielke.cc>).
=head1 HISTORY
Version 3 is a significant update to provide support for versions of
Perl later than Perl 5.004. Specifically, the XS version of the
soundex() subroutine understands strings that are encoded using UTF-8
(unicode strings).
Version 2 of this module was a re-write by Mark Mielke (C<mark@mielke.cc>)
to improve the speed of the subroutines. The XS version of the soundex()
subroutine was introduced in 2.00.
Version 1 of this module was written by Mike Stok (C<mike@stok.co.uk>)
and was included into the Perl core library set.
Dave Carlsen (C<dcarlsen@csranet.com>) made the request for the NARA
algorithm to be included. The NARA soundex page can be viewed at:
C<http://www.nara.gov/genealogy/soundex/soundex.html>
Ian Phillips (C<ian@pipex.net>) and Rich Pinder (C<rpinder@hsc.usc.edu>)
supplied ideas and spotted mistakes for v1.x.
=cut
|