pdSurname last name database

Last Name Database

How do you match and merge last names on your lists? The answer is pdSurname. It is a one-of-a-kind proprietary resource that does for surnames what our highly regarded pdNickname software does for first names. The package is designed to facilitate matching last names that are not exactly the same but are close in relationship, spelling, or sound. It is available in Pro and Standard editions.

Regardless of the version, it contains a large set of last names and variations covering more than 600 languages and all races along with a host of additional features never before available on this scale. The software is also recommended for genealogical and scholarly research.

Pro and Standard

Both editions include the same names and features except the Pro version comes equipped with fuzzy logic. Fuzzy logic allows matching when lists have typographical errors. The Standard edition has everything except fuzzy logic. More about fuzzy logic.

DOES FOR LAST NAMES WHAT PDNICKNAME DOES FOR FIRST NAMES

pdSurname is a proprietary resource not duplicated elsewhere. For more than 20 years our software has been utilized by businesses and organizations around the world in applications you use every day.

Comprehensive last name database:
  • 335,000 last name formations
  • 80 million standard last name variation records
Last name relationships identified:
  • Close variant
  • Near variant
  • Distant variant
  • Phonetic match
  • Fuzzy logic match (Pro only)
Most advanced phonetic matching algorithms
Special algorithms for prefix names: MC, MAC, O, DE, LA, VAN, AL, ST, others
Match quality scored on a 1 to 99 scale for exceptional ordering of results
More than 600 languages of last name origin and use identified
Last name usage by race identified:
  • White
  • Black
  • Hispanic/Latino
  • Asian/Pacific
  • Native American/Alaskan
  • Multirace
Excellent for genealogical and scholarly research

Pro only features

28 million fuzzy logic last name variation records
Other benefits:
  • Designed to be fully compatible with pdNickname and pdGender
  • Comes in multiple file formats: Comma Delimited (CSV), Fixed Length, and DBF
  • Full documentation
  • Perpetual Site License—allowing installation on all computers in the same building within a single company or organization
  • Available for immediate download

SPECIFICATIONS

pdSurname is a proprietary resource not duplicated elsewhere. For more than 20 years our software has been utilized by businesses and organizations around the world in applications you use every day.

Logo
pdSurname last name database logo
Sku
Pro: 1SS100P | Standard: 1SS100S
Product Name
pdSurname Pro | pdSurname Standard
Version number
1.0
Description
Last name variation software
Total records*
Pro: 109,932,801 | Standard: 81,079,801
Zipped size**
Pro: 1.2 GB | Standard: 962 MB
Extracted size**
Pro: 22.9 GB | Standard: 16.8 GB
File formats included
Comma Delimited (CSV), Fixed Length, and DBF
Availability
Immediate download
List price
Pro: $495 Buy | Standard: $399 Buy

*The record count is the total number of records contained in each of the three included file formats.

**The zipped and extracted sizes show the combined total size of all product files.

Compatibility
pdSurname utilizes only the ANSI character set (ASCII values 0 to 127 and extended values 128 to 255) and comes in multiple file formats to insure compatibility. It is also designed to be fully compatible with pdNickname and pdGender.
Optional developer license
Available (Questions... | Apply...)

DOCUMENTATION

For better usability of our software we create precision documentation with examples—so you don’t have to be worry. The user guide includes detailed instructions, file layouts, the site license, and additional information useful for both business applications and those employing the product for research.

To view the PDF user guide you will need Adobe Acrobat Reader version 4.05 or higher installed on your computer or device. This is a free program downloadable from the Adobe website.

View documentation

SAMPLE

A random sample of the software database is available for download. It includes records from the main database with language information as well as the written documentation from the product and other information. The sample is extracted from the Pro edition. The Standard edition does not include fuzzy logic records.

The database come in three file formats to insure compatibility with any database system. Each format contains the same data. Formats include: comma delimited (CSV), fixed length, and DBF.

The written documentation, including the site license, comes in Adobe Acrobat PDF format. To view these documents you will need Adobe Acrobat Reader version 4.05 or higher installed on your computer or device. This is a free program downloadable from the Adobe website.

Download sample

What is fuzzy logic?

The fuzzy logic technology in the Pro edition of this software allows matching data that has typographical errors. If users look at the fuzzy logic records, they are likely to see errors they have repeatedly made or seen. In many cases you will have to look close to see the difference, but they are different. There are more than 28 million fuzzy logic records.

The most likely typographical errors are determined based on the number of letters, the characters involved, where they are located in the name, the language, and other factors. None of the fuzzy spellings formulate a real name already in the database. This sometimes happens when the fuzzy spelling was already a real variation of the same name.

Some fuzzy logic matches have one typographical error while others have multiple issues, so the technology is suited for even the worst typists and transcribers. The algorithms have five layers:

Phonetic misspellings

These algorithms look at digraphs, trigraphs, tetragraphs, pentagraphs, hexagraphs, and even a German heptagraph, “SCHTSCH”, used to translate Russian words with the “SHCHA” or “SHCH” (romanticized) sound. These are, respectively, two to seven letter sequences that form one phoneme or distinct sound. Most of letter sequences trigraph and above are Irish who have more language rules than you can shake a stick at.

Many misspellings occur as transcribers enter the sounds they hear. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based. Furthermore, our algorithms consider both how a name may sound to someone who speaks English as well as how it may sound to someone who speaks Spanish, which is often different. Take the digraph “SC”. Before the vowels “E” or “I” it is most likely to be misspelled by an English speaker as “SHE” or “SHI” while a Spanish speaker may hear “CHE” or “CHI” and sometimes “YE” or “YI”. Our library includes over 80,000 language-based letter sequence phonetic rules. Phonetic misspelling examples:

Example 1 | Real: AGLIANO | Fuzzy: ALLANO
Example 2 | Real: GUALTIERREZ | Fuzzy: GUALTIEREZ
Example 3 | Real: HEATHFIELD | Fuzzy: HEATHFALD
Example 4 | Real: AAGARD | Fuzzy: OUGHGARD
Example 5 | Real: YOUNGMAN | Fuzzy: YONGMAN

Reversed digraphs

These algorithms look for misspellings due to reversed digraphs (two letter sequences that form one phoneme or distinct sound) which are a common typographical issue, such as “IE” substituted for “EI”. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based. Reversed digraph examples:

Example 6 | Real: ANGLES | Fuzzy: ANLGES
Example 7 | Real: DIELEMAN | Fuzzy: DEILEMAN
Example 8 | Real: OLEARY | Fuzzy: OLAERY
Example 9 | Real: RODREGUEZ | Fuzzy: RODREUGEZ
Example 10 | Real: SCHUMACHER | Fuzzy: SCHUMAHCER

Double letter misspellings

These algorithms look for misspellings due to double letters typed as single letters and single letters that are doubled. The most common typographical issues occur with the characters, in order of frequency, “SS”, “EE”, “TT”, “FF”, “LL”, “MM”, and “OO”. Double-letter misspelling examples:

Example 11 | Real: HUMBER | Fuzzy: HUMBEER
Example 12 | Real: ZWOLLE | Fuzzy: ZWOLE

Missed letters

These algorithms look for missed keystrokes and provide fuzzy logic matches with missing letters. Unlike the other algorithms, these are not language specific. Keystrokes can be missed in any language. Missed letter examples:

Example 13 | Real: HUNTER | Fuzzy: UNTER
Example 14 | Real: TAMERON | Fuzzy: TAMRON

String manipulations

Because so many of our algorithms are language-rule bases, additional name string manipulations are provided for the relatively small number of names without language applied. Most of these are similar to the reversed digraph substitutions. String manipulation examples:

Example 15 | Real: ELWORTHY | Fuzzy: ELWROTHY
Example 16 | Real: PEOPLE | Fuzzy: POEPLE

Comprehensive last name database

Unlike first names, which have been with us since antiquity, fixed surnames were not used much until late in the High Middle Ages when populations grew and people found it necessary to be more specific when talking about someone else. For example, in England, the practice of family names was introduced after the Norman conquest in 1066, but was not fully adopted until the 13th and 14th centuries. The Welsh did not use them until the 17th century, and the Japanese did not have them until the 19th century except among aristocrats. On the other hand, matrilineal surnames existed in China prior to the Shang Dynasty (1600–1046 BC). Ireland was the first country in Europe to adopt fixed last names, “Ó Cleirigh”, recorded in 916, being the very first.

In the beginning, surnames were names like John son of Thomas (patronymic), Jane of the Hills (habitational), Henry the Weaver (occupational), and Mary the Redhead (characteristic), until the adoption of modern last names, which were often alterations of these old-fashioned names.

The problem

Through this process, many variations of last names occurred, some by design and others due to carelessness or lack of education. For example, as families immigrated to other countries they often modified or even translated their name to fit in with a new language. Many other variations occurred as a largely uneducated society tried to transcribe their names the best way they could while educated families decided to attenuate, accent, or otherwise modify their surnames over time, and Brown becomes Browne. Still other names are not variations at all, but sound similar.

pdSurname to the rescue

Apparently as all this was going on they were not thinking of modern day scribes, typists, and data processers who now need to work with all the variations and phonetic similarities. That is why pdSurname was invented. A one-of-a-kind proprietary resource that does for last names what our highly regarded pdNickname software does for first names, it is designed to facilitate matching last names that are not exactly the same but are close in relationship, spelling, or sound. Coverage includes surnames from hundreds of languages and the package employs the best matching algorithms designed for this process.

As a further benefit, for a large majority of last names, the language of origin and use have also been researched and included, and all names have a real or estimated calculation for usage among races, including white, black, Hispanic/Latino, Asian/Pacific, Native American/Alaskan, and multiracial use.

An enhanced version even incorporates sophisticated fuzzy logic which allows matching when lists have typographical errors.

This easy-to-use, comprehensive, and up-to-date software is of great value for businesses and organizations working with lists of names, but ancestry researchers, students, teachers, and scholars benefit as well because this software is recommended for study in genealogy, onomatology, anthroponymy, ethnology, linguistics, and related disciplines.

Related names

Each record has two sets of name information, a NAME1 side and a NAME2 side. The relationship between each name pair is identified as a close, near, or distant onomastic variation, as a phonetic variation or, in the Pro edition only, as a fuzzy logic variation.

The onomastic distance of true variations is rated on a 1 (closest) to 3 scale. The value is determined by tabulating or estimating the number of lines separating the names on a name tree.

Records are coded as follows:

Examples

Here are some examples of related names:

Example 1 | Name1: ACKERMAN | Name2: AKERMAN | Close onomastic variant (1)
Example 2 | Name1: MANCILL | Name2: MONSELL | Near onomastic variant (2)
Example 3 | Name1: WILLIAMSON | Name2: WILMSEN | Distant onomastic variant (3)
Example 4 | Name1: CORREY | Name2: CURIE | Phonetic match (P)
Example 5 | Name1: SANTILLA | Name2: SANTOLLA | Phonetic match (P)
Example 6 | Real: GUALTIERREZ | Fuzzy: GUALTIEREZ | Fuzzy logic match (Pro only) (F)
Example 7 | Real: AAGARD | Fuzzy: OUGHGARD | Fuzzy logic match (Pro only) (F)

Most advanced phonetic matching algorithms

A major benefit of the software is the advanced system for indexing last names based on sound and spelling. Our structure analyzes hundreds of phonetic lines to make comparisons. These include proprietary algorithms designed for specific languages, language families, and dialects along with special situations. They are the most advanced algorithms ever designed for last names.

Additionally, as part of our phonetic indexing process, we include matches from six open source algorithms most data engineers are familiar with:

Soundex

This is the original phonetic algorithm. It was developed by Robert C. Russell and Margaret King Odell and patented in 1918 and 1922. The process was the first to index names by sound, as pronounced in English. The algorithm mainly encodes consonants. A vowel is not encoded unless it is the first letter

Metaphone

This is considered the first advanced phonetic algorithm. It was published in 1990 by Lawrence Philips and improved on Soundex by using information about variations and inconsistencies in English spelling and pronunciation to produce more accurate coding.

Double Metaphone

This algorithm, also published by Lawrence Philips, is called “Double” because it can return both a primary and a secondary code for a name string. The algorithm takes into account spelling peculiarities of a number of languages in addition to English.

New York State Identification and Intelligence System (NYSIIS)

This algorithm was developed in 1970 and is similar to Soundex except it maintains relative vowel positioning and handles some phonemes and sequential letters better. The accuracy increase over Soundex has been cited as 2.7 percent.

Caverphone

This algorithm was first developed by David Hood in the Caversham Project at the University of Otago in New Zealand in 2002 and revised in 2004. It was created to assist in data matching between late 19th century and early 20th century New Zealand electoral rolls.

Daitch-Mokotoff Soundex

This algorithm was developed in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of Soundex algorithms designed to allow greater accuracy in matching of Eastern European and Ashkenazi Jewish surnames with similar pronunciation but differences in spelling.

Special algorithms for prefix names

Our algorithms are specially tuned to work effectively with names that have prefixes such as MC, MAC, O, DE, LA, VAN, AL, ST, and many others. Traditionally phonetic algorithms have difficulty with these names because the prefixes create numerous false matches and miss true matches. Our algorithm greatly reduces this problem by separately measuring both the full name and the main part of the name following the prefix. Knowing the language of the name is key to our technique. Users will find this a major advantage with our system. Here are examples:

Example 1 | Name1: MCARTHUR | Name2: MCDALE | FALSE MATCH: Matched by open sources but not by our proprietary algorithms
Example 2 | Name1: DEGARCIA | Name2: GARCIA | TRUE MATCH: Matched only by our proprietary algorithms

Match quality scored on a 1 to 99 scale for exceptional ordering of results

The overall quality of each name-pair match is quantified on a scale of 01 (best) to 99. The scoring considers several factors:

The number of matches from a query can sometimes be very numerous, and the score is effective in ordering the output for filtering. Users will find this a major advantage with our system.

Languages of origin and use

Language coverage is extensive. The list exceeds 600 languages, language families, and dialects. Some languages refer to ethnic groups. None of the languages were derived algorithmically and the provided information represents years of extensive onomastic research. When different sources list different origins and usages they may be combined depending on the reliability of the source and the reasonability of the information. Differently styled names can have different language values.

Top 30 languages

The following are the top 30 languages with the number of occurrences in the names database. The language count is one for each unique name formation and not one for each relationship (which would be many more):

1. Polish Jewish (36,900)
2. Irish (35,500)
3. Czech Jewish (30,200)
4. German Jewish (22,500)
5. German (20,200)
6. Spanish (16,700)
7. French (14,600)
8. English (14,500)
9. Italian (12,800)
10. Scottish (12,400)

11. Other Jewish (7,700)
12. Dutch (7,100)
13. Russian (5,400)
14. Polish (5,000)
15. Catalan (3,500)
16. Armenian (2,800)
17. Arabic (2,500)
18. Native American Dakota (2,500)
19. Japanese (2,200)
20. Czech (2,200)

21. Hindi (2,000)
22. Swedish (1,700)
23. Hungarian (1,700)
24. Middle English (1,600)
25. Norwegian (1,400)
26. Indian (1,300)
27. Turkish (1,300)
28. Anglicized Irish (1,200)
29. Ukrainian (1,100)
30. Welsh (1,100)

Note that the counts are rounded to the lower 100.

A list of all the identified languages is included with the software as a Microsoft Excel (XLSX) file. The language names chosen are detailed and easy to search for.

Last name usage by race

The race usage of each name is identified in a series of fields which provide an actual or estimated percentage of use for each race. Differently styled names can have different race values. Race coverage includes:

Excellent for genealogical and scholarly research

In addition to being a powerful resource for businesses and organizations working with list of names, the software is also recommended for ancestry and scholarly research. Attention has been paid to accurately and precisely representing the origin and history of last names and the relationships between them. It is of particular benefit in the following fields:

Special features

In addition to the onomasiological research, during each development cycle certain aspects are emphasized. In this version of pdSurname special attention was paid to the following: