NEW 3.0
pdGender gender coding database
MORE NAMES AND NEW FEATURES

Gender Coding Database

Male and female identification is essential for businesses and organizations. It allows you to send mail with a personal touch. Gender Coding also allows you to filter, map, and analyze your data based on this critical demographic. The new pdGender 3.0 lets you accomplish this in ways not before possible on this scale. It is available in Pro and Standard editions.

Coverage includes hundreds of thousands of names and the package employs the best matching algorithms designed for this process. As an added benefit, languages of origin and use have also been researched and included along with additional features never before available on this scale.

Pro and Standard

Both editions include the same names and features except the Pro version comes equipped with fuzzy logic. Fuzzy logic allows matching when lists have typographical errors or stylized spellings. The Standard edition has everything except fuzzy logic. More about fuzzy logic.

A POWERFUL GENDER CODING CREATION

pdGender is a proprietary resource not duplicated elsewhere. For more than 20 years our software has been utilized by businesses and organizations around the world in applications you use every day.

Comprehensive gender coding database:
  • 397,000 standard gender coding records

Advanced Gender Coding System
  • 1 unfiltered gender field
  • 140 filtered gender fields
Rare usages of unisex names by one gender identified
More than 500 languages of origin and use identified
Excellent for genealogical and scholarly research

Pro only features

More than 3 million fuzzy logic gender coding records
Other benefits:
  • Designed to be fully compatible with pdNickname and pdSurname
  • Comes in multiple file formats: Comma Delimited (CSV), Fixed Length, and DBF
  • Full documentation
  • Perpetual Site License—allowing installation on all computers in the same building within a single company or organization
  • Available for immediate download

SPECIFICATIONS

pdGender is a proprietary resource not duplicated elsewhere. For more than 20 years our software has been utilized by businesses and organizations around the world in applications you use every day.

Logo
pdGender gender coding database logo
Sku
Pro: 1NG300P | Standard: 1NG300S
Product Name
pdGender Pro | pdGender Standard
Version number
3.0
Description
Gender coding software
Total records*
Pro: 3,569,232 | Standard: 397,847
Zipped size**
Pro: 41.7 MB | Standard: 9.6 MB
Extracted size**
Pro: 3.4 GB | Standard: 662.4 MB
File formats included
Comma Delimited (CSV), Fixed Length, and DBF
Availability
Immediate download
List price
Pro: $395 Buy | Standard: $299 Buy

*The record count is the total number of records contained in each of the three included file formats.

**The zipped and extracted sizes show the combined total size of all product files.

Compatibility
pdGender utilizes only the ANSI character set (ASCII values 0 to 127 and extended values 128 to 255) and comes in multiple file formats to insure compatibility. The software has also been developed to be fully compatible with pdNickname and pdSurname. The name pair format in pdNickname is very similar to the pdSurname database except pdNickname is used to match give names and nicknames while pdSurname matches last names. pdGender is based on the first name database and is designed to apply gender identification to first name records. Note that pdNickname and pdSurname are not required to use pdGender but they are highly attuned to work together.
Optional developer license
Available (Questions... | Apply...)

DOCUMENTATION

For better usability of our software we create precision documentation with examples—so you don’t have to be worry. The user guide includes detailed instructions, file layouts, the site license, and additional information useful for both business applications and those employing the product for research.

To view the PDF user guide you will need Adobe Acrobat Reader version 4.05 or higher installed on your computer or device. This is a free program downloadable from the Adobe website.

View documentation

SAMPLE

A random sample of the software database is available for download. It includes records from the main database as well as the written documentation from the product and other information. The sample is extracted from the Pro edition. The Standard edition does not include fuzzy logic records.

The database come in three file formats to insure compatibility with any database system. Each format contains the same data. Formats include: comma delimited (CSV), fixed length, and DBF.

The written documentation, including the site license, comes in Adobe Acrobat PDF format. To view these documents you will need Adobe Acrobat Reader version 4.05 or higher installed on your computer or device. This is a free program downloadable from the Adobe website.

Download sample

What is fuzzy logic?

If you typed “Garfeild” into a word processor, it would probably be underlined with a squiggly red line signifying a misspelling. It is the name “Garfield” with the “IE” reversed to “EI”—a common mistake.

The fuzzy logic technology in the Pro edition of this software allows matching name data that has typographical errors. If you look at the fuzzy logic examples we have provided below, you are likely to see errors you have repeatedly made or seen. In many cases you will have to look close to see the difference, but they are different.

Fuzzy logic attempts to duplicate real errors created while entering names into databases. The most likely typographical errors are determined based on the number of letters, the characters involved, where they are located in the name, the language, and other factors.

The biggest advantage in our technology is in its ability to work with language rules that indicate how individual of various nationalities may hear and spell names.

Some fuzzy logic spellings have one typographical error while others have multiple issues, so the technology is suited for even the worst typists and transcribers. The algorithms have five layers:

Phonetic misspellings

These algorithms look at digraphs, trigraphs, tetragraphs, pentagraphs, hexagraphs, and even a German heptagraph, “SCHTSCH”, used to translate Russian words with the “SHCHA” or “SHCH” (romanticized) sound. These are, respectively, two to seven letter sequences that form one phoneme or distinct sound. Most of letter sequences trigraph and above are Irish who have more language rules than you can shake a stick at.

Many misspellings occur as transcribers enter the sounds they hear. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based. Furthermore, our algorithms consider both how a name may sound to someone who speaks English as well as how it may sound to someone who speaks Spanish, which is often different. Take the digraph “SC”. Before the vowels “E” or “I” it is most likely to be misspelled by an English speaker as “SHE” or “SHI” while a Spanish speaker may hear “CHE” or “CHI” and sometimes “YE” or “YI”. Our library includes over 80,000 language-based letter sequence phonetic rules. Phonetic misspelling examples:

Example 1 | Real: BARTHOLOMEW | Fuzzy: BARTHOLOMUE
Example 2 | Real: DAWNETTE | Fuzzy: DAUNETTE
Example 3 | Real: NATHANIEL | Fuzzy: NATHANAIL
Example 4 | Real: PHYLLIS | Fuzzy: FYLLIS
Example 5 | Real: SIGOURNEY | Fuzzy: SIGOURNI
Example 6 | Real: XAVIER | Fuzzy: XAVAR

Reversed digraphs

These algorithms look for misspellings due to reversed digraphs (two letter sequences that form one phoneme or distinct sound) which are a common typographical issue, such as “IE” substituted for “EI”. The character sequences and the sounds they produce are different for each language and situation, such as before, after, or between certain vowels and consonants, so our substitutions are language-rule based. Reversed digraph examples:

Example 7 | Real: ANNABETH | Fuzzy: ANNABEHT
Example 8 | Real: CAETLIN | Fuzzy: CEATLIN
Example 9 | Real: EUGENE | Fuzzy: UEGENE
Example 10 | Real: FRIEDRICH | Fuzzy: FREIDRICH
Example 11 | Real: RAQUEL | Fuzzy: RAUQEL
Example 12 | Real: VICKTOR | Fuzzy: VIKCTOR

Double letter misspellings

These algorithms look for misspellings due to double letters typed as single letters and single letters that are doubled. The most common typographical issues occur with the characters, in order of frequency, “SS”, “EE”, “TT”, “FF”, “LL”, “MM”, and “OO”. Double-letter misspelling examples:

Example 13 | Real: EMANNUEL | Fuzzy: EMMANNUEL
Example 14 | Real: KASSANDREA | Fuzzy: KASANDREA

Missed letters

These algorithms look for missed keystrokes and provide fuzzy logic matches with missing letters. Unlike the other algorithms, these are not language specific. Keystrokes can be missed in any language. Missed letter examples:

Example 15 | Real: ABDUL | Fuzzy: ADUL
Example 16 | Real: MARGARET | Fuzzy: MRGARET

String manipulations

These algorithm changes letters and syllables in a variety of ways. They are less guided by language rules and more guided by randomness. String manipulation examples:

Example 17 | Real: CYNTHIA | Fuzzy: CYNTTHA
Example 18 | Real: GERALD | Fuzzy: GERLLD

Comprehensive first name and nickname database

The product is like no other gender coding database. It is essentially very simple, but with a lot of power. Users match the first names in their database lists and the software provides male or female gender identification.

The main gender database list 397,000 given names and nicknames, complete with gender, languages of origin and use, name rank in the United States, and other demographics.

For unisex names, there are special filters allowing users to tweak the output based on languages and nationalities, usage by gender, and other factors.

The database contains all the first name spellings gathered and published by the U.S. Census Bureau and Social Security Administration between 1800 and the present time, related nicknames, and ethnic given names and nicknames not found in the United States. About 75 percent of the given names and nicknames can be found in the United States, and the remainder only found outside the United States.

Using this software results in reduced long-term costs, improved customer service, and better marketing data.

Gender Coding filters

This software identifies names as male (“M”), female (“F”) or, when the name is both male and female, unisex (“U”).

The WORLD field is the first in a series of 141 gender coding fields. Notably, it is the only gender coding field without filters of any kind. It is called “world” because it defines the basic international usage of each name. It can be utilize like the standard unfiltered gender coding fields most users are familiar with in other products. It derives the largest number of unisex identifications because it gives equal weight to all languages and nationalities. If a name if male in the United States and female in Vietnam, this field will flag the name as unisex.

Following this field are a series of 140 filtered gender coding fields which are the heart of the pdGender matching system. They allow filtering the gender coding output for languages and nationalities, rare usage by one gender, archaic names, and nicknames. Here are examples:

The field names are designed to indicate what filters are applied. Here are examples:

Example 1 | Field: WORLD_XA | Description: Gives equal weight to all languages and nationalities and reduces precedence of archaic names in gender determination
Example 2 | Field: USA_XAN | Description: Prioritizes United States names and reduces precedence of archaic names and nicknames in gender determination
Example 3 | Field: EN_FR_XAR | Description: Prioritizes English and French names and reduces precedence of archaic names and rare usages in gender determination
Example 4 | Field: HISP_XANR | Description: Prioritizes Hispanic names and reduces precedence of archaic names, nicknames, and rare usages in gender determination

Prioritizing Languages and Nationalities

Because names can have different genders in different languages and nationalities, a filter is provided allowing the choice of which languages and nationalities to take precedence. There are 35 options which are indicated in the prefix of each gender coding field name. The choices and field name prefixes are:

Prefix | Filter
WORLD_ | All languages and nationalities are given equal weight
USA_ | United States names are prioritized
US_ES_ | United States and Spanish (Español) names are prioritized
US_HS_ | United States and Hispanic names are prioritized
US_FR_ | United States and French names are prioritized
ENG_ | English names are prioritized
EN_AA_ | English and African American names are prioritized
EN_ES_ | English and Spanish (Español) names are prioritized
EN_HS_ | English and Hispanic names are prioritized
EN_FR_ | English and French names are prioritized
AFRAM_ | African American names are prioritized
SPA_ | Spanish names are prioritized
HISP_ | Hispanic names are prioritized
FRA_ | French names are prioritized
AFR_ | African (non-Muslim) names are prioritized
BRIT_ | British names are prioritized
CEL_ | Celtic (language family) names are prioritized
EASIA_ | East Asian names are prioritized
EA_PI_ | East Asian and Pacific Islander names are prioritized
GAEL_ | Gaelic (Goidelic language family) names are prioritized
DEU_ | German (Deutsch) names are prioritized
GEM_ | Germanic (language family) names are prioritized
HAW_ | Oceania Hawaiian names are prioritized
IND_ | Indian (South Asia) names are prioritized
ITA_ | Italian names are prioritized
JW_ | Jewish, Yiddish, and Hebrew names are prioritized
MUS_ | Muslim names are prioritized
NATAM_ | Native American names are prioritized
PISLR_ | Pacific Islander names are prioritized
ROA_ | Romance (language family) names are prioritized
SCAND_ | Scandinavian names are prioritized
SLA_ | Slavic (language family) names are prioritized
CYM_ | Welsh (Cymraeg) names are prioritized
WEST_ | Western World names are prioritized
NWEST_ | Non-Western World names are prioritized

Adding Other Filters

The suffix of each gender coding field name indicates any additional filters that are applied. They all begin with an “X”, indicating eXclusion, followed by up to three characters (“A”, “N”, and/or “R”, in respective order) showing what filters are applied. There are four possible suffixes:

Suffix | Filter
_XA | Reduces precedence of archaic names in gender determination
_XAN | Reduces precedence of archaic names and nicknames in gender determination
_XAR | Reduces precedence of archaic names and rare usages in gender determination
_XANR | Reduces precedence of archaic names, nicknames, and rare usages in gender determination

Rare usages of unisex names by one gender

One of the most useful features of the software is rare usages of names are identified by language. These flags are applied to unisex names and show when a name is used less than 20 percent of the time in the cited language and gender. This indicator allows filtering out rare usages in gender coding.

Note that rare usage indicators should not be compared for different languages, only within the same language. Because a name usage is labeled rare in Spanish and not in English does not mean the name is used less in Spanish than English, rather it means it is rare in Spanish compared to the Spanish opposite gender usage.

Languages of origin and use

Language coverage is extensive. The list exceeds 500 languages, language families, and dialects. Some languages refer to ethnic groups. None of the languages were derived algorithmically and the provided information represents years of extensive onomastic research. When different sources list different origins and usages they may be combined depending on the reliability of the source and the reasonability of the information. Differently styled names can have different language values.

Top 30 languages

The following are the top 30 languages with the number of occurrences in the names database. The language count is one for each unique name formation:

1. English (225,000)
2. Arabic (46,700)
3. Turkish (6,700)
4. Punjabi (6,700)
5. French (5,900)
6. Iranian (5,800)
7. Urdu (4,900)
8. Afghan Arabic (4,400)
9. Swedish (4,100)
10. Finnish (3,600)

11. Spanish (3,300)
12. Italian (3,100)
13. Bengali (3,100)
14. German (3,100)
15. Pashto (3,000)
16. Norwegian (3,000)
17. Danish (2,900)
18. Korean (2,700)
19. Egyptian Arabic (2,400)
20. Czech (2,200)

21. Czech (2,000)
22. Russian (1,900)
23. Dutch (1,800)
24. Hungarian (1,800)
25. Portuguese (1,700)
26. Malaysian Malay (1,700)
27. Albanian (1,700)
28. Japanese (1,700)
29. Bosniak Bosnian (1,500)
30. Icelandic (1,400)

Note that the counts are rounded to the lower 100.

Also note that the Arabic and Muslim name section is very large due to the many different variations and ways of writing these names. These include theophoric combination names such as those with the religious prefix “Abdul”. Both common and uncommon possibilities are included, and the use of Sun Letters in Arabic and Maltese is accounted for.

A list of all the identified languages with counts is included with the software as a Microsoft Excel (XLSX) file. The language names chosen are detailed and easy to search for.

Excellent for genealogical and scholarly research

In addition to being a powerful resource for businesses and organizations working with list of names, the software is recommended for ancestry researchers, students, teachers, and scholars. Attention has been paid to accurately and precisely representing the origin and history of given first names (also known as personal names and forenames) and nicknames (including short forms, diminutives, and even hypocoristics) and the relationships between them. It is of particular benefit in the following fields:

Special and unique origins

Of interest to those studying names, many records provide information about special and unique origins: