Tuesday, November 10, 2009

Parsing Names

I'm thankful for the coworkers who ask me to do odd-jobs in Excel.  If it wasn't for them, I'd never have anything to blog about.

Today's post is about parsing text, and specifically about turning a full name into a first, middle, and last name.  Let's start with a list of names.  Oh look, an obscure 80s television reference!



So we want to change these into first, middle, and last names.  How hard is that?  Select the names, go to Data > Text to Columns, select space delimited, finish the wizard, and Bob's your uncle.  Is that a correct use of an apostrophe?  A contraction of Bob is?  I think so.



Anyway, this doesn't work so well because some people have four names.  The census I was using in reality had a person with five different words in their full name.  Obviously these people are trouble makers and deserve to have their benefits revoked, but let's be the bigger person and try to work with them.



First off, a few things:
  1. Some people have small words that are part of their last name (Van, De La, etc.).  Because people can also have short middle names, we can't just pull these out by checking the length.  We'd have to keep a list of popular last-name articles and hope nobody's middle name was Van.  But we don't want to do this because...
  2. I don't know anything about the naming conventions of other countries/languages.  Would Chow Yun Fat consider Yun a middle name, or is it part of his last name?  I don't know!  So...
  3. We're going to have to consider the first chunk of letters the first name and the last chunk of letters the last name, and just throw everything else into the middle name and let someone else sort it out.
Now that we've defined our task, the solution is relatively straight forward.  To grab the first name, we're going to use the LEFT formula to grab the text from the left side of the name and the FIND formula to find the first space, or where we want to stop.



LEFT takes 2 arguments:  the word you want to chop and how many characters you want counting in from the left side of the word.  FIND is going to look for the string we specify, in this case a space (" "), and return the position of the first occurrence of that string in the text that we are searching.  Because we don't want to include the space, we'll subtract 1 from FIND's result.

The last name is not as easy.  I spent a good 2 minutes thinking about it, decided that there must be a function in Excel that reverses text, and went online to look for it.  No need to re-invent the wheel, there are probably only 5 people in the world that are actually doing new things in Excel and incidentally they all have very nice Excel blogs.

Instead of just a reverse text function, I found a question about and answer to the exact same problem I was having.  As a bonus, the link does include the code for a user-defined function called REVERSETEXT that was apparently written by the infamous J. Walkenbach.  I think he is a professional banjo player or something.  I'm going to use the ugly solution though, as originally posted by Aladin Akyurek at that same link:
=RIGHT(D2,LEN(D2)-SEARCH("@",SUBSTITUTE(D2," ","@",LEN(D2)-LEN(SUBSTITUTE(D2," ","")))))
What is this code doing?  Well the end of my lunch is fast approaching so some of this may be less than accurate, but here goes.  The first thing it's doing is counting the number of spaces in the string:
LEN(D2)-LEN(SUBSTITUTE(D2," ",""))
By deleting the spaces (John Doe = JohnDoe) and seeing how much the length of the string (returned by LEN) changes, we can find the number of spaces in the name.

We use this number for the second SUBSTITUTE function.  This function is substituting the last space in the name with an @ symbol.  The SUBSTITUTE function first takes the original string as an argument, then the text you want to change, what you want that text to change to, and, optionally, which instance of that old text you want to change.  So, for example, the formula SUBSTITUTE("Ba Ba","Ba","Da Bing",2) will return the string "Ba Da Bing", as it is only changing the second occurrence of "Ba".  We counted the number of spaces in the full name and are using that as the instance, so only the last space will become an @ symbol.

After that, we're just using the SEARCH function to find the position of the @ symbol in the array.  We subtract that position from the length of the full name, which will give us the number of characters after the last space in the string.  We can then pass this number to the RIGHT function to count in that many characters from the right side of the string and pull those out.



The next part is easy.  For the middle name, we're just going to use the MID formula and subtract the lengths of the first name and last name, then TRIM any extra spaces.



The MID formula returns sections of text.  You pass it the text you want to cut, where you want to start cutting, and how many letters you want to take.  We're telling it to cut the full name, start at the character after the length of the first name, which will be the first space, and then telling it to take as many characters as the full name has less the characters we already pulled out for the first and last names.  Since this will include the spaces, we're also using the TRIM formula.

The nice thing about this method is that it works fine for people with and without a middle name.  Where doesn't it work?  People with only one name.  People like Sting.  Or Bono.  Or Prince.  As you can see, people with only one name are generally pompous, self-important rock stars.  We don't shouldn't care whether or not they get correctly registered for their life insurance benefit or whatever else you might be using your census for.  If you really wanted to though, you can fix that by catching errors using the IF and ISERROR functions.

-David

No comments:

Post a Comment

 
Creative Commons License
David @ Work by David Montgomery is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.