Database conversions are not a “flip of the switch” process and require some work to get the data from the source database into the target database.
This walkthrough will outline tips and techniques used in Microsoft Excel to prepare data for loading into a target database or for performing offline data cleansing.
General Rules
- Save your work in versions so you can revert back if necessary
- Any new data added to the system post-export may have to be re-cleansed so you might want to freeze your database
- Work with the Excel file only in a secured and control environment
- Excel does have its limitations. 65536 rows to be exact
- Audit your work.
Sections
- Identify duplicate records
- Remove duplicate records
- Manipulate database columns to match a target format
- Populate blank data quality codes
- Split up one field into several fields
- Check for a middle initial
- Strip out undesirable characters
- Combine data elements that are stored across multiple columns into one column
- Translate values to be consistent with the target format
- Identify and remove erroneous service records
Identify Duplicate Records
- Copy the unique fields you wish to concatenate to blank columns at the beginning of the worksheet. These are considered your “key fields” and should be the key identifiers you use to find a record. (HUD recommends name, DOB, gender and social)
- Insert another blank column to the left of these columns
- Enter the concatenate function (i.e. =CONCATENATE(C2,D2,E2)) in the first row of the column you just created. Copy this formula down the column to the last record
- Click on Data | Sort and select the column containing the newly created key
- Insert another column. Check for duplicates using the EXACT command. If the new key field is in column A and Column B is blank enter the formula in the first row of Column B as follows =EXACT(A2,A3). Copy the formula down to the remaining cells. The EXACT() function returns TRUE if the values it is given are identical, and FALSE otherwise.
- Since it may be difficult to locate all of the cells that have a value of TRUE in them you may wish to highlight the TRUE values using conditional formatting. To do this select the test column and click on Format | Conditional Formatting. Arrange the drop down boxes to read “Cell Value Is” “Equal To” “TRUE”. Select the Format button, followed by the Patterns tab, then choose the color to highlight the cell if the function value is true
Remove Duplicate Records
If you are working with one flat table and need to break it into separate tables you MUST copy the client ID from the first record to the duplicated records so everything jives. Once you have performed this step you can copy the table over to the other worksheets where they can be manipulated further as needed
Scroll the worksheet for values of “True”. Compare the matching records to determine how to handle the record as follows:
A). If the data is purely redundant (elevator button syndrome) then delete any duplicate records from the table with a value in Exact Match?=“True”.
B). If there are potentially valid data in multiple records you may choose to merge the data into one record and delete the other(s).
Manipulate the database columns as needed to match the desired target format
Basic Manipulation
- Inserting/Adding Columns – Simply select the column to the right of the column you wish to add, right click, and select “Insert”
- Deleting Columns – This process can be handled simply by selecting the column that is not needed, right clicking and selecting “Delete”
- Moving Columns – Select the column you wish to move, right click, select “Cut”, select the column to the right of where you want the column to go, right click, and select “Paste”
Advanced Manipulation
- Create a separate worksheet in the same workbook that has the source data.
- Copy the target format field names into a header record (the first row) on the worksheet. You may have to use Copy | Paste Special and select Transpose in order to copy field names listed vertically to be listed across the worksheet
- Enter the equals sign (=) in the first column and first row under the newly copied file header row. Click over to the source file, select the corresponding field in the first row under the source file’s header row and hit Enter. Repeat this process until you have mapped one field for all of the values in your target database
- Copy the row containing these new formulas down your spreadsheet to at least as many rows as you have records in your source file
- Create a blank worksheet. Copy the entire worksheet using Select All (the box in the top left corner of the workbook) and use Paste Special. Select the option to paste Values only
- Save the newly created target file
Populate blank data quality codes
Fields such as last permanent zip code and social security number require a data quality code. If your staff has not been answering this question you can derive whether or not the code is partial or full based on the length of the data entered using the =LEN() function to count the number of characters
To use this technique perform the following:
- Insert a column next to the field you want to create codes for
- Enter =LEN(FieldYouWantToCheck) into Row 2 of a new column and copy the formula down the column.
- Sort the database by this new length column
- Use either Copy | Paste (CTRL C, CTRL V) or Find|Replace (CTRL F) to change values from lengths to HUD data quality code values.
Splitting up one field into several fields
To use this technique perform the following:
- Right-click column heading for the full name field and select Insert. If you are splitting the field
- Select the name field column then click Data in the menu bar and select Text to Columns
- Select the option for Delimited and click Next
- In the Delimiters section click Space and/or Comma (depending on how your data is stored) and then click Finish. Note if the data is stored as Simmonds, Matthew D. you will have to insert three columns and use both comma and space as delimiters
- When prompted, answer OK to replace the contents of the destination cells
- Change the text in the column headers to now accurately reflect the change. (i.e. Last Name, First Name, Middle Name
Checking for a middle initial
If you are breaking up one field, such as Full Name, into several you may have to account for smaller fields such as a middle initial field that will sometimes be incorrectly spliced into other fields. You can use the Len() formula to check this as well
To use this technique perform the following:
- Enter =Len(cell_you_want_to_check_for_the_middle_ initial) in a blank column
- Copy this formula down to the last data row
- Sort the data by this newly created length field using Data | Sort
- Copy & Paste data values as necessary to separate middle initials from last name fields
Stripping out undesirable characters
For the sake of data consistency, it is common to remove extra characters such as $,-/. If you are trying to link a social security number that has dashes with one that does not you will never get a match. You can use the Find|Replace (CTRL F) function to resolve this
To use this technique perform the following:
- To ensure the Find | Replace works only over the column you wish to work with copy the data column to a blank worksheet
- Highlight the column you want to work with and hit CTRL+F.
- To remove dashes you would enter “–” in the field labeled Find what:, click the Replace tab and replace the dash with blanks by hitting the space bar once in the Replace with: field and click Replace.
- After performing the Find|Replace copy the properly formatted data back into the original worksheet.
Combining Data Elements Across Multiple Columns to one Column
To use this technique perform the following:
- Create a blank column and use the =Concatenate formula as shown above to merge the values
- Copy this formula down to the last data row
- Use Copy | Paste Special to copy the newly created values to another blank column
- Use Find | Replace to change the merged values with valid values. (ie replace FALSEFALSEFALSEFALSETRUE with “White”)
- Repeat this process until all merged values are replaced with valid values
Translating Values to be consistent with a target format
- Create an Excel workbook with three worksheets. The first will contain the values from the source table. The second will be a Field Value Map table to match values used within the source and target databases. The third table will contain the formulas need to translate your table.
- Map the source values to the target values on the Field Value Map worksheet similar to the table shown.
NOTE: You MUST have your source values listed in ascending order (alphabetically or numerically) for this to work properly. - For any fields that need value translations to use the =VLOOKUP function in Excel to translate the source database values to the target values. For values that are not to be translated you can simply use the = technique described earlier
- Open a blank workbook. Use Copy | Paste Special and select to paste the Values on your target worksheet to the new workbook.
- Save the newly translated file.
Removing erroneous service records
Quite often it is difficult to figure out when a client has left (especially for programs such as street outreach) so the data records for a recorded service end up having blanks for an exit date. Blanks can be valid if you are still servicing the client. One way to check for erroneous blank conditions is to determine if you have provided the same client with the same service since the record where the blank exit occurs. You can do this using the =AND function as shown