Getting Started with U.S. Census Data

U.S. Census QuickFacts data is an amazing resource which is available to the public free of charge.  This treasure trove of information is an incredibly useful resource for performing many different types of analysis involving where people live in the United States as well as other demographic information, all of which is broken down by county and state.  While U.S. Census QuickFacts data is available and summarized from multiple sources throughout the Internet, the U.S. Census Bureau makes the information directly available only via flat text files on their web site.  Fear not!  Without too much effort, we can take those flat files and import them into a transaction data source.

In this post, I’ll be covering the process of downing the U.S. Census QuickFacts data from their website (currently data from the 2010 census) and importing it as new tables into an existing SQL Server database using SQL Server Management Studio and a little bit of Microsoft Excel.  I’m using SQL Server 2012 and Microsoft Excel 2013, but any recent version of both should work.

The first step in using U.S. Census QuickFacts data is to download it.  This can be done via the U.S. Census Bureau Website at which there are three files of interest:

  • DataSet.txt — the actual census data in a CSV file
  • FIPS_CountyName.txt — Columns which give the codes for each country and state used within the DataSet file.
  • DataDict.txt — descriptions of what each column name represents.  This file is a useful reference, but we won’t be importing it into our database.

Start things off by downloading the three text files.

U.S. Census data available for download from the Census Bureau website.

After having downloaded the three files, we now have to prepare them to import into our database.  Start by renaming the dataset and FIPS files to the names which you’d like the final tables to be in the database.  I renamed DataSet.txt to census_data.txt and FIPS_CountyName.txt to census_counties.txt.  Since we won’t be using DataDict.txt except for our own reference, we don’t need to do anything with it.

Importing FIPS_CountyName.txt (aka census_counties.txt)
If you look at FIPS_CountyName.txt in a text editor, you will notice one, two, or three columns depending on how you delimit the columns.  In my example, column 1, the FIPS code, is a 5 character numeric code.  Column 2, separated by a singe space, is the county name, and column three, separated by a comma, is the state name.  Since this 3-column version is both fixed width and comma delimited, we’re going to use Excel to get it to a purely comma delimited format.

Start by opening Excel.  Once within Excel, choose to open a file and choose your FIPS_CountyName.txt file (which hereafter I will refer to as census_counties.txt).  The Text Import Wizard will appear.  The first goal we’d like to accomplish is to split the FIPS code column from the county name and state column.  To do this, perform the following actions within the Text Import Wizard:

  • On step 1, choose Fixed width
  • On step 1, de-select “My data has headers
  • On step 2, add a break at character 5 (the end of the FIPS code) and a break at character 6 (the start of the name)
  • On step 3, set the column data format to Text for the first column so it doesn’t convert it to a number (in which we’ll lose the leading 0’s)
Choose Fixed width and de-select “My data has headers.”
Make sure you have column breaks at characters 5 and 6 and choose the type of Text for column 1.

Upon completing the wizard, you will now have a three column spreadsheet in which “A” contains the FIPS code, “B” is blank, and “C” is the county and state.  Delete the empty “B” column so we only have a two column spreadsheet:

Our county_name.txt spreadsheet with the FIPS code broken out from the county name.

Next, we need to separate the county name from the state name.  We can use the comma as a deliminator and “split” the column based on that.  To do so, select Column “B” and from the Data ribbon, choose “Text to Columns.”  You will see a familiar looking wizard.

Choose “Text to Columns” from the Data ribbon to “split” a column into two.

Within the “Convert Text to Columns Wizard,” choose Delimited and on the following page, choose Comma and complete the wizard.

The Convert Text to Columns Wizard used to split a column based on a delimiter.
The Convert Text to Columns Wizard with a comma delimiter splits the county name column into two.

After splitting the county name and state column into two distinct columns, we’re almost done!  We next need to add some column headers to our three columns.  In this case, I named the first column fips, the second column County_Name, and the third column County_State.  Save the file as a CSV, such as census_counties.csv and close it (Excel will give you some warnings about your choice of file format which you can ignore)

Our nearly completed county_name.txt file in Excel after being split into three columns.

We have one last task to finish cleaning our census_counties.txt file.  If you look closely at the State column, you’ll notice that each state is preceded by a space.  We can clean this up pretty easily by opening our census_counties.csv file which we just created with the built in Windows utility, Notepad.

With the file open in Notepad, perform a Replace and replace all instances of <comma><space> with <comma>.  The space before the state should disappear.  Congratulations, save the file and we now have a properly formatted CSV file ready for import.

Perform a Replace on your census_counties.csv file to replace <comma><space> with <comma>.

With our properly formatted CSV file, we can FINALLY use SQL Server Management Studio to start importing data intou our SQL database.

Open SQL Server Management Studio and right click on your database (if you don’t have a database, you can create a new one at this point).  Right-click on your database and choose Tasks -> Import Data… which will bring up the SQL Server Import and Export Data Wizard.

Right click your database and choose Tasks -> Import Data… to launch the Import and Export Data Wizard.

Within the SQL Server Import and Export Wizard, choose a Flat File Source and browse to your census_counties.csv file.  On the General page of the Data Source wizard page, the format should be Delimited and Column names in the first data row should be checked.  Since our file is properly formatted, the wizard hopefully guessed these correctly already.  Review the General, Columns, and Preview pages to ensure the wizard is properly finding three columns and adding appropriate column names.

The SQL Server Import and Export Wizard data source general page should have Delimited selected.
The SQL Server Import and Export Wizard Columns page with the correct columns shown.
The SQL Server Import and Export Wizard Preview page with the correct columns shown.

Finally, choose a destination for where the data will be imported.  This should be a SQL Server (SQL Server Native Client) database.  In my case, I’m importing into the CensusDemo database on localhost.  Complete the wizard and have it Run immediately and your data should import into your database.

Within the SQL Server Import and Export Wizard, choose a destination server and database for the data.
A final chance to review mappings within the SQL Server Import and Export Wizard before beginning the import.
Execute the SQL Server Import and Export Wizard with the Run immediately option.
The completion of the SQL Server Import and Export Wizard should end with success and 3195 rows imported.

At this point, we’ve successfully imported the County Name data (census_counties.csv) into our database.  We’re now ready to move on to importing actual census data!  Don’t worry, that process is much more straightforward…

Importing DataSet.txt (aka census_data.txt)
The DataSet.txt file (hereafter referred to as census_data.txt) is a properly formatted CSV file with all of the actual census data contained within.  Fortunately, since this file is properly formatted, we don’t need to use Excel or any other tools to prepare it for import.

There is one catch, however.  By default, without any other guiding information, the SQL Server Import and Export Wizard (which is what will be used to import the data into SQL Server) considers the data type of every column to be a VARCHAR (text-based) field.  This is a problem since we will be doing a lot of aggregating of this data, we don’t want to have to cast each field to a numeric type each time.

We could manually choose the data type for each column in the Mappings section of the SQL Server Import and Export Wizard, however, there are a lot of columns…and that sounds pretty tedious considering every column should be numeric.

Instead, we will create the blank table with the correct data types.  During the import process, the wizard will detect that the table already exists and honor its data types.  The following T-SQL code snippet can be used to create the census_data table for the 2010 QuickFacts Census data:


CREATE TABLE [dbo].[census_data]
(
[fips] VARCHAR(50) NOT NULL,
[PST045213] FLOAT NULL,
[PST045212] FLOAT NULL,
[PST040210] FLOAT NULL,
[PST120213] FLOAT NULL,
[PST120212] FLOAT NULL,
[POP010210] FLOAT NULL,
[AGE135212] FLOAT NULL,
[AGE295212] FLOAT NULL,
[AGE775212] FLOAT NULL,
[SEX255212] FLOAT NULL,
[RHI125212] FLOAT NULL,
[RHI225212] FLOAT NULL,
[RHI325212] FLOAT NULL,
[RHI425212] FLOAT NULL,
[RHI525212] FLOAT NULL,
[RHI625212] FLOAT NULL,
[RHI725212] FLOAT NULL,
[RHI825212] FLOAT NULL,
[POP715212] FLOAT NULL,
[POP645212] FLOAT NULL,
[POP815212] FLOAT NULL,
[EDU635212] FLOAT NULL,
[EDU685212] FLOAT NULL,
[VET605212] FLOAT NULL,
[LFE305212] FLOAT NULL,
[HSG010212] FLOAT NULL,
[HSG445212] FLOAT NULL,
[HSG096212] FLOAT NULL,
[HSG495212] FLOAT NULL,
[HSD410212] FLOAT NULL,
[HSD310212] FLOAT NULL,
[INC910212] FLOAT NULL,
[INC110212] FLOAT NULL,
[PVY020212] FLOAT NULL,
[BZA010211] FLOAT NULL,
[BZA110211] FLOAT NULL,
[BZA115211] FLOAT NULL,
[NES010211] FLOAT NULL,
[SBO001207] FLOAT NULL,
[SBO315207] FLOAT NULL,
[SBO115207] FLOAT NULL,
[SBO215207] FLOAT NULL,
[SBO515207] FLOAT NULL,
[SBO415207] FLOAT NULL,
[SBO015207] FLOAT NULL,
[MAN450207] FLOAT NULL,
[WTN220207] FLOAT NULL,
[RTN130207] FLOAT NULL,
[RTN131207] FLOAT NULL,
[AFN120207] FLOAT NULL,
[BPS030212] FLOAT NULL,
[LND110210] FLOAT NULL,
[POP060210] FLOAT NULL
)


With our empty table created, its time to launch the SQL Server Import and Export wizard by right-clicking on your database and choosing Tasks -> Import Data…

Choose a Flat File Source and browse to your census_data.txt file.  Flip through the General, Columns, and Preview sections and verify they look correct (all the appropriate settings should auto-detect)
The SQL Server Import and Export Wizard with a Flat File Source of your census_data.txt file.
The Preview section of the SQL Server Import and Export Wizard data source with the correct formatting.

Had we not pre-created the import table, the final page of the SQL Server Import and Export Wizard, Edit Mappings, will show the mappings for each column which will be imported.  If we were setting these manually, they’d be modified here from VARCHAR to FLOAT.

Complete the mappings page and you may see a warning page indicating that the import could fail if the data types in the import file don’t match those that we’ve specified in our table..we’re ok with that.  Complete the wizard and the data should import successfully.

The SQL Server Import and Export Wizard “Edit Mappings” page with correct data type mappings.
The SQL Server Import and Export Wizard warning that chosen data types could produce errors if not matched in input file.
The completed import from the SQL Server Import and Export Wizard with 3195 rows imported (one for each county).

Congratulations!  You’ve now successfully imported the primary census data file and have wrapped up this section.

Testing and Putting it All Together
We should now have two tables in our database, one containing code to county mappings and the other containing census data by county code.  We can test these pretty simply with a few T-SQL queries.

First, let’s look at a sample from the first table we imported, census_counties.  This table should consist of three columns — a varchar identifier, a county name, and a two-letter state.  We can check a sample of the table with the following syntax:

SELECT TOP * FROM census_counties

Testing our census_counties table — 3 named columns and sensible data, it checks out!

Next, let’s look at a sample from the second table we imported, census_data.  This table should consist of a large number of columns, each with a FLOAT-based number.  We can check a sample of the table with the following syntax:

SELECT TOP * FROM census_data

Testing our census_data table — a lot of columns, names matching our dictionary, and floats for data — it checks out!

Last but not least, let’s use the data.  I live in Gwinnett County, GA.  I know from looking at the DataDict.txt file mentioned earlier, that the column which represents population is named POP010210.  So, let’s write a simple T-SQL query to join the two table to tell me the population of Gwinnett County as of the 2010 Census:

SELECT 
c.County_Name, 
c.County_State, 
d.POP010210 
FROM dbo.census_data d LEFT OUTER JOIN dbo.census_counties c ON (c.fips = d.fips)
WHERE c.County_Name = ‘Gwinnett County

The results of a simple join of our two census tables to tell the population of Gwinnett County, GA.

There it is.  With the testing and sample query successful, we can now move onto an endless field of possibilities of analysis of the 2010 Census QuickFacts data.

A Little Simplification
You may notice that to use the census tables, a join is necessary since we have two tables of interest.  This adds a small amount of complexity, so you may want to simplify it a bit.  In my case, I have no intention of modifying the census data tables, so I will denormalize them and insert the County and State columns into the census_data table.

Let’s first add two new columns to our census_data table, County and State:

ALTER TABLE census_data ADD 
County VARCHAR(50) NULL,
[State] VARCHAR(50) NULL

Now, let’s populate these two new columns with the county and state data originally stored in the census_counties table.

UPDATE census_data 
SET 
census_data.County = census_counties.County_Name, 
census_data.[State] = census_counties.County_State
FROM census_counties
INNER JOIN census_data ON (census_data.fips = census_counties.fips)

Lastly, since we don’t need the counties table to join to any longer (since the data is now denormalized and contained within the data table), we can remove it from the database:
DROP TABLE census_counties

As a bit of a shortcut, should you not want to repeat all of the steps above, I’ve included a script to generate the SQL Server tables with all included data here.  Run the command on your desired database and the tables will be created.  Enjoy!