Friday, November 2, 2007

Mining for Data - Part 5

There are a number of ways to manipulate data. Since I am an old hacker from way back -- my first spreadsheet was DynaCalc, running under Microware OS-9 back in 1986 -- I tend to prefer spreadsheets over databases. They may not be as pretty sometimes, but they get the job done. Excel seems to be the spreadsheet of choice, but many people balk at the price of Microsoft Office, which does not normally come bundled with their computer, so an excellent alternative is OpenOffice's Calc application ("scalc"). Calc looks like Excel, behaves like Excel, and it is free.

We have downloaded our data reports in .csv format, so they are comma delimited text files. They also have funny file names, courtesy of Realist, such as "1856407_39153.csv", which can be a pain to keep track of. We need to organize our data. With the right mouse button, click on the .csv file name, and "Open With..." scalc (or Excel, or whatever your spreadsheet is called). Once the file is open, you can see from the data just what kind of fish you have caught. In this case, we see 2001 sales data for the 44302 ZIP Code. I saved that as "2001 44302 raw.xls" in a folder called "Realist Raw Data".

The full report starts out with tax mailing information, which I don't need for anything, so I simply highlight all those columns by holding down the Control ("CTRL") key and clicking on each column header, and then delete. That leaves the Tax ID field in column A. By clicking on the column B header, all of column B is highlighted, and a new column can be inserted. This is done twice, to make two new empty columns. I then go across the top of the spreadsheet, reformatting the column widths to whatever I find comfortable. Because both columns B and C are now empty -- the whole sheet has been shifted right by two columns -- I have room to move the "Recording Date" and "Sales Price" fields.

Click the header cell directly above "Recording Date". The entire column will highlight. Position the cursor over the "Recording Date" cell, press and hold down the mouse button, and drag the entire column to the left, into column B. Do the same thing with the "Sales Price" column, dragging it to column C. You now have the Tax ID number in Column A, the Recording Date in column B, and the Sales Price in column C. This will be handy for doing the stats later. Don't forget to save your spreadsheet early and often.

The 44302 ZIP Code has addresses from several Census Tracts. Click on cell A1 to park the cursor in a handy spot, then from the Data menu, choose Sort. A Sort Criteria window will open. At this point we will sort by Census Tract; the second sort criterion can be set to Carrier Route ( a little trick that will make it easier to fine tune the Census Tract Block Numbers, which are not a provided sort criterion). When the sort is done, we find that we can highlight all the data for Census Tract 5064 and 5065, copy it, and paste it into a new spreadsheet. I named the new spreadsheet "2001 sales data.xls" and saved it in a folder called "Realist Stats".

The file "2001 44302 raw.xls" can now be closed. I repeated the process of creating .xls files from the .csv files for each of the downloaded reports; there were 18 altogether. The files that I had named "2001 44313 raw.xls" and "2001 44320 raw.xls" were manipulated in the same way as I had manipulated "2001 44302 raw.xls", and the data for Census Tracts 5064 Carrier Route C053 and Census Tract 5065 copied from each and pasted into "2001 sales data.xls". At that point, I had a spreadsheet with all the recorded sales for 2001 within the neighborhood bounded north by Amelia, east by West Exchange, south by Copley, and west by Storer.

The whole process was then repeated for each of the years 2002, 2003, 2004, 2005, and 2006. As each "200x sales data.xls" sheet was completed, it was resorted using as the criteria (1) Census Tract", (2) Street Name, and (3) House Number. The data was scanned to make sure that properties located on the northeast side of Exchange Street were not accidentally included, and the files were then ready for a year-by-year analysis of the sales data.

Next : Crunch Time

No comments:

Post a Comment