It seems that whenever I get something to work it is just a small step towards another problem. Once I successfully drew a grid of all my artists (coloured by gender) I set about trying to get it to sort in some kind of useful order. Initially the grid is displayed in the order that artists are entered into the database, although I can’t see any kind of distinguishable pattern in this ordering system. Instead I wanted to display the artists by the year they were born. The idea being that you would be able to see which gender was more predominant during different time periods.
The main problem that emerged here was with my data. It was messy and inconsistent. Unfortunately there were many different date formats. See below:
I only wanted to use the year but, as you can see, sometimes the year is shortened to 2 numerals. To a human that’s fine, with the other details from the record we can work out if 26-Jan-22 is 1922 or 1722, but the computer has no idea, so these records don’t count. In some cases there’s no information at all, hence the big NULL, and there’s a lot of these records which need to be ignored too. When I filtered out the messy data it had a rather dramatic effect on the size of my grid.
Instead of displaying about 16,000ish records on the screen, it only shows 5009. I’ve instantly lost more than two thirds of all my artists.
I started playing around Freebase Gridworks, a java program that runs in browser and allows one to clean up messy data. It’s got a great interface, and seems very powerful, even if it is a little confusing to try and understand. To get my 5009 artists, I just filtered out all the non numeric data in Gridworks and then exported out a new tsv file. But when I started looking at the non-numeric data I can see that it’s also filtered out some entries where the year is complete (i.e. not just 16-02-22 but 16-02-1822). These are years I want to be included, so I need to take a different approach. Therefore I’ve started to try and introduce myself to regular expression (regexp). Basically I want to create a year by matching anytime 4 numbers occur in a row, without any dashes or spaces between them, then set that as a year.
So that’s where I’m at now, just trying to get my data to be clean enough to achieve this.
I was thinking though, I don’t know if I want to create a separate file with just the date stuff, because that’ll resort in a few different files with different amount of data. Maybe once I get the regexp working I can just match it straight through my code in eclipse/processing. Or use the date syntax as Mitchell suggested, or the php strtodate that Geoff likes. Hmm, thinking, thinking, thinking.