Monthly Archives:: August 2010

Refining problems

by in PhD.

It seems that whenever I get something to work it is just a small step towards another problem. Once I successfully drew a grid of all my artists (coloured by gender) I set about trying to get it to sort in some kind of useful order. Initially the grid is displayed in the order that artists are entered into the database, although I can’t see any kind of distinguishable pattern in this ordering system. Instead I wanted to display the artists by the year they were born. The idea being that you would be able to see which gender was more predominant during different time periods.

The main problem that emerged here was with my data. It was messy and inconsistent. Unfortunately there were many different date formats. See below:

I only wanted to use the year but, as you can see, sometimes the year is shortened to 2 numerals. To a human that’s fine, with the other details from the record we can work out if 26-Jan-22 is 1922 or 1722, but the computer has no idea, so these records don’t count. In some cases there’s no information at all, hence the big NULL, and there’s a lot of these records which need to be ignored too. When I filtered out the messy data it had a rather dramatic effect on the size of my grid.

Instead of displaying about 16,000ish records on the screen, it only shows 5009. I’ve instantly lost more than two thirds of all my artists.

I started playing around Freebase Gridworks, a java program that runs in browser and allows one to clean up messy data. It’s got a great interface, and seems very powerful, even if it is a little confusing to try and understand. To get my 5009 artists, I just filtered out all the non numeric data in Gridworks and then exported out a new tsv file. But when I started looking at the non-numeric data I can see that it’s also filtered out some entries where the year is complete (i.e. not just 16-02-22 but 16-02-1822). These are years I want to be included, so I need to take a different approach. Therefore I’ve started to try and introduce myself to regular expression (regexp). Basically I want to create a year by matching anytime 4 numbers occur in a row, without any dashes or spaces between them, then set that as a year.

So that’s where I’m at now, just trying to get my data to be clean enough to achieve this.

I was thinking though, I don’t know if I want to create a separate file with just the date stuff, because that’ll resort in a few different files with different amount of data. Maybe once I get the regexp working I can just match it straight through my code in eclipse/processing. Or use the date syntax as Mitchell suggested, or the php strtodate that Geoff likes. Hmm, thinking, thinking, thinking.


by in Honours, Research.

I’ve been working on this one all day, without much success. Until Mitchell the lifesaver pointed out I had a different integer in the loop, which was causing me some serious problems. One change and bang, works a treat!

Each box represents an artist in the collection, it’s a bit pretty at the moment and you can’t distinguish the colours. So here’s an updated version:

yellow = male, purple = female, orange = company, white = unknown

playing with data

by in Honours, Research, thinking.

I’ve spent the last couple of days struggling with my data files and getting them to split properly. Everything works fine with relatively small files (5000 lines or less) but once you go over that I had some serious issues where the program wouldn’t split the file properly.

After a lot of trial and error (importing and exporting the files into excel, checking them in text wrangler, removing tabs, adding tabs again, cursing, making smaller files, boosting the memory available to java and changing the default size of the hashmap, etc) I realised that it seemed to be getting caught on the blank fields. So I ran a little find and replace in excel, selecting any blank cells and replacing it with the word ‘NULL’. And it worked! Finally spitting the selected artist out below, rather than the error I’ve been getting for the last few days.

Also I’ve switched to using Eclipse with the Proclipsing plugin on Mitchell’s advice. It took a bit of effort to understand the differences between the usual Processing GUI and how processing works in Eclipse, but it’s a lot easier to understand for me, as I used Eclipse last year for our Immerson project.