Over the last few weeks I’ve been working my way through Ben Fry’s Visualizing Data book, slowly coding up the examples and thinking about how or what I will be able to achieve. The problem was that I didn’t feel I was getting anywhere, sure I was entering all the code but I struggled to see the relevance it would have to me. So I decided to jump right in and attempt my first data visualisation with data from the Prints and Printmaking databases. First though I had to extract the data…
I have a backup of the database from MSSQL in .bak format. Of course I can’t import this into MySQL so I’ve had to install Microsoft SQL management studio on one of our computers, and after playing around I was able to import my backup file and see how the multiple databases fit together and work. It was complicated. Originally I thought there were 6 databases, but there are actually 29! The are all connected in some way, but trying to match those connections is proving to be a not so fun task.
I started querying the databases in different ways, trying to match artists to works through various numbers (IRN, UniqueArtistIdentifier, etc), this took some time and the results were less than rewarding. So after some playing around I was able to instead generate a simple text file with the artists name, gender, IRN, and artistIdentifer.
And so the really fun work started, it was time to play with some data!
I asked myself the simple question: Which gender has a higher representation in the collection?
So I set about trying to bend examples from the book to work with the file I had. Basically I wanted to draw some cirlces to represent the amount of artists that were male, female, a company or other, I knew what I wanted to do, but actually achieving it was surprisingly hard. (It was my first attempt though). After about an hour I had produced this:
Yep, that’s it, some green circles across the top of the screen. It didn’t even really use the data I wanted to, but it was something and it proved I was able to parse and mine the text file, a relief by all means. After a few more hours and many attempts, I was able to get to this stage:
4 circles of completely random sizes, which don’t represent how big the data set is. Nonetheless they do show the prominence of male artists compared to female, and the big black bulk of unknown gender artists.
This simple visualisation made me realise that data visualisation has another important function: it allows us to see how much data is missing. The black circle is about the same size as the group of males, which means that a lot of artists entries don’t have a gender assigned to them. In my next attempt, I’ll try and display the IRN (unique ID of the artist) or name of the artists, which will help identify which artists are missing that data.
This first attempt is small and so simple, but I am pleased I could produce it (even with the swearing) in a few hours. It’ll be a steep hill to climb…