This week has gone pretty quickly and I’ve mostly been working on the text analyser / summary program. I even managed to take some photos! The week started with @dumbledad (= Tim) showing me some of the visualisation stuff he and an intern had been working on to visualise a book, some of which will appear shortly on a site somewhere… It’s all in the spirit of new and interesting data presentation in the spirit of Information Aesthetics and he sent me a link to some stuff he did on ManyEyes – word clouds (or ‘wordles’) comparing frequencies of words in narrative and speech. Some of the other ones are more difficult to describe but I’ll be sure to tweet link to them when they get published.
The idea of the summary program was that it split the book into sections then compared a histogram of word frequency densities in each section with another histogram for the entire book, then picked out the words which were most likely to be important to the section by choosing the most unusually frequently used ones. The problem with that was the program wasn’t picking out main characters because they were being mentioned all throughout the book. So I was to implement a system to split words into three categories: local to the section, local to the book (main characters) and common to the English language. The existing framework for a two-way local to section vs local to book had already been written so I was to implement the three-way split.
By Wednesday I’d finished the actual implementation so I started trying to invent a visualisation. My original idea was to have a ‘story line’ (no pun was actually intended) along which various threads would undulate, and the further out from the story line they are, the more important they are; think of it as a radial graph – I think I was probably inspired by the RealPlayer (yuk, I know) ‘cosmic string’ visualisation. I built a really flickery version as a mockup which was approved, and since I was by then starting to shy away from WPF I ended up learning DirectX overnight to implement a final 3D non-flickery version of it. After spending a whole day stressing over the edges of the scene getting cut off and finally realising I’d set the camera’s maximum viewing distance ridiculously low, I finally got it to work, and after writing some homebrew bezier curve code it looked pretty good (if I may say so myself); Tim tells me he’ll probably add a screen video of it to the online display of visualisations so … watch this space.
Another excitement of the week was a talk from TrueKnowledge (= TK), an internet answer engine. It’s similar to the famous Wolfram Alpha (= Walfa); however in my opinion it actually has more potential. Walfa throws manpower at writing new code to scrape information from various different sources on the fly which essentially means the more information you want, the more you’re going to need to work. TK on the other hand stores information in an enormous database which has a structure suitable for storing any type of information, and although work is done to ‘crawl’ Wikipedia and other sources for knowledge, it also sources the community for information which means it can gather lots of important knowledge very quickly with minimal effort. It also has awesome features of natural language parsing (ask it ‘what colour are red cars’ for example) and it can also give you a step-by-step explanation of the logical process that leads to its final answer.
It of course differs from Walfa in that it hasn’t got a tonne of Mathematica code behind it – its strengths are in factual and inferred knowledge as opposed to evaluating integrals. It’s currently in Beta and has an API (yay!) so I strongly encourage anyone who has used Walfa to give TK a go.
On Tuesday the weekly Mexican food van appeared – until then I’d never realised quite how amazingly good burritos can be! While we were eating we started discussing presentation of text. The problem is that a conventional layout presents the reader with a formidable block of text interspersed with some images which is difficult to follow and annoying to read since one always has to alternate between studying the image and reading the text. However attempts at producing non-linear presentations of information such as embedding text into the image as tooltips or expandable areas of the image etc. have always resulted in people simply not reading very much of the text and consequently missing out important stuff. The best solution we came up with is using an old method of collapsible clauses, just like collapsible code. For example, if a relative clause which in this case is italicised and relatively long yet somehow doesn’t contribute much to the sentence thus merely adds length and unnecessary information to the text making the ultimate meaning more difficult to discern is considered superfluous to the meaning of the sentence, it could be replaced by a small button that only shows the clause if clicked – such ideas are particularly relevant to German sentences which tend to have huge diversions into clauses before the verb is revealed right at the end. This way readers can quickly get the gist of what’s going on so they may study the image in an enlightened way, then go back and expand the text to get the full meaning.
There are also a few things I noticed about MSR in general. There is a strong sense of company loyalty – all employees seem to use Bing, and everyone I’ve seen even goes as far as using IE instead of Firefox! Using only Microsoft products to perform tasks however did make me aware of the wide range of programs they do produce – they even have Virtual Machine software and an internal proprietary alternative to SVN. I guess it does help the developers of these applications a lot if they have an enormous internal test group: all the employees and interns. There’s also pretty close integration with Redmond (Outlook + Office Communicator + global WAN shares) so feedback could be quite efficiently delivered. The entire place also operates in the spirit of trust – all users have admin rights (necessary for developers anyway) – which is so much better than what is implemented at school: a highly restrictive policy which, despite recent changes for the better, still filters out most protocols (FTP included) and in fact, instead of preventing people from doing things simply makes everything so much more difficult to do. Now I have to connect through encrypted VPN to use FTP…
Anyways overall it was a great two weeks. I enjoyed it hugely, I didn’t need to touch Excel, I didn’t make anyone coffee and I didn’t do any filing (who needs paper anyway? It’s a software company!) – instead I worked on real (and rather cool) projects, learnt some useful things, and made new acquaintances.
In other news, I’m off tomorrow to Cranfield for the Aerospace Challenge Finals – I’ll get to fly (actual!) planes, take lots of photos and it should be another great experience. They’d just better have wifi, though I’m bringing my Alfa Awus (ridiculously powerful) along in case of weak signal!