Data Visualisation

23 Apr

This is a good article on data visualization. The author indicates in his considerations section that “real data can be very difficult to work with at times and so it must never be mistaken that data visualisation is easy to do purely because it is more graphical.” This is a good point. In fact in some respects determining what the right visualization is can be harder than simply working with the data directly – however, much harder to communicate key insights to a diverse audience.

What rarely gets enough attention is that in order to create interesting visualizations, the underlying data needs to be structured and enhanced to feed the visualizations appropriately. The recent Boston bombing where one of the bombers slipped through the system due to a name misspelling recalled a project years ago where we enhanced the underlying data to identify “similarities” between entities (People, cars, addresses, etc.) For each of the entities, the notion of similarity was defined differently; for addresses it was geographic distance; for names it was semantic distance; for cars, it was matching on a number of different variables; and for text narratives in documents we used the same approach that the plagiarism tools use. In this particular project a name misspelling, and the ability to tune the software to resolve names based on our willingness to accept false positives, allowed us to identify linkages that identified  networks. Once the link was established we went back and validated the link. In the above example, the amount of metadata generated to create a relatively simple link chart was significant – the bulk of the work. In terms of data generated, it is not unusual for data created to dwarf the original data set – this is especially true if there are text exploitation and other unstructured data mining approaches used.

So … Next time the sales guy shows you the nifty data visualization tool, ask about the data set used, and how massaged it needed to be.


