lundi 28 décembre 2009

[18] Some thoughts on Compression, Visualization, and Blackjack

Beautiful Mnemonics - Compressed Graphics

It occurred to me a while ago that some printed graphics can be said to be "compressed," not in the JPEG sense, but rather in the traditional information theory sense of the word, i.e. they can contain more information than explicitly shown, provided the recipient knows the conventions necessary to extract the implied information. In traditional Tufte terminology, these graphics could be said to have a data-ink ratio higher than 1.0, and more poetically, to borrow another Tufte motto, they can illustrate a negative "smallest effective difference."

The example shown below is the compression of the traditional table for Blackjack basic strategy to a set of smaller tables with many implied cells. Perhaps these tables will not be useful to a complete amateur, but everyone with a basic knowledge of the game I have tested them with has found them very useful in memorizing the rules completely, something they have failed to do with the traditional table, which shows independently the 220 data points.

I believe this is an effective example of a special type of "compressed graphics" that can serve as "beautiful mnemonics" rather than "beautiful explanations."


(click for large version)


Below is a longer explanation of the two points above for those not familiar with Tufte's terms.

A longer version - Questioning data ink ratios and the smallest effective difference

Given the amount of manipulative and content-free media pushed around us (think Fox news), teaching people how to create elegant and meaningful visual explanations seems a useful goal. Perhaps no one has contributed more to it than Edward Tufte, who in his books explains the basic principles of effective information design and shows many outstanding applications. If you have not had a chance yet to see his work, I recommend you stop here and check the following link:

http://www.edwardtufte.com/tufte/

One of Tuftee's recurrent themes is avoiding waste, making graphics with a "high data-to-ink ratio." In other words, charts where every line and piece of text is meaningful, without unnecessary colors and decorations. A sort of Visual Strunk and White which Tuftee praises with the lyrical phrase "the smallest effective difference."

In Tuftee's books, "the smallest effective difference" is usually illustrated by starting with a bloated and ugly graphic and succesively getting rid of unnecessary ink, minimizing waste. The result is usually a beautiful and compact representation of the original data, with a higher "data-ink ratio."

I've often wondered about the possible results of the "clean up process" given a starting point that was not a bloated graphic, but already a clean one1. To which point can it be simplified? are there cases where we can delete even data-ink and still have a useful graphic? perhaps an even better graphic? In short, is there a case to be made for a negative smallest effective difference?

1. Playing with other properties of the graphic such as dimensions is one possible answer. This is illustrated in Tufte's discussions about sparklines. In this case I'm referring to further deletion of data past the obviously desired deletion of fluff.

Before jumping into these formulas it is important to make a disclaimer: Tuftee's points are heuristics using math terminology, not standalone formulas for quality. With the above in mind, lets phrase in pseudo-math terms the idea of the smallest effective difference as follows:

Data ink ratio = data ink / graphic ink (ink that conveys information vs total ink used to produce the graphic)

From this, one interpretation of the smallest effective difference could be:
graphic ink - data ink

This would seem to imply that the goal is a data ink ratio closest to 1.0, in other words, a smallest effective difference of 0. However consider a graphic that implies more data than it explicitly expresses:
data ink = explicit data + implied data

To express more data than that explicitly shown, the receipient must know an algorithm to extract such data from the original message. In other words, the reader must be able to infer the implied data. In traditional information theory, a classic example of compression is run-length encoding, where contiguous identical digits are implied by specifying how many times they repeat instead of writing them explicitly. e.g.
3.422222222231333333 is converted to
3.42{9}313{6}

The compressed format is shorter, although it requires the reader to know a convention. If we were to consider text as graphics we'd already have a first example of a graphic with a data ink ratio higher than one. Thankfully, we don't have to make such an extreme case to illustrate the idea: the blackjack table below is a more meaningful example.

The point of many graphics is to render new data clear to the reader. In such cases, using a "compressed" graphic may not be a good idea because first one would have to explain to the reader the conventions necessary to read the implied data. However, not all graphics are meant to explain new data. Memorization and quick confirmation previously known information is also helpful.

Here's a practical example, compressing the traditional table for Blackjack basic strategy to a set of smaller tables with many implied cells. Perhaps these tables will not be useful to a complete amateur, but everyone with a basic knowledge of the game I have tested them with has found them very useful in memorizing the rules completely, something they have failed to do with the traditional table, which shows explicitly 220 data points. I believe this is an effective example of a special type of "compressed graphics," that can serve as "beautiful mnemonics" rather than "beautiful explanations."




A traditional Blackjack Strategy table
vs A compressed/mnemonic table