How Polynary Creates Natural Language Descriptions of Data

Making Sense of Quantifying Adverbs

To an analyst, language descriptions can feel slippery and vague.  Dictionaries don’t even define the meaning of quantifying adverbs.  Numbers can be measured with great precision and their mathematical manipulations are well defined.  With this sense of exactitude why switch to language? 

The short answer is that language is the medium people use to state conclusions from the findings of analytic efforts.  In statistical studies these are generalizations about the patterns discovered in empirical data.  And language is the primary way people share the rules-of-thumb we pass around as knowledge.

Using descriptions requires that we operationalize the meaning of adverbs.  To this end we collected data on the range of values implied by an adverb from 100 native speakers of English.  We show the results of a set of primary adverb phrases in the array of histograms in Figure 13 relative to a 0-100 scale.

Figure 1: The quantitative meaning of selected adverbs

Adverb Histograms.png

The first observation about the graphs in Figure 13 is that adverbs do make quantitative distinctions.  The peak of these graphs shows where there is maximum consensus.  But unlike numbers, adverbs convey a range of values.  This is appropriate because descriptions form generalizations that cover regions of space, not points.

                                              The Construction of Descriptions

Descriptions connote the location, size, and shape of a region within an N-dimensional space.  Every region (basin) has a marginal distribution of values along each dimension.  The selection of the best adverb phrase for each dimension is the one that conveys this distribution most accurately as gauged by the percent overlap between the marginal and adverb distributions in Fig. 13 above.

We illustrate the method with the 2-dimensional example in Figure 14 where the task is to describe the blackened region of space (basin) labeled A0B within the yellow square (panel a).  The horizontal dimension is labeled A, and the vertical dimension is labeled B—contextualized here with values between 0-100.

Figure 2: Identifying the Best Description of Region A0B within the yellow square

Identifying Best Description.png

The marginal distribution of A-values in Figure 2 reflects that some values of A are more common than others (and therefore more descriptive of A) across the patch A0B.  This distribution is ‘scaled up’ to an area of 100 to allow a direct comparison to adverb graphs.  Selecting the best adverb is finding the adverb with the greatest overlap with the data being described.  The overlap between two graphs is the sum of the minimum of their graph heights at each value of X from 0 to 100.  The degree of overlap is zero if the two graphs do not overlap at all and 100 if the two graphs are identical.

Barely High is the best choice along the A-dimension because it has the greatest amount of overlap at 89%.  And the best adverb for the marginal distribution along the B-dimension is Somewhat to Barely Low with an overlap of 81%.   In short, A is barely high, and B is somewhat to barely low is the most accurate description of the region A0B out of over 12,000 alternative ways to describe it.      

This approach is completely general; we can use it to generate a language description for any arbitrary chunk of N-dimensional space.  For now, we will assume we need one adverb-adjective phrase per dimension; more succinct descriptions are derived from them.  The important connection between language and geometry is that quantitative descriptions point to the location, size, and shape of a sub-region of space.

To recap, the best adverb phrase is the one that has the greatest overlap with the  marginal distribution along each dimension .  In application there are over 100 adverb phrases available and the accuracy of the ‘best’ adverb phrase is commonly around 85 to 95%.  This was a surprise; quantitative adverbs are not as vague as commonly imagined.  Importantly, this approach removes the vagaries of how individual speakers/writers understand adverbs.  For effective communication it is more relevant to know how hearers interpret those adverbs.

                                          The Parsing Problem—Why these Descriptions?

One central principle to descriptive accounts is that we want them to be as concise as possible.  What does this mean?  The dictionary defines concise as the stating of much in few words.  ‘Few words’ is clear, but the meaning of ‘much’ is left to us.

One component of ‘much’ is the amount of accurate information conveyed by a description.  We already defined what we mean by the accuracy of an adverb phrase; we always want to use the most accurate adverb.  But what does information mean?  Information is closely related to how specific an adverb is.

A well-known measure of information, called Fisher information, defines it as the reciprocal of the variance (the standard deviation squared).  The information in an adverb phrase is estimated from adverb graphs like those shown in Figure 1.  The more specific or precise the adverb, the more information conveyed because it has a smaller range of values. 

The accurate information conveyed by an adverb phrase is the product of its accuracy times its information.  A given phrase can be perfectly accurate but covers a range of values so broad (i.e. ‘from extremely low to extremely high’) that it doesn’t really tell you anything.  And a given phrase may be very specific (i.e. ‘between very and extremely high’), but if it is inaccurate it is not saying anything valid.  Desirable adverb phrases are both accurate and specific, and they can fail on either account.

So one component to consider when we evaluate a description is the total amount of accurate information it conveys.  This is the sum of the accurate information imparted by its adverb phrases; each phrase adds more to the characterization of the object it is covering.

There is another component to the concept of ‘much’ to consider.  We want descriptions to be substantive.  We can construct descriptions that convey a lot of accurate and specific information but account for only a tiny percent of the objects we are trying to describe.  A description can’t say ‘much’ if it rarely occurs.  Doing so would result in too many descriptions to be practical.  The challenge of characterizing a collection of objects in words is to channel them into a cognitively manageable number of descriptions.  The measure of substantive-ness is the percent of objects covered by a description.

It appears that the term ‘much’ depends on two different components.  One component is the amount of accurate information conveyed by the description itself, the other is how substantive it is.  We define ‘much’ as the product of these two components; the larger the value, the more ‘telling’ a description is.  We will use ‘much’ and ‘telling’ to mean the same thing; a concise description is one that tells succinctly.

In the following section we will use the criterion measure of telling-ness to guide how we cluster basins into regions of space that are subsequently described in words.  Of the almost uncountable number of ways we could parse space into a set of descriptions, we seek to optimize how much the resulting descriptions convey.  That is, we seek a set of descriptions that offer a concise account of the objects in the sample. 

We end this section with another observation about quantitative language descriptions.  Now that we can calculate the amount of accurate information conveyed by each adverb-adjective phrase in a description, we can sort its phrases by how much accurate information they convey. This way we can make the description brief by ‘cutting off’ the later phrases with minimal loss of accurate information.  Descriptions can be made more succinct; as conceptual bins they only need to be long enough to draw the desired distinctions. 

                                                     The Method of Natural Clusters

As Marvin Minsky said, when you change how you represent information you change how you think about and process it.  For example, we can represent a set of N-dimensional objects by a one-way frequency table of polynary strings.  This provides a very compact way to summarize a data set (and compare it to other samples).  We could turn each of these polynary strings into a description and report the percent of objects that fit each description.  We can change the amount of details in this account by changing the length of the polynary strings.

But the empirical patterns in data are difficult to interpret through the frequency table format.  The notion of patterns is a geometric notion.  In the past we had no effective coordinate system to directly see the patterns in spaces beyond three dimensions, and almost no words or metaphors to directly characterize them.  But if we could identify the spatial structures in data, we could use quantitative descriptions to point to them. 

The problem is a generic one.  No matter how data are distributed in space, it’s unlikely that Polynary partitions it in a way that neatly aligns with the patterns and structures we seek to identify.  At the same time, polynary graphs of even short strings generate more descriptions than we would normally want.  The solution to these issues is to join spatially adjacent polynary basins into fewer regions that better approximate the empirical patterns in the data, and then describe each of these regions.

Describing a collection of multivariate objects is the task of grouping similar objects together to create a smaller set of descriptions.  This comes naturally to people.  Even as toddlers we could organize a hodgepodge collection of wooden blocks by putting them into piles of similar shape and size.  Each block is unique if examined closely; ignoring the details allows us to group them into a smallish number of piles.  This leaves us with a manageable number of unique descriptive generalizations characterizing the members of each pile. 

It is sometimes unclear whether we should put a block into one pile or another. The choice changes the members of the pile and their resulting description.  We use the measure of ‘much’ (the amount of accurate information conveyed) to guide our decision.  Descriptions that have higher values of ‘much’ are more telling because they convey a more substantial amount of accurate information.   The result we are seeking is to channel the collection of objects in a sample into a set of descriptions that offers the most telling account.

The method of Natural Clusters is a bottom-up clustering method applied to basins.  The first step is to identify all the basins that are more telling than any of the N to 2N basins adjacent to it.  This initial set, called seed clusters, yield a set of characterizations that tell more about the collection than any other.  These seed clusters are subsequently grown into larger regions by joining them to basins adjacent to the cluster until we account for all the basins (with data).

As clusters are grown, they become more substantive (they include more objects) but the amount of accurate information in its description tends to become smaller (because the region described is larger).  At each step a single basin is joined to that seed cluster that minimizes the loss in telling-ness.  When we are done the resulting set of clusters offers a comprehensive account of the objects in the collection.

Tracking the order that basins join seed clusters in this stepwise process means that we can offer the most telling account regardless of the percent of overall cases we wish to describe.  For instance, we can describe the majority of cases by describing the clusters at the point where they first cover half the sample.  This reflects another way language flexibly serves different descriptive goals.

Summary Remarks

We spent a lot of time on quantifying the properties of descriptions.  In many respects our approach is not new; we all speak and understand quantitative descriptions since childhood.  We intuitively understand that descriptions carry information, and vary in their accuracy, concision, and substantive-ness.  But what leads us to the adverbs we choose?  How do descriptions convey information?  And why do some descriptions seem better or more apt than others? 

We think the coherent and consistent method of creating descriptions of data we have described will be useful in making analytic findings more widely accessible.  The mathematical methods used in statistics make the communication of the findings to clients and decision-makers difficult.  Traditional methods don’t handle multivariate data effectively; they tend to focus at the level of individual dimensions.  By contrast, the conclusions from analytic studies that people seek are generalizations over objects.  And the real currency of communication for these statements is words, not numbers or equations.

A description is more telling if it conveys a greater substantive amount of accurate information.  The Method of Natural Clusters draws on this criterion measure to cluster basins into spatial regions whose descriptions provide the most concise account of a sample. Language is the medium people use to state conclusions from the findings of analytic efforts.  In statistical studies these are generalizations about the patterns discovered in empirical data.  And language is the primary way people share the rules-of-thumb we pass around as knowledge.