« September 2007 | Main | November 2007 »

October 27, 2007

Bartender

Last week, Shashi Borade was asking a few people whether they found the following joke to be funny: C.E. Shannon goes into a bar and orders a beer.  The bartender says, "I don't see where this joke is going."  Shannon replies, "It's an open problem."

ShannonI thought it was mildly amusing because of the breaking of the fourth wall.  Others did not agree. 

At this point, you must be thinking, "I don't see where this post is going." 

Some people find the joke funny and some do not, but it is not a noisy measurement, so averaging across subjects does not make sense.  A paper that I was flipping through recently develops a probability model for exactly this type of individual variation.  I reproduce the first paragraph of "Modeling Individual Differences Using Dirichlet Processes," by Navarro, Griffiths, Steyvers, and Lee (2005) here: Suppose we asked one hundred people which number was the most unlucky. Of those people, fifty said ‘13’, forty said ‘4’, and ten said ‘87’. This variation is unlikely to be due to noise in the cognitive process by which people make unluckiness judgments: If we replicated the experiment with the same people, the same fifty people would probably say 13 again. It seems much more likely that most of the observed variation arises from genuine differences in what those people believe. A complete explanation of people’s answers would have to account for this variation.

As I have become aware through the course I am taking this semester, linguists (syntacticians in particular) are sometimes criticized for getting data from just one speaker of a language and then basing theories on that data.  The common thinking is that data must be gathered from many subjects to be reliable. 

This contention was somewhat implicit in a workshop last week entitled "Where Does Syntax Come From? Have We All Been Wrong?" co-organized by Bob Berwick.  One of the speakers, Christopher Manning, talked about learning language from very large corpora, such as a corpus of Wall Street Journal articles, but his probability models did not seem to be mixture models, which would be necessary to capture individual variation.  Noam Chomsky, in his talk, basically brushed off statistics within his first two sentences. 

The 'LIDS-iest' of the talks during the workshop was given by Partha Niyogi, who applied learning theory to language acquisition and evolution.  The starting point for the talk was that the structure of all languages is the same except for certain parameters that may be set differently.  The running example of a parameter was whether a language is head-initial or head-final.  In English, the tree-structure of a sentence is something like this:
Cows_eat_grass
In Hindi-Urdu, the tree-structure is something like this:
Cows_grass_eat
As anyone familiar with graph theory knows, the two trees are the same, but are just drawn differently.  How to draw the tree is the parameter. 

As another example, consider the two sentences: (1) Norbert thinks that he is a genius; (2) Norbert thinks that himself is a genius.  Sentence (1) is grammatical, but he cannot refer to Norbert.  In sentence (2) himself refers to Norbert, but is an ill-formed sentence.  What we should really have is the sentence: (3) Norbert thinks that heself is a genius.  A parameter called the nominative island condition prevents English from having a sentence like (3), but Chinese has the parameter set differently.  In Chinese, the word taziji is used for heself

Niyogi provided analysis which explains that if an ideal learning algorithm is used to learn from a heterogeneous population of speakers -- most with parameter setting A and a few with parameter setting A' -- a language can evolve very quickly from having parameter setting A to having A'.  One such historical change was the conversion from Old English, a head-final language, to modern English, a head-initial language.  Slides from a similar talk are available here.

If I were clever, I would now tie together the penultimate paragraph about ideal learning algorithms and the opening gambit about the bartender.  However, how to do so is an open problem. 

October 09, 2007

Aleae

Have we crossed the Rubicon in how we deal with probabilistic reasoning?

In his textbook, Terrence L. Fine has written that: Probabilistic reasoning is a complex of approaches to navigating in the wide variety of realms of chance, random, uncertainty, and nondeterministic phenomena.  This complex ranges from the informal, intuitive, qualitative reasoning of everyday life (e.g. "I'll probably see you tomorrow") to the formal, quantitative reasoning that underlies much of engineering applications (e.g. "The probability that the next transmitted bit will be in error is .0001").  Our focus will be on a formal presentation of quantitative probability, a form of probabilistic reasoning that has been vigorously explored in this past century.

Having trudged up the slope to reach his class at 8:40 in the morning during the spring semester of 2002 and been a group tutor for it in 2004, I was well aware of Prof. Fine's view of and pedagogical style with quantitative probability -- a style not well received by some.  I did not know much about how he views the other parts of the complex of approaches of probabilistic reasoning, but I had the chance to learn when he visited LIDS last week

The case Prof. Fine presented during the LIDS colloquium was that noblemen and their dice need not be the only foundation for probability.  Converting everything to a single real number in the interval zero to one is not the only option.  There are things like comparative probability, upper and lower probabilities, and interval-valued probability that may be a better fit depending on what you're doing. 

The colloquium talk was followed up the next day by a talk entitled "Computationally-based Agnostic Induction" featuring some recent ideas.  The idea is to come up with the degree of inductive support for hypotheses from evidence, denoted h|e.  In standard, quantitative probability, this amounts to application of the Bayes theorem to obtain p(h|e). 

Instead, what was proposed was to think of evidence and hypotheses in terms of a formal language such that the evidence can be expressed as a binary sequence and hypotheses can also be expressed as binary sequences. 

Simple things have short sequences.  As an example, let us say that I flipped a coin today in the Stata Center and it came up heads.  Let us also say that I cannot observe the result directly and only know that it either landed on its edge or came up heads.  Two possible explanations for this evidence are: (1) the coin never lands on its edge, it is random whether it comes up heads or tails, and on this instance it happened to come up heads; and (2) on days that two of the following three people: the Dalai Lama, Prof. Fine, and Andrey Kolmogorov, are in the same city, my coin flip in the Stata Center comes up heads, if all three are in the same city, it lands on its edge, and otherwise it comes up tails.  Explanation (1) is simpler and requires a shorter description, whereas (2) is more complex and requires a longer description. 

I am not very comfortable with the theory and mathematics of this sort of stuff -- an interested reader should refer to this.  But in any case, the point is that if we go through all possible explanations of evidence and compute the complexity or description length for each of the hypotheses, the short explanations will be 'most likely.'  I put 'most likely' in quotation marks, because what we will really get is a partial ordering of which explanation is shorter than which other explanation.  The computation is done using a Turing machine.  The work is mathematically solid, but beyond my ability to explain fully. 

In his textbook, Fine has also written that: The ancient Greeks struggled to develop a notion of possibility or contingency as opposed to necessity.  For example, a future event ("there will be a naval battle tomorrow") considered today need not necessarily happen or fail to happen.

I had the intuition that going through all of the scenarios, explanations, or programs and performing the computation is like possible worlds semantics.  In semantics, the epistemic reading of a sentence like "John may leave" is true if and only if there exists some world w' compatible with the evidence in our world w in which John leaves.  The extension here is to also look at the complexity of w'.  According to Prof. Fine, however, the intuition is not quite right. 

This entire inquiry, reconsidering the foundations of probability, is quite philosophical.  I didn't appreciate everything in the two talks, but what I did get was that although some may consider the die for standard, quantitative probability to have been cast, that is not the only opinion. 

Most Recent Photos

  • Kneecrack
  • Leonardo6
  • Cartoon
  • Mr_strong_small
  • N515449250_884026_6342_2
  • N515449250_884026_6342