June 05, 2008

Purpose

Walking home from the Stochastic Systems Symposium, which I played a role in organizing, I encountered two shirtless guys riding their bikes down the middle of the street yelling that the Celtics had advanced to the NBA Finals.  Who knows why they were doing this a night late? 

Let me not be tardy like those shirtless guys on bicycles and write about an emerging paradigm.  Learning of complicated models from training examples has often been a generative pursuit, that is the goal is to learn a model from which one can generate new examples that are like the training examples.  One piece of work of this type is: Describing Visual Scenes Using Transformed Objects and Parts by Sudderth, Torralba, Freeman, and Willsky.  The key word in the title is describing.  The goal is to model or describe visual scenes, but part of the evaluation looks at classification tasks.  If the goal or purpose of the learning is in fact the classification task, then it makes sense to include classification in the objective function, making things more discriminative. 

In both Supervised Topic Models by Blei and McAuliffe in the most recent NIPS and Conditionally Trained Latent Dirichlet Allocation for Text Modeling and Categorization by Lacoste-Julien, Sha, and Jordan presented a couple of months ago at the Learning Workshop in Snowbird, models very much like those of Sudderth et alii are learned with the purpose of classification in mind.  Undirected graphical models are learned for the purpose of classification (hypothesis testing) in Learning Graphical Models for Hypothesis Testing by Sanghavi, Tan, and Willsky presented at the most recent SSP Workshop.  Overcomplete dictionaries are learned for the same purpose in Discriminative Learned Dictionaries for Local Image Analysis by Mairal, Bach, Ponce, Sapiro, and Zisserman to appear at CVPR later this month.

The purpose of classification (hypothesis testing) is also central to work I have done with my brother and fellow LIDS student Lav.  In Quantization of Prior Probabilities for Hypothesis Testing by Varshney and Varshney, which was recently accepted by the IEEE Transactions on Signal Processing and is available from arXiv, we are doing quantization (k-means clustering) with classification performance as the objective.  I encourage you to read the article and see how it is related to NBA referees and racial discrimination.  I'm biased, but I think it is quite interesting. 

I would write more about it, but Game 1 is about to start.  Beat L.A.!

February 14, 2008

Jury Duty

A few days after coming back from Houston, I had the pleasure of going up to Lowell, MA for jury duty

While biding my time in the courthouse, it occurred to me that jury duty is much like reviewing.  (Other people have, of course, made the same observation.)  The review process is subject to much debate, e.g. these two blog entries.  One difference for me is that I hope to never be on the other side in criminal proceedings, but I have and will continue to be on the other side in peer review. 

What was the probability of my getting called to jury duty in Lowell, not being empanelled, getting released before lunchtime, stopping at the Burlington Mall for lunch but first walking into Sears to buy pants at the same time the principal chair of the trumpet section in my high school's wind ensemble during my freshman year is walking out of Sears, and remembering his name (Larry Chien) ten seconds later?  There are many considerations in reporting probabilities of coincidental events after the fact, as discussed here recently in the context of primary elections in the city of my birth. 

The juror's task in a criminal trial is to decide whether the defendant is guilty beyond a reasonable doubt based on the evidence presented.  He or she is supposed to not have prior biases, but is supposed to use intelligence.  Are not priors a critical part of intelligence?  I am not a philosopher and do not pretend to have any insight about this topic.  Some quotations loosely related to the topic from things I have read recently:

The most important qualifications of a juror are fairness and impartiality. The juror must be led by intelligence, not by emotions, must put aside all bias and prejudice, must decide the facts and apply the law impartially.

It is a striking theorem of Bayesian analysis that, if the DM's prior distribution of the parameter p is sufficiently 'open minded', then, if the true value of p is p* (say), then the sequence of the DM's posterior distributions of p will become more and more concentrated in the neighborhood of p*.  In other words, the DM will asymptotically learn the true value of the parameter p.  By 'open minded' I mean, roughly speaking, that the DM does not rule out as impossible any value of the parameter between zero and one.  Technical Note: More precisely, 'open minded' means that the support of the prior is the entire unit interval.

It was with a sense of bewilderment and confusion that I left Benares.  Yet I was captivated by it all, and from that moment to this India has been to me the land of enduring and ever-increasing fascination, nor has a day passed without my learning something new and strange about the working of the Indian mentality.  How to express the thing is difficult, but I may put it thus.  As opposed to the Western mind, the Indian mind does not seem to be conditioned by facts.  Take the most highly educated Indian graduate of Oxford or Cambridge, a man versed in the arts, sciences or philosophy.  He will not think it incompatible with his learning to go on believing what he had been taught as a boy.  He has absorbed a new knowledge but it has not displaced the old.  Also, there seems to be a fusion in the Indian mind between myth and history, as though both were of a piece.  Of Indian thinking as I encountered it at the end of the nineteenth century, the most consistent thing was inconsistency.  Yet Indians in general are highly intelligent.  Their lawyers are among the best, and their linguists and teachers compare favourably with those of any other people.

The trouble is that this traditional picture of the relation between deduction and induction conflates two quite different things, a theory of reasoning and a theory of what follows from what.

  1. Office of Jury Commissioner, Commonwealth of Massachusetts (1998). The Trial Juror's Handbook, Sixth Edition.
  2. Roy Radner (2005). Costly and Bounded Rationality in Individual and Team Decision-making.  In: Understanding Industrial and Corporate Change. Oxford University Press.
  3. Sam Higginbottom (1949). Sam Higginbottom: Farmer, p. 53. New York: Charles Scribner's Sons.
  4. Gilbert Harman and Sanjeev Kulkarni (2007). Reliable Reasoning: Induction and Statistical Learning Theory, p. 5. MIT Press.

October 27, 2007

Bartender

Last week, Shashi Borade was asking a few people whether they found the following joke to be funny: C.E. Shannon goes into a bar and orders a beer.  The bartender says, "I don't see where this joke is going."  Shannon replies, "It's an open problem."

ShannonI thought it was mildly amusing because of the breaking of the fourth wall.  Others did not agree. 

At this point, you must be thinking, "I don't see where this post is going." 

Some people find the joke funny and some do not, but it is not a noisy measurement, so averaging across subjects does not make sense.  A paper that I was flipping through recently develops a probability model for exactly this type of individual variation.  I reproduce the first paragraph of "Modeling Individual Differences Using Dirichlet Processes," by Navarro, Griffiths, Steyvers, and Lee (2005) here: Suppose we asked one hundred people which number was the most unlucky. Of those people, fifty said ‘13’, forty said ‘4’, and ten said ‘87’. This variation is unlikely to be due to noise in the cognitive process by which people make unluckiness judgments: If we replicated the experiment with the same people, the same fifty people would probably say 13 again. It seems much more likely that most of the observed variation arises from genuine differences in what those people believe. A complete explanation of people’s answers would have to account for this variation.

As I have become aware through the course I am taking this semester, linguists (syntacticians in particular) are sometimes criticized for getting data from just one speaker of a language and then basing theories on that data.  The common thinking is that data must be gathered from many subjects to be reliable. 

This contention was somewhat implicit in a workshop last week entitled "Where Does Syntax Come From? Have We All Been Wrong?" co-organized by Bob Berwick.  One of the speakers, Christopher Manning, talked about learning language from very large corpora, such as a corpus of Wall Street Journal articles, but his probability models did not seem to be mixture models, which would be necessary to capture individual variation.  Noam Chomsky, in his talk, basically brushed off statistics within his first two sentences. 

The 'LIDS-iest' of the talks during the workshop was given by Partha Niyogi, who applied learning theory to language acquisition and evolution.  The starting point for the talk was that the structure of all languages is the same except for certain parameters that may be set differently.  The running example of a parameter was whether a language is head-initial or head-final.  In English, the tree-structure of a sentence is something like this:
Cows_eat_grass
In Hindi-Urdu, the tree-structure is something like this:
Cows_grass_eat
As anyone familiar with graph theory knows, the two trees are the same, but are just drawn differently.  How to draw the tree is the parameter. 

As another example, consider the two sentences: (1) Norbert thinks that he is a genius; (2) Norbert thinks that himself is a genius.  Sentence (1) is grammatical, but he cannot refer to Norbert.  In sentence (2) himself refers to Norbert, but is an ill-formed sentence.  What we should really have is the sentence: (3) Norbert thinks that heself is a genius.  A parameter called the nominative island condition prevents English from having a sentence like (3), but Chinese has the parameter set differently.  In Chinese, the word taziji is used for heself

Niyogi provided analysis which explains that if an ideal learning algorithm is used to learn from a heterogeneous population of speakers -- most with parameter setting A and a few with parameter setting A' -- a language can evolve very quickly from having parameter setting A to having A'.  One such historical change was the conversion from Old English, a head-final language, to modern English, a head-initial language.  Slides from a similar talk are available here.

If I were clever, I would now tie together the penultimate paragraph about ideal learning algorithms and the opening gambit about the bartender.  However, how to do so is an open problem. 

October 09, 2007

Aleae

Have we crossed the Rubicon in how we deal with probabilistic reasoning?

In his textbook, Terrence L. Fine has written that: Probabilistic reasoning is a complex of approaches to navigating in the wide variety of realms of chance, random, uncertainty, and nondeterministic phenomena.  This complex ranges from the informal, intuitive, qualitative reasoning of everyday life (e.g. "I'll probably see you tomorrow") to the formal, quantitative reasoning that underlies much of engineering applications (e.g. "The probability that the next transmitted bit will be in error is .0001").  Our focus will be on a formal presentation of quantitative probability, a form of probabilistic reasoning that has been vigorously explored in this past century.

Having trudged up the slope to reach his class at 8:40 in the morning during the spring semester of 2002 and been a group tutor for it in 2004, I was well aware of Prof. Fine's view of and pedagogical style with quantitative probability -- a style not well received by some.  I did not know much about how he views the other parts of the complex of approaches of probabilistic reasoning, but I had the chance to learn when he visited LIDS last week

The case Prof. Fine presented during the LIDS colloquium was that noblemen and their dice need not be the only foundation for probability.  Converting everything to a single real number in the interval zero to one is not the only option.  There are things like comparative probability, upper and lower probabilities, and interval-valued probability that may be a better fit depending on what you're doing. 

The colloquium talk was followed up the next day by a talk entitled "Computationally-based Agnostic Induction" featuring some recent ideas.  The idea is to come up with the degree of inductive support for hypotheses from evidence, denoted h|e.  In standard, quantitative probability, this amounts to application of the Bayes theorem to obtain p(h|e). 

Instead, what was proposed was to think of evidence and hypotheses in terms of a formal language such that the evidence can be expressed as a binary sequence and hypotheses can also be expressed as binary sequences. 

Simple things have short sequences.  As an example, let us say that I flipped a coin today in the Stata Center and it came up heads.  Let us also say that I cannot observe the result directly and only know that it either landed on its edge or came up heads.  Two possible explanations for this evidence are: (1) the coin never lands on its edge, it is random whether it comes up heads or tails, and on this instance it happened to come up heads; and (2) on days that two of the following three people: the Dalai Lama, Prof. Fine, and Andrey Kolmogorov, are in the same city, my coin flip in the Stata Center comes up heads, if all three are in the same city, it lands on its edge, and otherwise it comes up tails.  Explanation (1) is simpler and requires a shorter description, whereas (2) is more complex and requires a longer description. 

I am not very comfortable with the theory and mathematics of this sort of stuff -- an interested reader should refer to this.  But in any case, the point is that if we go through all possible explanations of evidence and compute the complexity or description length for each of the hypotheses, the short explanations will be 'most likely.'  I put 'most likely' in quotation marks, because what we will really get is a partial ordering of which explanation is shorter than which other explanation.  The computation is done using a Turing machine.  The work is mathematically solid, but beyond my ability to explain fully. 

In his textbook, Fine has also written that: The ancient Greeks struggled to develop a notion of possibility or contingency as opposed to necessity.  For example, a future event ("there will be a naval battle tomorrow") considered today need not necessarily happen or fail to happen.

I had the intuition that going through all of the scenarios, explanations, or programs and performing the computation is like possible worlds semantics.  In semantics, the epistemic reading of a sentence like "John may leave" is true if and only if there exists some world w' compatible with the evidence in our world w in which John leaves.  The extension here is to also look at the complexity of w'.  According to Prof. Fine, however, the intuition is not quite right. 

This entire inquiry, reconsidering the foundations of probability, is quite philosophical.  I didn't appreciate everything in the two talks, but what I did get was that although some may consider the die for standard, quantitative probability to have been cast, that is not the only opinion. 

August 21, 2007

Cheminformatics

A few days ago, my officemate and I were talking about how our respective summers went.  (She was out in Berkeley for a good chunk of it.)  I think that the general consensus among grad students about the summer is that yeah, you did do stuff, but you think you could have done more.  I had a similar discussion with one of my roommates Lee-Ping, a graduate student in chemistry.  He was saying that he did make progress on research, but didn't do enough to have something for the American Chemical Society national meeting that is here in Boston this year.  He didn't register for it either, but was thinking of possibly sneaking in.

Yesterday, LIDS students had the opportunity to hear about some chemistry without sneaking into the ACS meeting.  Shaillay Kumar Dogra, from Strand Life Sciences, spoke during a special LIDS seminar about cheminformatics. 

Potential pharmaceutical chemicals have two requirements: that they be biologically effective and that they have good absorption, distribution, metabolism, and excretion (ADME) properties.  Testing for ADME properties through experiments is costly and tedious, so if it could be done through cheaper and faster means, that would be good. 

The main idea of cheminformatics is that given experimental data about a property such as permeability through the intestine for a large database of molecules, model the property through machine learning techniques so that when presented with a new molecule not in the database, the property value for it can be estimated without running the actual experiment. 

To do machine learning, one needs features about the molecules.  Mr. Dogra talked about a 1081-dimensional feature space containing simple features such as the total number of atoms (nodes in the graph) and the total number of bonds (edges in the graph) and more complicated features such as graph eigenvalues.  On the algorithmic side, he employs permutation testing-based methods developed by our MIT colleagues.  Perhaps follow-up work on the same topic could be useful for this application domain. 

I wonder if my former roommate and chemist Brian Hanna snuck in to the ACS meeting.

May 19, 2007

Mind Games

Yesterday's LIDS lunch brought the academic year to a close and was the occasion for the release of the third volume of LIDS-All magazine; despite the absence of a trip inside the mind of Erik Sudderth, it contains many nice features.  A secret about the end of the semester is that professors and TAs are more thrilled about it than students. 

In the last lecture of the linguistics class I have been taking this semester, Semantics & Pragmatics, Kai introduced us to the Wason selection task, a psychology experiment that shows that humans behave illogically when making snap decisions about abstract correlations.  However, we are much, much better when the same exact task is presented in terms of social interactions, conventions, or contracts. 

Semantics is the study of meaning in natural human language.  Formal semantics makes use of partial functions, logic, and set theory to describe the truth conditions of a sentence, i.e. when it is true and when it is false.  Bob receives most messages is true if and only if the cardinality of the set of messages that Bob receives is greater than half the cardinality of the set of messages.

Pragmatics concerns itself with implicatures that can be drawn from sentences assuming cooperative communication among agents.  Through the logic of pragmatics, we infer that Bob does not receive all messages.  Bob receiving all messages entails Bob receiving most messages.  The hearer of the sentence would be interested in knowing whether Bob receives all messages and to be cooperative, the speaker should give the proper amount of information; since the speaker does not say that Bob receives all messages, the hearer believes that it must not be true. 

Detection, estimation, communication, and decision theory built upon probability theory, which is itself built upon set theory, is much more like semantics than pragmatics.  On the other hand, humans have difficulty separating semantics and pragmatics because we are always busy trying to make inferences.  In communication, we would like it if not sending a signal were to give us a free bit of information, but in some schemes it doesn't seem to work out that way in terms of communication theory, but in terms of human social interaction, not sending a signal provides a wealth of information. 

Two recent happenings in the world of sports and common reactions to them reveal something about how different the mind is from the axioms upon which study in LIDS is based.  The selection committee for the NCAA lacrosse tournament seeded Cornell fourth despite their primacy by many criteria.  Reactions did not bring up the mathematical fact that there is no optimal way to rank (Arrow's theorem), but focused on social/psychological aspects like fairness, bullying, and manipulation.  An econometric study was released showing that NBA referees are biased.  Reactions here disputed the math and statistics as much as, if not more than, the bias/illogicalness of humans when making snap decisions. 

The focus on social interaction differentiates us from systems that we design at present.  The quest for semantic communication is really a misnomer; it is pragmatic communication that we ought to seek, and taking a trip inside the mind is one way to go. 

May 08, 2007

Sea of Knowledge

LIDS is a vast sea of knowledge, but a fluid one.  Interactions with other seas of knowledge keep the lab from getting staid.  A confluence of events led to a deluge of colloquia and seminars in the first week of May. 

Compressive sampling and ℓ1, which has turned out to be a recurring theme, was presented by Emmanuel Candès.  Going through capacity calculations, Andrew Viterbi, an MIT alumnus, talked about wideband wireless communication like CDMA and OFDM, and how the next thing in cellular telephony is MIMO, probably built up on OFDM.  An electrical engineer's perspective on the life sciences, annotating genomes via Markov models in particular, was given by M. Vidyasagar, who interestingly or not, has an uncanny resemblance to the orthodontist I used to go to.  David Tse, a LIDS alum, gave two talks, the first about communication limited by a lack of cooperation and the second about communication enhanced by cooperation in a network.  Venkat Anantharam spoke on a counterintuitive topic, stochastic resonance, which I was first introduced to by Bernie Hutchins in my last semester at Cornell; the idea is that system performance can be improved by an increase in noise. 

As a student, the great thing about having such luminaries come through is that you get a sense of what's on the horizon and beyond.  Let me take a shot, as flawed as it may be, at a concise description of the scene painted by the roster of five speakers.  Like the five classical elements that have guided the understanding of nature: earth, water, fire, wind, and ether, five fundamentals: sparsity, diversity, Markovianity, cooperation, and randomness, have and will continue to define the study and design of information and decision systems. 

Elements_2

April 16, 2007

Dense Sparsity

I explained my work to him and he said this: 'The 20th century was the century of least squares, and the 21st may very well be that of ℓ1.'

Quite a statement.

On Wednesday, the week before last, Martin Wainwright, a LIDS alumnus, was back to give a talk at my group's weekly seminar series.  He talked about practical and information-theoretic bounds related to recovering the sparsity pattern of noisy signals using ℓ1-constrained quadratic programming.  The intimate relationship with sparsity is where ℓ1 derives its utility.

On Thursday, I had my research qualifying exam.  The RQE is the second part of the qualifying procedure for the Ph.D. in EECS.  It involves a written report and presentation to a committee of two professors about some past research.  I talked about my master's research, which was related to sparsity, applied to synthetic aperture radar image formation.  People almost always sail through the RQE, but you can't be nonchalant about it.  In my work, I used ℓp, 0<p<1, regularization to favor sparsity rather than ℓ1

If one really wants to look at sparsity, one should look at the ℓ0-criterion, which counts the number of non-zeroes.  ℓ0 is intractable to work with; ℓp, 0<p<1, is non-convex; and ℓ1 is convex.  A major thrust of research is in showing the equivalence of sparse solutions obtained by minimizing ℓ0 and ℓ1 in various settings.  Two of the folks working in the field, Tao and Donoho, combined to give six lectures at MIT.  The lectures were hosted by the math department, but a lot of LIDS students, post-docs, and faculty attended.  Tao didn't talk about sparsity, but Donoho was all over it.  In his first lecture on Monday, Donoho presented the preliminaries and showed some of sparsity's magic.  I fully and completely understood the talk, which is a rarity for me. In most talks I attend, if I get 15% of it, that is remarkable, although I have improved in my three years as a graduate student.

For lecture 2 of Donoho on Tuesday, I was back down to less than 15%, when he talked about neighborly polytopes and ℓ1.  I couldn't make it to lecture 3, because I had a plane to catch.  (Look out for my next post recapping the conference I attended, which unsurprisingly had some sparsity and ℓp mixed in.)

Martin made the comment that working in this specific area is stressful because it is moving forward so quickly.  My comrade Dmitry Malioutov can attest that independent, nearly simultaneous discoveries of the same or similar results are happening all the time in research on sparse signal representation and approximation.  Forecasting the future of this frontier is beyond me, but if my last week-and-a-half is to be believed, then there is some truth to the quotation at the beginning. 

Most Recent Photos

  • Kneecrack
  • Leonardo6
  • Cartoon
  • Mr_strong_small
  • N515449250_884026_6342_2
  • N515449250_884026_6342