It’s an interesting thought. The upcoming Sentiment Analysis Symposium is having participants compete to try and do just that. But Sentiment Analysis would be only one tool out of a whole bag of NLP tricks that professionals in this field could use to try and figure out who will be our next president. The thing that strikes me about the whole idea is the thing that usually jumps out at me when asked about predictive tasks of this kind: what’s the source data and how much do you know about it? The answer – for me at least – is that I don’t know nearly enough to have any chance of success at this task (well beyond the coin-toss chance of anyone else). Just imagining the number of possibilities boggles the mind. Let’s say I wanted to make this a linear-sequence problem – similar to part-of-speech tagging. I could take sequences of rebuplicans-in-office followed by democrats, democrats-in-office followed by democrats……you get the idea. But, ooops….can’t really do that because, well, we just haven’t had enough presidents to make a decently robust NLP problem out of this. OK, back to the drawing board. How about I analyze the convention-night speeches of all the winners in american history vs. all the losers. OK, now I am on to something. I am sure to have a few million words and I can take tf-idf scores, topic profiles and the like that I can use to train up a nice winner-loser classifier. Why, it’ll be better than a spam filter. Trouble is, what I am more likely to get out of it is author-recognition rather than winner recognition unless I can take out all the words that are stylistically associated with the speakers. I could do that but that would be very time consuming….probably take a few months for me to find a bunch of speeches from each candidate and….well, the election would be over. So, maybe all these professional polling folks have it right. Maybe the best way to predict who the next president is going to be is to ask people who they are voting for. Maybe nobody really listens to these politicians anyway no matter what they say. And therein lies the trouble….and the reason this is one problem NLP might not lick. Doesn’t mean I won’t try just for fun.

Advertisements

It’s been a while since my last post but I am looking forward to the upcoming Sentiment Analysis Symposium in May. One of the new things to think about with sentiment is how it can be expressed in different genres. Until recently linguists did not pay terribly much attention to genre in terms of how it might influence lexical and grammatical features for classification and other tasks. That is certainly starting to change but in particular I think the impact on sentiment analysis deserves deeper investigation going forward.

For one thing the microblog document (OK Twitter) has so many more ways to express sentiment than other document types. It is not just emoticons that are of interest but all kinds of textual manifestations of emotions including the representation of sound (ugh! Eeew!) that are fairly rare elsewhere even in email. I am also excited by the idea of how the concept of “sentiment” has become intertwined with “reputation”. Why is that exciting? Well because the traditional polarity expectations change when something as subjective as a “reputation” becomes the topic. For example, when sentiment analysis was applied mostly to product reviews or news snippets, what sorts of things happened? Well, your “bad” news events like earthquakes and people complaining about products were tagged negatively while product raves and good news were tagged as positive. Sure, once in a while the sarcastic review will stump the classifier as will things like “plummeting” inflation. But reputation is a different animal. Many people do not want certain things exposed simply because of their position relative to other things. For example, a Republican does not want certain types of quotes exposed – even if they are genuine and popular with the public – simply because they hurt his reputation *as a republican* in the media. Reputations may indeed have polarity – it just is not as invariant as the polarity inherent in events in other contexts.
I admit I don’t quite have my head around this yet but I am thinking it over ahead of the symposium and wonder if others are thinking about this too. I originally did not imagine that online communications would vary so much in terms of the way content is structured but I have been surprised….

Well I came away from the Sentiment Analysis Symposium very excited about all the applications of sentiment that were presented. One of the best talks discussed stock price fluctuations as a function of document-level sentiment viewed over time (from media sources). This is of course a very powerful application – depending upon its reliability. I had no doubt even before hearing this talk that sentiment analysis consumers were going to have a large appetite for this kind of application. One thing that struck me in particular though was how the “semantic scope” of sentiment might be expanded – what could anyone add to the financial analysis of unstructured data in this area that could be interesting. Or is it all and only what “people” (I include pundits in this designation) “think” of a company?

For example, while it might seem superficially that only “positive” and “negative” have any application to sentiment analysis in finance, this is, I believe, not the case. For example, if we expand to other types of oppositions we see some interesting types of document analyses that could be very useful. Let’s take a simple event opposition like “buy” vs. “sell” – or its dispositional relative “long” vs. “short”. What if we applied that to topics on a document level or entities on a sentence level? Would it be nice to see the groups that are long on gold and short on t-bills – or turn that into grouping “contrarian” vs. “conventional” positions? OK so maybe things like this are already covered in structured data and available on Bloomberg but what about more subtle oppositions like “expanding”/’growing” or “retracting”/”shrinking”. These oppositions are much more likely to be discussed in quotes from corporate leaders and reported in the media – often echoed in the text of “forward looking statements” of annual reports. Now this sort of thing gets a little more difficult for the quants to pick up without language analysis, and, I would say, a nice addition to the greater SA offering. Now I am sure I will have colleagues laughing at my giddy insistence on putting the cart before the horse – after all, traditional sentiment analysis accuracy testing is still controversial – but, hey, it’s my blog and I’ll dream if I want to.

Since I am speaking at the Sentiment Analysis Symposium next week, I have had sentiment analysis on my mind, as you might imagine. What I find interesting is that, like so many other areas of natural language processing technology, it tends to have its own little niche of practitioners who are completely shut off from the other communities under the NLP umbrella. Very few who associate themselves with it have given much thought about the interaction of SA with information retrieval, machine translation or even document classification of which it is a proper part. The latter is especially surprising to me, since considering the semantic nature of the three (aha – or should there be two??) “traditional” sentiment classes – positive, negative and neutral –  raises  some important issues in the general semantics of  “opposition”. Lets start with the paradox of having more than two sentiment classes. The effect of that idea is to move sentiment analysis out of the semantic “bucket” of polarity altogether. Is that something that makes sense for usability and information quality? Does it open the door to making sentiment gradable in general? What would that mean? One thing it would mean is that any hope of alignment with human judgments – already shaky –  would be gone. It would also negatively impact usability by virtue of the weak semantic substance in the (theoretically infinite) number of sentiment classes. Each vendor of the technology could have a different proprietary scale making product comparisons impossible as an added distraction.

OK well maybe I’m getting a bit extreme……and we haven’t seen sentiment scales above three coming out in products.  But on the other hand is dividing the world of thought up by applying binary sentiment over popular opinion a reasonable classification goal as an alternative? Or is that too limiting? I actually believe that opening sentiment analysis up to explore the greater world of semantic “opposites” is the way to push the technology into a future of greater usability and profit. I suppose we’ll see what people think when I float that idea at my talk……..

My last post left off by asking readers to play 20 questions using people as the intended objects and then, reflecting on how that unfolded, read about the Frame Problem – a much discussed and debated issue in both computer science and contemporary philosophy.

Before I get into what I believe to be the applications of the Frame Problem to today’s search technology paradigm, I will go back to the thread of “properties” to which I promised you I would return.

Remember the “properties” of George Bush that we discussed – properties such as “IS_FUNNY”, and “IS_FORMER_PRESIDENT_OF_U.S.” – were things that the search engine did not understand and could not use to help the user find more “useful” results despite finding results that were, technically, “relevant” to “George Bush”.
To show the importance of properties in general information retrieval (and now I am going far beyond just search technology), try playing 20 questions again as if you were a typical search engine. Someone would start the game with a person in mind. You would be tempted to say something like “Is this person in the news?” or “Is this person female?”. But things like “HAS_GENDER” and “IS_FAMOUS” are properties, aren’t they? So you can’t do that. If you were a search engine, all you could do is blindly throw out contexts where you had encountered a “person” in the past – definitions, lists of synonyms etc. You could only distinguish on the basis of frequency (or more precisely features) of occurrence. Now, you are never going to get anywhere in 20 Questions this way, are you?? And this is why search engines that can’t distinguish properties don’t get you useful results even though what they produce may be relevant or “popular”.

All of this is to tie in with the notion of the Frame Problem. This problem, as I mentioned before, is a long-discussed and disputed problem related to artificial intelligence and philosophy. But really it is very relevant not just to search technology, but to the very activity of search in general – the idea of task completion, really. So, your “task” in 20 questions is to guess the identity of a person, place or thing within a certain number of tries, and to complete this task as efficiently as possible (and “win” the “game”) you must have a strategy. The importance of a “strategy” in completing any task – from supplying search engine users with good results to winning 20 questions – cannot be overlooked. In fact, if you read Daniel C. Dennet’s seminal work on the Frame Problem (See Dennett, D.C. 1984. Cognitive wheels: The frame problem in artificial intelligence. In Hookway, Minds, Machines and Evolution, 129—151), you will quickly learn how much knowledge is required just to make a turkey sandwich! The frame problem is really about “framing” the knowledge required for task completion so that it does not involve either too much or too little data. For example, there are all kinds of data points that a human being processes when making a turkey sandwich but only a subset of them are relevant to the completion of the task – so for example you maintain the knowledge that refrigerators keep things cold but you don’t really need to draw on that knowledge to make your sandwich, do you? So effective task completion involves not just knowing how to do something but using the right knowledge at the right time.
I will leave off with a Google search for Toyota – which has at least three possible referents – an organization, a product manufactured by an organization and a place. Google is able to separate genre pretty well – that is, it has news separated from Wiki pages, separated from Twitter feeds. So while genre recognition is indeed getting closer to notions of “utility” and salient contextual knowledge in our search technology it falls short of truly recognizing properties of entities.

More next time….until then check out Dennett 1984 and this time think of how to program a robot to be good at 20 Questions!

One of my favorite topics – and I hope one that I have not beaten to death quite yet – is the difference between “objects” and “referents” and the problem that poses for search engines – it is one that current search technology is not set up to recognize – and one to which the idea of “relevance” does not apply. To summarize its effects, it means that when I put the name of an individual into a search box, I might get a Wiki page – which will single out a particular individual in the world and tell me “who” that individual “is” – or I will get pages about any individual with that same name in all sorts of different contexts. And these sorts of results may be relevant but the point I made was that they often fail to be “useful”. So this raises the question of what would be useful. Here is where I stopped by suggesting that solving the “object”/”referent” distinction may not really help because of the many “properties” attributed to individuals. These “properties”, amounting in most cases to what an individual’s “role” is in all kinds of situations, actually begin to define at least what the search engine needs to distinguish one individual from another. So even if we were to strip out all data (such as Wiki) that supplies definition-style information, and were left with all-and-only pages relevant to the string “g-e-o-r-g-e b-u-s-h”, we still would not find that each one of equal relevance was equally useful!

OK, now, some readers who are emotionally attached to physical being will have to risk joining me in the parallel universe of the search engine for a moment. What I am saying is that these “properties” need to be recognized in order to satisfy both utility and relevance at the same time. Why? Let’s give “George Bush” results a closer look. As an individual, Mr. Bush has many properties. He is a former president of the United States and he is a member of the Republican Party. He also, evidently, is “funny”. This latter property “IS_FUNNY” turns out to be quite a salient attribute to the search engine. Of course, the search engine doesn’t know that the properties “IS_FUNNY”, “IS_MEMBER_OF_THE_REPUBLICAN_PARTY” and “IS_FORMER_PRESIDENT_OF_U.S.” apply to the same individual in the world (technically it does not know what a property is but I’ll save that subject for another post). That may not ever really be knowable or truly important for usability. What is really important is that it also does not know that these properties are different, which partially explains why it can’t determine the different roles of an individual. Knowing, for example, that the latter two properties are relevant to a POLITICIAN would be useful, wouldn’t it? Would it be even better to be able to see results separating George Bush “the Politician” from George Bush “the Comedian”?

Well the good news is that this is actually possible with some of the new techniques being used in today’s information retrieval products. And it is being put into practice. There are, however, some daunting challenges inherent in such efforts. And they don’t necessarily lie in the details of implementing latent semantic indexing. In fact, you can find them easily by playing 20 questions. Go ahead. Try playing 20 questions when the object you are thinking of is a person instead of an object or place. My guess is that those who win this game are very good at narrowing down and identifying salient properties of people. I encourage all readers not only to try this but once you have, read about the Frame Problem here in the Stanford Encyclopedia of Philosophy: http://plato.stanford.edu/entries/frame-problem/

The reason that 20 Questions is a good test is that it was originally conceived to find the information necessary to identify and arbitrary object. The game suggests that this information approximates a limit of 20 bits – under the assumption that each question allows the questioner to eliminate half the objects in his information universe – allowing the questioner to distinguish between 2 or 1,048,576 subjects. The best strategy for 20 questions of course it to ask the type of questions that do in fact split the information field in half. Not so simple. And even more complex when trying to split the data relevant to people this way.

What I would like to explore – and have you help me explore – in my next post is why some searches are so much more difficult than others. It may not be just about the sheer number of properties associated with any given entity…..it may be about finding and articulating the properties. Some properties just seem to be easier to “pin down” than others.
More next time……..