Opinion polls have had an air of mystery about them. In this special article, Rajeeva Karandikar, renowned statistician and Director, Chennai Mathematical Institute, explains the science behind opinion polls, and asserts that when done correctly, they can provide a near-accurate estimation of the mood of voters. Ahead of the general elections later this year, Dr. Karandikar explains his methodology and estimates that the Bharatiya Janata Party (BJP) is way ahead of the other parties, and about 100 seats ahead of the Congress. On the other hand, the BJP, along with its current allies, is about 50 short of the majority mark.
In the last few months, there has been a lot of discussion about opinion polls. Various television channels have been conducting nationwide opinion polls since last summer and the overwhelming message seems to be that the United Progressive Alliance (UPA) government led by Dr. Manmohan Singh is on the back-foot and likely to lose, while the Bharatiya Janata Party (BJP) under the leadership of Mr. Narendra Modi is on the upswing. As usual, some people have a scientific basis for opinion polls. Indeed, each time we disclose our findings based on an opinion poll, someone or the other raises this question. Typically, representatives of the party that is projected to lose question the surveys.
This article is divided into three parts. In the first part, I will explain the scientific basis of opinion polls and I will also discuss the power as well as the limitations of opinion polls and comment on the demand to ban the polls.
In the second part, I will give our broad methodology and also our track record. By ‘our’, I mean the Centre for the Study of Developing Societies (CSDS), television channel CNN-IBN, and myself. Since November 2005, we have worked together. CNN-IBN commissioned the survey (with different print partners at different times), CSDS conducted the poll through its associates and I translated the vote data collected by CSDS into seat projections for the Lok Sabha (Lower House of Parliament) or Vidhan Sabha (State Assembly).
In the third part, I will present our findings on the mood of the nation as seen by us on the basis of our tracker poll conducted in January 2014.
How can the opinions of 30,000 people give any insight into the way the country, with over 800 million voters, thinks?
Here I will try to explain the scientific basis of an opinion poll with minimal usage of technical terms. Of course all I can try to convince a reader is this: Assuming that a proper methodology is followed, then an opinion poll based on a sample size of say 25,000 respondents, in our country, which has over 50 crore voters, can yield surprisingly good results.
A professor walks into a class with a box containing cards that are identical in all aspects except for their colours, red or blue. The professor tells the students that 990 of the cards are of one colour while only 10 are of the other colour. She calls upon a student to come forward, mix the cards in the box by shaking it (without opening the box), and then pull out one card from the box without looking inside. All students can see that the colour of the card is red. The professor then asks the students to guess the colour that has 990 cards. She then tells the students that those who can guess the colour correctly will get three extra days to submit the assignment that is due the next day with no penalty for a wrong guess. Readers can put themselves in the place of a student and think, as to what would be their guess? I suppose the answer is obvious—red is the dominant colour. Those who have not studied probability theory would answer this out of common sense and those who have studied probability and statistics would have a justification of their own. However arrived, the answer is same—the observed colour must be the dominant colour.
Consider a boys’ school where the principal wishes to ascertain whether cricket or football is more popular among the students. Let us suppose that the number of students in the school is 10,000 and the principal thinks that she has enough time to talk to about 100 students. She has a box containing slips of paper, each having the name of one student. After mixing, she repeatedly draws one slip from the box and does this 101 times, thus generating a list of 101 students, and talks to these students to ascertain their opinion.
Suppose one were to make all possible lists of 101 students and in each list the opinion of the majority of students in that list is recorded. The following table (Table 1) shows the percentage of lists that have cricket as the majority opinion: the column header is the overall proportion of students who prefer cricket and the row header is the number of students in each list. Note here that we have considered all possible lists.
We see that the majority opinion on the chosen list would very likely reflect the dominant opinion in the school if the popularity of the sport exceeds 60%. In this case, the principal would be wrong with only 2% probability—if cricket is the choice of 60% (or more) then it is the dominant choice on 98% lists (of size 101) and if football is the choice of 60% (so cricket is the choice of only 40%), only in 2% of lists it shows up as the preferred choice. And if, instead, she generates a list of 201 students and goes with the majority opinion, she would be wrong only in 0.2% cases. Now, if instead of generating the lists by drawing sequentially names from the box containing slips of paper with names of students, what if the principal just steps out of her office during the lunch break and walks into the corridor and talks to the first 201 students she meets? Can we be sure that “go with the majority” strategy would only have a 0.2% error? Not quite. What if, at that time, a cricket T20 match is going on? Many cricket fans would be checking updates on their mobiles instead of going out of the class during the lunch break. Or she could meet a group of students who are debating the tactics of the captain of the Indian cricket team, changing the topic to global warming as they see her approach them! In either case, the error could be much higher.
Now let us consider another scenario. A political party has two claimants for a Lok Sabha constituency, say Sethumani and Perumal. Suppose the constituency has 5 lakh voters. Let us imagine that we have generated all possible lists with the names of 501 voters from the constituency and each list is written on a lottery ticket. The ticket is coloured red if 251 or more voters on that list prefer Sethumani over Perumal and the ticket is coloured blue if 251 or more voters on that list prefer Perumal over Sethumani. Suppose all such lists are written out on lottery tickets. The following table (Table 2) gives the proportion of red tickets. The column header is the percentage of support for Sethumani and the row header is the number of voters on each list.
Let us assume that there is a gap of at least 10 percentage points in the support level of the two candidates, i.e. the winner is getting at least 55% votes. If the party makes lists of size 501, Table 2 shows that 98.8% of lists will be red if the voters prefer Sethumani and only 1.2% lists will be red if over 55% voters prefer Perumal. This is just a question of counting and is purely mathematical—no element of probability or statistics enters here. Thus if all possible lists are written and mixed and one list is drawn, talking to the voters on the list will give the party a clear idea as to who is the preferred candidate.
Even if the winning candidate has support of 53% voters, if the party makes lists of 1,501 voters, chooses one list (as described above) and talks to voters on the list, with 99% accuracy, the party can figure out who has more support.
What if the constituency has 10 lakh voters? Naive readers would think that to get same level of accuracy, the list size should be doubled too. However, an interesting fact is that if we were to count lists with total population size 10,00,000 instead of 5,00,000, the resulting table does not change.
In the next table (Table 3), we consider a population where the gap between the support of the two candidates is 4 percentage points, with the winner getting 52% support. This time the column headers are population size and row headers are sample size and the entries are the percentage of samples that identify the winner correctly.
We can see that there is very small change as the population size increases. Indeed, the columns corresponding to population size 50,000 and 50,00,000 are essentially the same, and the only difference is in the decimal place, maximum change being 0.3. Thus a list of size 4,001 out of 1,00,000 population yields the same level of accuracy as out of 50,00,000 population size—about 99.4%.
The list is what is known as a sample, the list size is sample size and choosing one list as described above is what is known as choosing a sample via random sampling.
At the bottom of this article, I have included a computer program written in python language for computing these numbers. You either have to believe me or have a mathematical expert confirm the accuracy of the programme and then run the same on a computer with python installed. A reader can change the population size, sample size and the support levels for candidate of interest to get the percentage of samples which show the candidate of interest to be the winner.
Now suppose there are seven candidates. We again prepare all possible lists of 2,501 voters. On each ticket, we write down the percent of votes out of voters in that list that each candidate receives. Now if we mix the tickets well and then draw one ticket, then we can be confident that the true vote percentage in the whole constituency for a given candidate is within 2.5% of the vote percentage written on that ticket.
Essentially this is what we do when we conduct an opinion poll. In reality we do not create these tickets, but select a group of 2,501 voters, and ascertain the opinion of this group, henceforth called a sample. It is the vote percentage in this chosen sample that we report. The crucial thing is that our choice should be as if we have written all possible lists and put them in a box, mixed them and then drew one. This is what is called random sampling.
Colloquially, most people think that random is same as arbitrary. This is the sense in which it is used in RAM (Random Access Memory). In a statistical context, random sampling refers to the methodology of choosing the sample rather than to the chosen sample.
So the essence of what has been written above is that if a random sample of size 2,501 is selected, then with 99% probability, the proportion of supporters of a candidate in the population and the chosen sample differ by at most 2.5%. This number of 2.5% for a sample size of 2,501 is called a sampling error. Thus if 34% of voters in my sample (of size 2,501) prefer Mr. X, I can be confident that between 31.5% and 36.5% voters in the population (from where the sample was drawn) prefer Mr. X.
Suppose we have access to a list of all telephone numbers in use in a constituency. We can use a computer program to generate a list of 2,501 randomly generated phone numbers from this list. We can then call these numbers and ascertain the view of the owner. In this case we could estimate the opinion of the group of people who have phones. In this method, richer, urban, educated class will be over represented and our estimate could be off the mark. Thus, the most important ingredient in the opinion poll is the methodology of sample selection. Unless this is done properly, there is no statistical guarantee that the estimate would fall within the error margin of 2.5% (for the sample size of 2,501). In the United States, most opinion polls are done using randomly drawn telephone numbers and seem to work fine since the telephone penetration is almost universal. Of course, way back in 1948, such surveys gave an incorrect verdict—that Truman would lose the election—which he won hands down. At that time, the poorer rural section of the United States was under-represented.
Thus, a good survey can yield surprisingly good estimates of vote shares of main political parties across a nation or a State or a constituency. One needs a reasonable sample size at the level that one needs the vote estimate. And one needs to follow sampling methodology strictly.
Limitations of opinion poll based projections
In the previous section I have talked about the power of opinion polls. Now let me come to its limitations. The public interest is in the number of seats leading parties would get and not the vote shares—as it is the seats that determine who forms the government. Of course if one can get a sample of 4,000 in every constituency, we can predict the seats fairly accurately. However getting a sample size of 4,000 per constituency is impractical— this would require a sample of size over 21 lakhs which is difficult—it will cost a lot and would require a large trained and reliable manpower, which we don’t have. So one can only get vote shares at State level or for a cluster of 10-12 constituencies. From here to predicting seats requires a mathematical model. I will explain how we handle this later. But this step brings in the possibility of error over and above that of sampling error—which I will call modelling error.
The other limitation of a survey done well ahead of actual polling day is that though the survey measures the opinion of the whole population, what really counts is the group that actually goes and votes. And it is known that the propensity to vote is much lower among the urban, upper middle class and upper class, college educated, high income groups. The other factor is that in India, voting intentions can undergo massive swings as voting day approaches. These two factors mean that the predictive power of any opinion poll done weeks ahead of the poll is limited. All it can measure is the mood of the nation at the time the poll was conducted.
I have written about the possibility of change in voting intentions as the voting day approaches. We had seen such a sea change in 1998, when we had done a pre-election opinion poll for India Today ahead of the first phase and then after each phase we did a day-after poll targeting the same respondents and we found that during the interim period (8 days for about 180 constituencies, 17 days for another 180 and 26 days for the remaining constituencies), about 30% respondents had changed their voting intention. While the BJP had gained during the period, there had been an overall churn, the largest being in Tamil Nadu, perhaps due to the blast in Coimbatore.
On the demand to ban opinion polls
Several things are being said in support of the demand to ban opinion polls—the main ones being:
1. That the polls are not scientific as they are based on the opinions of a very small fraction of voters.
2. That voters are influenced by the polls and thus it is possible for psephologists and the media to manipulate public opinion using opinion polls.
I have already answered the first one. It is scientific and if done properly, it can lead to an insight into the voters’ minds.
As for opinion polls influencing voters, I do feel that it probably does, and I have data which makes me think so. In every poll that we do, we ask a question as to whom the respondent voted for in the previous election (previous Lok Sabha election / Vidhan Sabha election depending upon the current election) and we have found that invariably, the recall for whoever won the last election is much higher, even when our estimate for the current round turns out to be very good. Thus in 2011 [the State elections to the West Bengal Assembly], over 70% respondents recalled having voted for the Left front in 2006 (whose actual vote share was about 50%), even though they were then going to vote for the Trinamool Congress! This we have observed across the country and over several years. This phenomenon points to a tendency of voters to identify with the winner. Thus there could be a tendency to vote with whoever is being projected to win. But there is also another effect, complacency. Some do believe that people were so convinced in 2004 that the National Democratic Alliance (NDA) would win, that several supporters simply did not bother to go out and vote! This may or may not be true. Though I do feel that polls have an influence, I don’t believe there is any reason to ban them. The views of experts on numerous TV channels, news about the campaign, editorials in newspapers also influences the voters. Can we ban all this? And if the demand for a ban emanates from the apprehension that opinion polls can be manipulated, then the same applies to expert opinions, reports and editorials. A political expert would be allowed to come on air and go on and on as to why such and such social group would vote for a specific party. And a reporter would be allowed to talk about what people coming out of a political rally were saying (perhaps based, in turn, on the correspondent’s conversations with five to ten persons coming out of the rally). An editor will give his take on how the mood of the nation is. As we do not ban these, why then, should opinion polls be banned?
If such a ban is imposed, I wonder if a reporter would be prevented from writing about polls done privately. For example, highly reliable sources may tell her that a private poll commissioned by a corporate house says that the CKQ party is winning the polls hands down. Betting is illegal in India and yet newspapers write about the odds being offered in the betting market for an outcome of a cricket match without giving names and addresses of the bookies. Betting on the outcome of an election is a huge market too and mainstream newspapers have reported on the odds being offered on a candidate winning a seat or a party getting a majority in a State or at the centre. Would the law ban newspapers from writing about this? And can such a ban be enforced? Law can prevent Indian news channels from airing this as they are governed by stiff laws on the uplinking of newsfeed. But what is to prevent, say, the BBC or CNN from airing the findings of such a survey?
Of course a question remains as to why have opinion polls at all? The answer to this lies beyond predictions of who is winning or losing. Only an opinion poll can tell the reasons as to why the people voted the way they did. Indeed, CSDS has been doing such polls to understand the mind of the voter. This alone can give social scientists an insight into the issues that are important and as to how different socio-economic groups are voting. For example, the nuclear treaty—the issue on which the Left front withdrew support to the UPA government in 2008-09—was a non-issue to the general public, as was the fact that about 5% - 7% Muslims have been voting for the BJP over the last two decades.
I think that an outright ban by law is a bad idea. However, I think that all media should come together and agree on some norms such as disclosure of methodology and certain other transparency measures. Professor Yogendra Yadav, a political scientist, psephologist and now a politician, has proposed such measures in an article written some time ago. Perhaps the Press Council of India can take a lead role in this.
How do we handle sampling and vote-to-seat conversion?
People have often asked me why I need to develop a new methodology, why not simply use the methodology used in U.S. or the U.K., proven over the years. Well, the U.S. electoral system is completely different, the main interest being the Presidential race, with the winner-take-all at the State level. Of course the Indian system is close to the one in U.K., indeed, derived from the system in U.K. When I first got involved with this exercise, for India Today and Doordarshan [India’s state broadcaster] for the 1998 poll, Professor Clive Payne, a statistician and a renowned psephologist with over two decades of experience of conducting and analysing opinion polls in the U.K. for the BBC, had been flown in as a consultant. We spent two days in the Delhi winter, walking in Lodhi Gardens and exchanging notes. He explained to me the political scene in the U.K., and the methodology used by him in the U.K. context. He then went on to ask me literally hundreds of questions about the Indian system, availability of data etc. At the end of two days, we came to the conclusion that the methodology used by him cannot be used in the Indian context and that we would have to start from scratch. There were several reasons for reaching this conclusion. I will list two. In the U.K., they have good data on the socio-economic profile of each locality and constituency even at the level of polling stations and this is a crucial input in the model used by Prof. Payne.
On the other hand, in India, the census data is available at district level but there is no matching between districts and constituencies, with each constituency spreading across two or three districts. Thus the socio-economic profiles of constituencies are, at best, a reasonable guess based on census data, and booth-level profiles are simply not there. The bigger reason is to do with differences in voter behaviour in India and in the U.K. According to Prof. Payne, the voting intentions in the U.K. are fairly stable across time with a large proportion of voters not changing their vote in a five-year time horizon. Thus most experts would be able to pin down for large number of seats as to who will win (and perhaps, in private, the parties would also agree). These are what Prof. Payne called “safe seats”. Thus his methodology focuses on the remaining seats and the model uses past data and socio-economic profile of polling booths and constituency.
As I have remarked earlier, in India there is big churn even in the few weeks leading to a poll and thus a potential for big change from one election to another. This means that very few seats (if any) are safe seats (in the sense described above) and almost all seats are up for grabs. Thus we (Prof. Clive Payne, Prof. Yogendra Yadav, and I,) agreed that we would have to work out our own methodology.
As explained above, using statistical techniques, one can get a fairly good estimate of the percentage of votes of the major parties in the country (or in a State) at the time the survey is conducted. However, the public interest is in prediction of number of seats and not percentage of votes for parties and thus the media is also interested in seat projection.
It is possible (though very unlikely), even in a two-party system, for a party ‘A’ with say 25.5% [of the vote share] to win 272 (out of 543) seats (simple majority) while the other party 'B' with 74.5% votes to win only 271 seats (‘A’ gets just over 50% votes in 272 seats winning them, while ‘B’ gets 100% votes in the remaining 271 seats). Thus, a good estimate of vote percentages does not automatically translate to a good estimate of the number of seats for major parties.
So in order to predict the number of seats for parties, we need to estimate not only the percentage of votes for each party, but also the distribution of votes of each of the parties across constituencies. And here, independents and smaller parties that have influence across a few seats make the vote-to-seat translation that much more difficult. As I have said earlier, the possibility of conducting a poll with sample size 4,000 in each of the 543 constituencies is out of the question. So what can be done?
One way out is to construct a model of voter behaviour. While such a model can be built, estimating various parameters of such a model would itself require a very large sample size. Another approach is to use past data in conjunction with the opinion poll data. In order to do this, we need to build a suitable model of voting behaviour—not of individual voters but for percentage votes for a party in a constituency.
To make a model, let us observe some features of Indian democracy. Voting intentions in India are volatile—in a matter of months they can undergo a big change. An example: Delhi in March 1998 Lok Sabha (a BJP victory), November 1998 Vidhan Sabha (the Congress wins hands down), October 1999 Lok Sabha (once again the BJP wins).
Let us note that while the behaviour of voters in a constituency may be correlated with that in adjacent constituencies in the same State, the voting behaviour in one State has no correlation with that in another State. The behaviour is influenced by many local factors including the political history of a State. Take the recent case of adjacent constituencies on the Tamil Nadu-Karnataka border or the Uttar Pradesh (U.P.) - Bihar border. One can see that voting patterns are very different across various elections.
Socio-economic factors do influence the voting patterns significantly. However, incorporating it directly in a model will require too many parameters. It is reasonable to assume that the socio-economic profile of most of the constituencies does not change significantly from one election to the next. So while the differences in socio-economic profiles between two constituencies are reflected in the differences in voting pattern in a given election, the change from one election to the next in a given constituency does not depend on them.
So we make an assumption that the change in the percentage of votes for a given party from the previous election to the present is constant across a given State. The change in the percentage of votes is called a swing. Under this model, all we need to do via sampling is to estimate the swing for each (major) party in each State. Then using the past data we will have an estimate of percentage votes for each party in each State. Here we can refine this a little—we can divide the big States in geographic regions and postulate that the swing in a seat is a convex combination of swing across the State and swing across the region. The resulting model is not very accurate if we look at historical data, but is a reasonably good approximation, to predict the seats for major parties at national level. I will come back to this point later.
Working with this model, we need to devise a sampling scheme to estimate vote shares across States and geographical regions. The model would then yield estimated vote shares of major parities in each constituency and then we need a technique for translating these approximate estimates of vote shares to seats for major parties. I will now describe each of these steps.
We first decide upon the total number of constituencies to be sampled. We have usually picked anywhere between 100 and 280 constituencies for sampling. For the same overall sample size, doing just 100 has its advantages while doing 280 too has some advantages.
The parliamentary constituencies are listed by the election commission is such a way so that constituencies in a State form a cluster. Thus, we choose to pick constituencies via circular random sampling or systematic sampling. This ensures that each State has almost proportional representation. Next, we get the list of polling booths in each of the chosen constituencies. Here we have seen that booths in each locality form a cluster, with booths from adjacent localities also mostly appearing close to each other. Likewise, in the electoral roll for each booth, one household comes together, one building or a society comes as a cluster, neighbours appear close to each other, and so on. Thus we choose, say, 6 booths in each constituency via circular sampling and then obtain voter’s lists of these booths and once again choose, say, 40 respondents in each booth via circular sampling. Over the years we have seen that samples obtained by this method (and analogous one for State elections) yield samples that seem to mimic the population on attributes for which census data is available. The scheme chosen is dependent upon the way the data is organised by the Election Commission.
Most of the agencies that conduct opinion polls use methodologies used in market research whereby the enumerators are given a profile on 5 or 6 attributes such as gender, education, religion, caste, income and are asked to somehow identify voters so that the sample has this pre-assigned profile. The hope is that if the sample matches the population profile on these important attributes, it will match on attributes of interest as well. However, there is no statistical guarantee that this will be so, since there is no randomization.
I would stress again that in our methodology, which involves enumerators going door-to-door to get opinion of voters that have been selected via randomization as described above, we do not impose a sample profile. Indeed, our first check point after we get the data is to see if the sample profile matches the population profile (at State level) on each of the attributes on which such profiles are available. Minor deviations can be corrected via suitable multipliers while estimating vote shares, while major deviations would indicate deficiency with the whole sampling exercise or may point to possible data entry error.
The CSDS conducts the poll through Lokniti—a network of political scientists across the country and I must congratulate the CSDS-Lokniti team for conducting meticulously the surveys over the years.
Vote share to seats conversion
As I have explained above, once we have the vote shares across States and across geographical regions, we can use the past data and obtain estimates of vote shares of major parties in each constituency.
One way forward is to count the winner assuming that our estimated vote share is the true vote share, namely whoever gets the maximum vote share in our estimates is declared a winner and then we count winners. If, in a constituency, our predicted margin for the leading candidate is 4%, we will be lot more confident about the leading candidate winning the seat than the situation where our predicted lead is just 1%. So we translate this confidence to probability of victory for the two leading candidates. The best case scenario for the candidate who is second is that actually he has a slender lead and yet a sample of the given size shows him trailing by the given margin.
Suppose the constituencies #114 and #117 are in a small State where the State sample is 1,600 and in constituency #114, a candidate A is projected to get 50.5% votes and candidate B is to get 49.5% votes, while in constituency #117, candidates C and D are projected to get 52% and 48% votes respectively. Now the best case scenario for B is that both A and B have nearly equal support and yet a sample of size 1,600 shows him to have only 49.5% votes. This is same as in 1,600 tosses of a fair coin, one sees 792 (=49.5% of 1,600) heads or less. The probability of this happening can be shown to be 0.3538. Thus we assign B with probability of win 0.3538 and A gets 1 minus 0.3538=0.6462. Similar calculations show that probability of win for C (in constituency #117) is 0.9424 and for D is 0.0576.
Finally, summing over all the probability of win across all constituencies for a given party yields estimate of its expected seats. The process described above for two parties can be extended to more parties. Based on our experience, we have chosen to stop at first three. So we look at the top 3 parties in every constituency and based on their predicted vote share, we assign probability of victory to each of the three (adding up to 1).
This seems to work well. Now that I have explained the full methodology, I will explain the notion of back testing in this context.
Let us do a mental exercise. Using the 2009 Lok Sabha election data, let us obtain the actual vote shares of major parties in each State and geographical regions. Suppose that in February 2009 someone had, based on an opinion poll or a crystal ball, given us these exact vote shares. We can then use 2004 actual data as the past data and use these vote shares with our model to predict vote shares in each seat and based on that predict total number of seats for major parties. We have done so and found that the resulting estimate is fairly close to the true outcome of 2009. This is what is called back testing a model. If we compare the predicted vote shares in each constituency with the actual, we will see lot of variation. Thus the model is rather crude at micro level but good enough for predictions at the national level. Indeed, even the State level estimates of seats are not all that good.
Difficulties faced in implementation of this model
I have described the broad methodology used by me over the years. Now let me come to some difficulties faced over the years. We have talked about parties. But how does one deal with an alliance? Take the case of the Congress-NCP alliance in Maharashtra. The two parties were allies in 2009 and it appears that they will be allies in 2014 too and so, essentially, as far as Maharashtra is concerned, we can treat the UPA (Congress+NCP) as one party. However, in 2009 in Tamil Nadu, the Congress and the DMK were allies and now most likely they are contesting separately. So using the data from 2009 we will have to separate the votes of the Congress and the DMK. Here some political judgement comes into play. Treating the DMK as the dominant partner, we could split the alliance votes as 70% for the DMK and 30% for the Congress. Or we can do 70% for the DMK and 30% for the Congress in seats where the candidate was from the DMK and 60% for the DMK and 40% for the Congress where the alliance candidate was from the Congress. In this fashion when alliances change, we use judgement to come up with a simulated history file, which then becomes the basis for applying swing. When a new party enters the fray, such as the Aam Aadmi Party in Delhi in 2013, we used expert opinion to figure out areas where the party was strong and areas where they were weak and used this to create an artificial history vote file.
Some years ago, the boundaries of constituencies were redrawn and we had to get whatever information we could and then create a simulated history file with new constituencies. Thus each time there are challenges and we do work out some reasonable fixes. The broad model remains the same.
Pre-election poll, exit poll and day-after-poll
We have mentioned that the predictive power of an opinion poll conducted several weeks before the actual voting is rather limited—this is because of two reasons—one is that there could be, and often is, variation between the opinion of the whole population and the opinion of the subset that votes in the election. Moreover, the opinion of voters itself can change during the weeks preceding the voting day.
Exit polls seem like an answer to these distortions. Here the interviews are done as voters are exiting the polling booth so that one samples only among the voters who have voted and soon after they have voted. However, during an exit poll, it is difficult to implement our strict sampling scheme. At best we can choose the constituencies and booths where we would conduct the survey but the choice of respondents would have to be left to the enumerator since it would not be feasible to identify respondents out of a pre-selected list. So the practice is that every 10th voter coming out is targeted.
In our experience, exit polls did not yield a sample that was as balanced as the door-to-door method on the demographic attributes. In view of this we have generally avoided exit polls. Instead, given the practice over last two decades of a multi-phase poll along with a gap between last phase and counting day, we have been conducting our survey during two or three days following actual voting in the given constituency. We have been calling this post-poll.
Thus, in the recently completed polls for the four States, on December 4, when the voting took place for the New Delhi Assembly election during the last phase, we went on air with our findings of the post-polls conducted in Madhya Pradesh, Chhattisgarh and Rajasthan. We conducted post-poll surveys in Delhi on the December 5, and went on air with our findings for Delhi on December 6. While the post-poll includes people who have not voted, we record if the ink mark is visible and then as far as seat prediction is concerned, we only take into account the opinion of those with the ink mark visible.
I am confident that a post-poll conducted following the methodology outlined above yields a fairly good estimate of the seats for major parties/alliances. But still it has its own limitations. It would mostly be right and sometimes off the mark.
Our track record
Beginning with October 2005, CNN-IBN, CSDS-Lokniti and I have done numerous poll projections, mostly based on post-poll surveys, occasionally based on pre-election poll surveys. Here is the listing of all such occasions, what we said and what actually happened:
So, according to my own assessment:
We were not good on 4 occasions (off the mark and others did better than us)
(i) Punjab (2007)
(ii) Gujarat (2007)
(iii) Karnataka (2008)
(iv) Gujarat (2012)
On the following 7 occasions we were good (generally on track and as good as others)
(i) Kerala (2006)
(ii) Uttarakhand (2007)
(iii) Uttar Pradesh (2007)
(iv) Lok Sabha (2009)
(v) Tamil Nadu (2011)
(vi) Himachal Pradesh (2012)
(vii) Uttarakhand (2012)
And on the following 16 occasions we were very good (estimates on the dot or close and better or as good as others)
(i) Bihar (2005)
(ii) Assam (2006)
(iii) Tamil Nadu (2006)
(iv) West Bengal (2006)
(v) Bihar (2010)
(vi) Assam (2011)
(vii) Kerala (2011)
(viii) West Bengal (2011)
(ix) Uttar Pradesh (2012)
(x) Punjab (2012)
(xi) Manipur (2012)
(xii) Karnataka (2013)
(xiii) Madhya Pradesh (2013)
(xiv) Rajasthan (2013)
(xv) Chhattisgarh (2013)
(xvi) Delhi (2013)
These results show the power as well as the limitations of opinion polls. One can get the basic story right — who will get largest number of seats, would this party (or alliance) cross the half-way mark comfortably or would it be around that number or well short.
Moreover, the opinion polls give an insight into why people have voted the way they did —what were the issues that decided their vote etc. The CSDS brings out detailed studies on such questions. This is the only way one can get insights into the mind of the voter.
Mood of the Nation in 2014
There have been lot of surveys giving their projections for the next general elections. We have done our own survey in January 2014. Here are our findings. The BJP is way ahead of the rest and about 100 seats ahead of the Congress. On the other hand, the BJP, along with its current allies, is about 50 short of the majority mark.
v=52 # support for candidate of interest
psize=5000000 #population size
ssize=4001 #sample size
def binomlist(N, R):
for k in range(1, R): #INCLUDES 1 EXCLUDES r+1 : indicate begining of block
a.append((a[k-1]*(N-k+1))//k) # // for integer division
z=sum([ c[r-k]*d[k] for k in range(0,t+1)])
print("Population size ->> ",psize)
print("Sample size ->> ",ssize)
print("% support for candidate of interest ->> ",psize)
print("% of samples with the candidate as winnner->> ",c)