Mining the Online Mandi

“Frankly the overwhelming majority of academics have ignored the data explosion caused by the digital age. The world’s most famous linguists analyse individual texts; they largely ignore the patterns revealed in billions of books. The methodologies taught to graduate students in psychology, political science and sociology have been, for the most part untouched by the digital revolution. The broad, mostly unexplored terrain opened by the data explosion has been left to a small number of forward thinking professors, rebellious grad students and hobbyists.” This is the conclusion by Seth Stephens Davidowitz in his New York Times bestseller “everybody lies: What the Internet Can Tell Us About Who We Really Are”.

Many of us, who have worked with data are used to getting from certain sources and structured in certain ways for us to process and then get down to the analysis. I myself am no different from you but about a year ago I was working on a project that tested my creativity and forced me to step out of my comfort zone. This was not a major shift for me but it did teach me that we do have a plethora of resources at our disposal that we can exploit, to be able to handle some pressing questions. This is just for illustrative purposes and can be modified and used as needed. The data that researchers have access to, by way of the world wide web, is immense.

I have a client who runs an agricultural marketing and distribution social enterprise. They were collaborating with a friend who is a farmer working with a group of fellow farmers who were struggling with semi-drought like conditions in the southern Indian state of Tamil Nadu. To be able to make any kind of impact we had to have what economists call economies of scale. As long as procurement was for small quantities of vegetables, the farmers who often were struggling to make ends meet, would be unreliable, often not the best quality with wild fluctuations in price. Stability would be possible only when they were reasonably sure that the a large proportion of their produce was going to be procured by my client. Aspects such as quantities and the exact vegetable / fruit /staple mix that were needed would have to be determined. To be able to do this, my client needed to be fairly certain of demand for their pesticide-free produce which was procured with the well-being of the farmers in mind.

Since this organisation was at the very inception of their operations, there was very little historical data for me to work with. We collected data at the most granular level but since it was not collected over a sufficiently long time period, it was difficult to use it to determine the demand. In addition, the business was set to undergo a massive change because initially they operated only on Saturdays but later scaled up to six days a week. Hence in order to fine-tune the strategy and get a sense of the demand, it would be interesting to know what customers who shopped in organic stores in the city were looking for.

One of the avenues that is always open was to opt for a survey of customers/ potential customers which I did anyway to get a sense of what people were looking for when they visited the store. But the sample size was fairly low and it was crucial to ensure that the findings would hold for the entire target customer segment. In my search for data for the demand for organic vegetables in the city, I came across Google Ratings data. This data was being collected for the retail segment and consisted of a rating given by a customer along with a comment on why they have graded it as such. The latter is not mandatory but for analysis purposes, it would be difficult to include the data points without the qualification of the rating as we needed to attribute the rating to various causal factors. Another aspect of the data that is attractive is that people are more likely to be honest about their experience when it is done voluntarily. When surveys are administered it is possible for the participants to respond in a manner that the biased towards the expectations of the enumerator.

The downloaded data has to be cleaned and prepared for the analysis phase. Quite obviously the ratings as they stand are not very useful for analysis purposes. If the quantum of data is very large it then becomes important to be able to automate this using Natural Language Processing or something similar. Admittedly this is a fairly time-consuming process that needs to be done carefully to render the data useful for the purposes for which was collected. Procuring data from the internet very often does necessitate knowledge of tools suited to handle big data as very often manual processing of the data would be fairly tedious. Hence we could say that big data can be a double-edged sword and analyses need to be tailored accordingly.

Once we are able to segment the shoppers of organic food better, we will begin to identify their needs better. As the graph says that almost half the shoppers who have posted qualified reviews have spoken about the quality of products as being a factor in their consideration process. This makes sense because people who are concerned about the vegetables being organic and being grown in good conditions and for the most part, pay a higher price, would be quality-conscious. Regarding the range, it is fairly natural that people would like to get most of their common supplies from one store. The rest of the analysis would proceed as you would normally. The findings from these studies are invaluable to clients to be able to tailor their product mix or business related operations based on what their customers need. This feedback transmitted to the farmers would also help them plan their crops accordingly and be an important factor in establishing a sustainable livelihood for them.

The world-wide web is a treasure trove of great data and in this era of big-data, we are limited only by our creativity in how we would data. Similar to the problem described above, you could also conduct extensive analyses regarding pricing and competitor product ranges etc by scraping data from various sites and using the data to get a sense of the pricing in a fairly competitive retail environment. In a country such as India most farmers buy and sell vegetables in what is commonly referred to as mandis or markets. These mandis are notoriously difficult for many of the farmers to navigate. Intermediaries who buy the produce and transport it from the rural markets to the urban stores often pay next to nothing for the produce. Very often this results in farmers struggling to break even, leading to high levels of indebtedness usually at usurious rates leading to extreme financial distress for many small landholders.

The Minimum Support Price or the MSP was the government’s answer to the farmers’ woes. However more recently trading vegetables on a Commodities Market has been touted to be a superior risk mitigation mechanism. Commodities Futures Trading are expected to manage pricing risks as well as price discovery. The price discovery can be managed digitally rather than through purely speculative means allowing the farmers to have some semblance of financial security. Evolution to an organised futures market would do wonders for the agricultural sector but for now, even an informal agreement for future procurement of produce through "fair-trade" like organisations could transform the livelihoods of farmers.

Mining data from the internet can be a boon and a bane. It is potentially a very rich and diverse source of data but its veracity and integrity has to be managed before it is analysed and critical decisions are taken on the basis of this information. But given that this opens up possibilities for analyses that were hitherto impossible, makes it too valuable to ignore. Where data is already available, it is possible to augment traditional sources of ERP and CRM to greatly expand and thus enhance analyses. All in all, I would say that there has never been a better time to be an analyst. What do you think - comment below so that we can exchange ideas.

Edited by Nirmala Samuel