A culture of discovery
Photo: Sentinel Media
Greg Stevens discovers a start-up using crowdsourcing to change the way complex problems are solved.
In many industries, even a fractional improvement in the ability to predict an outcome based on past data can be worth thousands, if not millions, of dollars. In medicine, it can mean saving lives.
The need for making the best possible predictions is well understood among research institutions and businesses alike – even if the means to accomplish this goal is not. This is the need that technology start-up Kaggle is looking to fill, by pitting teams against teams in a competition to build the best predictive models.
Kaggle has put together a social network of thousands of scientists, engineers, researchers, and analysts in fields ranging from mathematics and engineering to psychology and history. These are all highly trained and intelligent people who have one thing in common: they love solving problems.
Specifically, they love solving analytical problems: problems that involve finding patterns in data. Some have expertise in a particular computational tool or mathematical technique, others have expertise in a particular subject area. But all of them are on the look-out for opportunities to apply their skills and expertise to new and challenging conundrums.
Enter the customer. The customer can be any company, research organization, or even individual who has a problem to solve. They will provide a set of data, often with hundreds of variables and possibly millions of data points, and propose a challenge: find the best model that is able to predict one of the variables (usually some result or behaviour) from all of the others.
Kaggle will then marshall teams of analysts in their social network to compete with one another to create the best model.
Currently, The International Conference of Frontiers in Handwriting Recognition is sponsoring a competition to find the best model that can predict the author of a hand-written Arabic document based on visual features of the writing itself. Boehringer Ingelheim, a pharmaceutical company, is running a competition to find the best model that predicts the behavior of a certain class of molecules based on features of those molecules (such as their size, shape, or composition).
Contests have been set up to predict who will make an insurance claim in a two month period, who will be the winner in chess tournaments, which used cars are most likely to develop problems and even which photos in an album people are likely to find the most pleasing.
In most cases, the group or company proposing the competition already has a model of their own. The point behind the competition is to get a better one.
Because these are mathematical models, based on numerical data, it’s always possible to measure exactly how well a model is doing. As the teams work and test out different strategies and alternatives, they can measure the performance of the model that they are working on and compare it to models that have already been established.
More importantly, Kaggle gives the teams a real-time scoreboard from which they can compare the performance of their own model with the progress that has been made by others. This provides extra motivation, because it means that while the contest is still running participants can never afford to rest: the team with the best model on Monday could easily be overtaken by another team on Tuesday.
Smaller customers, offering smaller prizes and usually dealing in data that isn’t highly sensitive, will sponsor public competitions. Anybody can join a team to compete in a public competition; indeed, these competitions often draw a very wide audience of potential analysts.
In the last few years, there have been some highly-publicized contests, such as the competition in 2010 to predict the winner of the Eurovision Song Competition, which have given rise to no end of enthusiastic students, mathematical hobbyists and amateurs interested in cutting their teeth on real-world data and an opportunity to win some notoriety along the way.
The competitions run for a limited period of time, ranging from a few months to a year or more. At the end of the competition period, prizes are given to the first, second and third most successful models. But, for many participants, the monetary prize isn’t the primary motivation for joining.
Students, for example, are often excited just by having the opportunity to practice their data modeling skills on a real-world problem. It helps them to refine their techniques on large sets of data that they would not otherwise have access to, and gets them some recognition among their peers: academic celebrity that can open doors to other opportunities.
For Kaggle, the public competitions also serve another valuable purpose: identifying talent. Inevitably, there are a handful of participants who “come out of nowhere” but whose abilities truly shine. This is important, because although public competitions are good for publicity and function as a vetting ground for analysts, they are not ultimately the core of the Kaggle business model, which lies in the private competitions.
Kaggle offers private competitions for customers who put up larger prizes or have very sensitive data. In private competitions, membership on a competing team is offered only to select chosen members of the Kaggle network based on known performance and abilities.
Often, non-disclosure and intellectual property agreements are signed to ensure the security and privacy of the data involved. Prizes can run to millions of dollars, and at this level every team that participates gets some compensation for their time and effort, although naturally the “bonus” for producing the winning model is substantial.
Vision
Private competitions are what bring the Kaggle business model out of the realm of a game or a curiosity – something to have some fun with while placing bets on Eurovision – and into its own as a transformative approach to scientific research and business analysis.
Because all of the participating teams in private competitions are paid (albeit with performance-related bonuses), being a Kaggle analyst is more than just a novelty or hobby: it has the potential to provide a steady income. Moreover, the controlled participation and signed non-disclosure contracts in the private competitions ought to go a long way toward allaying the fears of businesses and research institutions who are handing over their data to Kaggle.
Of course, some businesses are still hesitant. After all, it’s easy to under-sell Kaggle’s business model by thinking of it as just another form of outsourcing.
If businesses think that the main advantage of using Kaggle for their analytical work is the standard outsourcing pitch – a reduction in payroll and other employee-related costs – then the well-known pitfalls of handing business processes, not to mention intellectual property, over to an outside company are sure to make it a hard sell.
But being an intellectual outsourcing company is not what Kaggle is about. “The main saving isn’t in employee costs for research and development,” says Anthony Goldbloom, Kaggle’s founder and chief executive. “It’s in getting a better result with Kaggle than you get with any in-house team of experts.”
This is no idle boast. In the competitions the company has run, Kaggle’s winning models have consistently out-performed models developed by the business who originally provided the data.
Even in fields where large teams of well-funded researchers have spent years developing analytical models – fields where modeling is critical, like HIV research or financial forecasting – Kaggle competitions are, within months, producing results that outstrip the best models developed by corporate or research institute teams.
People power
Understanding the power behind Kaggle comes down to understanding the real power behind crowdsourcing.
Some people compare it to adding microprocessors to a computer. The thinking goes: the more processors you have, the more powerful the computer.
Many of the media reports about Kaggle have even used phrases that suggest this type of imagery: they say that Kaggle “brings together thousands of intelligent minds to solve a problem,” as if “intelligence” were a fungible asset and results were simply a product of body count and IQ. If there were the case, there would be very little compelling reason for large corporations to entrust their data to Kaggle, rather than simply hiring their own team of scientists.
But the real advantage this company has isn’t in the number of experts or their level of intelligence: it’s the mechanism of the competition itself. For one thing, the selection of the winner in a competition is based purely on the performance of the specific model at hand. It doesn’t matter how long one’s resume is, or who has “seniority” or more publications in a particular field; on a case-by-case basis, Kaggle analysts are rewarded based on their success in modeling the specific problem they are working on.
As Goldbloom is fond of saying, this makes Kaggle one of the world’s first true “meritocratic data markets”. This is a critical factor, he explains, because true discovery in the data analysis world is probably 10 per cent expertise in the right tools, and 90 per cent insight. In a competition designed to predict which used cars would be the most reliable, the winning team had the insight to use car color as a factor. (It turns out you are least likely to have trouble with your used car if you buy an orange one.)
In the competition to predict how rapidly HIV symptoms will progress based on genetic markers, it was neither a data analyst nor a geneticist who solved the problem: it was an English major who learned data modeling techniques from YouTube videos. In the Kaggle world, a person’s ability to have remarkable insights can be discovered and rewarded in a way that simply isn’t possible when teams are put together by review boards and corporate hiring managers.
Another critical feature of Kaggle’s competition model is the real-time leaderboard, from which each team can see where it stands with respect to the others for the entire duration of the competition. In the corporate world, the conventional approach is to work on developing a model until it meets some specific set of requirements, and then to stop work on that model and declare success.
In Kaggle competitions, the teams work against each other for a set period of time, and constantly live with the possibility that the other team will develop a model to overtake them before the time is up. This motivates the teams to keep pushing themselves as hard as they can.
The Kaggle model can’t be used for everything. In order for it to work, the problem to be solved has to satisfy two important criteria: progress has to be objectively measurable, and there has to be a way to track progress continually in real time so that teams can keep an eye on each other’s progress.
So, for example, the Kaggle concept couldn’t readily be applied to game development, where there is no single objective way of measuring the success of the final result. (You can imagine people’s willingness to participate in competitions would drop off precipitously if they felt that they might believe their product to be the best, and yet have it not chosen as the winner.)
A competition like “be the first team to build a rocket that can make it to the moon” would also be difficult, because although it has an objective success measure (you either get to the moon or you do not), it is difficult to envisage a real-time leaderboard that is able to track “progress” of the teams toward that goal.
But a broad array of opportunities is still out there for Kaggle, and the company has the potential to fundamentally change the way that big data analysis is done. Anthony Goldbloom envisages a future where the competitive market of Kaggle provides a full-time salary for its analysts.
More than a workaday academic salary, too: in the competitive Kaggle meritocracy, academic exceptionalism could reward researchers in the same way that the financial industry rewards hedge fund managers. The sky is the limit as to what you can make, bounded only by the strength of your skills and depth of your insights.