What does a data scientist do?

More and more companies and industries are grappling with the challenges of extracting value from large amounts of data. Data scientists, the people whose job it is to overcome these challenges, are becoming more prominent, yet what it is they do, and how they’re different than software engineers, is still a mystery to a lot of people. The goal of this article is to explain one of the most important tools that data scientists use: machine learning (ML). The bottom-line is: using ML is slow, costly, and error-prone, but it allows companies to achieve business objectives that are unattainable any other way.

Just like a software engineer, the goal of a data scientist is to develop programs that perform business functions. In software engineering, the engineer writes a program that encodes all of the “rules” for what the program is supposed to do, based on the requirements. For example, take the task of returning all of the product reviews for a given product ID. The rules here include things like how to make sure the product ID is valid (and what to do when it’s not), how to query a database of reviews with the product ID, and how to format the reviews to be returned. A reasonably skilled software engineer can easily write a program that encodes all of these rules.

However, there are many problems where it is not feasible for anyone to write down all of the rules required for a software program to perform some task. Sometimes this is because the rules are simply not known, and other times it’s because there are way too many rules. Several good example of the latter type come from natural language processing (NLP), like the problem of predicting the sentiment of movie reviews (i.e. did the reviewer like the movie or not?). Like nearly all NLP problems, this is something that a human could do reasonably well, but it doesn’t mean that a person could easily write down a set of rules for how they made their decisions. If you had to, you’d probably start by listing key words or phrases that would indicate the reviewer’s sentiment, like “great”, “liked”, “smash hit”, “boring” and “terrible”, and use their appearance in a review to judge the sentiment. This will only work so far, because language is much more complex than that. You’d need additional rules to get around the fact that many of the key words can mean different things in different contexts, more rules to cover all of they ways that you can express sentiment without using a key word, and even more rules to detect when sentiment gets flipped by a negating phrase. Add to this the fact that people don’t actually write in perfect English, and you’d end up with a seemingly endless list of rules and an impossible software engineering problem.

ML has completely different approach: instead of writing out the rules in a top-down fashion, ML attempts to infer the rules in a bottom-up way, from data. For the problem of predicting the sentiment of movie reviews, you’d take a set of movie reviews with the actual sentiment of each, and feed them into a ML program. This ML program then literally outputs another program that takes in a movie review and outputs a prediction of its sentiment (with some expected accuracy). One reason that ML can work (not perfectly) on NLP problems is because where a human would have a very hard time creating millions and millions of rules, a computer has no such limitation.

Of course, there are a few catches to using ML. First, the data is not always cheap. Sentiment analysis of movie reviews has been popular topic only because many movie reviews online come with ratings (usually on a scale of 1 to 5), which tell you the answer — the term for this in ML is “ground truth”. For many problems, the ground truth is not readily available and can be very costly to figure it out. A recent Kaggle competition on this topic used a dataset of 50,000 movie reviews — imagine needing to read every single one of these to determine the ground truth sentiment.

Another catch, which I’ve already mentioned, is that ML will rarely produce a program with perfect accuracy. This is because for most real-world problems, it’s impossible to provide the ML program with all of the relevant information. For NLP problems, humans have a wealth of knowledge about what words mean, while computers do not. Many other real-world problems involve predicting human behavior, but it’s impossible to observe everything that’s going on in people’s heads. While ML algorithms are designed to make the most out of limited information, they’re only viable as a solution when the business objectives can tolerate some amount of error.

ML solutions are also slow to develop, even more so than standard software engineering solutions. As mentioned earlier, ML solutions need to work with limited information, which means that it’s impossible to know whether ML will meet the business’s requirements beforehand. Effective data scientists will make educated guesses about the data, ML algorithms, and algorithm parameters that are most likely to succeed based on the problem, but experimentation is always required. This can mean multiple iterations refining the data, algorithms, and parameters before a definitive answer can be reached about whether an ML solution will work.

Last, ML solutions in production are costly to maintain. Their performance needs to be continuously monitored, because their performance can change over time as the characteristics of the data they’re analyzing changes. In the case of predicting the sentiment of movie reviews, just changes in writing style could drop the accuracy significantly. Processes and instrumentation are required to evaluate a statistically significant sample of the solution’s predictions, and take actions to improve performance whenever it drops such as creating a new program using newer data.

I hope this de-mystifies some of what data scientists do, and explains one of the important tools they use to extract value from large amounts of data.