Hi, I’m Tim, a Senior Data Scientist in the Data Team. I joined AO last summer and I’ve been doing Statistics and Data Science for around 10 years now working on all kinds of different data-driven problems.
I’ve been part of many different types of Data Science projects. I’m proud that some have been very successful in terms of helping the customer with their goals but sometimes it just doesn’t work out. Sometimes, projects fail for reasons outside of yours or anyone else’s control – but there are things that you can control. With that in mind, here are my top 5 tips for a successful Data Science project:
1. Define the Problem
You can’t create a good solution if you don’t know the problem. This one might seem obvious, but it’s often underestimated. Where I most often see this go wrong is that Data Scientists are impatient to start the technical stuff, or are given unrealistic deadlines, and jump in assuming they know what the customer wants. It’s important to take the time to understand your customer’s business area and the context behind what they are asking for.
You might also find that what they are asking for won’t actually solve their problem, this is often due to a misunderstanding of how Data Science works. Don’t blame the customer when this happens, remember you’re the expert here, talk them through your concerns and make sure you’re talking the same language.
Finally, once you have a problem definition everyone is happy with make sure it’s documented. That way everyone has something to refer to and check progress.
2. Plot, Plot and Plot Again
For me, this is a golden rule of all Data Science work. Understanding the data you’re working with is always going to help. Often people try to build up an understanding using tables of summary statistics or my fitting models, but a picture really does say 1000 words. I always use scatter plots for numerical variables, for categorical variables boxplots are great. These plots can tell you; how the data is distributed, if you have outliers, are the relationships linear, should you be applying a transform, are your features colinear and so much more. This process of visualising the data sets up your choice of modelling technique as well as highlighting any issues with the data.
Once you’ve started modelling don’t think you’re done plotting! Comparing models using plots is always better than tables of error metrics. Plots of predictions vs observations or residuals vs features can highlight hidden issues with your models and how you can fix them. It’s also a great way to communicate your progress to customers who will get lost in tables of figures.
I sometimes hear the excuse that plots aren’t meaningful when you have big and complex data, my reply is that you just need to be smarter about what you plot. It’s true that when you have 1000s of variables you can’t plot them all. In cases like this, I tend to do things iteratively by doing a bit of exploratory analysis and feature engineering before stopping to make some plots to check I’m on the right track, and the features I’ve made make sense. This might be difficult but it’s easier than trying to do everything blindly.
3. Bring the Customer with You
If you followed tip 1 then you should have a clear idea of the problem you are solving and the business context behind it. The temptation is for Data Scientists to go off on their own and come back to the business once they have a solution. This can seem attractive as it lets everyone focus on their job and cuts down on the number of meetings, but it rarely leads to success. I always try to push back on this approach and insist on the customer is actively involved in as much of the project as possible. The reason is that good Data Science work should generate insights and questions at every stage, if the customer is on hand to answer these questions and discuss the findings it lets everything move quicker. I’ve also seen many projects where a discussion with the customer has radically changed the requirements, while this can be disruptive it’s much better to find out early on when you can still fix it.
Another benefit of bringing the customer with you is that it lets them really understand the work you are doing and feel it’s their project too. For a Data Science project to have real impact you need the customer to change the way they do their job and maybe their whole business process. This can be difficult for them and may cause some hesitation and push back. If the customer, or someone in their team, is part of the project it gives them that extra reassurance they are doing the right thing.
4. Get a Peer Review
I come from an academic background where peer review is part of the culture. I find in the industry this is often forgotten or even actively avoided. Every Data Science project is unique in some way, it’s why it’s interesting, so you’re doing something which has never been done before. This doesn’t just apply to junior staff, we’re all human and we all have gaps in our knowledge no matter how much experience you have. Therefore, I’d always encourage Data Scientists to talk about their ongoing work with colleagues. When you’re wrapped up in a project it’s easy to miss things, getting an outside perspective can help even if it’s just a quick chat over coffee.
I also encourage a more formal review before any important project milestones or decisions. This often gets push back as business stakeholders see it as adding unnecessary cost and delay to a project. I always try to emphasise that a review can be done in a few days or even hours and can spot issues that would have taken weeks or months to fix later. To make the review process easier try to start the process early with informal chats and advice, that way the reviewer is already aware of the project and has already shared any concerns.
5. Plan How to Productionise IT
I’ve put this one at number 5 because deployment usually comes at the end of the project but it needs to be considered from the start. A lot of Data Science projects will aim to deliver a product, such as a model or software tool, which needs to be productionised. It’s therefore important that whatever you produce can be hosted in whatever environment your company has. I had projects where the whole thing had to be recoded into another language because what we’d initially used wasn’t supported. This can add significant time and increases the chance of introducing bugs, all of which could have been avoided. You also need to consider the compute time and data volumes you’re using, can the production platform scale to what you need? If it’s a cloud platform it probably can but will that push the cost up?
Ideally, you want your team to have a well-defined pipeline for productionising Data Science work that way you can follow a set pattern and know it will work. If this doesn’t exist for you or your project doesn’t fit, make sure you have discussions with whoever is responsible for the production platform so they know what you’re planning and can spot any potential issues. If the production platform doesn’t exist then you need to factor that in, there will be increased cost and effort and it’s a bit of a step into the unknown.