In our society, Data Science is everywhere. If you spend 10 minutes on Twitter, you have a very high chance of encountering something related to data science. Even if having so many people discuss data science and machine learning is good for bringing awareness about the field, I believe it also makes things a little difficult when it comes to abstracting what practitioners, like data science consultants, are doing. That’s why I decided to write about the main phases involved in a data science project.
We always start by conducting an initial assessment of the problem. Although this step might appear trivial, I believe it to be crucial for the outcome of the project. During this initial evaluation phase, we gather as much information as possible about the problem that we need to solve. Also, we identify all the data sources and any potential issues that might arise with the data.
2.Data architecture and database design
Once we complete our initial assessment, we progress to structuring, normalising, and cleaning the data. In this phase, we usually agree on the database architecture and schema. We also set up the processes through which data will be captured, stored, and accessed in the later stages of the project.
3.Algorithm design and implementation
Here is where all the fun stuff happens. In this phase, we create all our hypotheses, develop and implement algorithms, train models. From my experience, in this phase is where we conduct most of the research and exploration. It is here that we probe alternative hypothesis, try different algorithms, invent new methods for doing computations, or build new models to describe the data. During this phase, you see all your data scientists and DevOps engineers gathering around their laptops and getting excited about some strange kink in a graph or some stunning set of numbers. Even if it looks daunting, this interaction and the ability to try out different ideas usually results not only in a better solution for the problem at hand but also in some thoughts on how to solve many other adjacent problems.
I included the testing phase as a separate entity, but that is not always the case. The Algorithm Design and Implementation Phase and the Testing Phase are in many cases connected. What usually tends to happen is that we have a hypothesis and a potential solution for the problem that we implement and test right away. You shouldn’t be alarmed if sometimes you see your Data Scientist redesigning the whole algorithm after a set of failed tests. That is common. Even more than that, we sometimes need to revise the data itself before we can progress.
5.Deploying to production
Once we thoroughly test a solution, we refactor it to get it ready for deploying to production. After careful examination, we implement the solution (algorithms and data) to the client’s production environment. Before this final step, we also set up monitoring and reporting processes to ensure that any future bugs that might appear are accurately captured and dealt with minimum disruption to the service.
The above steps briefly highlight our process for approaching any project. It is important to note that sometimes some of these phases could be missing, especially stage 2. Skipping step 2 is common when working with already established data pipelines, where the client has already done all the database design and architecture. Nevertheless, even if we don’t need to create the databases ourselves, we first understand what the client has implemented to ensure we understand what data is available for us to develop our models and algorithms.
To summarise, usually a data science solution comprises of five extended phases, starting with an initial assessment of the problem and ending with a fully developed and tested solution being live on the client’s side.