Predicting the length of time it will take to get a Machine Learning (ML) project into production can be tricky. If there is an issue, more often than not, it is likely related to a disconnect between engineering and the data science team. Collaboration between data science and engineering is critical for ML projects, but it is often a challenge.
Although data scientists and engineers both work with code and machines – their roles and mindsets are different. Data scientists extract knowledge and insights from data, while software engineers build products and systems. Data scientists can spend considerable time creating and tweaking data models and algorithms to get an ideal result, which makes their work more experimental and iterative than software development. Engineers are responsible for building functionality around the ML data models and getting products into production within a set timeframe.
The model development portion of an ML project is considered the research phase and is where many ML projects get stalled due to continual model adjustments. Therefore, it can be extremely beneficial for data scientists to think in engineering terms, which often leads to a faster production cycle.
When it comes to ML project management, one can separate the process into three stages: Proof of Concept (PoC) or the research phase, the Demo phase, and the Engineering phase. In this article, we examine these different phases and how one can handle them to ensure smooth and timely delivery of projects. The resulting protocol can also ensure better estimates of time for production deployment.
The Rule Book
Based on several years of experience handling various ML projects as part of a data science team, we have created a number of heuristic rules that one can follow to ensure smooth, predictable and faster time to production.
PoC Phase
Almost all ML projects require a PoC phase. PoC ensures a reasonable performing model, apart from ensuring feasibility.
Rule 1: Time bound PoC efforts
Since the PoC is essentially a research effort, it can go on for an undetermined time for two main reasons: 1) data scientists are never done searching for a better model and 2) ML models have a multitude of hyper-parameters to adjust and refine. Therefore, it is essential to set and stick to a pre-determined timeframe to complete the PoC. This reality also drives the need for Rule 2.
Rule 2: Set Expectations of PoC beforehand
Start by clearly defining the output of the PoC either in terms of metrics or a set of feature behaviour. One could argue that by clearly defining Rule 2, Rule 1 is unnecessary. But, Rule 2 will only be operational if the problem can actually be solved. Therefore, Rule 1 ensures the team does not go beyond a certain number of retries before giving up.
So how do you estimate the appropriate amount of time to develop a PoC? This takes experience and can evolve, but as a rule of thumb:
- 5 months for problems involving classical learning techniques or problems involving transfer learning
- 3 months for problems involving proven deep learning techniques
Demo Phase
Once the viability of the ML project is ensured, demonstration of the work becomes important. This also sets the path for Minimum Viable Product (MVP).
Rule 3: Demonstrate PoC effort to all the stake-holders
Involving the stakeholders has a number of impacts for MVP:
- Defining future course of the product
- Defining or re-defining supported features
- Redefining ML metrics
- Defining how it fits into an existing product or in the case of a new product, what the final product will look like
Though stakeholders differ from project to project, the minimum stakeholders should include:
- Product Owners: Person(s) who defined the product at the first place (CTO/CEO)
- Data Science Team: Team involved in PoC and subsequently, person(s) taking it to production
- Engineering Team: This team helps define the feasibility of a product
- Dependents: Mostly UI/UX team which takes the product to end-users. In some case, UI/UX may be combined with the Engineering team
The quality of demonstration becomes important as this is the project buying phase: the better the demonstration, the higher the chance of the project being approved. Data science is all about creating stories and this is the phase where the stories should speak clearly. These data science stories, combined with business understandable visualizations, are direct indicators of a successful demo. In addition to the model demo, a snapshot of how the PoC would be taken to production should also be presented by the engineering team and other dependents.
The demo phase should not last more than a month and should be time-boxed. Delaying a demo will result in higher chances of the PoC landing in the scrap yard or pre-empted by higher priority projects.
Engineering Phase
Once the project is in an approved phase, the next step is to take the PoC to production. Taking a PoC into production needs to be handled carefully, since the underlying product sometimes becomes the face of the company.
Rule 4: Set the Requirements Clearly
Setting clear requirements is important as it not only defines the goals for the data science team, but also for all the team/parties on which ML project is dependent on. The following factors should be accounted for:
- Features that will be supported
- Business metrics to be met
- Engineering and/or UI/UX requirements
- Infrastructure needs and DevOps requirements
- Budget allocation
The requirements should also determine if the final models performance does not meet the expectations either due to data unavailability or unforeseen model limitations. In such situations, one can still deploy to a limited set of users to validate the feature, as discussed under Rule 7.
Rule 5: Define Clear Timelines and the Design
Defining timelines for the data science team ensures the project is being tracked and brought to closure within an estimated time. It also sets a product launch time and therefore, timelines should be set carefully, accounting for unknowns. Timelines should also be accordingly defined for dependents to ensure all the parties work in parallel. A regular, agile-type tracking is required to identify blockers early and bring them to closure – before they start to over-power the project.
The allocation of sufficient time for QA and code reviews is often ignored in timelines. Code reviews ensure quality and code coverage, before QA takes over. QA defines the product stability and therefore, should be accounted for appropriately during the planning stage.
Timelines should include integration points clearly. In cases where a dedicated engineering team is available, at least one engineer from the product team should work together with the ML engineer to ensure smooth and faster integration with the system.
Design constitutes an integral part of the system and a well thought through system design ensures future changes, apart from system robustness. Timelines should allocate sufficient time for design which varies depending on whether it is entirely a new feature/product or an add-on feature. Various aspects to consider while designing:
- Modularity and Reusability
- Scalability
- Accommodability for improved ML models
- Ease of use by end users
Most of the ML projects take roughly 6-8 months to go to production.
Rule 6: Pre-launch Demo with All Stake Holders
A pre-launch demo is a good way to make sure the final product is consistent with what was agreed upon during start of the project. It also ensures the team accommodates minor changes resulting from final product observations and the resulting discussions. A pre-launch demo is also a counter-check on the business metrics defined earlier. Therefore, the pre-launch demo should be completed nearly a month before the launch.
Rule 7: Phase-wise Product Deployment
Deployment should be completed in phases to ensure user feedback is accounted for incrementally, thereby further ensuring product quality and stability. The specific phase-wise approach will differ depending on the type of ML project, but generally includes:
- Selective User Deployment: Pre-define the users to whom the product will be available. Typically these are internal users, who are less risky to the business and will provide detailed feedback.
- A/B Testing Deployment: This phase is used to show the feasibility of the ML solution against an existing solution, which in most cases, would be a heuristic or a rule-based approach. The product is exposed to end-users in selective way to judge the performance of the ML model.
- Final Deployment: In this phase, the product is exposed to all users and/or all organizations.
The final deployment may take 2-6 months, depending upon different deployment phases involved, but plan on a minimum of 2 months.
The Bottom Line
After considering all the phases and steps, its clear that an ML project can take roughly 10-12 months from PoC to production. To ensure the project is delivered within the allotted timeframe, start with clear requirements and well-defined business metrics. Also allow sufficient time for QA and a phased deployment schedule. By following the framework above, the probability of delivering your ML project on time can increase dramatically.