Translate Business problems to Data science problems.
Data science has come a long way in recent times, its processes are being used in the spheres of analysis, analytics, machine learning, deep learning, artificial intelligence.
It is the cornerstone for solving problems with data.
These problems can be related to finance, health, banking, insurance, entertainment etc
However, there’s a need to translate these problems into a form that can be interpreted and solved with data science.
This post addresses the phase between identifying a problem, coming up with the required data and designing a minimum viable product as a solution to solve the problem.
It would help data scientist/machine engineers convert vague business problems to data science problems with solutions.
There are 3 steps to translating business problems to data science problems
1. Understand & Define the problem
Frame the business problem
Prepare for a decision
2. Set analytic goals and scope your solution
Set objectives and define milestones
Design minimum viable product
Identify target metrics
3. Plan the analysis
Plan your datasets
Plan your methods
Understand & Define the problem
Frame the business problem
The first step deals with framing the problem, many a times data scientist are presented with very vague problems such as how to reduce customer churn, how to increase revenue, how to cut cost, how to improve sales, what do users want.
These problems are very vague, however, it is the job of the data scientist to frame and define it in a way that can be solved with data science. A data scientist is expected to probe and ask the stakeholders/process owner questions. Asking questions further demystifies the problem and makes it concrete.
When asking questions, make it tangible as quickly as possible. for example, the business wants to reduce churn and increase revenue, you want to ask the stakeholder/process owner questions like What strategies do you employ to retain customers? What are the initiatives the business employs to increasing revenue? What promotions are given to users? What are the major pain-points that you experienced that led to a loss of revenue? Which product had the most decline in revenue?
This gives them clarity in thought and also help you get to the bottom of the problem
Next, you prioritize the pain points, areas that are giving the customers or the business a lot of challenge should be on top of the list when defining the problem.
Try to get a balanced perspective from stakeholders, if some users are not happy with the products, compare their view with those that are happy with it. It helps identify bias.
In defining the problem, the problem posed by the stakeholder might not always be the pressing problem. For example, the stakeholder might want to find out why the users come to the website but do not purchase anything meanwhile the real problem is, can they improve the recommendations to users that align with their interest and push them to place an order.
Look for problems that the stakeholders mention incidentally as they tell you about what they think is their main problem.
Prepare for a decision
When defining the problem, it is important to think in terms of the decision that needs to be made to solve the problem such as Which user would churn in the next 70 day? Which user must be given the discounts to stay back on the app and when to trigger them? To a new user who has just landed on the app, what is the right ad to show?.
Stakeholders need to make decisions, and you should never assume that you fully understand what those decisions are, and you should definitely not assume that stakeholders will be able to naturally map your findings to their decision.
Asking questions like “who”, “what”, “where”, “when” and “why” helps you create a map of decisions and outcomes that will need to be considered when implementing the solution you eventually develop.
Here are some guidelines for mapping out relevant decision.
Consider timing. The problem should be framed in a way that would enable the decision to be made with respect to the time. For example, when should a particular ad be shown to a user for maximum conversion
Also, clarify expectations for the stakeholders and understand the impact of your solution on the downstream i,e the business finances, employee, business partner
Some problem might not be a data science problem, ensure the problem is solvable in principle, that is, it can be solved using one of the data science techniques such as supervised, unsupervised learning etc.
Analyse all data science problem in a way that would lead to quantifiable impact for users such as an increase in daily active users and quantifiable impact for stakeholders such as an increase in revenue with lesser cost
Now you have defined your problem, “which user should be given a discount to prevent them from churning in the 70 days?”
Set analytic goals and scope your solution
Set objectives and define milestones
Translate the defined problem into analytical needs.
In most cases, you cannot and should not wait for other people to tell you
what steps you need to take to solve a problem. Part of your job as a data scientist is to define the path to a solution, not just take the path others have
laid down. Do not think about methods and algorithms yet. Your task right now is to plan out what a viable solution will look like.
What analytics goal do you need to accomplish in order to claim you have found a solution?
What are the options for reaching those goals?
which options are cost-effective?
How will you measure the extent to which your proposed solution addresses the business problem?
For example, the goal of “which user should be given a discount to prevent them from churning in the 70 days” is clear enough from a business perspective, but in terms of running an actual analysis, we need to further break it down into smaller milestones. It often helps to re-frame the business goal as a question
The right granular milestone questions are important, they guide you to your goal
- How do we identify customers that are going to churn in the next 70 days?
- what criteria should be used to determine who should be given a discount?
- What features can be used to differentiate churners from non-churners?
- What is the lifetime value for each customer?
- How do we determine when to trigger them with a discount, what data do we need?
These questions also guide you in thinking of important data point while solving your problem, you can bucket important data feature. For our example, critical features would be customer age bucket (young, adolescent, young adult, middle, old ), the number of site visits, customer spend, location, time spent on the site, product click.
Thinking in terms of milestone helps to foresee dependencies.
Design Minimum Viable Product
After defining your problem and setting your milestones, you want to start building the solution as a data scientist, you want to build a minimum viable product that allows you to provide value to your stakeholders in smaller increments.
For example, a client wants to build a Mansion, inexperienced data scientists will then try to figure out how to build the mansion they were asked for. Experienced data scientists will try to figure out how to build a shed, then figure out how to turn the shed to a tent, then to a hut, a bungalow, a storey building and finally a mansion.
Data science project should be done in a way that would give incremental value to the client within the project timeline.
A minimum viable product could be an analysis report, an analytical dashboard, an interactive web app, a mobile app or an API
The MVP would depend on the stage and scope of the project. it is important to consider the following questions when building it
- What is the smallest benefit stakeholders could get from the analysis and still consider it valuable?
- When do stakeholders need results by? Do they need all the results at once, or do some results have a more pressing deadline than others?
- What is the simplest way to meet a benchmark, regardless of whether you consider it the “best” way?
it allows the stakeholders to be involved in shaping the product at each product version and reduces the risk of having to throw away months of work because of misunderstood/miscommunicated requirements
The typical journey of a data science product is
Descriptive solution — tells you what happened in the past.
Diagnostic solution — helps you understand why something happened in the past.
Predictive solution — predicts what is most likely to happen in the future.
Prescriptive solution — not only identifies what’s likely to happen but also provides insights and recommends actions you can take to affect those outcomes.
A data scientist should plan in sprints, think modularly and get regular feedback from the stakeholders.
Identify target metrics
Having a target metric is important because it tells you and your stakeholders how successful your data science solution is in solving the business problem.
One way to look at it is the actual outcome the business stakeholder wants to achieve. A metric must be measurable.
One can say the target metrics for the business problem “which user should be given a discount to prevent them from churning in the 70 days” would be “is the person likely to churn in the next 70 days (Yes/No)”, a classification problem.
On a side note, you may have extremely high confidence in the quality of your analysis, and yet the results of the analysis might not be cost-effective for the business to implement. it important to consider the following questions to guide your thinking.
- Why should anyone trust the results of this analysis?
- What is the confidence in the prediction? Can they go blindly with the suggestion or some other checks are needed?
- Where does the bulk of the value come from? Are there parts of the analysis that are more valuable than others?
- Along with the provided solution, can you solve other problems
Here are some guidelines for selecting good metrics.
Think explicitly about trade-offs. Almost any metric will involve a trade-off. For example, in a classification problem, “precision” focuses on minimizing false positives, while “recall” focuses on minimizing false negatives. False positives might be more important to the business than false negatives, or the reverse could be true.
Which is more harmful — identifying a loyal customer as likely to churn, or identifying a likely to churn customer as loyal. The stakeholders want to identify customers that likely to churn, so identifying likely to churn customers as loyal would not help the business. Hence we want to reduce false negative, a high recall model would be more suited
Find out the business’s “value” units: Find out what unit of value your stakeholders think in, and estimate the value of your analysis using that unit. For example, stakeholders have said that they want to reduce churn, but upon further investigation, you might find that what they really want is increased daily active users which in turn impacts revenue.
Subset all metrics. An analysis should almost never have only one set of metrics. All metrics used for the analysis as a whole should be repeated for any relevant subsets: customer age bracket, customer spend, site visit, etc. An analysis may perform very well on average but abjectly fail for certain subsets
Make it non-technical and explainable, as possible, stakeholders do need to be able to understand whatever metrics you use.
It can keep you from pursuing interesting analytic questions that don’t ultimately lead to value for the business. It keeps you focused on explaining and justifying your work, which helps those around you support you better
You know how and why it is important to the business. You’re focused on a specific decision stakeholders need to make. You’ve identified what metrics you will need to make your case to the company’s stakeholders
Plan the analysis
Plan your dataset
It is important you consider the following questions when collecting data
- What data is available to answer your questions, and is that data sufficient for you to give an answer you feel good about?
- How difficult it is to obtain the data that you are looking for? Is the data in the public domain or does it incur costs to obtain the data that you need?
- What is the form factor for the data you need? Is it in a neatly labelled format? If not available in the required format, how much effort does it take to label the data?
- Which data can be acquired easily, which data needs additional effort to acquire? Align your milestones to make a minimum viable product with easily acquired data first and then add more and more data.
- Do all the data you need exist in datasets that can be easily joined together? Or will you have to spend time figuring out how to link records across datasets?
- How many pieces of data that you want can actually be missing or inaccessible before you decide that the analysis is simply not feasible?
You would need to standardized the different columns and get the data to the format you need. There could be a lot of inconsistencies in the data and cleaning and transforming this data becomes really crucial. As you go deeper into data wrangling, analysis and aligning the data to the problem, more such challenges would arise that need to be overcome.
Always remember that the key to solving the problem is obtaining, cleaning and wrangling with the data. An estimated 80% of the effort is spent in this stage, so you need to be patient and try to question the data at every stage.
Data is the key to success or failure for any data science project
Ensure you check for sufficiency of data to solve the problem. It’s your job as a data scientist to identify data problems before you conduct your analysis and to only spend your time trying methods that are appropriate to the situation. Sometimes you won’t even realize that a crucial data point is missing until you are in the thick of your analysis.
Identify all dataset needs ahead of time. Make sure you have all the pieces to the data puzzle available. For example, you could say: “customer age bracket, site visit, location, customer spend to start with.”
Focus on necessity — the things you need in order to proceed with your work. It is harder to focus on sufficiency: even if you have everything you need, that doesn’t mean you’ll still be able to complete the analysis as planned
If data from different datasets don’t have a common key on which to join the information, or you can’t get access to some datasets even though they exist, or some of the data have so many missing values that they cannot support your use case, then your analysis will disappoint both you and your stakeholders
Understand the data-generating process. Even if the data technically exists somewhere in a database, take the time to figure out how it got there. Understand anyways it was filtered, transformed, or otherwise processed before it got to the place where you will receive it
Focus on data refresh cycles: how old is the data? When does it get updated? How is it updated? What/who decides when it is updated?
Know when additional data collection is necessary. Sometimes the only way to complete an analysis is to collect more data. If additional data collection isn’t possible, then the scope and goals of the analysis need to be renegotiated with stakeholders
It is always easier to plan for contingencies before you begin your analysis than it is to try to adapt in the middle of your work as deadlines approach. It ensures you have all of the support you need and If stakeholders are made aware of problems in the data from the start, they will be more patient and sympathetic when you face delays or unexpected obstacles.
Identify the bare minimum data that you can collect along with the sources of how you would collect. Identify the challenges you can anticipate. Look at the apps in the same domain of the problem that you are solving and look at the different data points that are present.
Some of those problems manifest themselves only through careful Exploratory Data Analysis (EDA). It’s easy to look at a column name and assume the dataset has what you need. Because of that, it’s very common for data scientists to find out, at least halfway into their analysis, that the data they have isn’t really the data they need. Hence a thorough EDA is essential before applying the methods. If you are able to answer most of the questions in the EDA phase and identify the right insights to the stakeholders that is itself a huge value add.
Plan your methods/models
Which methods/models are inappropriate for your analysis? Of those methods/models that are appropriate, what are the costs and benefits of using each one? If you find a number of methods that are appropriate and have roughly the same costs and benefits, how do you decide how to proceed?
This is the core competency of a data scientist: choosing and using analytic techniques to derive value from data. Identify un-suitable models first. Judge, whether a black box solution would suffice for the business needs or the model we apply needs to be interpretable to explain the results to the stakeholders.
Keep constraints in mind. If your preferred method requires a GPU but you don’t have easy access to a GPU, then it shouldn’t be your preferred method, even if you think it is analytically superior to its alternatives. Similarly, some methods simply do not work well for large numbers of features, or only work if you know beforehand how many clusters you want. Save time by thinking about the constraints each method places on your work — because every method carries constraints of some kind.
Choose boring technology. Analytic approaches like deep learning and reinforcement learning are exciting. As a general rule, the more exciting the technology is, the less you should use it.
When technologies are new, they are less stable and harder to support and maintain. It’s often not particularly fun to implement a simple heuristic or use a model that has been around for decades, but that is often the most appropriate choice for a business even if though its less appealing. Look for surprises in your data, not in your technology, and you will tend to build tools that last longer and work better.
Even after you eliminated unsuitable methods and further narrowed down your list to accommodate your project’s constraints, you will still likely have more than one method that could plausibly work for you. There is no way to know beforehand which of these methods is better — you will have to try as many of them as possible, and try each with as many initializing parameters as possible, to know what performs best.
You will probably run out of time before you run out of models and configurations to try. Don’t fall into the trap of thinking you need to ask for more time in order to test everything — set yourself a time limit and go with the best you have at the end of that time. It keeps you from wasting your time on methods that will not ultimately suit your purpose. If a method works beautifully but does not work at scale, and you need it to work at scale, then it is not a good method to choose. If a method can’t handle a high number of variables without overfitting, and you have a high number of variables, it is not a good method to choose
Keeps your work compatible with the rest of the business. Be a good colleague and think about how your work is going to impact others. Your work shouldn’t just accomplish your own commitments to stakeholders. It should make it as easy as possible for others, such as engineers, to accomplish their commitments. Build things in a way that others can use them as easily as possible
There are constraints on the deployment costs so the use of GPU must be avoided. Considering all these, the logistic regression or random forests might be the right choices for modelling.
What you have just seen is a representation of how a business problem is converted into a data science problem and the steps involved in the data analysis. The approach and the questions might differ from case to case, but overall these guidelines would get the job done.