Technology

First the Question, Then the Model: CRISP-DM and Rollins Methodology

In data science, success comes not from picking the right algorithm, but from asking the right question. In this post, I compare CRISP-DM and Rollins' 10-Stage Methodology, and try to walk through how a data scientist manages the process step by step. What you'll find here is purely from my own learning notes, not expert advice, just a student's perspective.

The Key to Success in Data Science Projects: Methodology from Business Understanding to Feedback

In data science projects, there is something just as important as technical knowledge: following the right process. In this post, we will look at two core methodologies. The first is CRISP-DM, which has been the industry standard since 1996. The second is Rollins' 10-Stage Methodology, built on top of CRISP-DM as its foundation. We will cover how both work, what they have in common, and where they differ.

CRISP-DM: The Roadmap of Data Science

Before defining CRISP-DM directly, let's ask ourselves a question to better understand it: imagine we just received a data science project. We have a large dataset and a problem in front of us. What should we do first, where do we even begin? Many people might think about jumping straight into the modeling phase. This is exactly where CRISP-DM comes in. CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model developed in 1996 through a collaboration of several companies, designed to make managing data science and data mining projects easier and more structured. In plain terms, it is the systematic answer to the question "How do we structure a data project?"

So why does it exist? Why do we need CRISP-DM and what problem does it solve? If we cannot manage the process systematically when starting a project, our models may produce no value at all, or the quality of the output may be poor. Without a methodology, our projects will most likely run into the following problems:

  1. If modeling starts before the business problem is clearly defined, we may find the right answers to the wrong question. That does not help us solve the actual business problem.

  2. If a model is built without cleaning the data first, poor quality data will not produce quality results, and again we fail to solve the business problem.

  3. If the results are never deployed, there will be no working project to solve the business problem in the first place.

CRISP-DM — 6 Phases

Phase 1 — Business Understanding

What problem are we trying to solve? Moving forward without a clear goal can lead to a serious waste of time in any project. That is why this is the most critical phase. Here we define the business objective, set the success criteria, choose the data mining goal, and assess the risks. The aim is to clarify what problem our model will solve and how the results will serve the business problem.

Phase 2 — Data Understanding

This is where we face the data. The only question we ask is: "What do we have?" Data is collected, an initial exploratory analysis is performed (EDA — Exploratory Data Analysis), data quality is assessed (missing values, outliers, and inconsistencies are identified), and patterns in the data are noticed. By the end of this phase, we start to get familiar with the data and begin to understand it.

Phase 3 — Data Preparation

We now understand the data; next comes preparing it for the model. Around 60-70% of the time spent on a project is in this phase. Missing values identified in the Data Understanding phase are filled in (imputation), encoding is applied (categorical to numerical), and new features are derived. The data is first cleaned, then brought into a healthy state ready for the modeling phase.

Phase 4 — Modeling

We now have clean data. It is time to select and train an algorithm. In this phase, a suitable model is chosen (Decision Tree, Random Forest, XGBoost…). The data can also be tested with multiple models, and results are compared to select the one that best fits the business goal.

Phase 5 — Evaluation

The first question to ask in this phase is: "Does the model actually work?" Our goal is for the model to properly meet the business objective. We check whether the success criteria set in Phase 1 have been reached. Is there anything we missed? Are the results good enough for the business goal? Even if the model is technically successful, if it does not solve the business problem, it is still useless. In that case, we go back to Phase 1.

Phase 6 — Deployment

The model is now ready to be taken into the real world. The business objectives have been met, the problem defined in Phase 1 has been solved, and a working model that produces healthy output has been built. It is now time to deliver this model to business partners. It can be shared as an API, integrated into reporting systems, and handed over in a working state. A maintenance plan is created and put in place to keep the model running properly.

John Rollins Data Science Methodology

John Rollins is one of IBM's senior data scientists. He took CRISP-DM as a foundation and developed a clearer and more didactic methodology on top of it. This methodology has become a standard in data science education programs. Rollins extended the 6-phase CRISP-DM one step further by laying out a 10-stage roadmap. Each stage looks for an answer to a specific question, and in doing so, all stages work together to build a solid foundation for any project.

Why Did Rollins Feel the Need to Develop a New Methodology?

CRISP-DM came out in 1996. For its time, it was truly a groundbreaking framework. However, over time, certain gaps started to appear in the processes of companies and projects. CRISP-DM was still useful, but it had its shortcomings. There was a need to make additions and build a new synthesis that also made use of CRISP-DM.

Shortcomings of CRISP-DM

  1. CRISP-DM gives data scientists the phases, but it does not tell us what question to ask at each one. For an experienced data scientist this may not be a problem, but for someone new to the field it is not clear enough.

  2. After understanding the business problem, the process moved directly into data collection. But Rollins stopped here and asked: "Have we identified the type of this problem? Is it classification? Clustering? Regression?" Moving into data collection without knowing the answers to these questions simply did not make sense.

  3. CRISP-DM ended at the deployment phase. Yet in today's world, models continue to live after they go live. Models degrade over time and need to be updated. Rollins made this a formal part of the cycle.

To sum it up: CRISP-DM tells you what to do, Rollins teaches you why and how to think. That is why Rollins' methodology is highly informative and educational for those who are just getting started in data science today.

10 Stages of Rollins' Methodology

Phase 1 — Business Understanding

This is the starting point of the project. Clients come to the data scientist with a problem, but that problem is usually generalized, raw, and vague. The data scientist removes this ambiguity and steers the project toward a truly measurable goal.

Example: "Our sales are dropping" → "Which customer segment churned in the last 6 months?"

Phase 2 — Analytic Approach

Once the problem is clear, the next question is: "What type of problem is this?" The answer determines what kind of algorithm will be used.

  • Yes/No → Classification

  • Numerical Prediction → Regression

  • Discovering Groups → Clustering

  • "What happened?" → Descriptive Analysis

Phase 3 — Data Requirements

The problem was defined in Phase 1, and the analytic approach was set in Phase 2. Now we ask: "What data do I need to answer this question?" For example; customer demographics, purchase history, last activity date, and so on.

Phase 4 — Data Collection

Once the required data is listed, the collection process begins. We determine where and how the data will be gathered, and work it out in coordination with the client. As data comes in, requirements may change; in that case, we can go back to Phase 3.

Phase 5 — Data Understanding

After the data is collected, exploratory analysis begins. "What is this data telling us?"

Continuing with our sales example:

  • 70% of churned customers have not made a purchase in the last 3 months.

  • The 35-45 age group has the highest churn rate.

  • Customers with a high number of complaints churned 3 times more.

  • The income column has 15% missing data.

This is one of the most critical phases, because data quality issues surface here. What matters is not just seeing the data, but understanding and analyzing it deeply. Which variable carries a strong signal, and which one is just noise. Recognizing that happens here.

Phase 6 — Data Preparation

The data has been analyzed and understood. Now we need to figure out how to build a clean dataset that can be fed into the model. This is usually the longest phase in any project, and around 60-70% of the time is spent here. Continuing with our sales example:

  • The 15% missing data in the income column is filled in (segment average can be used).

  • New features are derived from variables such as complaint count and last purchase date.

  • Categorical variables like age and location are converted to numerical values (encoding).

  • All numerical values are brought to the same scale.

  • The dataset is split into two parts (for training and testing).

The more carefully this phase is handled, the stronger the dataset becomes. The output of any model is directly tied to the quality of the clean dataset built here.

Phase 7 — Modeling

After the data is prepared, we arrive at what many consider the most exciting phase in data science: modeling. We decide which algorithm to use. This decision was already made back in Phase 2, but the model is also tested with different algorithms and the results are compared to reach the most accurate outcome possible. One thing to keep in mind though: no matter how powerful the model or algorithm is, the results it produces are directly tied to the quality of the work done in Phase 6. That is why Phase 6 sits at the heart of any project.

Phase 8 — Evaluation

The model has been trained. Now we check whether it is actually answering the right question. This phase involves a two-layered evaluation: the technical layer and the business layer. In the technical layer, measurements are taken using the test set. In the business layer, we go back to Phase 1 and investigate whether the model is truly answering the question "Which customer segment churned?" If needed, we can return to Phases 6 and 7 to make improvements. Even if the model is technically successful, if it does not solve the business problem, we cannot move on to deployment.

Phase 9 — Deployment

The model has been approved and is working. The next step is taking it into the real world. Based on our sales example, here is how it might go:

  • The model is served as an API and integrated into the CRM system.

  • The sales team can receive a "high risk" list every morning: "These 50 customers may churn this month."

  • A dashboard can be set up for management.

  • A maintenance plan is created, defining who will update the model and when.

Once the model goes live, the work is not fully done. The real lifecycle starts here.

Phase 10 — Feedback

The final phase and at the same time the point where the cycle begins again. We look for an answer to the question: "Is the model actually working in the real world?" Looking through our sales example:

  • After a certain period, we check back on the model: "Did the customers predicted to churn actually churn?"

  • The sales team's feedback is analyzed: "Is the list working, or are there too many false alarms?"

  • We check whether the model has degraded over time. With new customer behaviors, the model may start producing incorrect results.

Based on the feedback, we go back to Phase 1 and the cycle begins again.

Common Ground and Differences Between the Two Methodologies

Both methodologies share the same core philosophy: data science projects should be carried out through a systematic process, not randomly. Both start with understanding the business problem and end with deploying the model. This structure is common to both.

Throughout the process, both methodologies have a mechanism for going back. Going back is not a mistake; the structure of these methodologies is already cyclical by design. An unexpected issue may come up during data preparation, or a new segment may need to be added at the deployment stage. This cyclical structure ensures that the business problem is solved with a continuously evolving understanding. The model stays up to date, offers new solutions to the problem, and gets better over time.

Finally, both are industry-agnostic. The methodologies can be applied in any field; they are not tied to a specific tool or algorithm. The core goal is to make data processes systematic and ground them in universally applicable rules.

Differences

CRISP-DM tells you what to do, Rollins teaches you why and how to think. CRISP-DM does not specify what question to ask at each phase. Rollins ties every phase to a question, which makes the process much easier to follow.

On top of that, Rollins added two critical stages that are not found in CRISP-DM: Analytic Approach and Feedback. With Analytic Approach, right after the business problem is clarified, the question "What type of problem is this?" is asked. With Feedback, once the model goes live, the cycle formally closes and begins again.

Conclusion

Both methodologies remind us of the same thing: success in data science is not just about picking the right algorithm. Asking the right question, understanding the data, and managing the process in a systematic way are just as critical as building the model itself. CRISP-DM gives us the structure, and Rollins turns that structure into a way of thinking. A good data scientist follows this systematic structure and, most importantly, communicates it with their business partners, just like a storyteller.

References

  1. Chapman, P. et al. (1999). CRISP-DM 1.0: Step-by-step data mining guide. SPSS Inc. Link

  2. Wikipedia. (2024). Cross-industry standard process for data mining. Link

  3. Rollins, J. B. (n.d.). Foundational Methodology for Data Science. IBM Analytics. Link

  4. IBM. (n.d.). Data Science Methodology. Coursera. Link

  5. IBM. (n.d.). CRISP-DM Help Overview. IBM SPSS Modeler Documentation. Link

First the Question, Then the Model: CRISP-DM and Rollins Methodology | Samet Caner