What is SEMMA?
The SAS Institute developed SEMMA as the process of data mining. It has five steps (Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. You can use the SEMMA data mining methodology to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.
Why SEMMA?
Businesses use the SEMMA methodology on their data mining and machine learning projects to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
- Sample: This step entails choosing a subset of the appropriate volume dataset from a vast dataset that has been given for the model’s construction. The goal of this initial stage of the process is to identify variables or factors (both dependent and independent) influencing the process. The collected information is then sorted into preparation and validation categories.
- Explore: During this step, univariate and multivariate analysis is conducted in order to study interconnected relationships between data elements and to identify gaps in the data. While the multivariate analysis studies the relationship between variables, the univariate one looks at each factor individually to understand its part in the overall scheme. All of the influencing factors that may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
- Modify: In this step, lessons learned in the exploration phase from the data collected in the sample phase are derived with the application of business logic. In other words, the data is parsed and cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and transformation.
- Model: With the variables refined and data cleaned, the modeling step applies a variety of data mining techniques in order to produce a projected model of how this data achieves the final, desired outcome of the process.
- Assess: In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the studied topic. The data can now be tested and used to estimate the efficacy of its performance.
How Popular is SEMMA?
In four polls spanning from 2002 to 2014 from KDnuggets.com, respondents selected SEMMA 7 – 13% of the time. While significantly less than CRISP-DM, this represents the second most commonly selected pre-defined framework.
We conducted a similar poll on this site in 2020. SEMMA was only selected by a single person. This is not a true comparison to KDnuggets’ polls as our audience likely has different demographics and our result options and question were different.
However, anecdotally, we don’t encounter many practitioners who have even heard of SEMMA. And given its myopic focus (as discussed in the next section), SEMMA likely has fallen out of favor with more modern and comprehensive data science methodologies.
Recaps from previous lessons 👇
- CRISP-DM is Still the Most Popular Framework for Executing…
-
What is CRISP DM?
-
What is a Data Science Life Cycle?
-
What is Waterfall?