Model-based testing areas, tools and challenges: A tertiary study

Context : Model-based testing is one of the most studied approaches by secondary studies in the area of software testing. Aggregating knowledge from secondary studies on model-based testing can be useful for both academia and industry. Objective : The goal of this study is to characterize secondary studies in model-based testing, in terms of the areas, tools and challenges they have investigated. Method : We conducted a tertiary study following the guidelines for systematic mapping studies. Our mapping included 22 secondary studies, of which 12 were literature surveys and 10 systematic reviews, over the period 1996–2016. Results : A 3-level hierarchy of model-based testing areas was built based on existing taxonomies as well as data that emerged from the secondary studies themselves. This hierarchy was then used to classify the studies, tools, challenges and their tendencies in a uniﬁed classiﬁcation scheme. The most studied areas overall are two model paradigms: UML models and transition-based notations, and two test levels: unit and integration testing. Only ﬁve studies were found to compare and classify model-based testing tools, which motivated us to classify all tools found into our hierarchy of areas. Most tools fell within the model paradigm area. Over time, tools that test the functional behavior have prevailed, with a recent tendency to support executable tests. With regard to model-based testing challenges, most of them are associated to the model speciﬁcation area. Besides, a grounded analysis was done on challenges, yielding six categories. Availability is the category where more challenges are reported. Over time, challenges have moved from complexity to lack of approaches for speciﬁc software domains. Conclusions : Only a few systematic reviews on model-based testing could be found, therefore some areas still lack secondary studies, particularly, test execution aspects, language types, model dynamics, as well as some model paradigms and generation methods. We thus encourage the community to perform further systematic reviews and mapping studies, following known protocols and reporting procedures, in order to increase the quality and quantity of empirical studies in model-based testing.


Introduction
Model-based testing (MBT) is a commonly studied software testing approach in secondary studies.MBT attempts to automate parts of the software testing process, primarily the test design and the test case execution stages.It has been found that MBT detects at least as many errors as manual testing [1].Investing in MBT has been shown to reduce the overall cost, time and effort of the testing process [2].Also, MBT may improve testing quality by meeting certain criteria such as code coverage [2].Finally, tools for model-based testing can help with requirements traceability and evolution, both of which can be difficult to achieve in traditional testing processes [2].
Model-based testing suffers from some limitations.First, it requires a tester with a different skill set than that for manual testing [2].In particular, the tester needs to be able to abstract and design the model from the requirements, which means she has to know and decide upon which aspects of the software under test (SUT) should be model or not.Second, training costs are high (due to the special skills needed) and therefore, there is a learning curve before MBT can be used [2].Third, a considerable investment is required to build the model, since the quality of generated tests depends on the quality of the model.
Recently, a few tertiary studies have been conducted in software engineering [3,4] and software testing [5].
According to Garousi et al. [5], MBT is the testing method with more secondary studies.There are no tertiary studies specific to MBT, to the best of our knowledge.A tertiary study that aggregates and synthesizes knowledge on MBT from secondary studies can be useful to both academia and industry.The academia benefits from tertiary studies by pointing knowledge areas in which further research is required, and the industry benefits from the information regarding novel tools and techniques uncovered by tertiary studies [4].
To address the lack of tertiary studies on model-based testing, our research goal was to identify and characterize secondary studies in model-based testing, in terms of the areas, tools, and challenges investigated 1 .We conducted a tertiary study based on the guidelines described in [7,8] as well as the recommendations stated in [9,10,11].The studies were characterized in terms of specific MBT areas, model types, test selection criteria, test generation and test execution techniques, testing tools, and challenges.We established three specific research questions to guide our analysis: RQ1.What model-based testing specific areas have been investigated in secondary studies?Answering this question will enable us to determine which specific MBT areas have been addressed (or not) by secondary studies, including model representations, criteria for test case selection, and techniques for both test case generation and test case execution.RQ2.How have model-based testing tools been characterized in the literature?This question will allow us to identify existing classifications for MBT tools, as well as the particular characteristics exhibited by current tools, according to these classifications.Furthermore, we will classify MBT tools according to the areas hierarchy derived from RQ1, as a unified classification scheme.RQ3.What are the challenges reported in the literature of model-based testing?This question will allow us to identify common challenges of current research on MBT, classify these challenges according to the MBT areas hierarchy derived from RQ1, and devise potential areas of future research.
The analysis derived from our research questions RQ1 to RQ3 will enable us to identify areas, tools and challenges in MBT research.It is also interesting to analyze tendencies of these three aspects over time, with the underlying motivation of recognizing which ones are booming, declining or stable in time.
In this paper, we extend our previous publication [6] by extending the analysis performed for each research question to include trends in areas, tools and challenges over time, which aims at discovering how research in model-based testing has evolved in time.In addition, we expanded the results to include counts and percentages of studies and tools per MBT area, which allowed us offer a richer discussion of the results.Moreover, we performed a new grounded analysis of challenges, resulting in six categories: Efficacy, Availability, Complexity, Professional Skills, Investment, Cost & Effort, and Evaluation & Empirical Evidence.Furthermore, each challenge was also classified according to the top-level areas of our hierarchy.We analyzed these data in three different ways: challenges by categories, challenges by areas, and the relationship between categories and areas.
The remainder of our paper is structured as follows.We discuss the background in Section 2. Section 3 presents the methodology.Section 4 answers our research questions.And section 5 offers our conclusions and future work.[2] and [13]).

Background
Software testing consists of verifying if a program matches its expected behavior [12].It can be performed at different levels: unit, integration, and system.Software testing approaches can also be categorized as white-box or black-box, depending on whether they use information about the internal structure of the SUT [12].Model-based testing is the automation of the design of black-box tests [2].It is based on a model of the SUT that represents aspects that will be tested.From this model, test cases can be generated in an automated way, usually with the help of a tool.Model-based testing is usually performed at the unit level and covers the functional behavior of the SUT [2].
Utting and Legeard's book on model-based testing [2] is essentially a body of knowledge in MBT, containing existing approaches, process stages, tools and examples.It offers valuable insight for those new to the area.The process of model-based testing has four stages [13], as illustrated in Figure 1: (1) building the model of the SUT from the requirements, (2) choosing the test selection criteria, (3) generating test cases as either abstract test sequences or concrete test scripts, and (4) executing the test cases.The model should be built for testing purposes (i.e., simpler than the model used for development) [2].An MBT tool may provide support for activities in any of these stages.
Utting et al. [14] attempt to sort out the MBT body of knowledge by defining a taxonomy of MBT approaches.This taxonomy has six dimensions: (1) model scope, which encompasses aspects of the SUT that are being represented by the model, (2) model characteristics, like non-determinism, (3) model paradigm, which groups modeling notations into families, (4) test selection criteria, which covers the criteria that is used to generate the test cases, (5) test generation technology, which comprises techniques that generate test cases from the model, and (6) test execution, which includes the different ways of generating and executing the test cases.Authors suggest that other taxonomies may be defined to express different aspects of MBT's approaches or tools.
Several benefits of MBT over traditional testing have been identified [2].First and foremost, MBT has proven to be more likely to find errors than traditional testing.While the quality of the testing still relies on the experience of the tester (who either creates the model or the actual test cases), its effectiveness is still unquestionable.Second, MBT reduces testing time and costs.Third, test cases generated by an MBT tool are typically of higher quality than those manually written, since the tools can generate large test sets with good coverage (more consistently than a tester).Fourth, MBT can help detect defects in software requirements, given that some failures occur due to unspecified, unknown or wrong behavior from the requirements.Fifth, MBT makes it easier to implement new requirements because the model is usually smaller and simpler than a test suite.Sixth and last, MBT offers better traceability from test cases to informal requirements and test selection criteria.
Limitations of MBT have also being recognized [2].MBT requires a different set of skills, compared to traditional testing.A skillful traditional tester may find MBT difficult to adopt because of this.Besides, MBT is normally recommended for organizations with certain level of maturity in their testing process, due to its complexity.Another shortcoming is that MBT is usually applied for functional testing only.Furthermore, MBT tends to have problems with outdated requirements, tends to consume a significant amount of time in manual analysis of failed tests, and can generate many useless test cases [2].
The International Software Testing Qualifications Board's Foundation Level Certified Model-Based Tester Syllabus [15] is a guide for certification on MBT.It explains multiple aspects of model-based testing in a concise way, and it focuses on the practical part of MBT.This guide also covers an important topic on how to evaluate and deploy an MBT approach.
Anand et al. [16] performed an orchestrated survey on existing test case generation techniques.In their paper they include five generation techniques.In their model-based testing section, they cover some of the previous work in MBT.They explain the finite-state machine and labeled transition system approaches.They also showcase some commercial tools, such as Conformiq Designer, Smartsensing CertifyIt, and Spec Explorer.
Utting et al. [13] report on the updates on the literature of model-based testing in the last 10 years.In this chapter, they give an overview of the updated model-based testing process and its history, and the current state of MBT adoption in the industry.They also present the current modeling languages and test case generation technologies.At the end of the chapter, they suggest recommendations for the MBT research community and offer some ideas in order the assess the existing challenges.
We have found few tertiary studies that include model-based testing as one of the studied approaches, such as [5], but none in which MBT is the main topic.Our study will perform a deeper, more accurate mapping of model-based testing than the existing tertiary studies.We believe it will useful for those new to MBT research, as a guide for other studies.

Methodology
Our tertiary study was conducted following the methodology stated in [7,8] and the recommendations in [9,10,11,17].In particular, our search strategy was hybrid, as recommended in [11].First, we conducted an automated search in Scopus and Web of Science engines to select a relevant seed set of secondary studies.Then, we performed a backward snowballing to allow the identification of new relevant papers.

Search process and study selection
We initially conducted an exploratory search to identify the terms (or keywords) relevant for our mapping.Then we selected this search string: This search string was used to conduct a search on the Scopus (subareas "COMP" and "ENG") and Web of Science online academic engine, up to April of 2017.We looked for studies that contained the search terms anywhere in the title, abstract or keywords fields.

Inclusion and exclusion criteria
Articles meeting the following criteria were included:  When a secondary study was published in more than one journal or conference, both versions of the study were compared for the purpose of data extraction.

Selection process
Our study selection process is depicted in Figure 2. The search string produced an initial set of 47 potentially relevant studies.The inclusion and exclusion criteria were applied to these studies, considering only the title, abstract and keywords fields.Any paper that was irrelevant became excluded.The following steps were performed during the inclusion and exclusion process: (i ) all papers were evaluated by five members of the research group, (ii ) each member independently screened the paper for inclusion and exclusion criteria, and decided whether or not to include the paper, offering a justification, (iii ) the team met and compared their results.Any disagreement about the inclusion of a paper was discussed until a final consensus was reached, and the decision was documented.In general, we tended to include rather than exclude potentially relevant papers.After applying the exclusion criteria, we were left with 13 studies [S03, S04, S06, S08, S10, S11, S16, S17, S18, S19, S20, S21, S22].
To expand the scope of our automated search, backward snowballing was manually applied to the remaining secondary studies.We further applied backward snowballing to 3 relevant tertiary studies [5,3,4].This expansion process yielded 9 more papers [S01, S02, S05, S07, S09, S12, S13, S14, S15].Our automated search was not able to find these papers because they did not math our search string: some targeted specific models such as UML and FSMs; some used a name other than model-based testing, for example, a formal testing approach; and yet others focused on a specific aspect of MBT like test case generation.

Quality assessment
The quality of the secondary studies was evaluated using the Database of Abstracts of Reviews of Effects (DARE) criteria 2 , as in [4,17].The original DARE assessment is based upon four questions that aim to evaluate how well the systematic review process is reported.These questions are: Q1.Are the study's inclusion and exclusion criteria explicitly and clearly defined?Y (yes): the inclusion criteria are explicitly and clearly defined; P (partially): the inclusion criteria are implicit; N (no):  the inclusion criteria are not defined and cannot be easily inferred.Q2.Is the literature search likely to have covered relevant studies?Y: the authors have either searched a digital library and included additional search strategies or identified and referenced all journals addressing the topic of interest; P: the authors have searched searched a digital library with no extra search strategies; N: the authors have searched an extremely restricted set of journals or conferences.Q3.Did the reviewers assess the quality/validity of the included studies?Y: the authors have explicitly defined quality criteria and extracted them from each primary study; P: the research question involves quality issues that are addressed by the study; N: no explicit quality assessment of individual papers has been attempted, or quality data has been extracted but not used.Q4.Were the basic data/studies adequately described?Y: information is presented about each paper so that data summaries can clearly be traced to relevant papers; P: only summary information is presented about individual papers, it is not possible to link individual studies to each category; N: the results of the individual studies are not specified, the individual primary studies are not cited.
A commonly used convention in DARE is to score each question on the following 3-point scale: Yes = 1, Partially = 0.5, and No = 0, and then add the scores of individual questions for an overall score.A high overall DARE score is usually interpreted as representing a high quality secondary study, meaning we have confidence in its findings [17].
The quality assessment of our 22 papers is presented in Table 1.We observe that all but one of the systematic reviews (SR) exhibited high overall DARE scores: between 3 and 4 (4 being the maximum possible score).On the other hand, only one literature survey (LS) attained an overall DARE score of 3, while the others obtained overall scores of 1 or below.Studies with low quality scores were not excluded from our subsequent analysis.This decision was taken because DARE is inherently biased towards systematic literature reviews and their formal and rigorous process [17], usually granting low quality scores to less formal secondary studies such as surveys.Furthermore, DARE would assign high scores to secondary studies that actually report or describe the process steps in detail, yet it does not (and possible cannot) assess whether the steps were correctly performed.

Data extraction and analysis
The extraction process was performed in four steps, which are described next.The first step consisted in randomly assigning each paper to a pair of researchers (from the research team, including the authors of this paper).In the second step, each researcher independently screened the assigned papers for extraction.The third step consisted in having each pair meet and compare their individual results, discussing about We analyzed all 22 secondary studies and extracted any relevant data that helped answering our research questions.In Table 2 we list the data items extracted from each study.A spreadsheet was created to aid in the organization and categorization of the extracted data.No statistical data analysis (such as meta-analysis) was performed due to the heterogeneity found among secondary studies.
For RQ1, our initial hierarchy was the union of the categories in our reference taxonomies [14,34] ( [34] is S17 in Table 1).Then, we kept adding categories (areas) that emerged from the secondary studies.Since we wanted to create a tree rather than a flat structure (for better understanding and categorization of data), we had to group subareas sharing one or more characteristic into an upper-level node/area.We tried to respect the original category names given by the authors, but sometimes we had to choose (whenever different names were given to the same area), and in one case, we had to create the label.When in doubt about the classification of an approach, we followed our reference taxonomies.
For RQ2, model-based testing tools mentioned in secondary studies were identified.We extracted information reported about each tool, usually based on an existing classification/taxonomy.We then attempted to group them in common categories.Furthermore, we classified each tool based on the hierarchy of areas derived from RQ1 (Section 4.1), as a way to standardize the classification of all tools found.Finally, we also analyzed tendencies of tools over time, along the MBT areas discovered in RQ1.
For RQ3, challenges reported in secondary studies were identified.Then, a grounded analysis was performed to allow categories to emerge from the data.These categories were associated with challenges previously reported in the literature [21,23,22,2,6].Afterwards, each challenge was classified based on the hierarchy of areas derived from RQ1 (see Section 4.1), in an attempt to standardize the classification of data and the presentation of results.Lastly, we analyzed tendencies of challenges in time, across the MBT areas discovered in RQ1.

Threats to validity
We briefly discuss here what are the main threats to the validity of our study.
Search terms and digital libraries.In order to minimize the risk of missing relevant studies, we first did an exploratory search in order to identify relevant papers and keywords for our search string.Several trials and validations were performed on the search string by multiple researchers, in order to have a stable and final search string.We only used the Scopus and Web of Science search engines, but to counteract, we applied snowballing, as recommended by [11].The snowballing process was a backward process based only on titles, implying the risk of missing relevant studies.Thus, some studies could still be missing from our analysis.
Selection of studies.In order to minimize the researcher bias in applying the exclusion and inclusion criteria, multiple researchers performed various validations of the excluded and included studies.Since we included only secondary studies in model-based testing, our findings apply only in this context.
Data extraction and categorization.In order to minimize possible inaccuracies in the data extraction as well as researcher bias in the classification, various researchers validated the extraction process.Although the categorization was mainly performed by a single researcher, the other researchers gave feedback on the categories, as a way to validate them.Moreover, most of the areas or categories used in RQ1 through RQ3 were based on previous MBT studies.Nevertheless, some subjective interpretation took place when merging categories and determining whether a study provided support for a specific category.

Results and discussion
In this section we present the results of our mapping study, based on 22 secondary studies, and address our specific research questions.

RQ1: Model-based testing research areas
The RQ1 research question helped us identify MBT areas that have been investigated by secondary studies, as well as their tendencies over time.Particularly, we aimed to discover which model representations, test case selection criteria, test case generation methods and test case execution techniques have been most (and least) studied.For this, we built a hierarchy of MBT areas and subareas based upon two existing taxonomies: one proposed by Utting et al. [14] (aimed for MBT approaches) and other proposed by Marinescu et al. [34] (aimed for MBT tools).Areas (or categories) were added, merged or renamed as papers were read and analyzed.When the hierarchy was finalized, we reviewed again all the studies in order to complete the list of supporting studies per area.The final 3-level hierarchy of MBT areas is shown in Figure 4. References inside internal nodes are studies that actually use or propose the category represented by the node.The reference list per leaf node corresponds to studies that support that specific category.A study was considered to 'support' a category or area if it mentions the category (not necessarily if it was a mainstream topic).The coloring scheme used in the third level of our hierarchy is as follows: red nodes are highly investigated areas (with 9 or more studies), blue nodes are moderately investigated areas (between 5 and 8 studies), uncolored (or while) nodes are low investigated areas (between 2 and 4 studies), and yellow nodes are scarcely investigated areas (with 1 or no studies).Some collateral findings emerged during the construction of our hierarchy.Most of them were inconsistencies among classifications offered by different authors.For example, [S12] classified FSM and LTS as state-based models, while Utting et al. [14] and [S17] classified them as transition-based models in their taxonomies.Another example is regression testing, that is classified as a test level by [S04] and as a test type by us and Utting et al. [13]).Additionally, [S21] considered search-based algorithms for test case generation (also known as meta-heuristic search) as a category outside MBT, while other studies [S13, S15, S21] [14] included it as a test generation method within MBT.
From Figure 4 we see that the two most studied areas throughout the hierarchy are UML models (with 16 studies) and transition-based notations (with 12 studies), both belonging to the model paradigm level-2 area and model specification level-1 area.Likewise, the third and fourth most studied areas throughout the hierarchy are are unit testing and integration testing, both belonging to the test level level-2 area and test objectives level-1 area.
Table 3 shows the number of secondary studies found per area of our hierarchy (the specific studies mapped to each area can be found in Figure 4).A single study may provide support for one, multiple, or no areas at any level, hence, it could potentially be added to the paper counts of different areas.However, it would only be counted once for the higher-level (parent) area, to avoid over-counting.Therefore, paper counts of the subareas do not necessarily add up to the total paper count of their parent area, the same way that paper counts do not add up to our total number of secondary studies (22).Table 3a shows a top-level view of the number of studies per level-1 area of our hierarchy.We observe from this table that the level-1 area with most studies ( 19) is model specification, followed in second place by test objectives (with 14 studies), and third place by test generation (with 12 studies).The level-1 area with less studies is test execution (with just 5 studies).Next we describe the most and least investigated subareas (level-2 and level-3) for each of the top (level-1) areas of our hierarchy.3b shows the number of studies for level-2 and level-3 areas belonging to the SUT area.The level-2 area with more studies is domain (with 8 studies), while the level-2 area with least studies is software paradigm (with 4 studies).The most studied level-3 area of the software paradigm area is object-oriented, with 4 studies.Regarding the domain area, all its level-3 areas (embedded systems, real-time systems, reactive systems, and web applications) have 4 studies each.

SUT. Table
Test objectives.Table 3c shows the number of studies for level-2 and level-3 areas of the test objectives area.The level-2 area with more studies is test level, with a total of 11 studies, whereas the level-2 area with least studies is test artifact, with 8 studies.Among the most studied level-3 areas of test artifact are functional    Test generation.Table 3e shows the number of studies for level-2 and level-3 areas of the test generation area.The test generation's level-2 area with more studies is generation method, with 9 studies, whereas the level-2 area with least studies is selection criteria, with 6 studies.We found that the most studied selection criteria's level-3 area is structural model coverage, with 6 studies, while the least studied is ad-hoc test case specification, with one study.Regarding the generation methods' level-3 areas, the most studied ones are random and search-based, with 4 studies each, and the least studied is constraint solving, with no studies.
Test execution.Table 3f shows the number of studies for level-2 and level-3 areas belonging to the test execution area.The test execution's level-2 area with more studies is technology, with 4 studies, while the least studied level-2 areas are mapping and test scaffolding, with 1 study each.The most studied technology's level-3 area is offline, with 4 studies, while the least studied is online, with 3 studies.With respect to conformance check's level-3 areas, the most studied area is conformance relation, with 3 studies, and the least studied is test oracle, with 2 studies.Additionally, we found that all level-3 areas belonging to mapping and scaffolding have only one study each.

Trends in MBT research areas
We performed an analysis of secondary studies by MBT area through time, in order to determine research trends.The right-most column (Total) shows the total number of studies for each subarea.
Figure 5a shows the research trends for the top (level-1) areas of MBT.Initially (in 1996), the only studied area was model specification.There was a gap from 1997 to 2004, where no studies were reported.Later, between 2005 and 2016, the strongest areas were test objectives and model specification, with constant activity almost every year.There had been two peak years for all MBT areas: 2010 and 2015, when many studies were published.Particularly, in 2015 all areas were studied by at least two papers.More recently (in 2016), there has been activity in all but one area (test objectives).
Figure 5b shows the research trends for the model paradigm subareas.We chose to analyze this level-2 area because it contains two of the most studied areas overall.For model paradigm, we too observe a peak in 2015 across its level-3 areas.Areas with a steady development in time are UML models and transition-based notations.Areas with recent activity include: data-flow, operational, and attribute event grammar, each of them having one study in 2015.Stochastic models and state-based (pre-post) notations are also gaining popularity.
Figure 5c shows the research trends for the test level subareas.We chose to analyze this level-2 area because it contains the third and fourth most studied areas overall.Research on unit and integration testing has been steady through time, with a small increase in 2015.On the other hand, research on system testing has only been studied sporadically.No studies were found on acceptance testing level.
Figure 5d shows the research trends for the test artifact subareas.We see here two peak years for all level-3 areas: 2010 and 2015.Furthermore, we note that functional and extra-functional behavior are the strongest and earliest subareas, but there has been a recent emergence in architecture and environment artifacts.

Discussion on MBT areas
It is clear from Table 3 and Figure 5 that there have been a few well studied areas, which have dominated the research on MBT over time, but many other areas need to be studied further.Examples of well studied areas in MBT are popular model paradigms such as UML models and transition-based notations, and widespread test levels such as unit and integration testing.Other areas that have been moderately studied are common test artifacts such as functional and extra-functional behavior, popular test types like functional and regression testing, state-based notations system test level, and structural model coverage as selection criteria for test generation.Examples of areas that have been little studied fall under software paradigms (particularly, cyber-physical-oriented, aspect-oriented and service-oriented), SUT domain (mobile applications and software product lines), language type (design, test-specific, and generic), model paradigm (data-flow, operational, and attribute event grammar), model characteristics (dynamics), selection criteria (ad-hoc test case specification), test generation method (manual, model-checking, theorem proving, axiomatic and labeledtransition system), and text execution mapping and scaffolding.Finally, there are some areas that have not been studied at all: acceptance test level, model scope (input-only or input-output models), constraint solving generation method, and model paradigms such as history-based and functional.We recommend to expand MBT research to cover areas that have been either little studied or no studied at all.
We can also analyze what stages of the MBT process (refer to Figure 1) have been most studied.In order to do that, we first need to establish an association between the MBT stages and our areas.In particular, the model building stage maps directly to the model specification level-1 area, and indirectly to the SUT and test objectives level-1 areas.The test selection criteria stage maps to the selection criteria level-2 area.The test case generation stage maps to the generation method level-2 area.Lastly, the test case execution stage maps to the test execution level-1 area.Having established this mapping, we can now infer where the focus of each MBT stage has been, research-wise.For the model building stage of the MBT process, the focus has been on two modeling paradigms: UML models and transition-based notations, at the unit and integration levels.For the test selection criteria stage, the focus has been on structural model coverage.For the test case generation stage, both random and search-based algorithms have prevailed.For the test case execution stage, there has been a focus on offline technology.
To analyze the implications of our research for the industry, we compared our hierarchy of MBT knowledge areas with the ISTQB foundation level model-based tester syllabus [40], inspired by what Garousi et al. [5] did in their tertiary literature review.Next we discuss about important MBT areas that are not included in the ISTQB syllabus, as well as sections of the of the ISTQB syllabus that are not covered by secondary studies.
Knowledge areas that should be considered for inclusion in the ISTQB syllabus are: validity of a particular MBT approach for different SUT domains and paradigms, representational characteristics of models, methods or technologies for automated test case generation, methods for automated conformance checking (such as oracles), and scaffolding support of MBT tools.We also believe that the syllabus should provide more detail    for the knowledge areas it currently covers.As an example, the syllabus mentions the concept of modeling notations and some of their existing types, with no further analysis or comparison among them.Lastly, the syllabus should somehow integrate the broad spectrum of existing MBT tools, either as a separate section or as a subsection of 'tool support' for existing sections.
On the other hand, there are some sections from the syllabus that need to be covered by secondary studies: the integration of model-based testing into the software development cycle, the creation and maintenance of an MBT model, best practices for modeling or test selection, and the deployment of model-based testing approaches.In general terms, we see that the research community has geared towards secondary studies mainyl related to MBT techniques, lacking important aspects from the industry-perspective such as how to integrate, deploy, and adopt MBT in industrial contexts.

RQ2: Model-based testing tools
The RQ2 research question led us to identify classifications that have been used in secondary studies for MBT tools.This research question also motivated us to classify all MBT tools according to the areas hierarchy derived from RQ1, in an attempt to provide a single unified classification scheme.Finally, tendencies of MBT tools over time were also addressed with RQ2.

Existing classifications of MBT tools
In what follows, we present the results with respect to the tool classifications found in secondary studies.We found 11 secondary studies that mentioned MBT tools.Of these, only 5 gathered multiple tools and classified them in dimensions.The other 6 simply mentioned the tools.Several studies hinted the existence of tools, but offered no name for them.For the purpose of this study, only tools with a name were considered.
Studies that classified tools used similar criteria, which can be generalized as support for the following five dimensions: (i ) model specification [S03, S10, S19, S20], (ii ) test selection criteria [S10, S17, S19], (iii ) test generation [S10, S17], and (iv ) test execution [S10, S17, S19].The most common criterion for tool classification was (support for) model specification, with 4 studies.This criterion corresponds, in most cases, to the model paradigm level-2 area of our hierarchy (in Figure 4), but in some cases it also includes the model scope, language type and model characteristics areas.The next two most common criteria for tool classification were (support for) test selection criteria and test execution, with 3 studies each.The test selection criteria corresponds to a level-2 area with the same name in our hierarchy, while the test execution criterion corresponds to a level-1 area with the same name.The least common criteria of tool classification was (support for) test generation, with just 2 studies.This criterion corresponds to the generation method level-2 area of our hierarchy.
Marinescu et al.
[S17] performed a state-of-the-art survey in tool-supported model-based testing by presenting and classifying some of the most mature tools available in the literature.The authors approached this by proposing a taxonomy based on the one by Utting et al. [14].Many tools were characterized according to this new taxonomy.Shafique and Labiche [S19] studied the support of MBT tools that rely on state-based models.They characterized and compared these tools along several criteria, particularly, their support for test, test scaffolding, and related activities.These criteria were further divided into more specific classes like model-flow, script-flow, and data-flow coverage criteria.This is one of the studies that perform a more in-depth tool classification.Saifan and Dingel [S10] conducted a survey on how MBT has been used for testing the quality attributes of software.In their work, they extended the taxonomy of Utting et al. [14] by adding three additional criteria, and characterized existing tools based on these categories.Siavashi and Truscan [S20] performed a study on the use of environment models in MBT.They gathered a list of tools that use these models, and grouped them by model language.Mahdian et al. [S07] conducted a survey on regression testing that employs UML models.In their study, they cited studies that investigates regression approaches, types of UML diagrams, and proposed tools.
Studies that did not classify MBT tools were also considered for the purpose of our study.In general terms, they either mentioned the tools as examples of particular approaches/models, or commented on primary studies that proposed and experimented with those tools.The information found in these type of studies was: the description of the tool, the model or language supported by the tool, how the tool works, the studies that proposed or employed the tool, and its usage in industry.The information found in these studies was very heterogeneous, but it allowed us to identify tools that were not cited in other papers.
We found a total of 98 MBT tools in the analyzed papers.Approximately 38 of these tools had little or no information, so they could not be associated to MBT areas.The complete list of tools and their characterization can be found at https://tinyurl.com/yc29yxl6.Such table is sparse for most of the tools and some areas, indicating that no study covers all the dimensions found.

Classification of tools based on the hierarchy of MBT areas
We classified these tools with respect to the hierarchy of areas depicted Figure 4, in order to find out which areas are supported by tools and which ones are not.For each low-level (level-3) area of the hierarchy tree (in Figure 4), we list the specific tools that support that area, and highlight in bold the secondary studies from which tools were extracted.By doing this we homogenize the tool data reported by all secondary studies.Tables 4 through 8 shows the quantity of tools that support each area of our hierarchy.A tool may support one, multiple, or no areas at any level, and so the total counts does not align with the total number of tools (98).In what follows, we summarize the results of this classification, describing the most and least tool-supported areas.Table 4 shows the number of tools reported for each of the five top-level (level-1) areas of our hierarchy.We found that most tools provided support for the model specification area.Test objectives is the second most supported area, with test execution and test generation following closely behind.No tools were reported to support the SUT area.It is important to notice that tools may support multiple areas in a particular level.We count tool support for a particular area as the number of tools that support at least one of its sub-areas.This means that the number tools that support a particular area may be less than the sum of tool support across all its sub-areas.For each level-2 area of the taxonomy, the amount of tool that support this area is the number in parenthesis beside the name.
In what follows, we present and discuss the amount of tools that support each area.For each top-level area, we present all of its level-2 areas, and all of its level-3 areas if at least one tool supports it.In other words, we do not enter a level-2 area if no tools support this area.SUT.No tools were reported as supporting the SUT area.The only study that reviews MBT tools and mentions the SUT area is Saifan and Dingel's [27].
Test objectives.Table 5 shows the amount of tools supporting the level-2 and level-3 areas belonging to the test objectives area.Of all level-2 areas, test artifact has the most support, totaling 35 tools.The functional behavior level-3 area is highly supported by MBT tools, with a total of 34 tools.In contrast, the extra-functional behavior, architectural and environment level-3 areas are scarcely supported, with only 3, 0 and 1 tools respectively.Interestingly, only 10 tools are reported to support the test type level-2 area, even though it is closely related with the test artifact area.Among the level-3 areas of test type, functional testing is the most supported one, with 5 tools, followed by regression testing, with 3 tools.
The support for the test level area is only reported for one tool at the integration level-3 area.
Model specification.Table 6 shows the number of tools supporting the level-2 and level-3 areas belonging to the model specification area.Studies that report on this area focus mainly on the model paradigm level-2 area, with a total of 48 tools.Of these, the majority (35) support transition-based models in level-3.The second most supported level-3 area is UML models, with a total 17 tools.The remaining level-3 areas of model paradigm have 5 or less supporting tools (some have none).This shows a dominance of transitionbased models for MBT tools.In the model characteristics area, 8 tools were reported to support it.For the 8 tools that support the timing level-3 area, all of these tools support timing in models, and 6 suport untimed models.For the 5 tools that support the non-determinism level-3 area, 5 tools allow determinism in models, and 3 allow non-determinism.For the 6 tools that support the dynamics level-3 area, 5 tools use continuous models, 1 uses hybrid models, and no tools support continuous models.7 tools are reported in the language type level-2 area.Of these, 4 support the test-specific language, 2 support the domain-specific, 1 support the design languages, and no tools support the generic languages level-3 area.No tools are reported in the model scope level-2 area.
Test generation.

Trends in MBT tools
We performed an analysis of MBT tools through time, in order to determine trends in tool research along our MBT areas.Figure 6 shows the quantity of MBT tools released by year the (a) top-level areas of our hierarchy, (b) the modeling notation area, (c) the mapping area, and (d) the test artifact area.The rightmost column (Total) shows the total number of tools reported to support each subarea.This quantity may not match the aggregated sum over all years, as the release date was not available for all tools.
Figure 6a shows the research trends for the top-level MBT areas.The greatest activity in tools was between the years 2002 and 2007, as this period contains peak years for all MBT areas.From 2008 onward, the activity in MBT research decreased, but was still above that of the period before 2002.The year 2007 has been the peak year in all MBT research areas, and 2002 has been the second largest research peak.
Figure 6b  For UML, tools first appeared in the year 1997, and have had a steady activity since then.
Figure 6c shows the research trends in the level-2 mapping area its level-3 areas.Research on tools for abstract tests occurred between the years 2002 and 2007, and peaked on its initial year, 2002.On the other hand, research on tools for executable tests peaked in the year 2007, and has been steady over time.While tool support for abstract tests was more popular during the 2002-2006 period, tool support for executable tests became the main research focus in the following years.
Figure 6d shows the research trends in the test artifact level-2 area and its level-3 areas.Research on tools for functional behavior peaked on 2007, and has been active over time.This level-3 area had also a large activity in the years 2004 and 2005.Research on tools for extra-functional behavior started in 2009 but is still an area that needs more research.We did not find any research on tools that support the modeling of architectural descriptions.Regarding tools for environment modeling, only one was found in the year 1997.
We believe that research in model-based testing tools will probably keep the current trend, focusing on modeling of functional behavior, using transition-based notations, to generate executable tests.
Incorporating non-functional characteristics of the SUT to the models will require expanding the capacity of existing modeling notations.Thus, research on non-functional MBT tools will first require research on new modeling notations.We believe more tools with such support will slowly rise in the coming years, as they have in recent years.Lastly, it is possible that tool support for test scaffolding remains low, as this research area has only been active recently.

Discussion on MBT tools
The tools we found with this literature mapping were not explicitly classified in terms of their SUT domain or software paradigm.The absence of a tool classification by software paradigm may be because the term was recently coined.The key paper related to software paradigms was published on 2014 [S16], and the majority of studies that classified tools were published up to 2015 [S03, S10, S17, S19, S20].Thus, it is possible that the software paradigm dimension was too new to be considered in the secondary studies of our mapping.On the other hand, the lack of tool classifications by SUT domain draws attention since early    MBT studies (from 2007) had already discussed used this dimension (although not for tools) [S04, S05].It then seems that this dimension has not been adopted by MBT researchers to classify tools yet.
The current state of MBT tools allows testing of mainly functional behavior of a system.It was not until recently that research on MBT tools started to shift towards extra-functional behavior, which happened to be one of the main gaps in MBT tool research.Future research should consider the representation of non-functional aspects of systems, and their corresponding supporting tools.
Most of the MBT tools found in our study use either transition-based or UML notations.This is a trend that has continued over time, in favor of less diversity of modeling notations.We believe this trend will continue due to the popularity and ease of use of those notations.A possible reason for the lack of tools on other modeling notations may be that common generation techniques, such as graph search and model-checking, are applicable only to UML and trasition-based notations.This would imply that research on different model notations requires research on new generation methods as well.
One of the most important advances in MBT tools has been the shift from abstract test sequences to concrete test cases.This has allowed an increase in the automation of the tools, and a corresponding decrease in the amount of work testers need to do.However, some tasks must still be performed by a tester: they need to provide scaffolding-in the form of adapters, stubs, and oracles-for the tool to be able to generate executable test scripts.Automation of these tasks is an open area of research.

RQ3: Challenges
The RQ3 research question helped us identify the challenges related to MBT that have been reported in secondary studies.In particular, it allowed us to identify common challenges in MBT research, classify them into our hierarchy of MBT areas (derived from RQ1), and devise potential areas of future research.
In previous studies, Utting et al. [2] stated that MBT requires different skills than manual testing, and that it is normally used for functional testing only.Dias-Neto et al. [21,23,22] pointed out that the main challenges are the cost, effort, complexity and skill required to use MBT approaches.Besides, they found that more empirical evidence is needed to support the selection of MBT approaches and tools.There is little evidence of experiences in the industry, thus more case studies on the application of different MBT approaches are needed.In our previous study [6], we found that the main challenges of MBT were (i ) the need for more empirical evidence that supports the selection of MBT approaches and tools, (ii ) a better understanding of similarities and differences among MBT approaches, that help in choosing the most appropriate technique and tool for specific operating contexts, and (iii ) the need for considering both functional and non-functional testing in MBT studies.Additionally, other challenges found in [6] were related to the efficacy of applying MBT approaches and tools, the availability of MBT tools for different domains and contexts, the skills required from testers to use MBT, as well as the complexity and feasibility of adopting MBT approaches in practice.
We identified a total of 171 challenges related to MBT approaches and tools from secondary studies.In order to categorize challenges in MBT research, we performed a grounded analysis that identified the following six emerging categories: Efficacy, Availability, Complexity, Professional skills, Investment, cost & effort, and Evaluation & empirical evidence.Section 4.3.1 describes the challenges and their mapping, for each of the emerging categories.In addition, challenges were classified according to the top-level areas of the hierarchy presented in Section 4.1, namely: SUT, Test objectives, Model specification, Test generation, and Test execution.A 'General' category was added for challenges that apply to MBT approaches overall.Section 4.3.2presents the challenges and its corresponding mapping, classified by the top-level MBT areas.Table 9 summarizes the results of challenge counts and percentages, by emerging category (Table 9a), and the top-level area (Table 9b).Furthermore, the relationship between challenges in these two classification schemes is discussed.Finally, an analysis of trends in MBT challenges is presented.

Relationship between top-level MBT areas and emerging challenge categories
Table 10 shows the amount of challenges of each combination of emerging categories and top-level MBT areas.Figure 7 shows this relation in the form of a bubble graph.In the bottom axis the MBT areas are shown and in the left axis the categories of challenges associated with these areas.In this case, the diagram shows a bubble whose size is proportional to the quantity of challenges in the intersection of each challenge category and area combination.For each area, we will study the type of challenges that are related to it.SUT.For the SUT area, 3 challenges belong in the availability category.This relation is due to the lack of MBT approaches and tools for particular types of software systems or paradigms.The 2 challenges related to complexity, and the challenge belonging in to evaluation, deal with the difficulty of applying MBT to larger software systems.
Test objectives.For the test objectives area, 20 of the challenges belong in the availability category.This relation is due to the lack of MBT approaches that incorporate non-functional, environment, and architecture testing.The remaining 7 challenges belong in the other categories: professional skills, evaluation, complexity, and efficacy.These report on the importance of correctly selecting the adequate MBT approach that adapt to the specific software project.
Model specification.For the model specification area, 14 of the challenges belong in the complexity category.This indicates that modeling as an activity, and the existing model notations, are one of the most difficult aspects of model-based testing.The 7 challenges that relate to professional skills follow this same trend, reporting that testers should have sufficient knowledge and skills to be able to apply MBT. 4 of the challenges are related to the investment, cost, & effort of MBT; which report on the risk of incorrect modeling, which results in increased costs and effort.The 7 challenges related to evaluation & empirical evidence, as well as the 5 about efficacy, suggest that it is necessary to test and improve the efficacy of the existing notations, as they have important drawbacks.6 of the challenges are related to availability, which report on the lack of availability of tools for specific model notations.
Test generation.For the test generation area, 9 of the challenges belong in the availability category.These report mainly on the current lack of support for more complex generation criteria.The 7 challenges related to efficacy, as well as the 6 related to evaluation, state that there is no evidence on many of the existing approaches, which results on problems on the efficacy of this challenges, such as the ability to find faults and the relevance of the generated test cases.The remaining 9 challenges belong in the complexity, professional skills, and investment categories.These describe that testers must be familiar with the complex generation techniques in order to correctly apply MBT.
Test execution.For the text execution area, 6 of the challenges belong in the availability category.This are mainly related to the lack of support of test scaffolding for existing MBT tools, especially the lack of support for the creation of test oracles.The 2 challenges related to evaluation & empirical evidence report on the lack of evidence on the benefits of online and offline approaches.
General.For the challenges that do not belong in a specific area of MBT, 15 of the challenges belong in the evaluation & empirical evidence category, and 3 belong in the efficacy category.These report on one of the largest challenges in MBT: the lack of empirical evidence of the existing approaches.The 6 challenges in the availability area report the lack of availability of certain tools and approaches reported in studies.
The challenges in the the professional skills, complexity, and investment areas are related to the increased complexity in MBT approaches, which impedes their adoption by the industry.

Trends in MBT challenges
We performed an analysis of challenges by emerging categories through time, in order to determine trends.The results of this analysis are summarized in Table 11 as well as in Figure 8. Table 11 in particular shows the challenge count per year and emerging category.Figure 8 depicts the cross relationship between categories and years, where the left axis shows the categories of challenges while the bottom axis is a time line.The size of each bubble in this figure is proportional to the quantity of challenges at the intersection of challenge category and year.The earliest challenges in MBT (in 1996) dealt with complexity of models.The year 2006 had important activity for challenges, having at least two studies in each challenge category.Complexity was the most reported challenge until 2007.In 2007, the largest number of challenges on professional skills was reported.During the first years of MBT, challenges seem to be related to the complexity of both models and tools, as well as to the professional skills required to use MBT approaches.For the year 2008, challenges related to empirical evidence emerged.Evidence was required to support the selection of MBT approaches and tools, and experiences from industry were needed.Challenges in evaluation and empirical evidence spiked in 2008 and 2009.In the year 2010, all challenge types were reported, the trend being complexity challenges.This trend shifted to efficacy in 2012.From 2012 onward, the prevalence of efficacy and availability challenges started.The efficacy challenge is related to the correctness of the models to describe SUT's behavior, as well as the limitations of the models to represent the SUT.Additionally, efficacy also refers to the relevance of the generated test cases, and how these test cases make sense according to the source code of the SUT.The availability challenge refers to approaches, techniques, or tools for specific areas not being available for the public, and being very limited in their automation of MBT stages.In particular, the years 2014 and 2015 exhibited the largest activity in one challenge category: availability.In 2016, the amount of challenges in MBT decreased considerable, with attention moving back to the efficacy of the approaches.
Overall, in the early years of MBT, from 2005 to 2010, the research on MBT reported a lack of empirical evidence, a high complexity of the existing approaches, and the requirement of knowledge and skills to apply them.From the year 2012 onward, these problems were reported in lesser degree, as the importance of availability of tools for a particular MBT area and the complexity of the existing MBT approaches rose.Through the years, challenges related to investment, cost & effort were not the trend in MBT research.This may be due to MBT being still in early stages, with entities being more interested in having tools and approaches that correctly function, rather than focusing on resource efficacy.

Discussion on MBT Challenges
In general, we found that more empirical evidence is needed to support the selection of MBT approaches and tools.There is only a few evidence about experiences in the industry [S06], and case studies on practice using different approaches are needed [S03].Moreover, evidence are usually published biased towards success stories [S03].More research is needed in evaluating and analyzing the applicability of different models in MBT [S20] and the landscape of MBT approaches and tools should be compared [S07, S19].It is necessary to evaluate not just small modules, but full systems to find the better suited approach for an specific In practice, evidence in determining whether MBT approaches are useful to different domains and contexts is necessary to support the selection process.Different factors such as type and level of testing, models, types of software projects, tool characteristics should be studied to objectively measure the performance, effectiveness, complexity, costs, effort, flexibility, feasibility, advantages and benefits [S03, S04, S05, S06, S13, S20].Many MBT approaches have not been empirically evaluated or not transferred to the industrial environments [S04].The understanding about the similarities and differences among approaches could lead to determining appropriate MBT techniques and tools in specific operating contexts in practice.For [S11], a real work that remains is fitting specific models to specific application domains because many models have  Some of the main risks associated to MTB are related to the quality assurance of the artifacts used for test generation, the testing schedule planning, the selection of MBT approaches, the behavior or structural model construction, the selection criteria for test cases, tracking and change managing, manual tasks in the MBT process, test generation and execution process controlling [S05].To increase the practical application of MBT, tools have to support modeling and test generation steps.It is necessary to integrate automated and manual steps because the complex no automated steps makes any approach unfeasible.Thus, increasing the automation levels of the steps in a MBT approach and reducing requirements to use these approaches determines the feasibility of using MBT.The construction of models is not a trivial task due to the lack of a systematic methodology and of supporting tools for its automation [S20].The MBT testing tools should be integrated with the software development processes [S04, S11, S14, S20].
There exist many MBT tools which vary significantly in their specific designs, testing target, tool support, and evaluation strategies [S08].MBT tools generate many test cases and not having full automation support for managing could make the use of MBT infeasible

Conclusions and future work
In this paper, we have reported the results of a tertiary study that performs a systematic mapping of secondary studies in MBT, published between 1996 and 2016.From an initial set of 47 secondary studies, we ended up analyzing 22 papers (10 systematic reviews and 12 literature surveys), after applying the exclusion and inclusion criteria.The complete data extraction form is available online.We created a hierarchy of MBT areas, partially based on taxonomies from previous studies.We also categorized the MBT tools into this hierarchy.Finally, we grouped and classified the main challenges found in secondary studies.
Our main findings regarding the areas of MBT that have been investigated are the following: (1) the most studied modeling paradigm is UML, (2) the most studied test case selection criteria is structural model coverage, (3) the most studied test case generation methods are random and search-based algorithms, and (4) the most studied test execution area is technology: online/offline.Research on MBT over time has had two strong areas with constant activity: test objectives and model specification.
With respect to model-based testing tools, we found 98 MBT tools which were classified according to model specification, test selection criteria, test generation criteria, and test execution.There are many MBT areas not supported by tools, such as testing of non-functional characteristics, modeling of architecture and environment, multiple modeling paradigms (history-based, functional, and operational), several coverage criteria (data-flow, data, and fault-based), specific test generation techniques (axiomatic and UML-, LTS-, and FSM-based), and test scaffolding support.Very few studies categorize tools by test level, model scope, modeling language, and model characteristics.Also, no studies reported the SUT domain or paradigm targeted by the tool.Research of MBT tools over time has focused on the testing of functional behavior of systems, supporting only a subset of all the existing modeling notations.Tools have also shifted into supporting the generation of executable test cases from abstract ones.We believe this trend will continue until more powerful modeling notations are investigated.
In regards to challenges in model-based testing, six common categories emerged from a grounded analysis: (1) efficacy, (2) availability, (3) complexity, (4) professional skills, (5) investment, cost & effort, and (6) evaluation & empirical evidence.We found that most challenges were related to availability.Besides, we also classified challenges according to our hierarchy of MBT areas, and found that most challenges fell in the model specification area.Challenges in MBT research have evolved in time.At the beginning, the largest problems in MBT were related to the complexity of the approaches, and their lack of empirical evidence.More recently, the greatest challenge identified is the lack of approaches for specific software domains.Furthermore, a challenge that has been present over time is that MBT is regarded as difficult, demanding, and hard to adapt.
Since there are few secondary studies on MBT, we consider important to develop studies encompassing the following areas: model scoping (e.g., input-only vs. input-output models), model dynamics (e.g., discrete, continuous, and hybrid), uncommon model paradigms (e.g., history-based, functional, data-flow, operational, and attribute event grammars), language types (e.g., design, test-specific, and generic languages), selection criteria like ad-hoc test case specification, test generation methods (e.g., constraint solving, manual, modelchecking, theorem proving, axiomatic, and labeled-transition system), specific SUT domains (e.g., mobile applications and software product lines), specific software paradigms (e.g., cyber-physical-oriented, aspectoriented, and service-oriented), and test execution aspects such as mapping (e.g., abstract vs. execuable tests) and scaffolding of test cases (e.g., test adapters, stubs and oracles).
A recommendation for the MBT researchers is to use a systematic approach when performing secondary studies, and follow standard reporting procedures.Systematic reviews scored high in our quality assessment, and tended to offer information in a more structured and useful way.We also encourage the use of existing taxonomies for MBT (including the one presented in this work) when performing primary or secondary studies.Classifying tools and approaches under known classification schemes makes information easier to extract for others, and more comparable across papers.MBT is gaining popularity as a research area, and we believe that the adoption of a systematic methodology is important in order to increase the quality of the studies.
In the future, we plan on expanding this tertiary study to cover recently published secondary studies in MBT (2017-2018) and analize additionally aspects of these studies.A recent search yielded a number of interesting and relevant systematic studies that we could be interested in reviewing.
I1. Language: English I2.Type of study: Secondary study (survey, map or review) I3.Approach: Model based testing Articles meeting the following criteria were excluded: E1.Papers not available in full text E2.Publications not related to the software engineering domain E3.Non peer reviewed publications

Figure 2 :
Figure 2: Overview of the selection process used in the study.

2Figure 3 :
Figure 3: Number and type of secondary studies in time.
Figure 5 illustrates this by showing the number of studies by year, for (a) the top-level areas of our MBT hierarchy, along with three selected level-2 areas (that showed high study counts and interesting trends): (b) model paradigm's level-3 areas, (c) test level's level-3 areas, and (d) test artifact's level-3 areas.
(a) Trends of studies over time, for all level-1 areas.(b) Trends of studies over time, for model paradigm's level-3 areas.

Figure 5 :
Figure 5: Trends of secondary studies in time.
(c) Trends of studies over time, for level-3 areas belonging to the test level area.(d) Trends of studies over time, for test artifact's level-3 areas.

Figure 5 :
Figure 5: (Continued) Trends of secondary studies in time.
shows the research trends in the model paradigm level-2 area and each of its level-3 areas.Research on state-based notations peaked on the year 2004.Research on transition-based notations peaked in 2007, and has been steady through the years.Research on stochastic notations started on 2003 but didn't continue until the year 2011.Research on data-flow notations started on 1997 and continued up to 2004, where a gap occurred until 2014.Research on event grammar notations only happened on the year 2006.
(a) Trends of tools over time, for all level-1 areas.(b) Trends of studies over time, for model paradigm's level-3 areas.

Figure 6 :
Figure 6: Trends of tools in time.
(c) Trends of studies over time, for mapping's level-3 areas.(d) Trends of studies over time, for test artifact's level-3 areas.

Figure 6 :
Figure 6: (Continued) Trends of tools in time.

4. 3 . 1
Challenges by emerging categories Efficacy.The efficacy of the existing MBT approaches is addressed in 16 (9%) challenges [S03, S04, S05, S11, S13, S14, S22].The quality and characteristics of the model affect the quality of the generated test cases [S04, S11, S13, S22], as well as the criteria used to generate the test cases [S13].It is a key factor to define languages to for the model and the test objectives [S03].Test generation algorithms have model exploration problems, in particular when exploring cycles [S12, S13].Generating test data for test cases is also reported as a challenge for complex software systems [S13].Regarding the generated test cases, these may not be feasible [S13], and do not consider traceability with test scenarios [S14].For non-deterministic models, it is difficult to determine the amount of test cases required to reach certain coverage [S22].

Figure 7 :
Figure 7: Relationship between MBT top-level areas and challenge categories.
[S19].Practitioners faced problems related with the learning and understanding of the features of MBT tools [S19, S20].Many of them can generate executable test cases but still some just generate abstract tests [S17].In many cases, there are no tools available for modeling and generating tests in a particular domain and custom made tools are needed [S03].In other cases, MBT tools are not available to the public [S18, S21].Furthermore, most of open source tools have incomplete or outdated documentation [S19].

Table 1 :
Quality assessment of secondary studies, based on DARE.

Table 2 :
Extracted data items.

Table 3 :
Number of secondary studies per MBT areas.
(b) Number of studies per level-2 and level-3 areas belonging to the SUT area.(c) Number of studies per level-2 and level-3 areas belonging to the test objectives area.

Table 3 :
(Continued) Number of secondary studies per MBT areas.(d) Number of studies per level-2 and level-3 areas of the model specification area.
(e) Number of studies per level-2 and level-3 areas belonging to the test generation area.

Table 3 :
(Continued) Number of secondary studies per MBT areas.(f) Number of studies per level-2 and level-3 areas belonging to the test execution area.
behavior (or properties), with 6 studies, and extra-functional behavior (or non-functional aspects), with 5 studies.Among the least studied level-3 areas of test artifact are architectural description and environment, each with 2 studies.The most studied level-3 areas of test type are functional and regression testing, with 5 studies each; while the least studied is non-functional testing, with 4 studies.On the other hand, the most studied level-3 areas of test level are unit testing, with 10 studies, and integration testing, with 9 studies.Interestingly, we did not find any study on acceptance testing, and only 5 on system testing.Model specification.Table3dshows the number of studies for level-2 and level-3 areas belonging to the model specification area.The model specification's level-2 area with more studies is model paradigm (with a total 19 studies), while the least studied level-2 area is model scope (with 0 studies).None of the model scope's level-3 areas have studies.Regarding language type's level-3 areas, the most studied is domain-specific, with 2 studies.All the other level-3 areas of language type have 1 study each.Besides, we found that the most studied level-3 area within model paradigm is UML (with 16 secondary studies), followed by transition-based notation (with 12 studies).Within UML, behavioral models (including state machine, sequence and activity diagrams) are the most common, with 16 secondary studies in total.Within transition-based notations, finite state machines is the most common, with 9 secondary studies.The least studied level-3 areas of model paradigm are history-based and functional, with no studies at all.Furthermore, with regards to model characteristics' level-3 areas, the most studied ones are timing and non-determinism, with 2 studies each, whereas the least studied one is dynamics (i.e., discrete, continuous, hybrid), with 1 study only.

Table 5 :
Tools per level-2 and level-3 areas belonging to the test objectives area.

Table 6 :
Tools per level-2 and level-3 areas belonging to the model specification area.

Table 7 :
Tools per level-2 and level-3 areas belonging to the test generation area.

Table 8 :
Tools per level-2 and level-3 areas belonging to the test execution area.
Table7shows the quantity of tools supporting the level-2 and level-3 areas belonging to the test generation area.The selection criteria level-2 area is supported by 30 tools.Of these, 26 support structural model coverage, 11 support data coverage, 10 support requirements based coverage, 2 support stochastic and random criteria, and none supports fault-based coverage nor ad-hoc test case specifications.On the other hand, 29 tools support the generation method level-2 area.From these, 11 tools support model-checking, and 9 support graph search; while the remaining areas have 4 or less supporting tools.No tools were reported to support the UML, axiomatic, LTS and FSM generation methods.Test execution.Table8shows the number of tools supporting the level-2 and level-3 areas belonging to the test execution area.The technology level-2 area is supported by 35 tools.Of these, 30 perform offline testing and 15 perform online testing.The conformance check level-2 area is supported by 17 tools, 14 of which belong to test oracles level-3 area and 2 belong to conformance relation level-3 area.The mapping level-2 area is supported by 33 tools, of which 24 support executable test cases and 9 support abstract test cases.The scaffolding level-2 area is supported by 13 tools.For the scaffolding's level-3 areas, 9 tools support the creation of adapters, 4 support test oracles, and 2 support test stubs.

Table 9 :
Challenges of model-based testing.(a)Challengesgroupedbyemerging categories.Selecting a MBT approach is a complex task [S05, S06], and it requires knowledge on the domain of the software system [S04].Creating a model for a system is a complex activity [S04, S11, S20], as modeling languages can prove to be hard to use[S03].It is necessary to validate the model, as well as other artifacts as part of the MBT process[S04, S16].Models suffer from implicit complexity in the form of the state explosion problem [S01, S11], which is more apparent with non-deterministic models [S22].The selection of adequate test criteria has also been reported as a complex activity [S06, S11].In order to reduce complexity, it has been reported that MBT could automate the most complex parts of its process [S04, S19].Part of the cost of MBT comes from the learning overhead [S04, S11], and the creation of test oracles, which may be as complex as the SUT itself [S03].The benefits of MBT should be attainable with low investment [S04].Evaluation & empirical evidence.The evaluation and empirical evidence of existing MBT approaches is addressed in 33 (19%) challenges [S02, S03, S04, S05, S06, S08, S11, S16, S17, S19, S20].There is a need for comparison of existing MBT approaches, both against traditional testing [S05], and other MBT approaches [S03, S06, S08, S16].There is also need for evidence of costs, effort and quality of the existing MBT approaches [S04, S06].It has also been reported that the efficacy of models in different SUT domains should be evaluated [S11].Problems in existing MBT evaluations include the lack of evaluation on test criteria [S02], usage of small systems [S05], and not using test verdicts [S16].The complexity of modeling makes MBT approaches less comparable [S19, S20].4.3.2Challenges by top-level MBT areas Test objectives.The test objectives area is addressed in 27 (16%) challenges [S03, S04, S05, S10, S16, S17, S20].Testing of both functional and non-functional characteristics of software systems is desirable for MBT [S04].However, the support for non-functional testing in MBT approaches is limited, and is still an open field of research [S04, S05, S06, S10, S16, S17, S20].MBT approaches should also support testing of architecture [S20] and environment [S20].Studies on environment modeling don't report on how was this process performed [S20].Selection of an MBT approach depends on the objectives of testing, such as test level and type [S05].General.The entirety of MBT, without regard for a particular area, is addressed in 56 (33%) challenges.Adoption of MBT is a difficult and complex process [S03], and should be simplified in order to increase its feasibility [S04].This can be performed by automating the most complex parts of the process [S04], or reducing its skill and knowledge requirements [S04, S06].Model-based testing has been reported as being costly [S03, S05].MBT requires the system design to be performed twice-once for development and once for testing-[S03], and the correct approach of MBT must be chosen in order to reduce the risk of increased efforts , and do not support certain, more powerful generation criteria [S16, S19].The selection of which of these criteria to use impacts the final quality of the test cases [S06].There is a lack of evidence of efficacy of test generation criteria [S02], and on the existing test generation techniques [S16].Test execution.The test execution area is addressed in 8 (5%) challenges [S06, S17, S19, S20].There is a higher support for offline technology and executable test cases, and less for online technology and abstract test cases [S17].Most of the existing MBT tools do not support test scaffolding activities [S19].In particular, no tools fully support test stubs [S19], and most tools require test adapters to be implemented manually [S20].SUT.The SUT area is addressed in 6 (4%) challenges [S04, S05, S16].Software systems are becoming more complex each year, which rises the need for MBT evaluations with large SUTS [S04, S05].Existing MBT approaches don't specify to which software systems they are applicable[S04], and those that do focus on component-oriented systems[S16].It is necessary to have knowledge on the SUT before applying an MBT approach [S04].
process [S04, S13].Models in MBT should be validated [S03], as errors performed in modeling impact the quality of the generated tests [S05].These challenges make learning MBT a complex task overall [S19].Investment, cost & effort.The investment, cost and effort of existing MBT approaches is addressed in 13 (8%) challenges [S03, S04, S05, S06, S11, S22].The selection of the correct MBT approach is important, as it leads to reduction of effort and costs of MBT [S06].This leads to the need of empirical knowledge in costs and effort in MBT [S04, S05].Moreover, MBT requires planning in order to minimize costs and efforts [S11].

Table 10 :
Challenge count by top-level MBT area and category.

Table 11 :
Count of challenges per year, by category.
implicit drawbacks.Also, the number of techniques proposed for test case generation is very large and a deeper insight on the techniques proposed is needed [S16].In some cases, the generated test cases may get irreverent due to the disparity between a model and its corresponding code [S11].A key step in any MBT