Attributes Influencing the Reading and Comprehension of Source Code – Discussing Contradictory Evidence

Background: Coding guidelines can be contradictory despite their intention of providing a universal perspective on source code quality. For instance, five attributes (code size, semantic complexity, internal documentation, layout style, and identifier length) out of 13 presented contradictions regarding their influence (positive or negative) on the source code readability and comprehensibility. Aims: To investigate source code attributes and their influence on readability and comprehensibility. Method: A literature review was used to identify source code attributes impacting the source code reading and comprehension, and an empirical study was performed to support the assessment of four attributes that presented empirical contradictions in the technical literature. Results: Regardless participants’ experience; all participants showed more positive comprehensibility perceptions for Python snippets with more lines of code. However, their readability perceptions regarding code size were contradictory. The less experienced participants preferred more lines of code while the more experienced ones preferred fewer lines of code. Long and complete-word identifiers presented better readability and comprehensibility according to both novices and experts. Comments contribute to better comprehension. Furthermore, four indentation spaces dominated the code reading preference. Conclusions: Coding guidelines contradictions still demand further investigation to provide indications on possible confounding factors explaining some of the inconclusive results.


Introduction
Quality is a complex issue since its perception depends on different people's expectations for a product [1].
Commonly, software quality is assessed through the number of software defects and the effort required with its maintenance. Although this perspective is expected given the high costs involved in the evolution of software [2], many other features can influence the perception of quality regarding software products. The readability and comprehensibility of source code represent characteristics used to perceive software quality. Such characteristics impact software maintainability since the reading and comprehension of source code are the most time-consuming tasks in software maintenance and evolution, especially in cases in which there is a high turnover of developers and the system's documentation is incomplete or outdated [3] [4]. Besides the amount of effort involved in these tasks, the misunderstandings regarding the software objectives, and, consequently, the source code reconstruction due to a lack of understanding, are usually caused by the low readability and comprehensibility of source code, which represent challenges to requirements conformance and software evolution, leading to rework [3].
Although there is a consensus that readability and comprehensibility are essential characteristics to determine the quality of source code, the features contributing to increasing or decreasing the levels of readability and comprehensibility are still under investigation [4] [5]. Therefore, source code guidelines are usually developed without a proper theoretical and experimental background [6]. For instance, several programming styles aim at providing standards for coding in different programming languages, presenting a list of source code attributes and their referential values (in the form of rules) that should be observed. However, it is common to come across common attributes, and contradictory thresholds among these standardsthe Google Style Guide for Python 1 and Java 2 present different rules for naming, and the internal style guideline for Python 3 developers on Google use presents divergent rules for indentation when compared to the Google Style Guide for Python and the official Python style guide 4 .
The lack of agreement among standards can augment the misalignment regarding the quality of source code among practitioners, leading to its regular reconstructions by different developers over time [3]. The avoidance of source code reconstruction represents a motivation for our work. We observed this issue in one of our collaborations with the Brazilian industry a few years ago. By following an evidence-based approach, we planned and performed a structured literature review to acquire evidence regarding the influence of source code attributes on code reading and comprehension, supporting the assembling and tailoring of coding guidelines aligned with the company's quality goals. However, the identification of contradictory evidence in the technical literature added additional challenges on proposing an evidence-based solution for the company at that moment.
This topic was previously discussed in [7], which presented contradictory evidence concerning the impact of code size, identifier length, and indentation spacing on code readability and comprehensibility. To investigate whether software development experience and domain knowledge would support explaining the influence divergences regarding these quality attributes, we performed an empirical study that turned out to be inconclusive for some cases. Overall, regardless the programming experience, four spaces for indentation dominated the readability preference of participants in the performed empirical study. The readability and comprehensibility preferences towards long and complete-word identifiers were mostly true for both novice/experts developers. Furthermore, while all participants showed more positive comprehensibility perception for Python snippets with more lines of code, their readability perception regarding code size were contradictory. The less experienced participants preferred more lines of code while the more experienced ones preferred fewer. Now, in this extended version, we include further discussions by exploring additional works published in the technical literature in the last years, and by contributing to a broader view of the attributes that influence positively and negatively the readability and comprehensibility of source code, describing additional contextual information on the contradictory evidence identified in the technical literature and in the performed empirical study. We hope they can contribute to future investigations in the field of source code quality.
Therefore, besides this Introduction, this paper is organized into four additional sections. Section 2 presents the literature review extension, as well as an overview of 13 source code attributes 5 influencing the readability and comprehensibility of source code and their 94 measurement procedures. Still, in this section, we highlight the contradictory results identified through a literature review that led us to plan and execute an empirical study towards observing the role and influence of programming experience and domain knowledge in the revealed contradictions. As we detail in Section 3, the study was planned and executed in the context of a software engineering course with 38 undergraduate students with different levels of expertise and domain knowledge background, which offered us the opportunity of observing whether these factors might support explaining the contradictory results. While programming experience helped us to identify a probable root for the inconsistencies found in the technical literature regarding code size (novices tend to require more verbose code, while experts tend to appreciate compact code while assessing readability) and identifier length (inexperienced programmers had more difficulty in comprehending short identifiers than experts), it did not help us with the explanation for the contradiction for indentation spacing on readability perception. Section 4 discusses the threats to the validity of both studies (literature review and empirical study), Section 5 offers the conclusions, and future works.

2
Contradictory Evidence on Source Code Readability and Comprehensibility

Readability and Comprehensibility of Source Code
The readability of source code can be defined as a quality characteristic based on the human judgment of how easy it is to read the written code. This feature relates to program elements and their presentation to programmers. Conversely, the comprehensibility of source code is a quality characteristic based on the human judgment of how easy it is to understand the source code. This characteristic relates to the source code interpretation by its reader. This way, a source code can be readable, but not comprehensible and vice-versa, although its readability might affect its comprehensibility. These two concepts are interchangeably used in the technical works due to their close meaning and relation, yet they do not represent the same concept. Different strategies provide ways to support either the source code reading or source code comprehension. During a scoping review aiming at revealing technical works presenting strategies to enhance or provide support on the source code reading and comprehension, four distinct categories of works were identified. The first category regards the structural code analysis and modification. These works (for instance, by Kreimer [8] and Alshayeb [9]) evaluate the program structure as a whole against refactoring techniques and code patterns, assessing the source code comprehensibility and recommending high-level modifications. The second category represents the works concerned with providing software visualization support for source code comprehension. In this case, the focus is not the program evaluation/improvement per se, but rather its comprehension through its structure visualization [10] [11]. The third category organizes the works using colors to highlight the main source code features, especially its control flow, declaration and data input/output structures [12]. Instead of evaluating the source code, these works aim at calling attention to program elements simplifying the reading of source code.
The fourth category consists of works evaluating source code in the light of their attributes, identifying their influence on its readability and comprehensibility. Works like [13] [14] [15] [4] [16] assess, through empirical studies, the code size, complexity, commands separation, line length, indentation among others, to identify their influence on the readability and comprehensibility of source code.
Even though the first three categories represent essential works to support the source code reading and comprehension, they do not present alternatives to overcome the reconstruction issue we observed during one of our academia-industry collaborations. Our primary goal was to prevent the rewriting of source code (rework) by the practitioners in the company. Therefore, we needed to identify those source code attributes that could be modified to improve the source code quality and to align their different quality perspectives, focusing on the main types of code commonly reconstructed by the developers: code snippets.
Upon this set of works on readability and comprehensibility of source code, we selected the fourth category as appropriate to support our primary objective of preventing source code reconstruction in the target company. To identify other works in this category and enlarge our dataset, we planned and executed a structured literature review using the fourth category's works as control articles. The literature review was performed in January of 2014 to capture papers up to 2013 and re-executed in June of 2017 to capture works up to 2016. The following results represent the overall outcome of the literature review.

Literature Review Planning
A structured literature review based on systematic literature review (SLR) practices [17] was undertaken. The decision for performing a less in-depth literature search was based on the time available for executing the task in the context of the academia-industry collaboration. This fact made us simplify some of the SLR steps such as the search strategy and data synthesisexplained in the following subsections. Table 1 presents the objective that was set for the performed structured literature review. During the literature review, we wanted to answer the following research questions that were also used to guide the rest of the planning:  Which source code attributes are used to evaluate its readability and comprehensibility?
 What are the measurement procedures for these attributes?
 What are the existing relations between the identified attributes and the quality characteristics (readability/comprehensibility)?

Primary Study Search and Selection Strategies
As our first simplification for the structured literature review, we decided to select a single search engine for gathering the primary studies. Based on our previous experiences conducting systematic literature reviews, we chose Scopus as the search engine, since it has presented high coverage of works from the leading conferences and journals in the software engineering field. To support answering the research questions, we assembled a search string to retrieve specific attributes/measures for source code that would influence the quality characteristics under judgment. Additionally, we used five control articles ( Table 2) to gauge the following search string: TITLE-ABS-KEY((metric OR measure OR attribute OR predictor OR evaluation OR assessment OR improvement OR style OR standard OR pattern) AND (readability OR comprehensibility OR understandability OR identifier OR naming OR comment) AND ("software quality" OR "software readability" OR "software comprehension" OR "software understanding" OR "program quality" OR "program readability" OR "program comprehension" OR "program understanding" OR "code quality" OR "code readability" OR "code comprehension" OR "code understanding")). We did not intend to embrace all the technical literature in the field, but rather to gather a set of attributes and their measurement procedures that would support us assembling some initial evidence-based coding guidelines for the company, preventing its developers from reconstructing source code over time. To suit their development context, we used the categories previously described (Section 2.1) to support the decision for including and excluding papers. We aimed at gathering only papers fitting the fourth category. Thus we created the following inclusion and exclusion criteria for the primary studies selection:

 Inclusion Criteria
• (I1) The paper quality focus includes the readability and/or comprehensibility of source code snippets; AND • (I2) The paper presents source code attributes used to assess the quality of source code snippets about its readability and/or comprehensibility; AND • (I3) The paper reports evidence of source code attributes influencing the readability and/or comprehensibility of source code snippets.

 Exclusion Criteria
• (E1) The paper is not related to computer science; OR • (E2) The paper focus is not software engineering; OR • (E3) The paper quality focus is not the readability and/or comprehensibility of source code snippets; OR • (E4) The paper does not present source code attributes used to assess the quality of source code about its readability and/or comprehensibility; OR • (E5) The paper is not available; OR • (E6) The paper is not written in one of the following languages: English, Spanish or Portuguese; OR • (E7) The paper does not report any evidence of source code attributes influencing the readability and/or comprehensibility of source code snippets.
To select a paper for having its data extracted, a researcher should go through all the papers returned and assess their titles and abstracts according to the inclusion and exclusion criteria. In case of doubt on the selection of any paper, a second researcher should re-evaluate it, and if any doubt concerning the paper acceptance would remain, the paper should be included for a thorough reading. Whenever necessary, the papers' introduction and conclusions should be used during the study selection phase to make sure the papers were fulfilling the conditions for their acceptance. In addition, in case any paper was included, but after its reading, the researchers noticed it did not satisfy the conditions for its inclusion in the review, then it should be excluded.

Data Extraction and Synthesis
The papers included should have the following information extracted: i) quality characteristics definition; ii) source code quality attributes and their measurement procedures; iii) empirical evaluation type; iv) evaluation description; v) characteristics of the source codes used for the assessment (programming language, application, among others); vi) subjects' characteristics (practitioners or academics) in case applicable; and vii) results from the evaluation. Based on the control articles identified during the scoping review, we knew it would be hard to aggregate the different included studies results, mainly because while some studies report the same source code attributes, their measurement procedures do not match, making it impossible to combine evidence from different sources. Our idea for the synthesis should then follow a qualitative perspective, in the sense that we should create clusters of similar attributes and then appraise their overall influence on the readability and comprehensibility of source code.

Quality Assessment
When following an evidence-based approach in which the results from the technical literature will affect the selection of a software technology in real development scenarios, the quality assessment of the included papers is mandatory.
In this work, we identified seven criteria that should be used to assess the included papers. In the end, each paper would receive a score from zero to ten that would be used to determine whether their results would be taken to be discussed with the company's practitioners or not. It is essential to observe that while following an evidence-based approach we had to consider the strength of the evidence as one of the leading items to assess the quality of a study, and for this reason, criterion D is the one that accounts the majority of the assessment score.

 Quality Appraisal Criteria
• (A) Does the paper define the identified quality attributes? (number of attributes with definition/total number of attributes identified in the paper) • (B) Does the paper describe the attributes measurement procedures? (number of attributes with measurement procedures/total number of attributes identified in the paper)) • (C) Are the measurement procedures reproducible? (0none of them are; 0.25most of them are not; 0.5half of them are; 0.75most of them are; or 1yes, all of them are) • (D) What type of empirical study for assessing the influence of the quality attributes on the source code readability and comprehensibility is reported? (1proof of concept; 2survey or observational study; 3case study; 4controlled experiment) • (E) Are the source codes used as instruments in the empirical studies toy examples or real codes? (0toy; 1real) • (F) Are the participants of the empirical study practitioners or students? (0students or not applicable; 1practitioners) • (G) Are the study results reported in detail? (0no; 0.5partially yes; 1yes)

Source Code Attributes Impacting Readability and Comprehensibility
Overall, 404 papers were returned up to 2016 after re-executing the search string in Scopus. Each paper had its title, and abstract checked according to the inclusion and exclusion criteria (including those that already were included/excluded in the previous literature review presented in [18]), and 24 papers were included for data extraction.
Two papers ( [19] and [20]) could not be found. Thus the data we are going to present is concerned with the 22 papers we had access (see Table 3).

of method chains and comments in software readability and comprehension -An experiment
The works were organized in clusters according to their proximity regarding attributes under evaluation and the primary purpose of the work. Figure 1 depicts the dependency graph among these 22 included papers, and the four main clusters of works returned by our search. As a general comment, most of the works neither provide a precise definition of quality characteristics under investigation (readability and comprehensibility) nor differentiate them, which hampered the presentation of source code attributes for a specific quality characteristic and led us to indicate their influence on both the readability and comprehensibility.
K [4] is the most cited paper in the group, and the nature of its work can explain it. K intends to identify a way of measuring the readability/comprehensibility of source code using different combinations of source code attributes. The paper also has a high-quality appraisal score, which, among other reasons, indicates it describes in detail the attributes, their measurement procedures and the evidence of their influence on the source code readability/comprehensibility. Likewise, B [22], I [15] and O [16] also propose a metric for source code readability/comprehensibility by combining different source code attribute. That is why they together compose one of the clusters (1) identified in our structured literature review.
A [21], R [33], T [35] and U [36] share some similarities with the first cluster of works. Although they do not intend to propose a metric for measuring the source code readability/comprehensibility, they try to identify, from a big list of source code measures, those that have some influence on the reading and comprehension of source code. For doing so, they measured several source code features in either real or toy projects previously accessed concerning their readability/comprehensibility. After that, they applied either mining or a regression technique to identify those measures that had mostly affected the assessment of readability/comprehensibility. Among all the included papers, this set contains the ones with the lowest quality appraisal scores regarding our evaluation criteria. Although this assessment does not have any external significance, we would not be able to use them to support any final decision, since we could not gather from them the necessary amount of information to answer our research questions.
The cluster containing C [23], E [25] and V [5] (3), differently from the other two clusters, present a very restricted list of source code attributes under evaluation, all of them concerned with comments. Similarly, the cluster containing F [26], G [13], H [14], J [27], L [28], M [29], P [31], Q [32] and S [34] (4) has works that restrict their studies on the influence analysis of different identifier attributes on source code reading/comprehension. In some sense, some papers dealing with specific source code entities (identifier and comment) reflect the importance of these two entities to the reading and comprehension of source code.  [24] and N [30] were not grouped in any cluster given their distance in purpose from the other works. D deals specifically with indentation spacing and N deals with four different measures related to code size, structural and semantic complexity. Although it looks similar to works in the Cluster 2, it differs regarding how it evaluated the influence of the attributes on source code reading/comprehensionthe authors used an experiment instead of mining or regression analysis.
Briefly, all the papers provided some information concerning our research questions, even though not all of them presented detailed information regarding the measurement procedures of the evaluated source code attributes. From the 22 included papers, 13 source code attributes from three source code entities (program unit, identifier, and comment) presented evidence of their influence on the readability and comprehensibility of source code. At least one measurement procedure was found for each attribute, which led to 94 measurement procedures (either qualitative or quantitative). Appendix A presents the complete list of entities, attributes, measurement procedures and impacts on source code readability/comprehensibility gathered and aggregated from the included papers. The list lacks some attributes and measurement procedures since some of the primary studies either did not list all the observed attributes or described some of those they listed without enough information to be understood.
We followed a bottom-up method for identifying the source code attributes and their respective entities. For each of the papers, we first identified the cause-effect model that was being assessed, for instance, we gathered what was being measured in a source code (cause) and what was being observed concerning its impact in the source code quality readability or comprehensibility -(effect). In addition, we identified how the causes and effects were measured. Not all of the papers made explicit they were evaluating source code attributes, most of them measured different program elements and tried to make a causal relationship between such program elements and the readability/comprehensibility of source code. From a list of 94 measurement procedures (regarding amount/ average/ maximum of lines/ comments/ statements/ operators/ identifiers/ spaces/ characters etc.), we identified 13 source code attributes: six concerning a program unit, six concerning particular features of identifier and one concerning comments (see Figure 2).
One may wonder if it is acceptable to compare/group works from more than 30 years ago with recent ones since the programming languages evolved so much during this time. We should highlight that most of the measurement procedures we identified in the oldest works can also be used in modern programming languages and for this reason, we did not discard them from our included papers set. Moreover, the qualitative aggregation we performed with the results were on the level of attribute, which also allowed us to make a comparison of works that present very different measurement procedures. The measurement procedures regarding program units are concerned with measuring program elements in the whole piece of a program. For instance, counting the number of lines of code in a program unit, counting the number of branches in a program unit, identifying the presence of comments in a program unit, counting the number of spaces for indentation used in a program unit among others. Each of these measurement procedures, although used to measure program elements, serves as surrogates to a program unit attribute. We identified six program unit attributes (details in Appendix A):  size/length: groups measurement procedures intending to measure spatial dimensions of a program unit;  structural complexity: groups measurement procedures intending to measure the way the information is organized and connected in the program unit;  semantic complexity: groups measurement procedures intending to measure the conceptual information embedded in the program unit;  implementation level: groups measurement procedures intending to measure the level of abstraction of an implementation;  internal documentation: groups measurement procedures intending to measure the amount of documentation support the program unit has;  layout style: groups measurement procedures intending to measure the spatial location of program elements in the program unit.
Measurement procedures are related to identifiers concern measuring specific properties of an identifier disregarding their relationship with other program elements in the program unit. We identified six identifier attributes, differently from program unit attributes, most of them are measured qualitatively (details in Appendix A):  consistency: groups the measurement procedures aiming at measuring the identifier name consistency regarding upper/lower case and underscore use;  meaningfulness: groups the measurement procedures that intend to measure the identifier name retention in memory;  length: groups the measurement procedures aiming at measuring the spatial dimensions of an identifier;  scope: groups measurement procedures intending to measure the identifier boundaries of use;  multi-word separator style: groups measurement procedures aiming at measuring the way multi-word identifiers have their words separated;  type encoding style: groups measurement procedures intending to measure the way the identifiers show their type information to the reader.
Similarly to the case of the identifier entity, measurement procedures related to comments regards measuring specific properties of a comment. Just one attribute was identified for comment: quality. In this case, the quality is measured regarding whether the comment presents information that goes beyond what the code provides. It is important to note that the entity program unit also presents measurement procedures related to comments, but they intend to measure the presence, amount, and length of comments in the program unit, meaning no feature of a single comment is assessed concerning its influence on the source code reading/comprehension.
With a variety of measurement procedures, it is not a surprise to encounter contradictory evidence regarding the influence of some source code attributes on its reading/comprehension (rows of positive and negative symbols for each attribute in Figure 2). Still, it is possible to draw some insights from the bigger picture presented in Figure 2, especially from those works presenting confirmatory results. According to our findings, the bigger the structural complexity of a program unit, the worse the source code readability/comprehensibility (negative symbols in Figure  2), also a higher abstraction level of implementation leads to a better source code reading/comprehension. Concerning identifiers; consistency, meaningfulness and multi-word separator style use (especially the use of camel case) presented positive influence on source code reading/comprehension. High-quality comments also influence the quality of source code positively, especially regarding comprehensibility.
Such confirmatory evidence supported their use to align the quality perspectives and, at the same time, to improve the source code quality in the target company, motivating this evidence-based work. On the other hand, the existence of contradictions prevented us from using some of the results, which led us to investigate the contradictions further.

Contradictory Results Regarding Source Code Attributes
The encountered contradictions are concerned with five attributes: four related to program unitsize/length, semantic complexity, internal documentation, layout style -, and one to the identifierlength. The following subsections present a discussion on them.  [30] measured the size using only the total number of lines. While the former identified it as being directly proportional to the source code readability/comprehensibility, the latter reached no conclusive influence on these characteristics. None of the works presented conclusive justifications concerning their findings, which makes unclear the reasons for such divergences. Still, we can conjecture that the number of blank lines, presented in other works (B [22], I [15] and K [4]) as having a positive influence on source code reading/comprehension might have been a confounding factor in these studies. Also, the fact that up to today different procedures and tools measure lines of code differently compromises the actual comparison between the identified works, since we cannot assure how they measured this attribute, leading to another likely confounding factor.

Contradictory Behaviors for Program Unit Semantic Complexity and Identifier Length
Similarly, the studies in this group of contradictions did not measure program unit semantic complexity nor the identifier length using the same measurement procedures. However, in some sense, their results can be compared. Table 5 presents the works and the real contradictions for these attributes. We can identify three central contradictions in Table 5: i) concerning arithmetic operators; ii) concerning commas; iii) concerning the length of identifiers. Three different works investigated the impact of arithmetic operators on source code reading/comprehension. While it sounds reasonable that arithmetic expressions can be detrimental to comprehension, since they require extra attention from readers, I [15] and K [4] presented a positive relation between arithmetic operators and source code readability. All the works that investigated this program element applied mining/regression techniques to identify the impact of various other measures of source code reading/comprehension (cluster 2 in Figure 1), and they did not provide rich result details particularly for this case.
The other contradiction concerns the use of commas, which can be controversial. None of the papers provided any explanation on why they decided to measure commas nor why they assumed that commas might have some influence on source code reading/comprehension. We can infer that the authors used commas as a surrogate for parameters in method calls. Another possible reason for measuring commas is to capture information concerning the declaration of multiple variables for the same type. In both cases, it might be sensible to say that the number of parameters in method calls/multiple variables being declared for the same type can increase the line length, which was shown in these same works (I [15], K [4] and T [35]) to be detrimental to source code reading/comprehension. Especially regarding the declaration of multiple variables in the same line, this might also limit the existence of comments for variable explanations, which can be detrimental to source code comprehension. The lack of further information on this prevents us from making more conjectures about the results.
Regarding identifier length, it is possible to observe that some works dealt with it in a broader scenario, measuring the average length of identifiers in a whole program unit, while others had their focus on the length of a single identifier, disregarding their presence in a program unit. Although most of them assessed the identifier length through the number of characters as presented in DeYoung and Kampen (A) [ [14] used identifiers in the form of complete words, abbreviation and single character to assess the influence of these configurations on the source code comprehension. The last study concluded that the identifiers with complete words influence more positively the source code comprehension when compared to abbreviations and single character formats. Lawrie et al. (H) [14] could also draw some interesting conclusions regarding the gender/experience versus identifiers comprehension. Their results revealed that informative identifier names are more important for women than for men, even though women comprehend more from abbreviations than men do. Inexperienced programmers have more difficulty comprehending uninformative identifiers than experienced ones; on the other hand, neither work experience nor schooling seems to influence the writing of correct source code descriptions. Even though the results for identifier length were not conclusive, DeYoung and Kampen (A) [21] observed that short identifiers names are more suited to use in limited program scope, while long identifiers names in broader program scopes. Jørgensen (B) [22] obtained equivalent results to Lawrie et al. (G and H) [13] [14], showing that the longer the identifier, the better is the comprehension of source code. In the opposite direction are the results of Buse et al. [6,14], which says the average length of identifiers names has no influence on readability/comprehensibility and the length of the most extended identifiers has negative impact on the readability/comprehensibility of source code, meaning the longer the identifier, the worse to read/comprehend the source code. Butler et al. (M) [29] also presented negative results. However, the authors argued that there is both an upper and a lower limit to the length of the identifier, meaning it cannot be too long nor too short.

Contradictory Behaviors for Program Unit Internal Documentation
The contradiction concerning the presence of comments can also be controversial since it is almost universal sense that commenting the source code is an excellent practice for improving its comprehension. Among the works listed in Table 6, the one by Lee, Lee and In (U) [36] was the only one suggesting the contrary idea. It can be justified by the fact that the authors were not considering comments as part of the source code, this way they noticed that whenever a source code was assessed as not being readable/comprehensible, there was a comment attached to it to provide explanations to clarify probable doubts from readers. It is important to note that the quality appraisal score for this work was significant low, meaning it would not be used to make a definitive decision in out evidence-based work setting.  [15] and in 2010 [4], and Lee, Lee and In (U) in 2015 [36]. According to these works, indentation does aid program readability, but the number of spaces used for indentation differs among the studies. The works described by Jørgensen (B) [22] and by Lee, Lee and In (U) [36] presented a positive relation between the indentation spacing and the readability of source code, meaning the more spaces used for indentation, the better. Conversely, all other works described the negative relationship between the number of spaces used for indentation and readability, meaning although significant, the indentation should not contain too many spaces to prevent the hampering of readability/comprehensibility. Miara et al. (D) [24] showed that the best results for source code readability and comprehensibility (characteristic also evaluated in their study) were concerned with using indentation from two to four spaces, as 6-space indentation made either the length of lines become longer or divided the statements into different lines. According to their work, the increasing of the indentation spaces and use of nested commands reduce the readability of source code.

Empirical Study on Contradictory Attributes for Readability and Comprehensibility of Source Code
The contradictory results challenged our decision-making on using or not such source code attributes to tailor coding guidelines for the company. Thus, we still had to observe these divergent results, and therefore decided to evaluate some of them in an external context. We should stress, though, that not all of the attributes would be discussed with the company's practitioners, only those with a high-quality assessment and that was important to the company were considered at that moment.

Observing Contradictions: the Empirical Study Planning
An empirical study was planned with the object presented in Table 8 such as follows. During the empirical study, we wanted to respond the following research questions (RQ) that were also used to guide the rest of the planning:  What is the influence of the size of code on the comprehensibility of source code? (RQ1) o What features of the source code, participants or environmental characteristics support this influence?  What is the influence of the length of identifiers on the readability and comprehensibility of source code?
(RQ2) o Do gender, programming experience, and program scope features support this influence?  What is the influence of the presence of comments on the comprehensibility of source code? (RQ3) o Do programming experience, and program scope features support this influence?  What is the influence of code indentation spacing on the readability of source code? (RQ4) o Do the length of lines, nested commands, and programming experience features support this influence? For each RQ, an H0 (null hypotheses) representing no influence of the attribute on the quality characteristic was also formulated.
To answer these research questions, we decided to observe the participants' perceptions regarding the readability and comprehensibility of code snippets. We selected and modified, according to the mentioned attributes, source code snippets from real open source projects. Python was the programming language used in the snippets due to the participants' familiarity according to a previous survey done during the software engineering course. Two different well-known open source projects (according to GitHub 6 ) written in Python language provided the snippets: BitTorrent and YoutubeDL. For each attribute, two separate versions of the same source code have been created: i): small versus large (in terms of number of lines) source code size; ii) complete-word versus abbreviation relating to the length of identifiers; iii) with versus without comments; and iv) four versus eight spaces regarding indentation spacing. The snippets should be short and self-contained to be assessed during the time (one hour) available for the task.
The empirical study was designed to be conducted in person (in vitro) during a software engineering class regarding the quality of source code. The participants received a lecture on software quality and a general explanation on readability and comprehensibility of source codeno details about source code attributes were provided before the study. Next, they received: i) a consent form; ii) a characterization form; and iii) eight different source code snippets (randomly assigned to the students, one for each version of the attributes under evaluation). Since we did not want to have different source code attributes interacting among themselves, we decided for using different source code snippets for the source code attributes evaluation, controlling to the possible extension the factors affecting the source code readability and comprehensibility. This way we had eight different source code snippets, each one having two versions according to each attribute under evaluation. We wanted to guarantee that all the participants would receive the same code snippets, with the only difference being the versions of them (see Figure 3). The study was anonymous, and it did not influence the performance of participants in the module.
Following some pieces of advice on how to measure program comprehension [37], we assembled the study so that the participants should provide their perceptions on the readability (whether applicable) and comprehensibility (whether applicable) of snippets by using a Likert scale (five points) along with reasons (can be more than one) for the perception (in the case of readability) and an answer regarding a comprehension question (in the case of comprehensibility). Since they received printed snippets, some strategies were used to simulate better the way source code is typically viewed on a computer screen, for instance, printing the snippets in landscape orientation and bolding the native commands. In addition, smileys were used as a replacement for written Likert levels codes. Figure 3 depicts an example of two versions of the same code snippet used in the study (a participant assessed only one of the versions).

Figure 3: Two Different Versions of a Code Snippet Concerning Length of Identifiers
A senior programmer performed a pilot to assess the code snippets difficulty level and the time to complete all the evaluations. The senior programmer completed the study in 29 minutes (all possible options) and provided some feedback regarding the possible difficulty levels for undergraduate students, which supported the replacement of some of the previously selected snippets.

Contradictory Results
The study had 38 participants. A third of them finished the evaluation in less than 40 minutes. Their characterization revealed a broad range of different profiles. According to their self-reported expertise (using a five-point Likert scale ranging from "I have no experience" to "I practice it in several projects in the industry") on writing and debugging source code, software development, and evolution activities, and Python language knowledge, half of them was characterized in novices and half, in experts; according to their self-reported knowledge (using a three-point Likert scale ranging from "I do not know the domain" to "I am familiar with the domain") of the software from where the snippets were extracted (BitTorrent and YoutubeDL), half of them was marked as low level of knowledge and half, as high level of knowledge. We decided for using self-reported expertise and knowledge since a recent study showed that it could be an accurate way of measuring expertise [38]. Whereas looking the overall results for each attribute may not give us any conclusive answer as to the reasons for their different influence on the readability and comprehensibility of source code, the data stratification considering the main findings in the literature review can give us some hints about the previously observed contradictory results. Thus, the participants' characteristics and source code properties were used to support the data analysis, as can be seen in the following subsections.

Results Regarding Source Code Size (RQ1)
Neither readability nor comprehensibility perceptions presented significant differences in the analyzed cases (see Table 9). Concerning the readability characteristic, almost all explanations against the size of code were for the bigger code snippet versions, independent of the participants' programming experience, which makes us conjecture whether the code size has a negative influence on its readability. It is important to note that more lines tended to have worse readability perceptions for experts than for novices.

Comprehensibility Answers
On the other hand, the results for comprehensibility reached a bit more agreement among the participants and snippets versions. The results, in this case, showed some trend on the more substantial snippet versions, especially for novices, which presented more positive comprehension perceptions than the experts. While analyzing other source code attributes of the two snippets under evaluation, we observed some hints for this slight difference. The smaller version of snippet 1 has more complicated arithmetic expressions than its larger version, which took advantage of more IF-conditions to solve the problem of formatting bytes with their suffix. According to what was pointed in the previous sections arithmetic expressions can hamper the comprehension of source code, but what we noticed is that their complexity is more detrimental than their quantity. In the case of snippet 2, the smaller version avoided the use of an additional IF-condition while the more extensive version made explicit the return of a function through this condition. At least concerning the perception, novices tended to like more the versions with explicit source code in comparison to experts. No differences in results could be seen to different participants' domain knowledge.

Results Regarding Identifier Length (RQ2)
Gender, programming experiences, and program scope were mentioned as factors that could contribute to the influence of the identifiers length on the readability and comprehensibility of source code. Five out of 38 participants were women, which makes unfeasible to draw any conclusion towards the gender influence on the length of identifier and readability/comprehensibility analysis. The fact that we were more interested in assessing 'low-level' details of readability and comprehensibility (preference of the company) made us select self-contained snippets, which means none of the snippets in the study presented identifiers from outside their scopes, also making unfeasible the program scope analysis. Still, programming experience and domain knowledge were used to support data stratification. Perceptions 

Comprehensibility Answers
Although the participants wrote their perceptions regarding the readability of identifiers, all the given explanations for and against the style concerned with comprehension issues instead of reading ones, which made us conjecture whether the participants could correctly assess the influence of this attribute on this characteristic. The fact that both readability and comprehensibility perceptions presented similar results no matter what the used data stratification (experience or domain knowledge), also lower our confidence in making conclusions about the length of identifiers and readability. The results regarding the length of identifiers and comprehensibility, on the other hand, highlighted interesting insights. The comprehensibility perceptions for long identifiers were higher in most of the cases. Participants with higher programming experiences and low level of domain knowledge were the ones to give lower comprehensibility perception for long identifiers, and it was regarding the snippet 4 (see Figure 3 Table 10). Concerning the statistical tests, significant differences were found in experts and low-level domain concerning snippet 3.
Even though the perceptions were different, we could not conclude much from the scores for the comprehension questions. We could observe, though, that inexperienced programmers did have more difficulty in comprehending short identifiers than experts in the case of snippet 4, which it is by Lawrie et al. findings [14]. In fact, their perception was almost opposite for this case. It is important to point out that the snippet 4 is more extensive, has long lines of code, and presents more information than the snippet 3, supporting an explanation for difficulties presented by novices when compared with experts. We believe identifiers names embody the connection between the source code and the problem domain [32], and for this reason, we expected that domain knowledge could show some influence on the relation between identifier length and source code comprehension. Surprisingly, high domain knowledge did not seem to lead to better scores on comprehension questions. As an interesting observation, many participants in the expert group were also in the low-level domain knowledge group. This fact might help to explain the similarities obtained in these two groups for snippet 3, meaning; we cannot be sure whether the experience or the knowledge influenced the results the most. Probably these other factors confounded this result.

Results Regarding Comments Presence -Study Design Control (RQ3)
The presence of comments was used as a design control to enhance our confidence with the participants' responses for the other attributes. We decided to use the presence of comments as a control because it is a consensus in many studies that the presence of comments influences the comprehensibility of source code positively. To avoid the negative influence presented by Lee, Lee and In (U) [36], we warned the participants that everything that was inside the code box should count in their assessment (see Table 11). Perceptions 

Comprehensibility Answers
The results showed that while answering comprehension questions, students achieved better scores for snippets presenting comments describing their function, as expected. Their perceptions concerning the comprehensibility of snippets were more favourable for those presenting comments as well, which was also confirmed by Mann-Whitney statistic tests for ordinal data (p-value < 0.05 for novices, experts, low level and high-level knowledgealpha = 5%, as shown in Table 11), allowing us indicating that there is a difference in the perception of source code with and without comments. This initial analysis allowed us to strengthen confidence in the answers for the other attributes.

Results Regarding Indentation Spacing (RQ4)
Even though Python enforces a specific indentation scheme, the number of spaces used for indentation was under evaluation, rather than the existence of indentation, since the principal contradiction encountered in the technical literature was regarding the former. We clustered the results according to the participants' experience as it can be seen in Table 12.

Readability Perceptions  Readability Explanations
No significant difference in perception was observed regarding indentation spacingneither for novices nor experts. Some interesting observations can be made with the perception dispersion in the two groups though. The readability perceptions did not alter much in the expert group as happened in the novice one. As an interesting observation, all the explanations against the used indentation style were towards snippets with eight spaces, indicating an agreement between the results obtained in our empirical study and the ones presented by Miara et al. [24] and Buse et al. [15] [4].
In both groups, the median for readability perception was lower for snippet 8 in its 8-space indentation version in comparison to the other snippets configurations. As additional information, snippet 8 had shorter lines of source code as compared to snippet 7; another difference is the existence of a loop in the snippet 7 that is not present in snippet 8. Other than those, no further explanations could be gathered to support the justification for the difference in perception. Further investigation is required to check whether line length and loop can be moderator factors of indentation spacing and readability of source code. No findings related to nested commands were observed, probably due to the simplified source code structures used in our study.

Threats to Validity
Concerning the literature review, we are aware that the coverage of articles might not be complete, mainly because we had to simplify our search to provide information feedback to the company in a reasonable time span. We must acknowledge that the list of included works presented in this paper can be used as quasi-golden standard [39] in future investigations in the area. Concerning the study, as the human judgment of quality is somewhat subjective, the alternative used to measure it might not have been the most appropriate one, threatening the construct validity. Also, since it is almost impossible to isolate the source code attributes to evaluate their influence on quality separately from the rest of the source code, some other source code attributes might have affected the participants' responses, threatening the internal validity. Besides, there is a chance the participants' perceived readability and comprehensibility as a single concept, even though their differences were explained beforehand. Concerning the external validity, the results from this empirical study cannot be generalized neither can be applied in general cases, since the selected code snippets and participants may not be representative for a broader industrial scenario. Statistical tests for ordinal data were used to support the analysis of readability and comprehensibility perceptions to mitigate the threats regarding conclusion validity.

Conclusions
The results presented in this work supported a more thorough understanding of the interaction among programmers' characteristics, program elements, and source code readability and comprehensibility, by describing previous study observations and conducting an empirical study regarding source code attributes and their impact on source code quality. We presented 13 different attributes and 94 measurement procedures that have evidence of their influence on source code reading and comprehension. Some contradictions were identified during the literature review, which guided us to an empirical study investigation. For the code size, while it has a negative influence on readabilityin particular for experts -, our results showed a positive impact on comprehensibility, supporting the explanation of the contradictory results found in the technical literature. The comprehensibility perceptions for long identifiers was higher in most of the casesshowing some significant differences for snippet 3 -, especially for less experienced participants, showing a positive relation between identifier length and comprehensibility. Regarding the presence of comments (our study design control), we could observe with significant results that the presence of comments is favorable to source code comprehension. Moreover, even though no significant difference regarding readability perceptions was observed, participants showed preferences for indentation using four over eight spaces, indicating a negative relation between the indentation spacing and readability. Most likely, a misunderstanding between the concepts of readability and comprehensibility caused the contradictory results presented in the technical publications. Still, the presented results indicate that coding guidelines contradictions still demand further investigation to provide indications on possible confounding factors explaining some of the inconclusive results and to support the clarification of these findings.