Colloquia - Computer Science

Extra Seminar Computer Science

When:Mo 14-01-2019 09:00 - 10:00
Where:5161.0041B Bernoulliborg


The Relevance of Application Domains in Empirical Findings


Research on empirical software engineering has increasingly used

data from online repositories or collective efforts. The latest trends

for researchers is to gather as much data as possible to (i) prevent

bias in the representation of a small sample, (ii) work with a sample

as close as the population itself, and (iii) showcase the performance

of existing or new tools in treating vast amount of data.

The effects of harvesting enormous amounts of data have been

only marginally considered so far: data could be corrupted; reposi-

tories could be forked; and developer identities could be duplicated.

In this paper we posit that there is a fundamental flaw in harvesting

large amounts of data, and when generalising the conclusions: the

application domain, or context, of the analysed systems must be

the primary factor for the cluster sampling of FOSS projects.

In this talk we analyse a sample of software systems, and using

an existing approach based on Latent Dirichlet Allocation (LDA), we

derive their application domains. We extract a suite structural OO

metrics from each project, and cluster projects by domains: we show

that most of the chosen metrics come from different populations,

and are based on the application domains.