The experimental community corrupts the researcher

Experimentation is a valuable tool in computer science. Random experimentation is, however, a danger. This post examines current research practices and casts doubt on their benefits. Inspired by the French writer Rousseau (1972)—society corrupts the individual—, my conviction is that the experimental community corrupts the researcher. The scientific method has evolved throughout the centuries, and […]

Experimentation is a valuable tool in computer science. Random experimentation is, however, a danger. This post examines current research practices and casts doubt on their benefits. Inspired by the French writer Rousseau (1972)—society corrupts the individual—, my conviction is that the experimental community corrupts the researcher.

The scientific method has evolved throughout the centuries, and philosophers have had a distinguished role in that change by questioning the beliefs used to guide discoveries; by challenging new ways of thinking, sometimes without providing any answers, for the pleasure of asking questions (Russell 1997). This passion and awakened mind are missing in the current computer science community. Empiricists, especially, write large amounts of plain technical reports tracing experiments, oblivious to the beauty of essays, the excitement of sharing outlandish ideas. Going through the literature has often become a mechanical skimming/scanning task, seeking for numbers that highlight those decimals that the proposed techniques outperform their competitors by. These results sustain a sort of research, based on a mere parameter tuning of established algorithms.
The purpose of this viewpoint is to show the disenchantment that would-be researchers—even tenured researchers—suffer and to denounce the proliferation of questionable practices that are killing innovation.
We first review the effect of the modern obsession for publishing and to what extent academic research has distorted experimental science. Next, we see what calls the current methodology into question and why approaches are simply ignored or incomprehensibly revived.

The perversion of the community
Pure research is becoming less attractive nowadays. Many research lines are abandoned since investors are more interested in applications—despite the relevance of fundamental investigation. Hence, the groups that subsist are because either their research is leading in applicative domains or their volume of publications is high. What is behind these numbers? Parnas (2007) makes a strong point about them and guts every single perversion of the community—authorship in pacts, monthly instalments, tailor-made conferences—that have encouraged superficial research made from overly large groups, repetition, insignificant studies, half-baked ideas… Publish or perish has wreaked havoc on daily investigation, at least in the halls of European academia.
Eventually, impact factors, h-index, g-index have become the fallacious indicators of good research/ers and fired up the paper factory. Fresh Ph.D. students are burdened junk writers as soon as they learn that their career will be measured by these statistics. The pressure is intensified by supervisors, assessed by the same yardstick, who need to keep CVs up-to-date or repay colleagues in the favour chain which promotes quantity over substance.
This compulsive publishing has plagued conferences and journals with so many papers that it is getting difficult to track innovative ideas. The more one reads, the more one bumps into similar attempts, déjà vus—which slow down the learning curve and discourage further reading. Showing off the abilities of regular methods to non-technical experts and cherry-picking results from much wider experimentation are the most common schemes. The raison d’être of empiricism has been abused and now entails repeated preliminary results with no further continuation.

Experimental computer science
Experimental computer science, defined as “an apparatus to be measured, a hypothesis to be tested, and systematic analysis of the data (to see whether its supports the hypothesis)” by Denning (1980), is recurrent in machine learning, algorithm development, and software engineering. Nevertheless, experimental methodology has been twisted; instead of sustaining conjectures, experiments are run to provide material to decide them retroactively—to build a posteriori theories.
Machine learning, for instance, is based on trials with performance measures, learners, and data. The combination of these elements made Langley (1988) encourage practitioners to join empirical testing, as a process of theory formation. Competition testing—a term coined by Hooker (1995) in relation to heuristics—has been the subsequent chaos of such a call. Many years later, no new learning paradigm has been introduced, some progress in standards has been made, and micro-tuning of the existing techniques is the trendy research—the latter being the gold mine for publications. Superiority of techniques is claimed usually following a three-step procedure: selection of a few data sets, selection of referenced learners to compare with, and extraction of performance conclusions supported by erroneous statistical tests. With a pessimistic but very realistic description of the scene, Demsar (2008) warned of the misuse of such experimentation. Conventional statistical models are designed to test single learners in isolation; they are ill-suited to perform multiple comparisons.
Hypothesis testing is useful to say whether the probability of the apparent accuracy of a learner is due to chance, but its power goes down as the number of data sets examined increases. Then, it is worth determining what the ideal size of the test set is, what problems have to be involved, and empowering the testing methodology by sufficient data analysis. These—old claims—are things that one expects to be delighted with when reading papers. Yet, they are complicated milestones and many negative results are derived from the studies. Although these are meaningful to lead progress as well, the community does not consider them. This forces researchers to move back to the classical developments. In addition, groundless rejections cause frustration in researchers, which is reflected in their subsequent reviews. In turn, after being taught that going against the mass culture is not profitable, they will unwittingly stop promising ideas, frustrating new generations again.

Gaming the system in lieu of research
In validating incoming contributions, the clout of journals and reviewers, and the inertia of the scientific community as a society have a lot to do.
Current research is like politics—each tendency has its own press. No matter the thoroughness of the content, if the work submitted to a journal is not aligned with the thought of its staff, it will never get the green light. This results in contributions focused on pre-empting reviewers’ opinion than disseminating the work. Demsar (2008) suggests the web-to-peer review. This unlikely idea, which appears to enable critical and fair evaluations of “correctness, interestingness, usefulness, beauty, novelty“, also evidences the urge to adopt other measures of productivity and recognition to end with the fake tenure of rigour and biased opinions. The new peer-review process should give back credibility to publications, and researchers should not be able to game it.
Indeed, references have a crucial role in the shallow statistics above. Everyone knows they provide the information for the productivity computation. Thus, self-citations, citations to friends and the community clique, or citations to particular journals are some of the mechanisms to scale. Citing has lost its sense: guiding the reader to obtain the background necessary to understand the paper.
A reinterpreted experimental science and a deep knowledge of the system have been the mean for academic researchers to satisfy a demanding productivity. Unfortunately, this praxis is learnt by the new generation of researchers who will mistake research for poor scientific journalism/scientific patter. Publications should be the recognition to mature works and should slow down to gain in quality.

References
Demsar, J. “On the appropriateness of statistical tests in machine learning.” Proceedings of the 3rd Workshop on Evaluation Methods for Machine Learning. 2008.
Denning, P.J. “What is experimental computer science?” Communications of the ACM 23, no. 10 (1980): 543-544.
Hooker, J. N. “Testing heuristics: We have it all wrong.” Journal of Heuristics 1, no. 1 (1995): 33-42.
Langley, P. “Machine learning as an experimental science.” Machine Learning 3, no. 1 (August 1988): 5-8.
Parnas, D.L. “Stop the numbers game.” Communication of ACM 50, no. 11 (2007): 19-21.
Rousseau, J.J. Les confessions. Paris: Librairie Générale Française, 1972.
Russell, B. The problems of philosophy. New York: Oxford University Press, 1997.

The experimental community corrupts the researcher

Experimentation is a valuable tool in computer science. Random experimentation is, however, a danger. This post examines current research practices and casts doubt on their benefits. Inspired by the French […]

Experimentation is a valuable tool in computer science. Random experimentation is, however, a danger. This post examines current research practices and casts doubt on their benefits. Inspired by the French writer Rousseau (1972)—society corrupts the individual—, my conviction is that the experimental community corrupts the researcher.

The scientific method has evolved throughout the centuries, and philosophers have had a distinguished role in that change by questioning the beliefs used to guide discoveries; by challenging new ways of thinking, sometimes without providing any answers, for the pleasure of asking questions (Russell 1997). This passion and awakened mind are missing in the current computer science community. Empiricists, especially, write large amounts of plain technical reports tracing experiments, oblivious to the beauty of essays, the excitement of sharing outlandish ideas. Going through the literature has often become a mechanical skimming/scanning task, seeking for numbers that highlight those decimals that the proposed techniques outperform their competitors by. These results sustain a sort of research, based on a mere parameter tuning of established algorithms.
The purpose of this viewpoint is to show the disenchantment that would-be researchers—even tenured researchers—suffer and to denounce the proliferation of questionable practices that are killing innovation.
We first review the effect of the modern obsession for publishing and to what extent academic research has distorted experimental science. Next, we see what calls the current methodology into question and why approaches are simply ignored or incomprehensibly revived.

The perversion of the community
Pure research is becoming less attractive nowadays. Many research lines are abandoned since investors are more interested in applications—despite the relevance of fundamental investigation. Hence, the groups that subsist are because either their research is leading in applicative domains or their volume of publications is high. What is behind these numbers? Parnas (2007) makes a strong point about them and guts every single perversion of the community—authorship in pacts, monthly instalments, tailor-made conferences—that have encouraged superficial research made from overly large groups, repetition, insignificant studies, half-baked ideas… Publish or perish has wreaked havoc on daily investigation, at least in the halls of European academia.
Eventually, impact factors, h-index, g-index have become the fallacious indicators of good research/ers and fired up the paper factory. Fresh Ph.D. students are burdened junk writers as soon as they learn that their career will be measured by these statistics. The pressure is intensified by supervisors, assessed by the same yardstick, who need to keep CVs up-to-date or repay colleagues in the favour chain which promotes quantity over substance.
This compulsive publishing has plagued conferences and journals with so many papers that it is getting difficult to track innovative ideas. The more one reads, the more one bumps into similar attempts, déjà vus—which slow down the learning curve and discourage further reading. Showing off the abilities of regular methods to non-technical experts and cherry-picking results from much wider experimentation are the most common schemes. The raison d’être of empiricism has been abused and now entails repeated preliminary results with no further continuation.

Experimental computer science
Experimental computer science, defined as “an apparatus to be measured, a hypothesis to be tested, and systematic analysis of the data (to see whether its supports the hypothesis)” by Denning (1980), is recurrent in machine learning, algorithm development, and software engineering. Nevertheless, experimental methodology has been twisted; instead of sustaining conjectures, experiments are run to provide material to decide them retroactively—to build a posteriori theories.
Machine learning, for instance, is based on trials with performance measures, learners, and data. The combination of these elements made Langley (1988) encourage practitioners to join empirical testing, as a process of theory formation. Competition testing—a term coined by Hooker (1995) in relation to heuristics—has been the subsequent chaos of such a call. Many years later, no new learning paradigm has been introduced, some progress in standards has been made, and micro-tuning of the existing techniques is the trendy research—the latter being the gold mine for publications. Superiority of techniques is claimed usually following a three-step procedure: selection of a few data sets, selection of referenced learners to compare with, and extraction of performance conclusions supported by erroneous statistical tests. With a pessimistic but very realistic description of the scene, Demsar (2008) warned of the misuse of such experimentation. Conventional statistical models are designed to test single learners in isolation; they are ill-suited to perform multiple comparisons.
Hypothesis testing is useful to say whether the probability of the apparent accuracy of a learner is due to chance, but its power goes down as the number of data sets examined increases. Then, it is worth determining what the ideal size of the test set is, what problems have to be involved, and empowering the testing methodology by sufficient data analysis. These—old claims—are things that one expects to be delighted with when reading papers. Yet, they are complicated milestones and many negative results are derived from the studies. Although these are meaningful to lead progress as well, the community does not consider them. This forces researchers to move back to the classical developments. In addition, groundless rejections cause frustration in researchers, which is reflected in their subsequent reviews. In turn, after being taught that going against the mass culture is not profitable, they will unwittingly stop promising ideas, frustrating new generations again.

Gaming the system in lieu of research
In validating incoming contributions, the clout of journals and reviewers, and the inertia of the scientific community as a society have a lot to do.
Current research is like politics—each tendency has its own press. No matter the thoroughness of the content, if the work submitted to a journal is not aligned with the thought of its staff, it will never get the green light. This results in contributions focused on pre-empting reviewers’ opinion than disseminating the work. Demsar (2008) suggests the web-to-peer review. This unlikely idea, which appears to enable critical and fair evaluations of “correctness, interestingness, usefulness, beauty, novelty“, also evidences the urge to adopt other measures of productivity and recognition to end with the fake tenure of rigour and biased opinions. The new peer-review process should give back credibility to publications, and researchers should not be able to game it.
Indeed, references have a crucial role in the shallow statistics above. Everyone knows they provide the information for the productivity computation. Thus, self-citations, citations to friends and the community clique, or citations to particular journals are some of the mechanisms to scale. Citing has lost its sense: guiding the reader to obtain the background necessary to understand the paper.
A reinterpreted experimental science and a deep knowledge of the system have been the mean for academic researchers to satisfy a demanding productivity. Unfortunately, this praxis is learnt by the new generation of researchers who will mistake research for poor scientific journalism/scientific patter. Publications should be the recognition to mature works and should slow down to gain in quality.

References
Demsar, J. “On the appropriateness of statistical tests in machine learning.” Proceedings of the 3rd Workshop on Evaluation Methods for Machine Learning. 2008.
Denning, P.J. “What is experimental computer science?” Communications of the ACM 23, no. 10 (1980): 543-544.
Hooker, J. N. “Testing heuristics: We have it all wrong.” Journal of Heuristics 1, no. 1 (1995): 33-42.
Langley, P. “Machine learning as an experimental science.” Machine Learning 3, no. 1 (August 1988): 5-8.
Parnas, D.L. “Stop the numbers game.” Communication of ACM 50, no. 11 (2007): 19-21.
Rousseau, J.J. Les confessions. Paris: Librairie Générale Française, 1972.
Russell, B. The problems of philosophy. New York: Oxford University Press, 1997.

Data complexity in supervised learning

My thesis, Data complexity in supervised learning: A far reaching implication, is finally available online. This thesis takes a close view of data complexity and its role shaping the behaviour of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates. The work has been built upon four […]

My thesis, Data complexity in supervised learning: A far reaching implication, is finally available online.

This thesis takes a close view of data complexity and its role shaping the behaviour of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates. The work has been built upon four principles which have naturally followed one another. (1) A critique about the current methodologies used by the machine learning community to evaluate the performance of new learners unleashes (2) the interest for alternative estimates based on the analysis of data complexity and its study. However, both the early stage of the complexity measures and the limited availability of real-world problems for testing inspire (3) the generation of synthetic problems, which becomes the backbone of this thesis, and (4) the proposal of artificial benchmarks resembling real-world problems.

The ultimate goal of this research flow is, in the long run, to provide practitioners (1) with some guidelines to choose the most suitable learner given a problem and (2) with a collection of benchmarks to either assess the performance of the learners or test their limitations.

Data complexity in supervised learning

My thesis, Data complexity in supervised learning: A far reaching implication, is finally available online. This thesis takes a close view of data complexity and its role shaping the behaviour […]

My thesis, Data complexity in supervised learning: A far reaching implication, is finally available online.

This thesis takes a close view of data complexity and its role shaping the behaviour of machine learning techniques in supervised learning and explores the generation of synthetic data sets through complexity estimates. The work has been built upon four principles which have naturally followed one another. (1) A critique about the current methodologies used by the machine learning community to evaluate the performance of new learners unleashes (2) the interest for alternative estimates based on the analysis of data complexity and its study. However, both the early stage of the complexity measures and the limited availability of real-world problems for testing inspire (3) the generation of synthetic problems, which becomes the backbone of this thesis, and (4) the proposal of artificial benchmarks resembling real-world problems.

The ultimate goal of this research flow is, in the long run, to provide practitioners (1) with some guidelines to choose the most suitable learner given a problem and (2) with a collection of benchmarks to either assess the performance of the learners or test their limitations.

DCoL: New release v1.1

A new version of the data complexity library (DCoL) in C++ is available at http://dcol.sourceforge.net/. DCoL provides the implementation of a set of measures designed to characterize the apparent complexity of data sets for supervised learning, which were originally proposed by Ho and Basu (2002). More specifically, the implemented measures focus on the complexity of […]

A new version of the data complexity library (DCoL) in C++ is available at http://dcol.sourceforge.net/.

DCoL provides the implementation of a set of measures designed to characterize the apparent complexity of data sets for supervised learning, which were originally proposed by Ho and Basu (2002). More specifically, the implemented measures focus on the complexity of the class boundary and estimate (1) the overlaps in the feature values from different classes, (2) the class separability, and (3) the geometry, topology, and density of manifolds. In addition, two other complementary functionalities, (4) stratified k-fold partitioning and (5) routines to transform m-class data sets (m > 2) into m two-class data sets, are included in the library. The source code can be compiled across multiple platforms (Linux, MacOS X, and Ms Windows) and can be easily configured and run from the command line.

Practitioners are encouraged to consider the use of this software in the analysis of their data. A closer reading of data complexity can help them to understand the performance of machine learning techniques and their behavior.

DCoL: New release v1.1

A new version of the data complexity library (DCoL) in C++ is available at http://dcol.sourceforge.net/. DCoL provides the implementation of a set of measures designed to characterize the apparent complexity […]

A new version of the data complexity library (DCoL) in C++ is available at http://dcol.sourceforge.net/.

DCoL provides the implementation of a set of measures designed to characterize the apparent complexity of data sets for supervised learning, which were originally proposed by Ho and Basu (2002). More specifically, the implemented measures focus on the complexity of the class boundary and estimate (1) the overlaps in the feature values from different classes, (2) the class separability, and (3) the geometry, topology, and density of manifolds. In addition, two other complementary functionalities, (4) stratified k-fold partitioning and (5) routines to transform m-class data sets (m > 2) into m two-class data sets, are included in the library. The source code can be compiled across multiple platforms (Linux, MacOS X, and Ms Windows) and can be easily configured and run from the command line.

Practitioners are encouraged to consider the use of this software in the analysis of their data. A closer reading of data complexity can help them to understand the performance of machine learning techniques and their behavior.

Universitat d’Estiu d’Andorra

After a first-rate opening in May with the talk given by Prof. Cirac, the 27th edition of the Universitat d’Estiu d’Andorra officially starts today, Aug, 30 with a promising agenda: 6:00 pm: Equilibri climàtic del planeta Terra (Climate balance on planet Earth), presented by Josefina Castellví Piulachs, oceanographer specialized in marine bacteriology (Barcelona). 7:30 pm: […]

After a first-rate opening in May with the talk given by Prof. Cirac, the 27th edition of the Universitat d’Estiu d’Andorra officially starts today, Aug, 30 with a promising agenda:

6:00 pm: Equilibri climàtic del planeta Terra (Climate balance on planet Earth), presented by Josefina Castellví Piulachs, oceanographer specialized in marine bacteriology (Barcelona).
7:30 pm: La bellesa és dins el cervell? (Is beauty in the mind?), presented by Jean-Pierre Changeux, doctor in biology and pioneer of modern neurobiology (Paris).

For five days, Andorra will offer, under the interesting title Del cosmos a l’àtom passant per la vida, a series of talks focussed on science and society.

Universitat d’Estiu d’Andorra

After a first-rate opening in May with the talk given by Prof. Cirac, the 27th edition of the Universitat d’Estiu d’Andorra officially starts today, Aug, 30 with a promising agenda: […]

After a first-rate opening in May with the talk given by Prof. Cirac, the 27th edition of the Universitat d’Estiu d’Andorra officially starts today, Aug, 30 with a promising agenda:

6:00 pm: Equilibri climàtic del planeta Terra (Climate balance on planet Earth), presented by Josefina Castellví Piulachs, oceanographer specialized in marine bacteriology (Barcelona).
7:30 pm: La bellesa és dins el cervell? (Is beauty in the mind?), presented by Jean-Pierre Changeux, doctor in biology and pioneer of modern neurobiology (Paris).

For five days, Andorra will offer, under the interesting title Del cosmos a l’àtom passant per la vida, a series of talks focussed on science and society.

ICPR 2010 – Contest: Extended Deadline May, 26

Call for Contest Participation – Classifier domains of competence: The landscape contest (ICPR 2010) Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants on targeted-complexity problems may contribute to enrich our […]

Call for Contest Participation – Classifier domains of competence: The landscape contest (ICPR 2010)

Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants on targeted-complexity problems may contribute to enrich our understanding of the behavior of machine learning techniques and open further research lines.

The contest will take place on August 22, during the 20th International Conference on Pattern Recognition (ICPR 2010) at Istanbul, Turkey.

We encourage everyone to participate and share with us your work! For further details about dates and submission, please see http://www.salle.url.edu/ICPR10Contest/.

SCOPE OF THE CONTEST

The landscape contest involves the running and evaluation of classifier systems over synthetic data sets. Over the last two decades, the pattern recognition and machine learning communities have developed many supervised learning techniques. Nevertheless, the competitiveness of such techniques has always been claimed over a small and repetitive set of problems. This contest provides a new and configurable testing framework, reliable enough to test the robustness of each technique and detect its limitations.

INSTRUCTION FOR PARTICIPANTS

Contest participants are allowed to use any type of technique. However, we highly encourage and appreciate the use of novel algorithms.

Participants are required to submit the results by email to the organizers.
Submission e-mail: nmacia@salle.url.edu
Meet the submission deadline: Wednesday May 26, 2010

The contest is divided into two phases: (1) offline test and (2) live test. For the offline test, participants should run their algorithms over two sets of problems, S1 and S2. However, the real competition, the live test, will take place during the conference. Two more collections of problems, S3 and S4, will be presented.

S1: Collection of data sets spread along the complexity space to train the learner. All the instances will be duly labeled.

S2: Collection of data sets spread along the complexity space with no class labeling to test the learner performance.

S3: Collection of data sets with no class labeling, like S2 to be run for a limited period of time.

S4: Collection of data sets with no class labeling covering specific regions of the complexity space to determine the neighborhood dominance.

For the offline test, the results report consists of:

1. Labeling the data sets of the collection S2.

The procedure is the following:

  1. Train the learner using Dn-trn.arff in S1.
  2. Provide the rate of the correctly classified instances over a 10-fold cross validation.
  3. Label the corresponding data set Dn-tst.arff in S2.
  4. Store the n models generated for each data set to perform the live contest on August 22. Be ready to load them on this day.

2. Describing the techniques used.

A brief summary (1~2 pages) of the machine learning technique/s used in the experiments must be submitted. We expect details such as the learning paradigm, configuration parameters, strength and limitations, and computational cost.

IMPORTANT DATES

* May 26, 2010: Deadline for submission of the results and technical report

* May 29, 2010: Notification of participation

* Aug 22, 2010: Release of S3 and S4

* Aug 22, 2010: ICPR 2010 – Interactive Session


CONTACT DETAILS

Dr. Tin Kam Ho – tkh at research.bell-labs.com
Núria Macià – nmacia at salle.url.edu
Prof. Albert Orriols Puig – aorriols at salle.url.edu
Prof. Ester Bernadó Mansilla – esterb at salle.url.edu