Does Free Software Production in a Bazaar Obey the Law of Diminishing Returns?


Abstract

Free Software is defined. Brook's Law is introduced as a limiting factor in software production. A parallel is drawn between Brook's Law and the Law of Diminishing Returns. Eric S. Raymond's "Bazaar" model is introduced as a possible exception to the rule. Statistics gathered from an existing Free Software project are used to demonstrate the nonadherence of "Bazaars" to the law of diminishing returns. A possible flaw in Raymond's hypothesis is identified, and an economic alternative is offered.

Contents

  1. Abstract
  2. Contents
  3. Introduction
    1. Free Software Defined
  4. Brook's Law and The Law of Diminishing Returns
  5. The "Bazaar" Introduced
  6. The GNOME Project
  7. Examination and Discussion of GNOME Data
  8. Conclusion
  9. Endnotes
  10. Bibliography
  11. Appendices
    1. Acknowledgements and Thanks
    2. The Opensource Definition
    3. Caveats
    4. Economic Principles in Free Software
    5. Program Listings


Introduction

This essay seeks to examine the rapidly emerging production method of "Free Software" or "Opensource Software". It outlines analogues between Economic theory and Software Engineering in order to bring economic analysis to bear on the area.

Specifically, it aims to provide a quantitative analysis of what has until now been primarily examined in a qualititative way. Free Software has existed in one form or another since the very early days of computing, but very little attention has been paid to it until recently. Many Free Software projects have achieved significant or even dominant positions in their marketplace, and more firms are starting to utilise or release Free Software.

Free Software Defined

Definitions of Free Software (also know as "OSS", for Open Source Software) are typically framed in two perspectives: Ideological and Legal.

The two major ideological principles underlying Free Software are the protection of user/programmer choice, and the belief that the best solutions must be shared.

The first principle arose from Richard M. Stallman's dismay at the rise of proprietary software as the dominant format. Stallman believes that proprietary (also known as closed-source) software is a violation of the individual's right to choose other packages. He argues that access to the sourcecode grants freedoms to modify and augment without being "locked in" to one company's whims. Further, he argues that sourcecode access gives users a choice to go their own way, in defiance of the company's wishes should they be detrimental.

The second principle, that good solutions should be shared, arises from the so-called "Hacker Culture". Within this culture, brainpower is seen as a limited resource, which should not be wasted on unnecessarily reinventing the wheel. It is reasonable, therefore, that all solutions (embodied in sourcecode) should be available for anyone to use. The corrollary is that withholding solutions (or source) is effectively evil or sad, inasmuch as it is wasteful of resources. A sense of futility in this approach is summed up neatly by Philip Greenspun: "We're not fond of Bill Gates, but it still hurts to see Microsoft struggle with problems that IBM solved in the 1960s. Thus, we share our source code with others in the hopes that programmers overall can make more progress by building on each other's works than by trying blindly to replicate what was done decades ago."

The legal foundations of Free Software stem from a careful blend of Contract and Copyright laws. Free Software Licenses use the principles of Contract law to create their terms. Typically, these ensure that an author's version of the source is always available and that any modification made by anyone is likewise available under the same terms. Some licenses go so far as to impose these terms onto software where opensource code has been added as an external source.

If the user doesn't agree to the License, the law of Contract renders the terms of the License effectively powerless without punitive terms. However, at this point, standard Copyright laws take effect, and the user is granted no rights whatsoever. There is significant coercion, then, to accept the terms of the License on a contractual basis.

To place Free Software in an economic framework is considerably more difficult, but quite profitable. There are a range of issues and outcomes that emerge naturally from the application of elementary economic thought to a Free Software "economy". The main body of this essay assumes that just such a framework has been established. However, the description of such a framework is long and outside the scope of the main body of this essay, soh a short treatise on the elementary economic framework of Free Software can be found in Appendix IV.

Brook's Law and The Law of Diminishing Returns

According to Fred Brook's law, adding people to a late project makes it later. It's like adding gas to a fire. New people need time to familiarise ... their training takes up the time of ... [other] people ... and merely increasing the number of people increases the complexity and amount of project communication. Brooks points out that the fact that a woman can have a baby in nine months does not imply that nine women can have a baby in one month.

Managers need to understand ... More workers working doesn't necessarily mean more work will get done.

--Steve McConnell, "Code Complete"

In his essay "The Magic Cauldron", Eric S. Raymond estimates that almost 95% of software development is "in-house". This is the traditional meal-ticket of programmers and software engineers. It is from this heartland that Brook's Law is drawn.

Brook's Law - specifically - is that adding people to a late project willy-nilly will only make it later. Brooks derived this law from his own personal experience as a project manager on IBM's original OS/360 project. In his book, The Mythical Man-Month, Brooks pointed out the fallacy of simply throwing more "man-hours" (labour units) at the project in order to deliver it earlier.

According to McConnell, Brook's analysis of his own laws suggests an exception to the rule. " ... if a project's tasks are partitionable, you can divide them further and assign them to ... people who are added late to the project."

In short, we can summarise Brook's Law in two parts:

  1. If, in a (late) project, we can further subdivide tasks efficiently, we can add extra programmers or software engineers without penalty.
  2. Otherwise, if tasks cannot be easily or effeciently subdivided, there will be a penalty for adding extra programmers or software engineers.

Brooks gave several justifications for his law, outlined by McConnell above. One of the easier to seize upon, in economic and mathematical terms, is the complexity problems. It is argued that programming requires a large amount of communications between workers. It can be shown mathematically that if the number of programmers rises linearly, the number of possible communications paths between them rises quadratically. This is illustrated by the diagram below.

[Figure 1-1 Adapted from Code Complete]

But Brook's Law is not original. In fact, the first famous instances of the principles that Brooks expounded are not found in software engineering - they are found in a rice paddy.

In the classic example of the Law of Diminishing Returns, many textbook authors ask us to imagine a rice paddy. We might start with one worker on this paddy, who is barely able to care for and harvest even a fraction of the paddy. In comparison to neighbouring paddies, this paddy is woefully inefficient compared to its neighbours.

So another worker is added. Productivity rises sharply, as two workers can now work the field. We can measure this rise in productivity in terms of the total output and the change in the output - the marginal output.

We continue to add workers. At first, they replicate each other's work, all, say, take a quadrant of the paddy and work it. Later, some will specialise. Some will care for the rice, some will harvest it. Productivity continues to rise.

But the trend is not endless. At a certain point, adding more workers no longer causes a rise in productivity. Perhaps these extra workers need to be trained by other workers. Perhaps they get in one another's way, or there are workers standing by, idle, as excess working capacity. In any case, the marginal rate of productivity begins to fall, followed by the average total output.

It is not difficult to draw parallels with software engineering. Indeed, if Adam Smith had been working today, he may have used a software project as his example!

Let us take a project with one programmer. This programmer has begun to code, but the project is large. He is unable to produce many lines of code on his own, having to continuously stop and refer to manuals for unfamiliar areas, even to remind himself of what part of the project he is dealing with.

Let us add another programmer. Suddenly, they can divide the work amongst them, working on two different parts of the program at once. Then we can add more programmers - including specialists. The program is divided and subdivided into smaller units of specific purpose, and the specialists can focus on these parts.

The subdivision of units and the matching of specialists with program components means that the productivity rises.

But again, we come to a certain point where it begins to falter. Programmers are added who need to be trained on the deep secrets of the existing work. They need to be introduced to procedures and their tools, diverting time for existing programmers. They add overhead to communications paths, and some may spend time fallow, dragging down the average total output once more.

In economic terminology, Brook's Law might be re-summarised thusly:

  1. If, in a production process, we can further subdivide tasks efficiently, we can add extra Labour units without penalty.
  2. Otherwise, if tasks cannot be easily or effeciently subdivided, there will be a penalty for Labour units.
The implication is that there is a curve with a rise, a peak, and fall; that there is a point of maximum productivity. Indeed, while this rewriting of Brook's Law in economic terms does not exactly match up to the Law of Diminishing Returns, it is really the graphs that clinch it.

[Figure 1-2 Adapted from Economics from a Global Perspective]

This is an example of a typical Law of Diminishing Returns graph. One line shows Average Total Output (AO), and the other line shows Marginal Output (MO).

The real signature of diminishing returns is the marginal output line. The MO line in mathematical terms, is the derivative of the ATO line. It shows the rate of change of the ATO at any given "x-value", or units of labour.

The classic Law of Diminishing Returns MO is shown. It rises, peaks, and then crosses zero at the point where ATO peaks. It becomes negative, causing the ATO curve to nose over and dive.

Examples from where hard, actual data could be drawn are legion. In this case, it is generally applicable that this pattern will occur where the Law is true. Indeed, it is the recurrence of this pattern that is often used as proof of the Law's applicability.

[Figure 1-3 Adapted from The Mythical Man-Month]

This graph takes a slightly different tack. Rather than showing Labour-units vs output, it shows "Men" vs "Months". What we see is the 'fruitbowl' curve. This curve is the same curve we see in the Law of Increasing Costs - the logical twin of the Law of Diminishing returns. Compare this graph with the one below.

[Figure 1-4 Adapted from Economics from a Global Perspective]

So it becomes reasonable to assert that the Law of Diminishing Returns and Brook's Law are roughly equivalent. The terminology between them fluctuates, but the meaning and the consequent graphs are highly similar. And, just as the Law of Diminishing Returns has been demonstrated to appear over and over, so has Brook's Law. It is not a case of a coincidental match of graphs.

For the rest of this essay, Brook's Law and the Law of Diminishing Returns will be assumed to be functionally equivalent. This being so, it becomes viable to apply certain basic economic analyses to Brook's Law. In particular, we investigate the bold claim that Brook's Law can be broken.

The Bazaar Introduced

In his influential paper, "The Cathedral and the Bazaar", Eric S. Raymond proposes a theory about how Brook's Law can be overcome. He describes a model of development which he calls the "Bazaar".

The Bazaar concept is somewhat multi-faceted. Some key elements of Bazaar projects include:

In essence, Raymond argues that when Bazaar conditions exist, the opposite of Brook's Law becomes true: more programmers mean higher productivity. In doing so, he proposes several reasons why this might be so. These arise, almost naturally, from a combination of what Raymond considers the 'Hacker mindset'[4], ease of communication, and the open availability of sourcecode.

  1. Low Management Overhead: Programmers are able to modify sourcecode freely. They can add new features and fix bugs without being directed to do so, and without the necessity of broad and deep coordination.
  2. Competition between Solutions: As problems to solve become known in an open-source project, several competing solutions may emerge from multiple sources. The best solution can then be chosen and integrated. If there is disagreement, parties can split off their version of the solution by "forking the tree", in which case, both solutions are implemented.
  3. Parallel Handling of Vertical Problems: A vertical problem is one which is standalone (it stands up without support: vertically). Because these problems can be solved in isolation (the best example being debugging), the amount that can be solved scales easily with additional workers. An open-source environment has no practicable physical limits on how many programmers can participate, which means they excel in solving vertical problems in parallel.

Raymond asserts that, for these reasons, open-source projects can break Brook's Law. In particular, he points to the parallel nature of open-source development, going so far as to say "Given enough eyeballs, all bugs are shallow", dubbing this "Linus' Law", in honour of the Linux system.

Previously, we sought to establish a link between Brook's Law and the Law of Diminishing Returns. The outcome was that Brook's Law is analogous to the Law of Diminishing Returns, but having been derived as a single case from one field of human endeavour, rather than as a general law. The two were shown to be equivalent in describing a process.

By implication, then, Raymond has asserted that the Law of Diminishing Returns can be broken in a Bazaar environment.

The GNOME Project

In order to assess Raymond's claim, this essay uses a high-profile Bazaar-style project. The GNOME project (as with many newer Free Software projects) explicitly adopted Raymond's theory of the Bazaar as its working principles[5].

The GNOME has properties which make it an ideal source of data:

It is also the last property - the extensive use of CVS - that renders GNOME a useful source of data. The CVS system automatically keeps extensive logs of all programmer activity. For the GNOME project, all of this data is publicly available. It is these logs that form the tables on which this essay is based.

GNOME also happens to be a quite large project. It is not the largest Free Software project (the largest probably being the Linux operating system), but it is one of the largest with public-access CVS logs. It is its size which allows for the construction of smoother curves.

Examination and Discussion of GNOME Data

There are five graphs that this essay will examine. What we are looking to find is one of two 'signature' curves present in the data. These are the 'bell' present in LODR Average Total Output graphs, and the positive-to-minus graph present in LODR Marginal Output graphs.

The five graphs reflect the two ways of viewing the GNOME project. The first view is GNOME as a single, conglomerated project. All data is combined for representation in these graphs; all programmers, all output in the form of lines of code changed (LOC).

There are three likely conclusions that we could draw from examination of these graphs. In answer to the question "Does Free Software Production in a Bazaar Obey the Law of Diminishing Returns?", we might find that:

  1. Yes, Bazaars clearly Obey the Law of Diminishing Returns - in this case, we will expect to see the distinctive shapes, as described earlier in the essay.
  2. Yes, Bazaars Obey the Law of Diminishing Returns, but are extremely effecient - This is a case where we cannot decisively say that Bazaars obey or break the Law of Diminishing Returns, but that it looks as though they will in a project substantially bigger than GNOME.
  3. No, Bazaars do not Obey the Law of Diminishing Returns - in this case, we are in uncharted territory. The Law of Diminishing Returns, in cases where it is said to apply, has not yet been contradicted. This is the hardest conclusion to 'prove'.

These conclusions are dependent on the logical chain of argument that preceding body of the essay has sought to establish. In particular, it assumes that Brooks' Law and the Law of Diminishing Returns are functionally equivalent and that a Bazaar environment is a production process in the short-run. There are alternative conclusions if these assumptions are challenged, more detail on which is available in Appendix III.

Our examination of the data will now examine the GNOME project from two perspectives: as an aggregate set of data, and as a collection of subprojects.

GNOME as a Single Project

GNOME as a Single Project provides us with a large set of data to work with. All data from all projects is aggregated into a single table. From this table, we can generate figure 2-1: Contributors versus Average Total Output.

[Figure 2-1]

This graph displays a smooth gradiation in the project's ATO as contributors are added. Further to which, the curve displays slight acceleration. This happens to be the same pattern as a very early-stage ATO graph in a Diminishing Returns situation - marginal output is still accelerating. However, this is not a conclusive proof, as the graph could (theoretically) now take any one of an infinite number of paths as more contributors are added. Thus, we can take the evidence presented by this graph as circumstantial only.

GNOME as a Collection of Subprojects

Now we turn to a more plentiful collection of graphs, which consider GNOME not as an aggregate data-set, but as a collection of smaller projects from which data is drawn. Instead of forming a single, large table, we have instead extracted data individually on each project, which has then been aggregated in new tables.

These graphs probably present a better picture of Bazaar behaviour than the Single Project view. This is because each project is largely self-contained and self-led. Whilst a certain proportion of the GNOME project is planned on a higher level than the subproject, a larger fraction is formed by projects that are operated somewhat independently of the larger project.

The subproject level also has a higher 'concentration' of data. In the aggregate graph, the contributions of programmers are counted once. However, contributors to GNOME often contribute to multiple projects. This means that in contributor terms, GNOME hosts a 'virtual' Labour force of more than its 300 or so programmers (this number varies, see Appendix III for more detail).

In the project-level view, the contributions made to each project are individually counted and then aggregated across the project. This means that GNOME's entire 'virtual Labour force' is wholly represented, instead of actual contributors.

The 'virtual Labour force' does pose some problems for dealing with the data, however. Refer to Appendix III.

Our first graph is the Average Total Output over Subprojects. This graph is formed by breaking down each project into levels of programmers versus LOC changed. These figures are then averaged across all the graphs.

[Figure 2-2]

This graph is similar to our very first graph. Once again, it rises steadily upwards. However, the graph is not nearly so smooth. Most notably, there is two sharp dips in the graph in the mid-fifties and low seventies. It must be noted that the general trend observed before persists, however. But, for some reason, projects in the region of mid-fifties and seventies seem to have a lower productivity than projects elsewhere in the curve.

This evidence, whilst less smooth than the aggregate single-project curve, it does roughly concur. There are no surprises here, in that sense. If we are to repeat the interperetation, we would again say that there is circumstantial evidence of the Law of Diminishing Returns applying, but that this evidence is by no means conclusive in its nature.

Our second graph is the Marginal Output curve, generated from the same data.

[Figure 2-3]

This graph is the most interesting from an economist's standpoint. A standard Law of Diminishing Returns Marginal Output graph would have, in theory, started positive and then dipped below into negative values.

This graph has done no such thing.

What we see instead is a pattern of wide oscillation, that generally does not fall below zero at all. Instead, what start as small, almost insignificant 'pulses' grow in amplitude as the number of contributors increases. From this graph, it is not possible to make any conclusive points with regards to applicability of the Law of Diminishing Returns.

Conclusions

This essay will now consider conclusions, based upon the data.

If GNOME had obeyed the Law of Diminishing Returns, we would have seen the distinctive hill-and-valley curve in the marginal output.

Therefore, GNOME may be breaking the Law of Diminishing Returns. ,
If GNOME had obeyed the Law of Diminishing Returns, we would expect to see the bell-shaped curve in the average total output graphs. We saw what might have been the beginning of such a curve in both cases.
Therefore, GNOME may be obeying the Law of Diminishing Returns.
As the astute reader has noticed, these are contradictory. We are now forced to fall back on the balance of liklihood to make our conclusion.

Seeing as the Law of Diminishing Returns has not yet been broken, the author is inclined to say that:

GNOME probably obeys the Law of Diminishing Returns, but has not yet reached its production turning point.

Implications and Considerations

There are several implications and things worth considering.

The first is that the GNOME data-set was too small to be conclusive. Whilst there are hundreds of actual volunteers on the project, and a 'virtual Labour force' several times as large, the data simply has not shown us anything conclusive. A much larger set of data will be needed, in particular to:

  1. See if the 'bell' shape manifests on the ATO curves
  2. See if the 'hill-and-valley' shape manifests on MO curves

The second consideration is the oscillatory nature of the MO curve. A standard MO curve moves in a definite way. This MO curve bears no resemblance to the standard one, giving credence to the possibility that Bazaars may, in fact, break the Law of Diminishing Returns.

Further Directions of Research and Work

This essay, alas, raises more questions than it answers. While it has failed to draw definitive conclusions, it does manage to form an initial foray of economic theory into the Free Software world. But there is far, far more room for research. In particular:

In terms of assisting this research, a better range of tools needs to be made available and used for the gathering of project data. Whilst the CVS system keeps logs, these are by themselves not enough. Primarily, they are stored in a format which makes meaningful extraction of data difficult and less useful.

In Summation

In summation:
  1. The essay cannot conclusively state Free Software Production does or does not Obey the Law of Diminishing Returns.
  2. The essay concludes on the balance of liklihood that Bazaar production does not break the Law of Diminishing Returns
  3. The research would be more profitable if repeated on a larger project
  4. Future research needs to be assisted with better data tools.

Endnotes

  1. Sourcecode is another form of software. It is from sourcecode that the harder-to-modify "executable" is derived.

    Sourcecode is the 'recipe' for a task a computer might operate. It tells a computer how the data it is using is structured, how to access it, and what to do with it. This can be expressed in a number of artificial "programming languages". If one has access to the sourcecode, it is possible to intimately understand how a program works. It also becomes possible to expand or modify its functionality; or to reuse pre-existing sourcecode in new programs.

  2. In more detailed terms, we might define the factors thus:
  3. Meaning "Lines Of Code". Probably the most-used and most popular metric in software engineering. It refers to a single line of instructions for the computer that is not blank or a comment intended for human use. Because, typically, only one instruction is placed per line, LOC provides a usable measurement of both the gross size and general complexity of a given project.
  4. "Hacker" is a term of considerable stigma in mainstream society, usually taken to mean a person who maliciously trespasses on other people's computer systems.

    Raymond's use of the term is, in fact, the original computer-world meaning. He takes a hacker to be a person who delights in problem solving (especially in programming), and who believes that once a problem is solved, the solution ought to be shared.

    The more common use - intruder - is delineated by traditional hackers with the word "crackers". Hackers take great lengths to distance themselves from crackers and their activities.

  5. Ironically, Miguel de Icaaza, in private conversation with the essay's author, said that he did not himself believe that it was a case of "The Cathedral versus the Bazaar". He saw them as two extremes between which there are many comprimises - two ends of the same stick.
  6. Concurrent Versioning System. CVS is a change-management system. It provides a central point of work for multiple programmers working on the same sourcecode. Programmers can 'check out' code, work on it on their own computer, and then 'merge' their changes back into the original.

    Apart from its usefulness in centralising code storage and management, CVS provides change-tracking capabilities. It keeps 'delta-files', which list every change made to any file, at any time, by any programmer. This is normally used as a sort of super-powered "undo" function.

    Bibliography

    Appendices

    These appendices have been collated in order to remove a burden of text and detail from the main document. They encapsulate a body of supporting work which helps to set the main body into a better context than it could do by itself.

    In order, the appendices are:

    1. Acknowledgements and Thanks - A list of people who have directly or indirectly assisted with this essay.
    2. The Opensource Definition - A document published by the Open Source Initiative (OSI) is adapted in this appendix to provide a the 'common list' of Free Software/Opensource licensing features.
    3. Caveats - A collection of, well, caveats. These are possible flaws and 'gotchas' with both the main body's theses, conclusions and the data upon which all of it rests. Since the main body spends most of its time establishing a logical flow, the Caveats section serves to treat more in-depth these issues.
    4. Economic Principles in Free Software - Discussion of the Caveats produces a number of open questions regarding how elementary economic theory applies to the Free Software world. Since this essay has already introduced the ideological (see "Free Software Defined") and legal (see "The Opensource Definition") aspects of Free Software, this appendix exclusively aims to show how and where the most basic principles of economics apply. This is necessary to help create the theoretical framework upon which the main body of this essay relies.
    5. Program Listings - This section includes the programs written by the author in order to 'massage' the data into a graphable form. Because they incorporate the logic of how the data was arranged for graphical formatting, they are an essential part of the overall essay conclusion.

    Appendix I: Acknowledgements and Thanks

    As always, there are more to thank than I can easily recall, so I will try to run through in chronological order.

    Firstly, then, I would like to thank my first teacher of economics, ("Sir") Andrew Boukaseff ("QC"). Boukaseff - known as bouka, seffsta, andrew, drew, joke and bloke to friends and students - was the kind of teacher everyone loves to get. He made economics real and relevant. In a subject where the numbers are critical, he reminded us that beneath the national income accounts and development theories there are real human beings. The lesson that sticks with me best is that economics is just one of the parts of the greater human story.

    Secondly, I'd like to thank Eric S. Raymond. Eric's theories and papers provided the jumping-off point for a lot of my own work. He has given me moral support as I have slowly chipped away at this essay for the last two years; he was the first to agree to review it.

    Thirdly, I'd like to thank my Theory of Knowledge teacher, Mrs Forbes-Harper. We've had our angry exchanges and disagreements over the years, but Mrs F-H has helped me more (than almost anyone else) to understand just how one goes about proving things.

    Fourthly, I'd like to thank my father, Barry Chester. My dad has been a useful wellspring of critique over the years. Certainly, when I want to critique my own work, I find myself thinking "What would Dad say?"

    Fifthly, I'd like to thank my second economics teacher, Tony Trickey. The trickster is apt to get angry when poked with a stick or likewise provoked, but he certainly hammers home the point that the numbers do matter. Don't worry sir, you didn't fret for nothing!

    Sixthly, I'd like to thank Michael Zucchi, without whose help I would be paddling up a certain proverbial creek sans paddle. His programming ability and expertise with the GNOME CVS system made him the ideal canidate to generate the raw data I needed for this essay.

    Next, I'd like to thank the band of online reviewers who saw my essay in various incomplete and generally spotty incarnations. Some of these have already been listed above, others include Miguel de Icaza (leader of the GNOME project) and the license-discuss list hosted by the Open Source Initiative. Thanks, guys.

    Finally, but perhaps most of all, I would like to thank Richard M. Stallman.

    More than anyone else on this list, Stallman (RMS, as he is widely known) is a 'demi-god' in the hacker community. Even so, he has taken out much of his time to kindly help me with my work, pointing out items of representation and wording that could be fixed.

    He is a programming genius, probably one of the greatest of all time. His calm approach, and unflinching resolve to uphold his principles, make him a leader and visionary who will - I think - one day be listed alongside the likes of Martin Luther King and Ghandi.

    To anyone else I have neglected to mention - thankyou. Thankyou so very, very much in helping me on this long journey. I feel like I have been doing this forever, and now it's done! Best wishes to all of you, and be well.

    Appendix II: The Opensource Definition

    This appendix is adapted from two online documents: the Opensource Defition (http://www.opensource.org/osd.html) and the Opensource Definition Rationale (http://www.opensource.org/osd-rationale.html). Type in plain-type is from the OSD, the rationale is presented in italics.

    The intent of the Open Source Definition is to write down a concrete set of criteria that we believe capture the essence of what the software development community wants ``Open Source'' to mean -- criteria that ensure that software distributed under an open-source license will be available for independent peer review and continuous evolutionary improvement and selection, reaching levels of reliability and power no closed product can attain.

    For the evolutionary process to work, we have to counter short-term incentives for people to stop contributing to the software gene pool. This means the license terms must prevent people from locking up software where very few people can see or modify it.

    Open source doesn't just mean access to the source code. The distribution terms of open-source software must comply with the following criteria:

    1. Free Redistribution

    The license may not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license may not require a royalty or other fee for such sale.

    By constraining the license to require free redistribution, we eliminate the temptation to throw away many long-term gains in order to make a few short-term sales dollars. If we didn't do this, there would be lots of pressure for cooperators to defect.

    2. Source Code

    The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost -- preferably, downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

    We require access to un-obfuscated source code because you can't evolve programs without modifying them. Since our purpose is to make evolution easy, we require that modification be made easy.

    3. Derived Works

    The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.

    The mere ability to read source isn't enough to support independent peer review and rapid evolutionary selection. For rapid evolution to happen, people need to be able to experiment with and redistribute modifications.

    4. Integrity of The Author's Source Code.

    The license may restrict source-code from being distributed in modified form only if the license allows the distribution of "patch files" with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software.

    Encouraging lots of improvement is a good thing, but users have a right to know who is responsible for the software they are using. Authors and maintainers have reciprocal right to know what they're being asked to support and protect their reputations.

    Accordingly, an open-source license must guarantee that source be readily available, but may require that it be distributed as pristine base sources plus patches. In this way, "unofficial" changes can be made available but readily distinguished from the base source.

    5. No Discrimination Against Persons or Groups.

    The license must not discriminate against any person or group of persons.

    In order to get the maximum benefit from the process, the maximum diversity of persons and groups should be equally eligible to contribute to open sources. Therefore we forbid any open-source license from locking anybody out of the process.

    Some countries, including the United States, have export restrictions for certain types of software. An OSD-conformant license may warn licensees of applicable restrictions and remind them that they are obliged to obey the law; however, it may not incorporate such restrictions itself.

    6. No Discrimination Against Fields of Endeavor.

    The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

    The major intention of this clause is to prohibit license traps that prevent open source from being used commercially. We want commercial users to join our community, not feel excluded from it.

    7. Distribution of License.

    The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties.

    This clause is intended to forbid closing up software by indirect means such as requiring a non-disclosure agreement.

    8. License Must Not Be Specific to a Product.

    The rights attached to the program must not depend on the program's being part of a particular software distribution. If the program is extracted from that distribution and used or distributed within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the original software distribution.

    This clause forecloses yet another class of license traps.

    9. License Must Not Contaminate Other Software.

    The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software.

    Distributors of open-source software have the right to make their own choices about their own software.

    Yes, the GPL is conformant with this requirement. GPLed libraries `contaminate' only software to which they will actively be linked at runtime, not software with which they are merely distributed.

    Appendix III: Caveats

    The deep waters of Free Software and research are untested. There are areas where deep points of disagreement can arise in any paper which strives (as this one does) to objectively analyse the field.

    Again, due to contraints placed upon the main body of this essay, there is scant space in which to fairly and properly address these items of contention. As such, the author finds it necessary to instead provide a series of possible problems and flaws in the essay. In the spirit of laissez faire, these are the caveats - literally, the "Bewares".

    These come from two sources. One is from the Author's own attempt to self-critique. The second is from a small body of reviewers who have offered varying degrees of insight. These people are listed at length in Appendix I.

    These caveats can take several 'genres':

    But really, such categorisations are without limit. Instead, we treat the caveats where they rise: essentially, from the logical structure of the essay. Where either a premise is made or a derivation drawn, there is room for contention. Indeed, at many links in the logical chain, there arose caveats.

    A summary of the essay's logical structure might be rendered thusly:

    1) Describe Free Software
    in terms of elementary
    Economic theory
    |
    2) Define Free Software
    |
    3) Describe Brooks' Law
    |
    4) Use analogy to equate
    Brook's Law and the
    Law of Diminishing
    Returns
    |
    5) Introduce Raymond's
    Bazaar, including
    "Breaks Brooks' Law"
    assertion.
    |
    6) From premise that
    Brook's Law = LODR
    and from premise
    that Bazaars break
    Brooks' Law, derive
    that Bazaars can
    break the LODR.
    |
    7) Select and introduce
    the GNOME test-case.
    |
    8) Using GNOME data,
    create graphs.
    |
    9) Use graphical patterns
    to draw conclusions on
    Raymond's assertion.
    

    If, at each stage, you accept the essay's assumptions and premises, the logical flow is basically self-consistent and valid. But, if you do not, there is room for contention.

    Describe Free Software in terms of elementary economic theory.

    This is not part of the main body, but appears largely in Appendix IV.

    The major caveats here are twofold. Firstly, there is the assignment of factors of production. It may be that ideas are not Land at all, but are merely Capital. In this case, the fixed factor of the essay would be Capital.

    Secondly, there is contention about whether it is a fixed factor at all. Whilst the fixed-factor argument is justified, it is open to attack. Primarily, by the argument that Bazaars are evolutionary in nature; and thus, do not have a fixed set of ideas to work to at all.

    There is a minor issue of understanding, also, with the use of the word "Free". The word "Free" appears in two contexts in Appendix IV. The first is in the libertine sense: Free for Freedom. The other is in the Economic sense Free for Free Good.

    Richard Stallman noted that "This is a legitimate question for study, but to put the question in context, it is important to note at the beginning that economics can only partly explain what happens in free software. Free software development as an activity often has strong noneconomic motivations, including political idealism, and free software as a social phenomenon is influenced by many noneconomic factors, such as community spirit."

    Richard asked me to switch from using the term "Free" in reference to Free Goods, in favour of "zero-price". Whilst this term would reduce confusion, I felt it would fail to convey the additional meaning the word "Free" carries in economics.

    Define Free Software

    There were no real objections to this part of the essay. Reviewers commented on the use of words, rather than the content. It does not significantly alter the logical flow of the essay. Rather, it serves to introduce the reader to the subject of study.

    Describe Brooks' Law

    There was some disagreement about my description of Brooks' Law.

    In particular, the exact wording of Brooks' Law is that "Adding people to a late project makes it later"; whereas I had included the "efficient partitioning" component.

    In regards to the 'extended' version of Brooks' Law, I rely on Steve McConnell's Code Complete as my authority. He describes further analysis that Brooks applies to his own law, which includes the 'extended' version.

    Use analogy to equate Brooks' Law and the Law of Diminishing Returns

    This is the the most important part of the essay. There are two levels of objection here.

    Firstly, in the extreme case, this section may be entirely wrong. The reasoning is by analogy and graphical comparison. There is no formal or mathematical equivalence at work. The equivalence is defendable only in qualitative terms.

    Secondly, in the less extreme case, Eric S. Raymond says that "I strongly suspect [that] ... Brooks's Law is not precisely equivalent to LODR, but is rather a special case of it involving particular nonlinear scaling phenomena. Accordingly, one may assert that the bazaar mode repeals Brooks's Law without making any commitment about the applicability of the LODR in general."

    Introduce Raymond's Bazaar, including "Breaks Brooks' Law" assertion

    This section does not have any caveats in and of itself. There is the possibility that Raymond's hypotheses are in themselves flawed. If this is the case, this essay is built upon a flawed base.

    Otherwise, this section was not contended by the reviewers. Most notably, Eric S. Raymond (author of the paper in which the Bazaar was introduced) did not contend.

    From premise that Brook's Law = LODR and from premise that Bazaars break Brooks' Law, derive that Bazaars can break the LODR.

    This section is, in terms of derivative logic, correct. If the two premises are true, the conclusion must also be true.

    For the purposes of the essay, it is assumed that the premises are both true. If they are not, then this derived answer would be thrown into doubt.

    Select and introduce the GNOME test-case.

    There is some contention about the choice of the GNOME project as a test-case. Some reviewers argued that each major Free Software project has its own way of organising itself, and that the conclusion would not be generally applicable across all Free Software projects.

    Using GNOME data, create graphs.

    This is perhaps the richest vein of caveats.

    Firstly, there is the data itself. Whilst the GNOME project has relatively usable data in its CVS system, this data is not likely to be entirely accurate. This is due largely to the problem of contributor ambiguity.

    The CVS system is only able to keep track of contributors based on the identity they supply. As such, there is a strong possibility that some of the contributors are in fact 'duplicates', where the work contributed by one GNOME project member is logged against several identities.

    Estimating the scale of this ambiguity is difficult. The raw CVS data (as supplied to the author by Michael Zucchi) included a list of 'possible other contributors'. These were based on emails, but once again, email addresses can (and do, in the GNOME project membership) change.

    Second is the formation of the graphs themselves. There is a possibility that the author's construction of the graphs is flawed. The reader is advised that Appendix V includes program listings for this essay, in the popular programming language, Perl.

    Third is the size of the data set. Whilst it was large enough to cause the author headaches, it may not be large enough to iron out statistical anomalies and was certainly not large enough to reach a decisive conclusion.

    Use graphical patterns to draw conclusions on Raymond's assertion.

    Like the earlier use of graph comparison, the use of graphs to draw conclusions is not perfect. At best, we will be able to assert qualititative outcomes on quantitative data.

    A more in-depth study would ideally test data against an expected value or values in any number of variables.

    Appendix IV: Economic Principles of Free Software

    As a field of research, the world of Free Software and/or Opensource Software is relatively untouched. It is the child of a thriving subculture culture and several obscure accidents of history; but its implications have changed, change, and will continue to change the world. Among the children and siblings of Free Software, we can count the Internet, email, the World Wide Web, the GNU/Linux operating system; and we can count things that rely on such software. Amongst these ranks are endless websites, commercial firms, governments and educational institutions.

    Yet very little in the way of research has been done in this area. It rests as a vast, untapped goldmine for the social sciences. Certain elementary rules appear have changed: but the changes have thus far been remarkably successful. Reasons as to why this is so are likely to form the basis of many fruitful years of research to come.

    This appendix attempts to take some of the most elementary tenets of Economics and apply them to the Free Software world. It serves two functions. In the first and more general instance, it provides a kind of "translating dictionary" function for both Economists and the Free Software community. It hopefully allows for meaningful insight on both sides of the fence.

    In the second instance, it exists as a supporting document to the main body of this essay. As well as highlighting areas of connection, it shows which things are considered to be 'true' for the purposes of the essay. This is an important issue, where even assigning Free Software items into the Factors of Production is fraught with possible oversights and flaws. This essay will point out these danger zones, and indicate which 'truth' the essay relies upon for its theses and conclusions.

    This appendix will address these issues in order. This list may seem like the table of contents for an economics textbook. This is purely because typing " .. and Free Software" is annoying and redundant for both author and reader.

    Sourcecode as a Good

    This essay assumes that the sourcecode is the product.

    Sourcecode is a series of instructions in a language designed to be readable by humans. In the act of programming, the most common process is to transcribe ideas into a computer program by creating sourcecode describing the ideas.

    This sourcecode is then 'compiled'. Compilation is the act of turning the sourcecode into machine-code, the language that the computer itself understands.

    Once a program has been compiled into machine-code, there are two major implications:

    Hence, the act of compilation has been ignored in this essay. Instead, the focus is on sourcecode as the final outcome of the Free Software production process.

    The first step is to try and classify Free Software as a Good. In economics, we discuss 'Free Goods' and 'Economic Goods'. The economic meaning of the word 'Free' is very different from the meaning meant by 'Free Software', so it is worth clearing this point up.

    The word 'Free', as used in 'Free Software', explicitly refers to the libertine aspect of the software. It does not refer to the economic meanings of price or scarcity. In order to define the sourcecode into Free or Economic categories, we summarise the definitions of both.

    An Economic Good is a good where:

    1. There is more wanted than available (ie, there is scarcity
    2. The good can command a price in the marketplace
    A Free Good is a good where:
    1. There is as much or more available than wanted (ie, there is abundance)
    2. The good is available at a zero or near-zero opportunity cost

    For reasons that will be further outlined in this appendix, it is arguable that Free Software is:

    1. In abundance
    2. Available at zero or near-zero opportunity cost
    Therefore, it is reasonable to assert that such sourcecode is a Free Good.

    Scarcity and Choice

    In Economics, our focus is on decisions made in conditions of scarcity. Indeed, many textbooks cite economics as the study of meeting 'unlimited wants with limited means'.

    We define choice as the act of distinquishing between alternative courses of action. Choice is usually based on the concept of scarcity and opportunity cost, where opportunity cost is "the next best choice".

    However, we have already defined Free Software as being a Free Good. Therefore, the opportunity cost of obtaining it is near-zero. However, the opportunity cost of producing Free Software is an entirely other item of concern. By no means is there a near-zero opportunity cost for the production of Free Software. It takes time and considerable mental effort to create quality sourcecode. This is time and mental effort that could be expended in any number of alternative ways; some including increased monetary payment.

    The Three Questions of Production

    And so we shift from the scarcity-choice (or rather, the abundance-choice) world of the sourcecode consumer to the scarcity-choice world of the programmer.

    The Free Software programmer is part of a production process. They are producing sourcecode. As with all production processes, they are implicitly or explicitly resolving three fundamental questions:

    1. What to Produce - What project they should devote their time and mental effort to
    2. How to Produce - What production system they should use to create Free sourcecode
    3. For Whom to Produce - Who they are producing the sourcecode for.

    As regards to What to Produce, there are three basic hypotheses:

    This list is abbreviated from several competing, similar hypotheses that have been proposed. Essentially, they cover the field. Some also tend to include "Big Itch" in "Homesteading", as an example of satisfaction through altruism.

    As regards to How to Produce, the potential answers are varied. One popular option is Raymond's "Bazaar" model. One of the claims made about the Bazaar model is examined by this essay; and the Bazaar model has already been outlined before.

    As regards to For Whom to Produce, the answer is heavily influenced by the What question. Programmers have either worked in their self-interest (Developer's Itch, Homesteading); or they have worked for 'everyone' towards an ideal or an altruistic motive (Big Itch).

    The Factors of Production

    The production process, in economic terms, is the act of putting the four Factors of Production in a black box and shaking them up a little. The four factors are Land, Labour, Capital and Enterprise. We will show how these factors map to Free Software.

    The economic meaning of the word "Land" probably causes more confusion than all other economic jargon confined. While, certainly, Land can include real estate, it is not only real estate than can be Land. Land is considered to be any naturally-occuring thing included in the production process. It can be the real-estate used for a farm or a factory, it can be minerals extracted from the ground, it can even (for a tourist destination) be constant, dependable sunshine.

    Assuming that Free Software takes place in a networked, online environment (as the Bazaar states it must), real estate is of almost no concern to a Free Software production process. The only possibly naturally-occuring item in Free Software is the ideas from which software is made. More on that below.

    Labour is easier to map to general terms. Labour is the component that does human work. In Free Software, the programmers can be counted as Labour.

    After Land, Capital is probably the next most confusing hijacking of a word by economics. In a production process, Capital does not mean cash. It refers to manufactured items which assist production. Essentially, this means machines and components. Just as with Land, it can either be a part of something (just as mineral ores are used to make steel) or it can help something to occur (just as sunshine encourages tourists to vist).

    Capital in the Free Software world breaks down into two areas, then: Tools and Libraries. Tools are programs that facilitate the creation of sourcecode: sourcecode editing programs, compilers, revision management systems and the like. Libraries are components which can provide significant portions of any new sourcecode's functionality.

    Enterprise is the factor which draws all the others together. It is the component that makes the decisions, takes the risks, and receives the bulk of the profits for those risks. In Free Software, Enterprise is usually a matter of trust rather than official authority. Since there are no means of punitive enforcement over the Labour, Enterprise programmers can lead only by merit and consensus. Otherwise, their Labour will desert them for other projects.

    That ideas are Land - naturally occuring - has been an issue of debate for thousands of years. Rather than engaging in an epistomological argument, this essay assumes that ideas are naturally-occuring. It also takes the line that sourcecode, as a production process, turns ideas into sourcecode.

    It can then be said that a program design is a selection of which ideas to implement. Since most Free Software projects work towards the implementation of only a few ideas at a time, it is taken that Land (in the form of ideas) is the fixed factor in Free Software production. This allows the essay's use of the Law of Diminishing Returns to apply.

    Ownership and Control

    Ownership and Control are two aspects of the production process that are related but seperate. Ownership is the means whereby one gains the rewards for possession of something, Control is the means whereby one takes risks with something.

    In private enterprise, is common that Ownership and Control are effectively divided. Shareholders, while ultimately in control of their holding, give authority to corporate executives to take risks on their behalf. In return, they reap the benefits.

    In Free Software, Ownership rests with the author of the code. This is in keeping with standard copyright laws. Control, however, is markedly reduced. Free Software licenses specifically waive much of the legally-enforced control that an author has over their sourcecode.

    This means that control effectively rests with community consensus. If the sourcecode producing community wishes that the sourcecode goes in a certain direction, the chances are that it will. If there both a powerful want to take a certain direction and powerful resistance to that direction, sourcecode often 'forks'. This means that the community that has grown up around certain sourcecode splits up, each taking their own copy of the code. This is akin to genetic mutation in ecosystems.

    The Marketplace

    Sourcecode has a marketplace of consumers and producers. Indeed, it is probably the purest form of the the Free Market that there is. Specifically:

    The same sourcecode is propagated across many points of access, and anyone on the internet is able to access any of these points easily. The sourcecode is identical or close to identical across these points. The internet allows for nearly-instantaneous transmission of information about suppliers' products throughout the market.

    Finally, there is a push in Free Software to sell by product differentiation. One such firm is Red Hat, which has established brand-recognition in the marketplace by selling pre-packaged, easy-to-use Free Software. While anyone else can use this software, Red Hat has differentiated itself from the market effectively enough to assume a commanding position.

    Appendix V: Program Listings

    This appendix provides sourcecode listings for the Perl programs used in the creation of the essay's graphs. In the spirit of Free Software, every listing in this appendix is provided under the terms of the GNU General Public License. No Warranty is provided; these programs are provided As-Is.

    These scripts were written for Perl, version 5.005_03 built for BeOS on the x86 platform. Note that the auto-execution line ("#!/") is set to the BeOS default, and will not work on linux or other unices without adjustment.

    strip_pvloc.pl

    #!/boot/home/config/bin/perl -w
    #
    # prog_v_loc.pl: Produces data for Programmers vs LOC graphs based on CVS
    # data extracts.
    
    # Load file
    
    print "\nstartup succeeds";
    
    $InFile = 'scanall.out';
    $OutFile = 'gnome.strip';
    &SpliceToFile;
    
    sub SpliceToFile {
    
    open(CVSIN, $InFile) or die print "\nno such file.\n";
    open(OUTFILE, ">$OutFile");
    
    print "\nopen file succeeds\n";
    
    # Parse file
    
    $k = -1;                  # A nasty hack, as $i is outside the scope of the
                              # if block which outputs the stripped data to file.
                              # $k is set to -1 to get past the first line cleanly.
    
    for($i=0; ; $i+=1)
        {
             $CurrentLine = $_;
             chomp($CurrentLine);
             @NewRow = split(/,/, $CurrentLine);
         
    # If line contains useful data, splice it and output.
    
             if ($NewRow[0] eq '')
             	{ 
             	    $k += 1;
                    splice(@NewRow, 0, 2, ());    # Remove filename and version
                    splice(@NewRow, 2, 1, ());    # Remove state
                    splice(@NewRow, 4, 1, ());    # Remove other users
                    push(@CvsData, [ @NewRow ]);
    	                for($j = 0; $j < 4; $j++)
    	                {
    	                    print OUTFILE $CvsData[$k][$j] . ",";
    	                }
                    print OUTFILE "\n";
                }
             else { next; }
        }
    
    }          
    
    print "\nparse file succeeds\n";
    
    
    print "\nprogram complete.\n";
    

    strip_proj.pl

    #!/boot/home/config/bin/perl -w
    #
    # prog_v_loc.pl: Produces data for Programmers vs LOC/Project graphs based on
    # CVS data extracts.
    
    # Load file
    
    print "\nstartup succeeds";
    
    $InFile = 'gnome.exp.csv';
    $OutFile = 'gnome.proj.strip';
    &SpliceToFile;
    
    sub SpliceToFile {
    
    open(CVSIN, $InFile) or die print "\nno such file.\n";
    open(OUTFILE, ">$OutFile");
    
    print "\nopen file succeeds\n";
    
    # Parse file
    
    $k = -1;                  # A nasty hack, as $i is outside the scope of the
                              # if block which outputs the stripped data to file.
                              # $k is set to -1 to get past the first line cleanly.
    
    for($i=0; ; $i+=1)
        {
             $CurrentLine = $_;
             chomp($CurrentLine);
             @NewRow = split(/,/, $CurrentLine);
         
    # If line contains useful data, splice it and output.
    # File filed order: 0 filename, 1 version, 2 datestamp, 3 user, 4 state,
    # 5 added lines, 6 removed lines, 7 possible other contributors
    #
    # Expected output is to project name, datestamp, user, added lines
    # and removed lines (5 fields)
    
             if ($NewRow[0])
             	{ 
             	    $k += 1;
                    
                    splice(@NewRow, 7, 1, ());    # Remove other users
                    splice(@NewRow, 4, 1, ());    # Remove state
                    splice(@NewRow, 1, 1, ());    # Remove version
                    $FileName = $NewRow[0];       # Get filename
                    @TempArray = split(/\//, $FileName);     # extract project name
                    $NewRow[0] = $TempArray[0];   # Assign proj. name to 1st field
                    
                    push(@CvsData, [ @NewRow ]);  # Store for later output
                    
                    # output data to $OutFile
    	            
    	                for($j = 0; $j < 5; $j++)
    	                {
    	                    print OUTFILE $CvsData[$k][$j] . ",";
    	                }
                    print OUTFILE "\n";
                }
             else { next; }
        }
    
    }          
    
    print "\nparse file succeeds\n";
    
    
    print "\nprogram complete.\n";
    
    

    prog_v_loc.pl

    #!/boot/home/config/bin/perl -w
    #
    # Note that the above auto-exec line is for the BeOS, not Linux/Unix.
    #
    # prog_v_loc.pl: Produces data for Programmers vs LOC graphs based on massaged
    # CVS data.
    
    # Load file
    
    print "\nstartup succeeds";
    
    $InFile = 'gnome.strip.sorted';
    $OutFile = 'gnome.pvloc.dat';
    
    # Load up the @GraphData array.
    
    open(INFILE, $InFile) or die print "\nno such file.\n";
    for($i=0; ; $i+=1)
        {
             $CurrentLine = $_;
             chomp($CurrentLine);
             ($Time,$Contributor,$LinesAdd,$LinesRm) = split(/,/, $CurrentLine);
             $LinesChanged = $LinesAdd + $LinesRm;             # Consolidate Lines added+removed
             @NewRow = ($Time, $Contributor, $LinesChanged);
             push(@GraphData, [@NewRow]);
        }
              
    
    open(OUTFILE, ">$OutFile");
    $CurrentProgrammer = $GraphData[0][1];
    $CurrentTotal = $GraphData[0][2];
    $NumProgrammers = 0;
    
    # Below is some hideous nesting. It makes more sense read from the inside out.
    # The short form is that it compares $NewProgrammer to each element in the
    # @ProgrammerNames array. If there is no match, it adds the name to array, increments
    # the running total of programmers by one, and prints the number of programmers
    # and total current LOC to the output file.
    
    for($i = 1; $i < $#GraphData ; $i++ )
       {
           $NewProgrammer = $GraphData[$i][1];
           $NewTotal = $CurrentTotal + $GraphData[$i][2];
    
           if( $NewProgrammer ne $CurrentProgrammer )
               { 
                 $MatchCounter = 0;
                 for($j = 0; $j <= $#ProgrammerNames; $j++)
                     {
                         if( $ProgrammerNames[$j] eq $NewProgrammer )
                            { 
                             $MatchCounter+=1;
                            }
                     }
                 if( $MatchCounter == 0)
                     {
                      push(@ProgrammerNames, $NewProgrammer);
                      $NumProgrammers +=1;
                      print OUTFILE "\n$NumProgrammers,$CurrentTotal";
                     }
               }
           $CurrentProgrammer = $NewProgrammer;
           $CurrentTotal = $NewTotal;
       }
    
    # Accounting for the final programmer, who gets missed in the loop above.
    
    $NumProgrammers +=1;
    print OUTFILE "\n$NumProgrammers,$CurrentTotal";
    
    print "\nprogram complete.\n";
    

    prog_v_projloc.pl

    #!/boot/home/config/bin/perl -w
    #
    # Note that the above auto-exec line is for the BeOS, not Linux/Unix.
    #
    # prog_v_loc.pl: Produces data for Programmers vs LOC graphs based on massaged
    # CVS data.
    
    # Load file
    
    print "\nstartup succeeds";
    
    $InFile = 'gnome.strip.sorted';
    $OutFile = 'gnome.pvloc.dat';
    
    # Load up the @GraphData array.
    
    open(INFILE, $InFile) or die print "\nno such file.\n";
    for($i=0; ; $i+=1)
        {
             $CurrentLine = $_;
             chomp($CurrentLine);
             ($Time,$Contributor,$LinesAdd,$LinesRm) = split(/,/, $CurrentLine);
             $LinesChanged = $LinesAdd + $LinesRm;             # Consolidate Lines added+removed
             @NewRow = ($Time, $Contributor, $LinesChanged);
             push(@GraphData, [@NewRow]);
        }
              
    
    open(OUTFILE, ">$OutFile");
    $CurrentProgrammer = $GraphData[0][1];
    $CurrentTotal = $GraphData[0][2];
    $NumProgrammers = 0;
    
    # Below is some hideous nesting. It makes more sense read from the inside out.
    # The short form is that it compares $NewProgrammer to each element in the
    # @ProgrammerNames array. If there is no match, it adds the name to array, increments
    # the running total of programmers by one, and prints the number of programmers
    # and total current LOC to the output file.
    
    for($i = 1; $i < $#GraphData ; $i++ )
       {
           $NewProgrammer = $GraphData[$i][1];
           $NewTotal = $CurrentTotal + $GraphData[$i][2];
    
           if( $NewProgrammer ne $CurrentProgrammer )
               { 
                 $MatchCounter = 0;
                 for($j = 0; $j <= $#ProgrammerNames; $j++)
                     {
                         if( $ProgrammerNames[$j] eq $NewProgrammer )
                            { 
                             $MatchCounter+=1;
                            }
                     }
                 if( $MatchCounter == 0)
                     {
                      push(@ProgrammerNames, $NewProgrammer);
                      $NumProgrammers +=1;
                      print OUTFILE "\n$NumProgrammers,$CurrentTotal";
                     }
               }
           $CurrentProgrammer = $NewProgrammer;
           $CurrentTotal = $NewTotal;
       }
    
    # Accounting for the final programmer, who gets missed in the loop above.
    
    $NumProgrammers +=1;
    print OUTFILE "\n$NumProgrammers,$CurrentTotal";
    
    print "\nprogram complete.\n";
    

    prog_v_mloc.pl

    #!/boot/home/config/bin/perl -w
    #
    # Note that the above auto-exec line is for the BeOS, not Linux/Unix.
    #
    # prog_v_mloc.pl: Produces data for Marginal TO vs No. Programmers graphs based
    # on massaged CVS data.
    
    # Load file
    
    #print "\nstartup succeeds";
    
    $InFile = 'gnome.proj.strip-sorted';
    $OutFile = 'gnome.tp_v_mloc.dat';
    
    # Load up the @GraphData array.
    
    open(INFILE, $InFile) or die print "\nno such file.\n";
    for($i=0; ; $i+=1)
        {
             $CurrentLine = $_;
             chomp($CurrentLine);
             ($Project, $Time, $Contributor, $LinesAdd, $LinesRm) = split(/,/, $CurrentLine);
             $LinesChanged = $LinesAdd + $LinesRm;             # Consolidate Lines added+removed
             @NewRow = ($Project, $Time, $Contributor, $LinesChanged);
             push(@GraphData, [@NewRow]);
        }
        
    
    #Summarise marginal productivities per project
    
    $CurrentProject = $GraphData[0][0];
    $CurrentTotal = $GraphData[0][3];
    $CurrentProgrammer = $GraphData[0][2];
    
    #$TallyHo = 0;
    #$TallyHey = 0;
    $LineCount = 0;
    print "\n$#GraphData\n";
    
    open(OUTFILE, ">$OutFile");
    
    for($i = 0; $i < $#GraphData; $i++ )
    {
    $LineCount +=1;
    
    $NewProgrammer = $GraphData[$i][2];   # put first programmer on programmer list
    $NewProject = $GraphData[$i][0];
    $NewTotal = $CurrentTotal + $GraphData[$i][3];
    
    #load next line
    #if project is still the same
    #  do black magic to determine marginal programmers, loc
    
    if($NewProject ne $CurrentProject)
    {
    #  $TallyHo += 1;
    #  print "TallyHo! $TallyHo\n";
    #                        print OUTFILE "\n$CurrentProject,$NumProgrammers,$CurrentTotal";
     $NewTotal = 0;
     $CurrentTotal = 0;
     undef @ProgrammerNames;
     $NumProgrammers = 0;
    # $CurrentProgrammer = $GraphData[$i][2];
    }
    
    elsif($NewProject eq $CurrentProject)     # if the project is the same project
    {
    #  $TallyHey += 1;
    #  print "TallyHey! $TallyHey\n";
    
    
    #  print results to outfile
    #  clear out old project array
    #  load firstline data into project array
    
           if( $NewProgrammer ne $CurrentProgrammer )
               { 
                 $MatchCounter = 0;
                 for($j = 0; $j <= $#ProgrammerNames; $j++)
                     {
                         if( $ProgrammerNames[$j] eq $NewProgrammer )
                            { 
                             $MatchCounter+=1;
                            }
                     }
                 if( $MatchCounter == 0)
                     {
                      push(@ProgrammerNames, $NewProgrammer);
                      $NumProgrammers +=1;
                            print OUTFILE "\n$CurrentProject,$NumProgrammers,$CurrentTotal";
    
                     }
               }
        }
    
    
    $CurrentProgrammer = $NewProgrammer;
    $CurrentTotal = $NewTotal;
    $CurrentProject = $NewProject;
    
    }
    
    #Average out marginal productivities
    #finish