Study of Historical Code
I’ve started studying a larger historical code base. Within this post, I want to summarize the sort of historical questions we might ask and notes on how to approach them.
Objective
My objectives for studying and writing about historical source code is to understand and communicate the:
- Intent and purpose of the software
- Design, engineering trade offs, and technical decisions
- Significance and influence of the code, its other forms, and how it was used
- Authorship, the process of development, inspirations, and why it was written
In his speech about history writing, [Knuth]’s (paraphrased) list was:
- Understand the process of discovery
- Understand the process of failure
- Celebrate the contributions of many cultures
- Telling historical stories as the best way to teach
- Learn how to cope with life
- Become more familiar with the world, and to know how science fits into the overall history of mankind
In contrast to Knuth’s list, my list is less focused on the “lives of scientists” angle, although I am similarly interested in the process of development and process of failure and recognizing sources of influence and contribution. For these kinds of studies, I am less interested in the development of particular algorithms or discoveries and more about larger scale engineering efforts, which by nature tend to be more impersonal.
There appear to be very few studies of historical code. [Charoenwet] is motivated by historical analysis rather than algorithmic analysis, however, the paper is focused on an methodological experiment using LDA rather than the source code as text. The field of archaeogaming, which focuses on using archaeological techniques on digital games and worlds, has featured papers focusing on technical methods used in games (e.g. [Aycock] with its analysis of a maze generation algorithm). Thus, as more historical sources come to light, this appears to be a wide-open field for new insights and methods.
Historical Questions
[Wardhaugh] discusses how to read historical mathematics. Paralleling that list, we can similarly analyze source code.
What does it say / do?
- Programs are (almost always) written to perform some functional purpose. What was that purpose?
The source code describes the computation of some business logic, within some constraints. I suspect we will usually have more than just the source code, which can shed additional light on the code.
-
What data types encode the business domain? What algorithms are used to compute the results? What input/output is used to read in data and communicate the result?
-
What programming language or languages are used? Does the construction follow modern ideals or does it follow unusual or archaic patterns? How would this program interface with the execution environment, both hardware, operating system, and other programs?
Older programs are likely to be batch-oriented, reading a stream of records (likely passed in via cards or tape), with variables and control logic either provided in-band or out-of-band. Later programs may be more file-oriented or interactive. These clues may inform a “potsherd”-like system enabling dating of programs.
- Can the program be translated into a modern language or structure?
This question is less about “can” and more about understandability for modern readers.
Who developed it?
- Are there parts of the design, implementation, or documentation that indicate authorship and place of development?
Differing code styles may point to multiple authors, or development across time, although a consistent code style may just indicate multiple authors were working from the same guide or were similarly educated/trained.
Names of authors may be hidden as easter eggs or as magic numbers.
- Why did the author or authors develop it? Did the author’s background or circumstances influence the result?
Typically requires research beyond the source code, but [Wardhaugh] includes a letter which obliquely informs the author’s circumstances. Source code is rarely narrative, though.
- Who were the author’s colleagues? mentors? enemies?
Research beyond the source code, although developers have often expressed frustrations with partners / hardware / customers in source code comments. If the source code was originally commercial, however, these comments are likely to be scrubbed.
Authorship may be afflicted with the “most famous person associated” curse, so we need to be careful in interpreting the evidence.
How was it built?
-
Was this code meant to serve a short-term use (like a specific study or job), or was it intended for long-term use?
-
If long-term use, were choices made to make it more maintainable?
-
How much effort or time was involved?
If we have access to source code control history, we can infer the timeline with considerable accuracy. If not, and we do not have external data on the development, there are also models based on lines of code (e.g. COCOMO) to provide some suggestion.
-
Did it have a clarity of purpose or did the design change over time?
-
Who did the development team expect to use the program? Who actually did?
-
Why has this source code survived (and become available for study)?
Although some code survives as printouts stored in a garage, most code is only available if there is a deliberate decision to retain and release it. Filters, such as media decay, lack of archival, and companies failing all lead to the loss of code.
- Has this code been translated or modified before becoming available for study?
Who consumed it?
-
Who read or used this program (which may differ from the intended audience)? In what kind of computing environment was it used? Under what licensing terms?
-
If they had a choice, why did consumers choose this program over alternatives?
-
Does this program fall into a genre?
Constraints of Research
- Disclose constraints and limitations of research, including any licensing or contractual limitations on the research, as well as limitations in terms of the information available to do the study.
For example, the Computer History Museum’s EULA.
References
[Aycock] Aycock, John, and Tara Copplestone. 2018. “Entombed: An Archaeological Examination of an Atari 2600 Game.” The Art, Science, and Engineering of Programming 3 (2). https://doi.org/10.22152/programming-journal.org/2019/3/4.
[Charoenwet] Charoenwet, Wachiraphan. 2018. “A Digital Collection Study and Framework Exploration — Applying Textual Analysis on Source Code Collection.” In 2018 3rd Digital Heritage International Congress (DigitalHERITAGE), 1–8. https://doi.org/10.1109/DigitalHeritage.2018.8810105.
[Knuth] Knuth, Donald, and Len Shustek. 2021. “Let’s Not Dumb down the History of Computer Science.” Communications of the ACM 64 (2): 33–35. https://doi.org/10.1145/3442377.
[Wardhaugh] Wardhaugh, Benjamin. 2010. How to Read Historical Mathematics. Princeton and Oxford: Princeton University Press. https://press.princeton.edu/books/hardcover/9780691140148/how-to-read-historical-mathematics.