Home / Columns / Why Software Heritage is creating a global software archive
Programming code abstract technology background of software deve
π Economics π Digital

Why Software Heritage is creating a global software archive

Stefano Zacchiroli
Stefano Zacchiroli
Professor in computer science at Télécom Paris (IP Paris)
Key takeaways
  • Software Heritage was launched in 2015 as a digital conservation initiative: a "Library of Alexandria" for the precious modern asset of software source code.
  • Its ambition: to collect, preserve and share all publicly available software in source code form.
  • Open-source software is ubiquitous in IT products these days, for example, in IoT devices, phones, cars and cameras.
  • Preserving source code and software is also important for research and industry, as it is increasingly used in these fields.
  • The aim of these libraries is to accumulate as much knowledge as possible in one place.

Soft­ware Her­itage was launched in 2015 as a dig­i­tal preser­va­tion ini­tia­tive. Its ambi­tion: to col­lect, pre­serve, and share all soft­ware that is pub­licly avail­able in source code form. This uni­ver­sal soft­ware archive will help guar­an­tee the reli­a­bil­i­ty or orig­i­nal­i­ty of source codes so that the “offi­cial”, unmod­i­fied ver­sions remain pre­served for­ev­er – regard­less of any sub­se­quent changes that may be made.

Such preser­va­tion is impor­tant because soft­ware takes a lot of intel­lec­tu­al ener­gy to cre­ate and con­tains advanced tech­ni­cal knowl­edge, in the form of algo­rithms, which are under­stand­able only by read­ing the source code form of soft­ware. “This knowl­edge can con­tain inno­va­tions, so the source code of some soft­ware can be as inno­v­a­tive as a sci­en­tif­ic paper or as a patent,” explains Ste­fano Zac­chi­roli of Télé­com Paris, one of the founders of Soft­ware Her­itage. Avoid­ing that this impor­tant tech­ni­cal knowl­edge is lost has been acknowl­edged by UNESCO, as part of the Paris Call on Soft­ware Source Code as Her­itage1.

Pre­serv­ing source code and soft­ware is also impor­tant for research and indus­try since they are being increas­ing­ly used in these domains. Indeed, a large part of the tech­ni­cal and sci­en­tif­ic knowl­edge devel­oped today resides in soft­ware that must there­fore be pre­served to guar­an­tee the repro­ducibil­i­ty of exper­i­ments and results – the basis of the sci­en­tif­ic method. This approach is already being seen in move­ments like Open Access, for instance, which ensures that sci­en­tif­ic papers are avail­able in the long term and acces­si­ble to every­one. We also see it embod­ied in the open data move­ment, of which the aim is to keep sci­en­tif­ic data open and shared universally.

Soft­ware Her­itage has assem­bled the largest pub­lic archive of soft­ware in source code form, com­prised of more than 10 bil­lion unique source code files and more than two bil­lion com­mits – the inter­nal revi­sions of soft­ware used by devel­op­ers – har­vest­ed from more than 160 mil­lion devel­op­ment projects. Among the most famous: the source code of the Apol­lo 11 nav­i­ga­tion sys­tem, which allowed humans to go to the Moon, or that of the NCSA Mosa­ic brows­er, which pop­u­larised the use of the Web. The size of the archive is cur­rent­ly about one petabyte, which while big is not as big as archives of videos, for instance. The project was found­ed in 2016 by Rober­to Di Cos­mo (Inria and Uni­ver­sité de Paris) and Ste­fano Zac­chi­roli (Télé­com Paris) in col­lab­o­ra­tion with Inria and UNESCO. It now has a num­ber of spon­sors from both the pri­vate and pub­lic sectors.

Soft­ware Her­itage is con­tribut­ing to these efforts with the long term archival of soft­ware in source code form. Some­times even the soft­ware is a major sci­en­tif­ic break­through itself and, as such, it con­tains valu­able knowl­edge that needs to be pre­served for future reuse.

Open-source soft­ware is also ubiq­ui­tous in IT prod­ucts – for exam­ple, in IoT devices, phones, cars and cam­eras. The dif­fi­cul­ty here is that any­one can mod­i­fy this soft­ware, so vari­ants of a piece of soft­ware become an inte­gral part of new devices. Soft­ware Her­itage pro­vides a place where this soft­ware can be stored over the long term with iden­ti­fiers that can be used to recog­nise the spe­cif­ic ver­sion of a soft­ware orig­i­nal­ly installed in a giv­en device. This is very impor­tant for track­ing vul­ner­a­bil­i­ties and iden­ti­fy­ing new­er prod­ucts that need to be “fixed”.

How is the database constructed?

Soft­ware Her­itage main­ly archives source code by crawl­ing pub­lic plat­forms that are used by devel­op­ers to col­lab­o­ra­tive­ly devel­op open-source soft­ware. The most well-known of these are GitHub and Git­Lab. Anoth­er way is to involve researchers who active­ly “push” soft­ware to the archive. For instance, in France, HAL is a pop­u­lar, pub­licly-fund­ed, open access plat­form used by the sci­en­tif­ic com­mu­ni­ty to deposit papers as preprints.

The key point to note here is that soft­ware and par­tic­u­lar­ly open-source soft­ware is mas­sive­ly dupli­cat­ed these days. What hap­pens is that the same piece of source code can be found in thou­sands or mil­lions of dif­fer­ent places on the Inter­net at the same time.

To address this chal­lenge, Soft­ware Her­itage struc­tures the archive as a giant graph (a Merkle DAG struc­ture), which is entire­ly de-dupli­cat­ed. This means that if the same source code file is stored in thou­sands or mil­lions of dif­fer­ent places, it will be archived only once, while keep­ing track of all the dif­fer­ent places it is linked from. This is the case not only for indi­vid­ual files, but also for entire source code direc­to­ries and com­mits, which can be very big for sub­stan­tial pieces of soft­ware. It is also true for soft­ware releases.

Doing this is essen­tial for keep­ing the size of the archive “small”, that is, to min­i­mize the dupli­ca­tion of infor­ma­tion that needs to be saved. It is also use­ful for sci­en­tif­ic use cas­es because by look­ing at this glob­al graph of pub­lic code, one can see who else has ref­er­enced your soft­ware and per­haps used it to cre­ate some­thing else. In a sense, this graph allows to mea­sure the impact of soft­ware devel­oped by researchers and open source developers.

“Soft­ware Her­itage is a ‘great library of source code’, ana­logue to the great libraries of the ancient world,” says Zac­chi­roli. “The aim of these ancient libraries was to accu­mu­late as much knowl­edge as pos­si­ble in a sin­gle place. Soft­ware Her­itage is the Great Library of Alexan­dria for the pre­cious mod­ern good that is soft­ware source code.”

Peo­ple can vis­it the Soft­ware Her­itage library to find the code they are inter­est­ed in, per­haps because it has dis­ap­peared from its orig­i­nal host­ing place, or per­haps to analyse the full extent of knowl­edge stored in it.

Interview by Isabelle Dumé

For more infor­ma­tion : https://​www​.soft​ware​heritage​.org/

1https://​en​.unesco​.org/​f​o​s​s​/​p​a​r​i​s​-​c​a​l​l​-​s​o​f​t​w​a​r​e​-​s​o​u​r​c​e​-code

Contributors

Stefano Zacchiroli

Stefano Zacchiroli

Professor in computer science at Télécom Paris (IP Paris)

Stefano Zacchiroli's research focuses on the digital commons, open source software engineering, computer security and the software supply chain. He is co-founder and technical director of Software Heritage, the largest public archive of software source code. He has been a Debian developer since 2001 and was the leader of the Debian project from 2010 to 2013. He is a former board director of the Open Source Initiative (OSI) and winner of the 2015 O'Reilly Open Source Award.