Home / Chroniques / Why Software Heritage is creating a global software archive
37,6
π Economics π Digital

Why Software Heritage is creating a global software archive

Stefano Zacchiroli
Stefano Zacchiroli
Professor in computer science at Télécom Paris (IP Paris)
Key takeaways
  • Software Heritage was launched in 2015 as a digital conservation initiative: a "Library of Alexandria" for the precious modern asset of software source code.
  • Its ambition: to collect, preserve and share all publicly available software in source code form.
  • Open-source software is ubiquitous in IT products these days, for example, in IoT devices, phones, cars and cameras.
  • Preserving source code and software is also important for research and industry, as it is increasingly used in these fields.
  • The aim of these libraries is to accumulate as much knowledge as possible in one place.

Soft­ware Her­it­age was launched in 2015 as a digit­al pre­ser­va­tion ini­ti­at­ive. Its ambi­tion: to col­lect, pre­serve, and share all soft­ware that is pub­licly avail­able in source code form. This uni­ver­sal soft­ware archive will help guar­an­tee the reli­ab­il­ity or ori­gin­al­ity of source codes so that the “offi­cial”, unmod­i­fied ver­sions remain pre­served forever – regard­less of any sub­sequent changes that may be made.

Such pre­ser­va­tion is import­ant because soft­ware takes a lot of intel­lec­tu­al energy to cre­ate and con­tains advanced tech­nic­al know­ledge, in the form of algorithms, which are under­stand­able only by read­ing the source code form of soft­ware. “This know­ledge can con­tain innov­a­tions, so the source code of some soft­ware can be as innov­at­ive as a sci­entif­ic paper or as a pat­ent,” explains Stefano Zac­chir­oli of Télé­com Par­is, one of the founders of Soft­ware Her­it­age. Avoid­ing that this import­ant tech­nic­al know­ledge is lost has been acknow­ledged by UNESCO, as part of the Par­is Call on Soft­ware Source Code as Her­it­age1.

Pre­serving source code and soft­ware is also import­ant for research and industry since they are being increas­ingly used in these domains. Indeed, a large part of the tech­nic­al and sci­entif­ic know­ledge developed today resides in soft­ware that must there­fore be pre­served to guar­an­tee the repro­du­cib­il­ity of exper­i­ments and res­ults – the basis of the sci­entif­ic meth­od. This approach is already being seen in move­ments like Open Access, for instance, which ensures that sci­entif­ic papers are avail­able in the long term and access­ible to every­one. We also see it embod­ied in the open data move­ment, of which the aim is to keep sci­entif­ic data open and shared universally.

Soft­ware Her­it­age has assembled the largest pub­lic archive of soft­ware in source code form, com­prised of more than 10 bil­lion unique source code files and more than two bil­lion com­mits – the intern­al revi­sions of soft­ware used by developers – har­ves­ted from more than 160 mil­lion devel­op­ment pro­jects. Among the most fam­ous: the source code of the Apollo 11 nav­ig­a­tion sys­tem, which allowed humans to go to the Moon, or that of the NCSA Mosa­ic browser, which pop­ular­ised the use of the Web. The size of the archive is cur­rently about one peta­byte, which while big is not as big as archives of videos, for instance. The pro­ject was foun­ded in 2016 by Roberto Di Cosmo (Inria and Uni­versité de Par­is) and Stefano Zac­chir­oli (Télé­com Par­is) in col­lab­or­a­tion with Inria and UNESCO. It now has a num­ber of spon­sors from both the private and pub­lic sectors.

Soft­ware Her­it­age is con­trib­ut­ing to these efforts with the long term archiv­al of soft­ware in source code form. Some­times even the soft­ware is a major sci­entif­ic break­through itself and, as such, it con­tains valu­able know­ledge that needs to be pre­served for future reuse.

Open-source soft­ware is also ubi­quit­ous in IT products – for example, in IoT devices, phones, cars and cam­er­as. The dif­fi­culty here is that any­one can modi­fy this soft­ware, so vari­ants of a piece of soft­ware become an integ­ral part of new devices. Soft­ware Her­it­age provides a place where this soft­ware can be stored over the long term with iden­ti­fi­ers that can be used to recog­nise the spe­cif­ic ver­sion of a soft­ware ori­gin­ally installed in a giv­en device. This is very import­ant for track­ing vul­ner­ab­il­it­ies and identi­fy­ing new­er products that need to be “fixed”.

How is the database constructed?

Soft­ware Her­it­age mainly archives source code by crawl­ing pub­lic plat­forms that are used by developers to col­lab­or­at­ively devel­op open-source soft­ware. The most well-known of these are Git­Hub and Git­Lab. Anoth­er way is to involve research­ers who act­ively “push” soft­ware to the archive. For instance, in France, HAL is a pop­u­lar, pub­licly-fun­ded, open access plat­form used by the sci­entif­ic com­munity to depos­it papers as preprints.

The key point to note here is that soft­ware and par­tic­u­larly open-source soft­ware is massively duplic­ated these days. What hap­pens is that the same piece of source code can be found in thou­sands or mil­lions of dif­fer­ent places on the Inter­net at the same time.

To address this chal­lenge, Soft­ware Her­it­age struc­tures the archive as a giant graph (a Merkle DAG struc­ture), which is entirely de-duplic­ated. This means that if the same source code file is stored in thou­sands or mil­lions of dif­fer­ent places, it will be archived only once, while keep­ing track of all the dif­fer­ent places it is linked from. This is the case not only for indi­vidu­al files, but also for entire source code dir­ect­or­ies and com­mits, which can be very big for sub­stan­tial pieces of soft­ware. It is also true for soft­ware releases.

Doing this is essen­tial for keep­ing the size of the archive “small”, that is, to min­im­ize the duplic­a­tion of inform­a­tion that needs to be saved. It is also use­ful for sci­entif­ic use cases because by look­ing at this glob­al graph of pub­lic code, one can see who else has ref­er­enced your soft­ware and per­haps used it to cre­ate some­thing else. In a sense, this graph allows to meas­ure the impact of soft­ware developed by research­ers and open source developers.

“Soft­ware Her­it­age is a ‘great lib­rary of source code’, ana­logue to the great lib­rar­ies of the ancient world,” says Zac­chir­oli. “The aim of these ancient lib­rar­ies was to accu­mu­late as much know­ledge as pos­sible in a single place. Soft­ware Her­it­age is the Great Lib­rary of Alex­an­dria for the pre­cious mod­ern good that is soft­ware source code.”

People can vis­it the Soft­ware Her­it­age lib­rary to find the code they are inter­ested in, per­haps because it has dis­ap­peared from its ori­gin­al host­ing place, or per­haps to ana­lyse the full extent of know­ledge stored in it.

Interview by Isabelle Dumé

For more inform­a­tion : https://​www​.soft​ware​her​it​age​.org/

1https://​en​.unesco​.org/​f​o​s​s​/​p​a​r​i​s​-​c​a​l​l​-​s​o​f​t​w​a​r​e​-​s​o​u​r​c​e​-code

Contributors

Stefano Zacchiroli

Stefano Zacchiroli

Professor in computer science at Télécom Paris (IP Paris)

Stefano Zacchiroli's research focuses on the digital commons, open source software engineering, computer security and the software supply chain. He is co-founder and technical director of Software Heritage, the largest public archive of software source code. He has been a Debian developer since 2001 and was the leader of the Debian project from 2010 to 2013. He is a former board director of the Open Source Initiative (OSI) and winner of the 2015 O'Reilly Open Source Award.

Support accurate information rooted in the scientific method.

Donate