Home / Chroniques / Why Software Heritage is creating a global software archive
37,6
π Economics π Digital

Why Software Heritage is creating a global software archive

Stefano Zacchiroli
Stefano Zacchiroli
Professor in computer science at Télécom Paris (IP Paris)
Key takeaways
  • Software Heritage was launched in 2015 as a digital conservation initiative: a "Library of Alexandria" for the precious modern asset of software source code.
  • Its ambition: to collect, preserve and share all publicly available software in source code form.
  • Open-source software is ubiquitous in IT products these days, for example, in IoT devices, phones, cars and cameras.
  • Preserving source code and software is also important for research and industry, as it is increasingly used in these fields.
  • The aim of these libraries is to accumulate as much knowledge as possible in one place.

Soft­ware Heri­tage was laun­ched in 2015 as a digi­tal pre­ser­va­tion ini­tia­tive. Its ambi­tion : to col­lect, pre­serve, and share all soft­ware that is publi­cly avai­lable in source code form. This uni­ver­sal soft­ware archive will help gua­ran­tee the relia­bi­li­ty or ori­gi­na­li­ty of source codes so that the “offi­cial”, unmo­di­fied ver­sions remain pre­ser­ved fore­ver – regard­less of any sub­sequent changes that may be made.

Such pre­ser­va­tion is impor­tant because soft­ware takes a lot of intel­lec­tual ener­gy to create and contains advan­ced tech­ni­cal know­ledge, in the form of algo­rithms, which are unders­tan­dable only by rea­ding the source code form of soft­ware. “This know­ledge can contain inno­va­tions, so the source code of some soft­ware can be as inno­va­tive as a scien­ti­fic paper or as a patent,” explains Ste­fa­no Zac­chi­ro­li of Télé­com Paris, one of the foun­ders of Soft­ware Heri­tage. Avoi­ding that this impor­tant tech­ni­cal know­ledge is lost has been ack­now­led­ged by UNESCO, as part of the Paris Call on Soft­ware Source Code as Heri­tage1.

Pre­ser­ving source code and soft­ware is also impor­tant for research and indus­try since they are being increa­sin­gly used in these domains. Indeed, a large part of the tech­ni­cal and scien­ti­fic know­ledge deve­lo­ped today resides in soft­ware that must the­re­fore be pre­ser­ved to gua­ran­tee the repro­du­ci­bi­li­ty of expe­ri­ments and results – the basis of the scien­ti­fic method. This approach is alrea­dy being seen in move­ments like Open Access, for ins­tance, which ensures that scien­ti­fic papers are avai­lable in the long term and acces­sible to eve­ryone. We also see it embo­died in the open data move­ment, of which the aim is to keep scien­ti­fic data open and sha­red universally.

Soft­ware Heri­tage has assem­bled the lar­gest public archive of soft­ware in source code form, com­pri­sed of more than 10 bil­lion unique source code files and more than two bil­lion com­mits – the inter­nal revi­sions of soft­ware used by deve­lo­pers – har­ves­ted from more than 160 mil­lion deve­lop­ment pro­jects. Among the most famous : the source code of the Apol­lo 11 navi­ga­tion sys­tem, which allo­wed humans to go to the Moon, or that of the NCSA Mosaic brow­ser, which popu­la­ri­sed the use of the Web. The size of the archive is cur­rent­ly about one peta­byte, which while big is not as big as archives of videos, for ins­tance. The pro­ject was foun­ded in 2016 by Rober­to Di Cos­mo (Inria and Uni­ver­si­té de Paris) and Ste­fa­no Zac­chi­ro­li (Télé­com Paris) in col­la­bo­ra­tion with Inria and UNESCO. It now has a num­ber of spon­sors from both the pri­vate and public sectors.

Soft­ware Heri­tage is contri­bu­ting to these efforts with the long term archi­val of soft­ware in source code form. Some­times even the soft­ware is a major scien­ti­fic break­through itself and, as such, it contains valuable know­ledge that needs to be pre­ser­ved for future reuse.

Open-source soft­ware is also ubi­qui­tous in IT pro­ducts – for example, in IoT devices, phones, cars and came­ras. The dif­fi­cul­ty here is that anyone can modi­fy this soft­ware, so variants of a piece of soft­ware become an inte­gral part of new devices. Soft­ware Heri­tage pro­vides a place where this soft­ware can be sto­red over the long term with iden­ti­fiers that can be used to reco­gnise the spe­ci­fic ver­sion of a soft­ware ori­gi­nal­ly ins­tal­led in a given device. This is very impor­tant for tra­cking vul­ne­ra­bi­li­ties and iden­ti­fying newer pro­ducts that need to be “fixed”.

How is the database constructed ?

Soft­ware Heri­tage main­ly archives source code by craw­ling public plat­forms that are used by deve­lo­pers to col­la­bo­ra­ti­ve­ly deve­lop open-source soft­ware. The most well-known of these are GitHub and Git­Lab. Ano­ther way is to involve resear­chers who acti­ve­ly “push” soft­ware to the archive. For ins­tance, in France, HAL is a popu­lar, publi­cly-fun­ded, open access plat­form used by the scien­ti­fic com­mu­ni­ty to depo­sit papers as preprints.

The key point to note here is that soft­ware and par­ti­cu­lar­ly open-source soft­ware is mas­si­ve­ly dupli­ca­ted these days. What hap­pens is that the same piece of source code can be found in thou­sands or mil­lions of dif­ferent places on the Inter­net at the same time.

To address this chal­lenge, Soft­ware Heri­tage struc­tures the archive as a giant graph (a Merkle DAG struc­ture), which is enti­re­ly de-dupli­ca­ted. This means that if the same source code file is sto­red in thou­sands or mil­lions of dif­ferent places, it will be archi­ved only once, while kee­ping track of all the dif­ferent places it is lin­ked from. This is the case not only for indi­vi­dual files, but also for entire source code direc­to­ries and com­mits, which can be very big for sub­stan­tial pieces of soft­ware. It is also true for soft­ware releases.

Doing this is essen­tial for kee­ping the size of the archive “small”, that is, to mini­mize the dupli­ca­tion of infor­ma­tion that needs to be saved. It is also use­ful for scien­ti­fic use cases because by loo­king at this glo­bal graph of public code, one can see who else has refe­ren­ced your soft­ware and per­haps used it to create some­thing else. In a sense, this graph allows to mea­sure the impact of soft­ware deve­lo­ped by resear­chers and open source developers.

“Soft­ware Heri­tage is a ‘great libra­ry of source code’, ana­logue to the great libra­ries of the ancient world,” says Zac­chi­ro­li. “The aim of these ancient libra­ries was to accu­mu­late as much know­ledge as pos­sible in a single place. Soft­ware Heri­tage is the Great Libra­ry of Alexan­dria for the pre­cious modern good that is soft­ware source code.”

People can visit the Soft­ware Heri­tage libra­ry to find the code they are inter­es­ted in, per­haps because it has disap­pea­red from its ori­gi­nal hos­ting place, or per­haps to ana­lyse the full extent of know­ledge sto­red in it.

Interview by Isabelle Dumé

For more infor­ma­tion : https://​www​.soft​wa​re​he​ri​tage​.org/

1https://​en​.unes​co​.org/​f​o​s​s​/​p​a​r​i​s​-​c​a​l​l​-​s​o​f​t​w​a​r​e​-​s​o​u​r​c​e​-code

Contributors

Stefano Zacchiroli

Stefano Zacchiroli

Professor in computer science at Télécom Paris (IP Paris)

Stefano Zacchiroli's research focuses on the digital commons, open source software engineering, computer security and the software supply chain. He is co-founder and technical director of Software Heritage, the largest public archive of software source code. He has been a Debian developer since 2001 and was the leader of the Debian project from 2010 to 2013. He is a former board director of the Open Source Initiative (OSI) and winner of the 2015 O'Reilly Open Source Award.

Support accurate information rooted in the scientific method.

Donate