3_machineLearning
π Digital π Science and technology
What are the next challenges for AI?

Machine Learning : can we correct biases ?

with Sophy Caulier, Independant journalist
On December 1st, 2021 |
4min reading time
Stéphan Clémençon
Stephan Clémençon
Professor of Applied Mathematics at Télécom Paris (IP Paris)
Key takeaways
  • AI is a set of tools, methods and technologies that allow a system to perform tasks in an (almost) autonomous way.
  • The question of trust in Machine Learning (ML) tools is recurrent, because deep learning requires very large volumes of data, which often come from the web.
  • There are different types of bias that can be related to the data source used. These include “selection bias”, due to the lack of representativity, and “omission bias”, where data are lacking.
  • When the available data are too sparse to implement ML in a simple way, we talk about “weak signals”. Hybridisation of ML with symbolic AI could provide solutions.

What are the major chal­lenges cur­rent­ly facing arti­fi­cial intelligence ?

In my area of exper­tise, which is machine lear­ning (ML), the three topics that I am cur­rent­ly pas­sio­nate about, and that could poten­tial­ly be consi­de­red as the great chal­lenges in this field, are bias and fair­ness, weak signals, and lear­ning on net­works. But this is only a par­tial view of the chal­lenges in AI, which is a very broad and most­ly inter­dis­ci­pli­na­ry field. AI is a set of tools, methods and tech­no­lo­gies that enable a sys­tem to per­form tasks in a qua­si-auto­no­mous way, and there are dif­ferent ways of achie­ving this.

ML is about the machine lear­ning from examples, trai­ning itself to per­form tasks effi­cient­ly that it will later under­take. The great suc­cesses in this area are com­pu­ter vision and auto­ma­tic lis­te­ning, used for appli­ca­tions in bio­me­trics for example, and natu­ral lan­guage pro­ces­sing. One of the ques­tions that cur­rent­ly arises is how much confi­dence can be pla­ced in ML tools, as deep lear­ning requires very large volumes of data, which very often come from the web.

Unlike data­sets pre­vious­ly col­lec­ted by resear­chers, web data is not acqui­red in a “control­led” way. The vast quan­ti­ty of this data can some­times lead to the metho­do­lo­gi­cal ques­tions that should be asked to exploit the infor­ma­tion it contains being igno­red. For example, trai­ning a facial recog­ni­tion model direct­ly from web data can lead to bias, in the sense that the model would not reco­gnise all types of faces with the same effi­cien­cy. In this case, the bias may stem from a lack of repre­sen­ta­ti­ve­ness in the faces used.

If, for example, the data cor­res­ponds main­ly to Cau­ca­sian faces, the sys­tem deve­lo­ped may reco­gnise Cau­ca­sian faces more easi­ly than other types of faces. Howe­ver, the dis­pa­ri­ties in per­for­mance may also be due to the intrin­sic dif­fi­cul­ty of the pre­dic­tion pro­blem and/or the limi­ta­tions of cur­rent ML tech­niques : it is well known, for example, that deep lear­ning does per­form as will in the recog­ni­tion of the faces of new-borns as it does for adult faces. Howe­ver, there is cur­rent­ly no clear theo­re­ti­cal insight into the link bet­ween the struc­ture of the deep neu­ral net­work used and the per­for­mance of the model for a given task.

You say “cur­rent­ly”. Does that mean that these biases could one day be remo­ved, or their effect could diminish ?

There are dif­ferent types of bias. They can be rela­tive to the data, there are the so-cal­led “selec­tion” biases, lin­ked to the lack of repre­sen­ta­ti­ve­ness, “omis­sion” biases, due to errors through endo­ge­nei­ty, etc. Biases are also inherent in the choice of the neu­ral net­work model, the ML method, a choice that is inevi­ta­bly res­tric­ted to the state of the art and limi­ted by cur­rent tech­no­lo­gy. In the future, we may use other, more effi­cient, less com­pu­ta­tio­nal­ly inten­sive repre­sen­ta­tions of infor­ma­tion, which can be more easi­ly deployed, and which may reduce or eli­mi­nate these biases, but for the moment they exist !

What role does the qua­li­ty of the data­sets used for trai­ning play in these biases ?

It is very impor­tant. As I said, given the volume of data requi­red, it is often sour­ced from the web and the­re­fore not acqui­red in a suf­fi­cient­ly control­led way to ensure repre­sen­ta­ti­ve­ness. But there is also the fact that this data can be ‘conta­mi­na­ted’, in a mali­cious way. This is cur­rent­ly an issue for the com­pu­ter vision solu­tions that will be used in auto­no­mous vehicles. The vehicle can be decei­ved by mani­pu­la­ting the input infor­ma­tion. It is pos­sible to modi­fy the pixel image of, say, a traf­fic sign so that the human eye sees no dif­fe­rence, but the neu­ral net­work ‘sees’ some­thing other than the traf­fic sign.

ML is based on a fre­quen­tist prin­ciple and the ques­tion of the repre­sen­ta­ti­ve­ness of the data in the lear­ning phase is a major issue. Using auto­no­mous dri­ving as an example, we now see many vehicles on the Saclay pla­teau, equip­ped with sen­sors to store as much expe­rience as pos­sible. That being said, it is dif­fi­cult to say how long it will be before we have seen enough situa­tions to be able to deploy a suf­fi­cient­ly intel­li­gent and reliable sys­tem in this field, enabling us to deal with all future situations.

There are cer­tain­ly appli­ca­tions for which the data avai­lable today allows ML to be imple­men­ted in a satis­fac­to­ry man­ner. This is the case, for example, for hand­wri­ting recog­ni­tion, for which neu­ral net­works are per­fect­ly deve­lo­ped. For other pro­blems, in addi­tion to expe­ri­men­tal data, gene­ra­tive models will also be used, pro­du­cing arti­fi­cial data that account for adverse situa­tions but without clai­ming to be exhaus­tive. This is the case for ML appli­ca­tions in cyber­se­cu­ri­ty, in an attempt to auto­ma­ti­cal­ly detect mali­cious intru­sions into a net­work for example.

Gene­ral­ly spea­king, there are many pro­blems for which the data avai­lable is too sparse to imple­ment ML in a simple way. This is often the case in ano­ma­ly detec­tion, par­ti­cu­lar­ly for pre­dic­tive main­te­nance of com­plex sys­tems. In some cases, the hybri­di­sa­tion of ML and sym­bo­lic tech­niques in AI could pro­vide solu­tions. These ave­nues are being explo­red in the civil and mili­ta­ry avia­tion sec­tors, as well as in medi­cal ima­ging. In addi­tion to their effec­ti­ve­ness, such approaches may also enable machines to make deci­sions that are easier to explain and interpret.

What is dri­ving evo­lu­tion in AI today ?

The field of mathe­ma­tics contri­butes a lot, espe­cial­ly in terms of effi­cient infor­ma­tion repre­sen­ta­tion and algo­rithms. But it is also tech­no­lo­gi­cal pro­gress that is dri­ving AI for­ward. The mathe­ma­ti­cal concept of neu­ral net­works has been around for many decades. Recent tech­ni­cal advances, par­ti­cu­lar­ly in the field of memo­ry, have made it pos­sible to suc­cess­ful­ly imple­ment deep neu­ral net­work models. Simi­lar­ly, dis­tri­bu­ted com­pu­ting archi­tec­tures and dedi­ca­ted pro­gram­ming fra­me­works have made it pos­sible to scale up lear­ning on large volumes of data. What remains to be done is to desi­gn more fru­gal approaches, so as to reduce the car­bon foot­print of com­pu­ta­tions, which is a very topi­cal issue !

Support accurate information rooted in the scientific method.

Donate