π Digital π Science and technology
What are the next challenges for AI?

Machine Learning: can we correct biases?

Sophy Caulier, Independant journalist
On December 1st, 2021 |
4 min reading time
Stéphan Clémençon
Stephan Clémençon
Professor of Applied Mathematics at Télécom Paris (IP Paris)
Key takeaways
  • AI is a set of tools, methods and technologies that allow a system to perform tasks in an (almost) autonomous way.
  • The question of trust in Machine Learning (ML) tools is recurrent, because deep learning requires very large volumes of data, which often come from the web.
  • There are different types of bias that can be related to the data source used. These include “selection bias”, due to the lack of representativity, and “omission bias”, where data are lacking.
  • When the available data are too sparse to implement ML in a simple way, we talk about “weak signals”. Hybridisation of ML with symbolic AI could provide solutions.

What are the major chal­lenges cur­rent­ly fac­ing arti­fi­cial intelligence?

In my area of exper­tise, which is machine learn­ing (ML), the three top­ics that I am cur­rent­ly pas­sion­ate about, and that could poten­tial­ly be con­sid­ered as the great chal­lenges in this field, are bias and fair­ness, weak sig­nals, and learn­ing on net­works. But this is only a par­tial view of the chal­lenges in AI, which is a very broad and most­ly inter­dis­ci­pli­nary field. AI is a set of tools, meth­ods and tech­nolo­gies that enable a sys­tem to per­form tasks in a qua­si-autonomous way, and there are dif­fer­ent ways of achiev­ing this.

ML is about the machine learn­ing from exam­ples, train­ing itself to per­form tasks effi­cient­ly that it will lat­er under­take. The great suc­cess­es in this area are com­put­er vision and auto­mat­ic lis­ten­ing, used for appli­ca­tions in bio­met­rics for exam­ple, and nat­ur­al lan­guage pro­cess­ing. One of the ques­tions that cur­rent­ly aris­es is how much con­fi­dence can be placed in ML tools, as deep learn­ing requires very large vol­umes of data, which very often come from the web.

Unlike datasets pre­vi­ous­ly col­lect­ed by researchers, web data is not acquired in a “con­trolled” way. The vast quan­ti­ty of this data can some­times lead to the method­olog­i­cal ques­tions that should be asked to exploit the infor­ma­tion it con­tains being ignored. For exam­ple, train­ing a facial recog­ni­tion mod­el direct­ly from web data can lead to bias, in the sense that the mod­el would not recog­nise all types of faces with the same effi­cien­cy. In this case, the bias may stem from a lack of rep­re­sen­ta­tive­ness in the faces used.

If, for exam­ple, the data cor­re­sponds main­ly to Cau­casian faces, the sys­tem devel­oped may recog­nise Cau­casian faces more eas­i­ly than oth­er types of faces. How­ev­er, the dis­par­i­ties in per­for­mance may also be due to the intrin­sic dif­fi­cul­ty of the pre­dic­tion prob­lem and/or the lim­i­ta­tions of cur­rent ML tech­niques: it is well known, for exam­ple, that deep learn­ing does per­form as will in the recog­ni­tion of the faces of new-borns as it does for adult faces. How­ev­er, there is cur­rent­ly no clear the­o­ret­i­cal insight into the link between the struc­ture of the deep neur­al net­work used and the per­for­mance of the mod­el for a giv­en task.

You say “cur­rent­ly”. Does that mean that these bias­es could one day be removed, or their effect could diminish?

There are dif­fer­ent types of bias. They can be rel­a­tive to the data, there are the so-called “selec­tion” bias­es, linked to the lack of rep­re­sen­ta­tive­ness, “omis­sion” bias­es, due to errors through endo­gene­ity, etc. Bias­es are also inher­ent in the choice of the neur­al net­work mod­el, the ML method, a choice that is inevitably restrict­ed to the state of the art and lim­it­ed by cur­rent tech­nol­o­gy. In the future, we may use oth­er, more effi­cient, less com­pu­ta­tion­al­ly inten­sive rep­re­sen­ta­tions of infor­ma­tion, which can be more eas­i­ly deployed, and which may reduce or elim­i­nate these bias­es, but for the moment they exist!

What role does the qual­i­ty of the datasets used for train­ing play in these biases?

It is very impor­tant. As I said, giv­en the vol­ume of data required, it is often sourced from the web and there­fore not acquired in a suf­fi­cient­ly con­trolled way to ensure rep­re­sen­ta­tive­ness. But there is also the fact that this data can be ‘con­t­a­m­i­nat­ed’, in a mali­cious way. This is cur­rent­ly an issue for the com­put­er vision solu­tions that will be used in autonomous vehi­cles. The vehi­cle can be deceived by manip­u­lat­ing the input infor­ma­tion. It is pos­si­ble to mod­i­fy the pix­el image of, say, a traf­fic sign so that the human eye sees no dif­fer­ence, but the neur­al net­work ‘sees’ some­thing oth­er than the traf­fic sign.

ML is based on a fre­quen­tist prin­ci­ple and the ques­tion of the rep­re­sen­ta­tive­ness of the data in the learn­ing phase is a major issue. Using autonomous dri­ving as an exam­ple, we now see many vehi­cles on the Saclay plateau, equipped with sen­sors to store as much expe­ri­ence as pos­si­ble. That being said, it is dif­fi­cult to say how long it will be before we have seen enough sit­u­a­tions to be able to deploy a suf­fi­cient­ly intel­li­gent and reli­able sys­tem in this field, enabling us to deal with all future situations.

There are cer­tain­ly appli­ca­tions for which the data avail­able today allows ML to be imple­ment­ed in a sat­is­fac­to­ry man­ner. This is the case, for exam­ple, for hand­writ­ing recog­ni­tion, for which neur­al net­works are per­fect­ly devel­oped. For oth­er prob­lems, in addi­tion to exper­i­men­tal data, gen­er­a­tive mod­els will also be used, pro­duc­ing arti­fi­cial data that account for adverse sit­u­a­tions but with­out claim­ing to be exhaus­tive. This is the case for ML appli­ca­tions in cyber­se­cu­ri­ty, in an attempt to auto­mat­i­cal­ly detect mali­cious intru­sions into a net­work for example.

Gen­er­al­ly speak­ing, there are many prob­lems for which the data avail­able is too sparse to imple­ment ML in a sim­ple way. This is often the case in anom­aly detec­tion, par­tic­u­lar­ly for pre­dic­tive main­te­nance of com­plex sys­tems. In some cas­es, the hybridi­s­a­tion of ML and sym­bol­ic tech­niques in AI could pro­vide solu­tions. These avenues are being explored in the civ­il and mil­i­tary avi­a­tion sec­tors, as well as in med­ical imag­ing. In addi­tion to their effec­tive­ness, such approach­es may also enable machines to make deci­sions that are eas­i­er to explain and interpret.

What is dri­ving evo­lu­tion in AI today?

The field of math­e­mat­ics con­tributes a lot, espe­cial­ly in terms of effi­cient infor­ma­tion rep­re­sen­ta­tion and algo­rithms. But it is also tech­no­log­i­cal progress that is dri­ving AI for­ward. The math­e­mat­i­cal con­cept of neur­al net­works has been around for many decades. Recent tech­ni­cal advances, par­tic­u­lar­ly in the field of mem­o­ry, have made it pos­si­ble to suc­cess­ful­ly imple­ment deep neur­al net­work mod­els. Sim­i­lar­ly, dis­trib­uted com­put­ing archi­tec­tures and ded­i­cat­ed pro­gram­ming frame­works have made it pos­si­ble to scale up learn­ing on large vol­umes of data. What remains to be done is to design more fru­gal approach­es, so as to reduce the car­bon foot­print of com­pu­ta­tions, which is a very top­i­cal issue!

Our world explained with science. Every week, in your inbox.

Get the newsletter