3_machineLearning
π Digital π Science and technology
What are the next challenges for AI?

Machine Learning: can we correct biases?

with Sophy Caulier, Independant journalist
On December 1st, 2021 |
4min reading time
Stéphan Clémençon
Stephan Clémençon
Professor of Applied Mathematics at Télécom Paris (IP Paris)
Key takeaways
  • AI is a set of tools, methods and technologies that allow a system to perform tasks in an (almost) autonomous way.
  • The question of trust in Machine Learning (ML) tools is recurrent, because deep learning requires very large volumes of data, which often come from the web.
  • There are different types of bias that can be related to the data source used. These include “selection bias”, due to the lack of representativity, and “omission bias”, where data are lacking.
  • When the available data are too sparse to implement ML in a simple way, we talk about “weak signals”. Hybridisation of ML with symbolic AI could provide solutions.

What are the major chal­lenges cur­rently facing arti­fi­cial intelligence?

In my area of expert­ise, which is machine learn­ing (ML), the three top­ics that I am cur­rently pas­sion­ate about, and that could poten­tially be con­sidered as the great chal­lenges in this field, are bias and fair­ness, weak sig­nals, and learn­ing on net­works. But this is only a par­tial view of the chal­lenges in AI, which is a very broad and mostly inter­dis­cip­lin­ary field. AI is a set of tools, meth­ods and tech­no­lo­gies that enable a sys­tem to per­form tasks in a quasi-autonom­ous way, and there are dif­fer­ent ways of achiev­ing this.

ML is about the machine learn­ing from examples, train­ing itself to per­form tasks effi­ciently that it will later under­take. The great suc­cesses in this area are com­puter vis­ion and auto­mat­ic listen­ing, used for applic­a­tions in bio­met­rics for example, and nat­ur­al lan­guage pro­cessing. One of the ques­tions that cur­rently arises is how much con­fid­ence can be placed in ML tools, as deep learn­ing requires very large volumes of data, which very often come from the web.

Unlike data­sets pre­vi­ously col­lec­ted by research­ers, web data is not acquired in a “con­trolled” way. The vast quant­ity of this data can some­times lead to the meth­od­o­lo­gic­al ques­tions that should be asked to exploit the inform­a­tion it con­tains being ignored. For example, train­ing a facial recog­ni­tion mod­el dir­ectly from web data can lead to bias, in the sense that the mod­el would not recog­nise all types of faces with the same effi­ciency. In this case, the bias may stem from a lack of rep­res­ent­at­ive­ness in the faces used.

If, for example, the data cor­res­ponds mainly to Caucasi­an faces, the sys­tem developed may recog­nise Caucasi­an faces more eas­ily than oth­er types of faces. How­ever, the dis­par­it­ies in per­form­ance may also be due to the intrins­ic dif­fi­culty of the pre­dic­tion prob­lem and/or the lim­it­a­tions of cur­rent ML tech­niques: it is well known, for example, that deep learn­ing does per­form as will in the recog­ni­tion of the faces of new-borns as it does for adult faces. How­ever, there is cur­rently no clear the­or­et­ic­al insight into the link between the struc­ture of the deep neur­al net­work used and the per­form­ance of the mod­el for a giv­en task.

You say “cur­rently”. Does that mean that these biases could one day be removed, or their effect could diminish?

There are dif­fer­ent types of bias. They can be rel­at­ive to the data, there are the so-called “selec­tion” biases, linked to the lack of rep­res­ent­at­ive­ness, “omis­sion” biases, due to errors through endo­gen­eity, etc. Biases are also inher­ent in the choice of the neur­al net­work mod­el, the ML meth­od, a choice that is inev­it­ably restric­ted to the state of the art and lim­ited by cur­rent tech­no­logy. In the future, we may use oth­er, more effi­cient, less com­pu­ta­tion­ally intens­ive rep­res­ent­a­tions of inform­a­tion, which can be more eas­ily deployed, and which may reduce or elim­in­ate these biases, but for the moment they exist!

What role does the qual­ity of the data­sets used for train­ing play in these biases?

It is very import­ant. As I said, giv­en the volume of data required, it is often sourced from the web and there­fore not acquired in a suf­fi­ciently con­trolled way to ensure rep­res­ent­at­ive­ness. But there is also the fact that this data can be ‘con­tam­in­ated’, in a mali­cious way. This is cur­rently an issue for the com­puter vis­ion solu­tions that will be used in autonom­ous vehicles. The vehicle can be deceived by manip­u­lat­ing the input inform­a­tion. It is pos­sible to modi­fy the pixel image of, say, a traffic sign so that the human eye sees no dif­fer­ence, but the neur­al net­work ‘sees’ some­thing oth­er than the traffic sign.

ML is based on a fre­quent­ist prin­ciple and the ques­tion of the rep­res­ent­at­ive­ness of the data in the learn­ing phase is a major issue. Using autonom­ous driv­ing as an example, we now see many vehicles on the Saclay plat­eau, equipped with sensors to store as much exper­i­ence as pos­sible. That being said, it is dif­fi­cult to say how long it will be before we have seen enough situ­ations to be able to deploy a suf­fi­ciently intel­li­gent and reli­able sys­tem in this field, enabling us to deal with all future situations.

There are cer­tainly applic­a­tions for which the data avail­able today allows ML to be imple­men­ted in a sat­is­fact­ory man­ner. This is the case, for example, for hand­writ­ing recog­ni­tion, for which neur­al net­works are per­fectly developed. For oth­er prob­lems, in addi­tion to exper­i­ment­al data, gen­er­at­ive mod­els will also be used, pro­du­cing arti­fi­cial data that account for adverse situ­ations but without claim­ing to be exhaust­ive. This is the case for ML applic­a­tions in cyber­se­cur­ity, in an attempt to auto­mat­ic­ally detect mali­cious intru­sions into a net­work for example.

Gen­er­ally speak­ing, there are many prob­lems for which the data avail­able is too sparse to imple­ment ML in a simple way. This is often the case in anom­aly detec­tion, par­tic­u­larly for pre­dict­ive main­ten­ance of com­plex sys­tems. In some cases, the hybrid­isa­tion of ML and sym­bol­ic tech­niques in AI could provide solu­tions. These aven­ues are being explored in the civil and mil­it­ary avi­ation sec­tors, as well as in med­ic­al ima­ging. In addi­tion to their effect­ive­ness, such approaches may also enable machines to make decisions that are easi­er to explain and interpret.

What is driv­ing evol­u­tion in AI today?

The field of math­em­at­ics con­trib­utes a lot, espe­cially in terms of effi­cient inform­a­tion rep­res­ent­a­tion and algorithms. But it is also tech­no­lo­gic­al pro­gress that is driv­ing AI for­ward. The math­em­at­ic­al concept of neur­al net­works has been around for many dec­ades. Recent tech­nic­al advances, par­tic­u­larly in the field of memory, have made it pos­sible to suc­cess­fully imple­ment deep neur­al net­work mod­els. Sim­il­arly, dis­trib­uted com­put­ing archi­tec­tures and ded­ic­ated pro­gram­ming frame­works have made it pos­sible to scale up learn­ing on large volumes of data. What remains to be done is to design more frugal approaches, so as to reduce the car­bon foot­print of com­pu­ta­tions, which is a very top­ic­al issue!

Support accurate information rooted in the scientific method.

Donate