Home / Chroniques / AI, a weapon against tax fraud
π Economics π Science and technology

AI, a weapon against tax fraud

Christophe Gaie
Christophe Gaie
Head of the Engineering and Digital Innovation Division at the Prime Minister's Office
Key takeaways
  • Tax fraud is a major issue, accounting for between 4% and 15% of the tax gap in various OECD countries.
  • In France, there is a desire to step up the fight against fraud, in particular by using artificial intelligence tools.
  • The CISIRH has developed an operational and theoretical framework for comparing different fraud detection algorithms around the world.
  • To combat tax fraud effectively, AI and algorithms will not be enough; this fight must be part of a collective and human approach.

Detect­ing tax fraud is a major chal­lenge, par­tic­u­lar­ly in the cur­rent con­text of high gov­ern­ment deficits. Fraud accounts for a sig­nif­i­cant pro­por­tion of the tax gap, esti­mat­ed at between 4% and 15% of the sums owed in var­i­ous OECD coun­tries. In France, for exam­ple, VAT fraud alone is esti­mat­ed at €20–25bn1. As a result, the Cour des Comptes has pub­lished numer­ous stud­ies high­light­ing the impor­tance of step­ping up the fight against fraud2. In France, tax fraud is mon­i­tored by the DGFiP, which uses sev­er­al arti­fi­cial intel­li­gence tools that have pro­duced very promis­ing results.

With this in mind, Christophe Gaie set up a project group with stu­dents from Cen­trale­Supélec. Togeth­er, they car­ried out a research study aimed at putting in place an oper­a­tional frame­work (method­ol­o­gy, algo­rith­mic approach, com­put­er code, sim­u­la­tion data, etc.) and shar­ing it with all those involved in the fight against fraud3.

What was the aim of this study?

This project is a con­tin­u­a­tion of the more the­o­ret­i­cal research that has helped to define and artic­u­late the var­i­ous con­cepts, issues and direc­tions in the field4. It extends and imple­ments this the­o­ret­i­cal aspect and pro­pos­es an oper­a­tional frame­work that enables algo­rithms devel­oped by researchers from all over the world to be devel­oped and compared.

As opti­mi­sa­tion is not a pro­hib­it­ed action, our work has focused on fraud in the sense of irreg­u­lar­i­ties. We have also con­cen­trat­ed our efforts on detect­ing fraud per­pe­trat­ed by indi­vid­u­als, as fraud by legal enti­ties can be dealt with elsewhere.

Where does your database for this study come from?

A tax file can con­tain a lot of data relat­ing to the indi­vid­ual: fam­i­ly sit­u­a­tion, income, assets, etc. Whether in the lab­o­ra­to­ry or when study­ing real data, it is not always pos­si­ble to have all the data avail­able. We there­fore cre­at­ed a fic­ti­tious data­base based on a set of pre-select­ed data: socio-pro­fes­sion­al cat­e­go­ry, income, expen­di­ture, amount of prop­er­ty. This data­base can of course be added to at a lat­er date.

For per­fect­ly valid rea­sons of con­fi­den­tial­i­ty of per­son­al data, the DGFiP can­not make data avail­able for the detec­tion of tax fraud. As a result, each researcher builds up his or her own data­base inde­pen­dent­ly, which has proved detri­men­tal for sev­er­al rea­sons. For exam­ple, each researcher has to build his or her own data­base, which is time-con­sum­ing and requires the researcher to appro­pri­ate con­cepts such as income, assets, etc., in order to detect fraud. But also, researchers’ algo­rithms are not nec­es­sar­i­ly com­pa­ra­ble with each oth­er, ref­er­ence data­bas­es being a clas­sic approach in the field of dig­i­tal research (data­base of ref­er­ences, telecom­mu­ni­ca­tion sig­nals or images…).

How does this AI detect fraud?

The AI is based on a tax file mod­el that allows the selec­tion of files to be checked accord­ing to con­fig­urable cri­te­ria. Based on our knowl­edge of the major­i­ty of fraud cas­es, we have defined a taxpayer’s like­li­hood of fraud accord­ing to dif­fer­ent typologies:

  • High expens­es and/or high assets in rela­tion to income,
  • Low expens­es and/or assets com­pared to income,
  • High wealth com­pared to sim­i­lar peo­ple in the same socio-pro­fes­sion­al category.

The dataset5 was com­piled using ref­er­ence data pub­lished by INSEE, tak­ing into account the dis­tri­b­u­tion of socio-pro­fes­sion­al cat­e­gories, the dis­tri­b­u­tion of income and wealth, and the dis­tri­b­u­tion of expen­di­ture accord­ing to these socio-pro­fes­sion­al cat­e­gories. The divi­sion into cat­e­gories is based on a sim­ple per­cent­age of the actu­al sit­u­a­tion. For the oth­er para­me­ters, we have used a Singh-Mad­dala dis­tri­b­u­tion6.

The fight against fraud can­not be based on sim­ple detec­tion algo­rithms; it must be inte­grat­ed into a col­lec­tive and human dimension.

We have devel­oped dif­fer­ent types of algo­rithms to detect poten­tial fraud cas­es: either based on neur­al net­works with dif­fer­ent sam­pling, or based on a ran­dom for­est, i.e. a col­lec­tion of deci­sion trees used to solve a clas­si­fi­ca­tion problem.

Have these algorithms been used on real cases?

Although the algo­rithms have not been imple­ment­ed on real data, it is quite pos­si­ble to share these ele­ments with pub­lic agents, in par­tic­u­lar those of the DGFiP’s SJCF-1D “Con­trol pro­gram­ming and data analy­sis” office, where one of the stu­dents sub­se­quent­ly com­plet­ed an intern­ship. Any col­lab­o­ra­tion or feed­back with a pub­lic enti­ty would be an oppor­tu­ni­ty to be seized.

What is the level of accuracy?

It is impor­tant to remem­ber that there is a trade-off in detec­tion between accu­ra­cy (i.e. the rate of cor­rect pre­dic­tions among pos­i­tive respons­es) and sen­si­tiv­i­ty (i.e. the rate of pos­i­tive indi­vid­u­als detect­ed by the mod­el). The results of an algo­rithm are there­fore expressed in terms of a met­ric that con­sid­ers the trade-off between pre­ci­sion and sen­si­tiv­i­ty (AUPRC: “area under the pre­ci­sion-recall curve”).

The pro­posed algo­rithms achieve an AUPRC of up to 0.851 for the sen­si­tiv­i­ty-opti­mised ran­dom for­est. This is an excel­lent result, which points to par­tic­u­lar­ly use­ful prospects for detect­ing poten­tial fraud using arti­fi­cial intelligence.

Is AI enough on its own?

No. The fight against fraud can­not be based on sim­ple detec­tion algo­rithms; it must be inte­grat­ed into a col­lec­tive and human approach. And that’s because the fight against fraud is not just a tech­no­log­i­cal issue. The detec­tion of poten­tial fraud must be cor­rob­o­rat­ed by the action of a tax audi­tor, as part of a pro­ce­dure that respects the rights of the tax­pay­er. This approach guar­an­tees that the sit­u­a­tion will be exam­ined by peo­ple who will take account of tax case law, under the super­vi­sion of a judge.

It is there­fore impor­tant to under­stand that the analy­sis of a case is entrust­ed to audi­tors on the basis of cri­te­ria such as skills, work­load, pro­fes­sion­al inter­est, cov­er­age of the tax sys­tem and so on. We have devel­oped algo­rithms that aim to sug­gest a dis­tri­b­u­tion of cas­es to the head of a team of audi­tors, who then has the final say. He or she may also take sub­jec­tive cri­te­ria into account, such as the need to train new agents, even if the allo­ca­tion of files would then no longer be ideal.

Final­ly, it is worth remem­ber­ing that a fraud detec­tion appli­ca­tion must be inte­grat­ed into an infor­ma­tion sys­tem that ensures that all the admin­is­tra­tion’s func­tions are car­ried out. There­fore, in addi­tion to research work, oper­a­tional imple­men­ta­tion requires plan­ning for both inter­con­nec­tions with oth­er appli­ca­tions and the main­tain­abil­i­ty of the fraud detec­tion appli­ca­tion. Sim­i­lar­ly, the abil­i­ty to inte­grate new, more pow­er­ful algo­rithms should also be detailed.

James Bowers

Legal dis­claimer: The con­tents of this arti­cle are the sole respon­si­bil­i­ty of the author and are not intend­ed for any pur­pose oth­er than aca­d­e­m­ic infor­ma­tion and research.

Acknowl­edge­ments: The author would like to thank the Cen­trale­Supélec stu­dents who worked on the project and all the co-authors with whom he car­ried out his research to con­tribute to aca­d­e­m­ic research into fraud.

3Pro­l­hac, J., Gaie, C. “Pro­vid­ing an open frame­work to facil­i­tate tax fraud detec­tion”, Inter­na­tion­al Jour­nal of Com­put­er Appli­ca­tions in Tech­nol­o­gy, In Pub­lish, 2023, https://​doi​.org/​1​0​.​1​5​0​4​/​I​J​C​A​T​.​2​0​2​3​.​1​0​0​55494
4Gaie, C. (2023). Strug­gling Against Tax Fraud, a Holis­tic Approach Using Arti­fi­cial Intel­li­gence. In: Gaie, C., Mehta, M. (eds) Recent Advances in Data and Algo­rithms for e‑Government. Arti­fi­cial Intel­li­gence-Enhanced Soft­ware and Sys­tems Engi­neer­ing, vol 5. Springer, Cham. https://doi.org/10.1007/978–3‑031–22408-9_4
6Singh, A., Nar­i­na, T. and Aakank­sha, S. (2016) “A review of super­vised machine learn­ing algo­rithms”, Pro­ceed­ings of the 3rd Inter­na­tion­al Con­fer­ence on Com­put­ing for Sus­tain­able Glob­al Devel­op­ment (INDI­A­Com), pp.1310–1315. https://​iee​ex​plore​.ieee​.org/​a​b​s​t​r​a​c​t​/​d​o​c​u​m​e​n​t​/​7​7​24478

Our world explained with science. Every week, in your inbox.

Get the newsletter