TY - GEN
T1 - Tax fraud detection for under-reporting declarations using an unsupervised machine learning approach
AU - De Roux, Daniel
AU - Pérez, Boris
AU - Moreno, Andrés
AU - Del Pilar Villamil, Maria
AU - Figueroa, César
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/7/19
Y1 - 2018/7/19
N2 - Tax fraud is the intentional act of lying on a tax return form with intent to lower one's tax liability. Under-reporting is one of the most common types of tax fraud, it consists in filling a tax return form with a lesser tax base. As a result of this act, fiscal revenues are reduced, undermining public investment. Detecting tax fraud is one of the main priorities of local tax authorities which are required to develop cost-efficient strategies to tackle this problem. Most of the recent works in tax fraud detection are based on supervised machine learning techniques that make use of labeled or audit-assisted data. Regrettably, auditing tax declarations is a slow and costly process, therefore access to labeled historical information is extremely limited. For this reason, the applicability of supervised machine learning techniques for tax fraud detection is severely hindered. Such limitations motivate the contribution of this work. We present a novel approach for the detection of potential fraudulent tax payers using only unsupervised learning techniques and allowing the future use of supervised learning techniques. We demonstrate the ability of our model to identify under-reporting taxpayers on real tax payment declarations, reducing the number of potential fraudulent tax payers to audit. The obtained results demonstrate that our model doesn't miss on marking declarations as suspicious and labels previously undetected tax declarations as suspicious, increasing the operational efficiency in the tax supervision process without needing historic labeled data.
AB - Tax fraud is the intentional act of lying on a tax return form with intent to lower one's tax liability. Under-reporting is one of the most common types of tax fraud, it consists in filling a tax return form with a lesser tax base. As a result of this act, fiscal revenues are reduced, undermining public investment. Detecting tax fraud is one of the main priorities of local tax authorities which are required to develop cost-efficient strategies to tackle this problem. Most of the recent works in tax fraud detection are based on supervised machine learning techniques that make use of labeled or audit-assisted data. Regrettably, auditing tax declarations is a slow and costly process, therefore access to labeled historical information is extremely limited. For this reason, the applicability of supervised machine learning techniques for tax fraud detection is severely hindered. Such limitations motivate the contribution of this work. We present a novel approach for the detection of potential fraudulent tax payers using only unsupervised learning techniques and allowing the future use of supervised learning techniques. We demonstrate the ability of our model to identify under-reporting taxpayers on real tax payment declarations, reducing the number of potential fraudulent tax payers to audit. The obtained results demonstrate that our model doesn't miss on marking declarations as suspicious and labels previously undetected tax declarations as suspicious, increasing the operational efficiency in the tax supervision process without needing historic labeled data.
KW - Anomaly detection
KW - Kernel density estimation
KW - Spectral clustering
KW - Tax fraud detection
KW - Unsupervised machine learning
UR - https://www.scopus.com/pages/publications/85051492335
U2 - 10.1145/3219819.3219878
DO - 10.1145/3219819.3219878
M3 - Conference contribution
AN - SCOPUS:85051492335
SN - 9781450355520
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 215
EP - 222
BT - KDD 2018 - Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
T2 - 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2018
Y2 - 19 August 2018 through 23 August 2018
ER -