Abstract
A widely acknowledged drawback of many statistical modelling techniques, commonly used in machine learning, is that the resulting model is extremely difficult to interpret. A number of new concepts and algorithms have been introduced by researchers to address this problem. They focus primarily on determining which inputs are relevant in predicting the output. This work describes a transparent, advanced non-linear modelling approach that enables the constructed predictive models to be visualised, allowing model validation and assisting in interpretation. The technique combines the representational advantage of a sparse ANOVA decomposition, with the good generalisation ability of a kernel machine. It achieves this by employing two forms of regularisation: a 1-norm based structural regulariser to enforce transparency, and a 2-norm based regulariser to control smoothness. The resulting model structure can be visualised showing the overall effects of different inputs, their interactions, and the strength of the interactions. The robustness of the technique is illustrated using a range of both artifical and “real world” datasets. The performance is compared to other modelling techniques, and it is shown to exhibit competitive generalisation performance together with improved interpretability.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404.
Bellman, R. (1961). Adaptive control processes. Princeton, NJ: Princeton University Press.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.
Blake, C., & Merz, C. (1998). UCI Repository of machine learning databases.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth Inc.
Brown, M., & Harris, C. J. (1994). Neurofuzzy adaptive modelling and control. Hemel Hempstead: Prentice Hall.
Buntine, W. (1991). Theory refinement on bayesian networks. In B. D. D'Ambrosio, P. Smets, & P. P. Bonissone (Eds.), Proc. Seventh Annual Conference on Uncertainty Artificial Intelligence. San Francisco, CA, (pp. 52–60), San Mateo, CA: Morgan Kaufmann Publishers.
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Journal of Data Mining and Knowledge Discovery, 2, 121–167.
Chen, S. (1995). Basis pursuit. Ph.D. thesis, Department of Statistics, Stanford University.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press.
Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence, 60, 141–153.
Dawid, A. (1979a). Conditional independence in statistical theory (with discussion). Journal of the Royal Statistical Society B, 41:1, 1–31.
Dawid, A. (1979b). Some misleading arguments concerning conditional independence. Journal of the Royal Statistical Society B, 41:2, 249–252.
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–141.
Friedman, N., & Nachman, N. (2000). Gaussian process networks. In Proc. Sixteenth Conf. on Uncertainty in Artificial Intelligence (UAI), to appear.
Girosi, F. (1997). An equivalence between sparse approximation and support vector machines. A.I. Memo 1606, MIT Artificial Intelligence Laboratory.
Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269.
Gull, S. (1989). Developments in maximum entropy data analysis. In J. Skilling (Ed.), Maximum entropy and bayesian methods. Dordrecht: Kluwer Academic Publishers.
Gunn, S. R. (1998). Support vector machines for classification and regression. Technical Report ISIS-1-98, Department of Electronics and Computer Science, University of Southampton.
Gunn, S. R., Brown, M., & Bossley, K. M. (1997). Network performance assessment for neurofuzzy data modelling. In Intelligent Data Analysis, (pp. 313-323).
Hadamard, J. (1923). Lectures on the cauchy problem in linear partial differential equations. Yale University Press.
Harrison, D., & Rubinfield, D. (1978). Hedonic housing prices and the demand for clean air. Journal of Enviromental Economics and Management, (5), 81–102.
Heckerman, D. (1999). A tutorial on learning with bayesian network, Learning in graphical models, Cambridge, MA: MIT Press.
Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243.
Husmeier, D. (1999). Neural networks for conditional probability estimation. Berlin: Springer-Verlag Publishers.
Kandola, J. S., & Gunn, S. R. (2000). On the use of advanced inductive methods for knowledge extraction from complex datasets. Submitted to Journal of Data Mining and Knowledge Discovery.
Kavli, T., & Weyer, E. (1995). On ASMOD-an algorithm for building multivariable spline models. In G. I. K. J. Hunt & K. Warwick (Eds.), Advances in neural networks for control systems, Springer series on Advances in Industrial Control. Berlin: Springer Verlag, pp. 83–104.
Lauritzen, S. (1995). Graphical models. Oxford: Oxford University Press.
MacKay, D. (1994). Bayesian non-linear modelling for the prediction competition. ASHRAE Transactions: Symposia, OR-94-17-1.
MacKay, D. (1995). Ensemble learning and evidence maximization. Technical Report, Cavendish Laboratory, Dept. Physics, University of Cambridge.
Méxszáros, C. (1998). The BPMPD interior point solver for convex quadratic problems. Technical Report WP 98-8, Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest.
Moody, J. E., & Rognvaldsson, T. S. (1996). Smoothing regularisers for projective basis function networks. Technical Report OGI CSE TR 96-006, Dept. Computer Science and Engineering, Oregan Graduate Institute of Science and Technology.
Neal, R. (1995). Bayesian learning for neural networks. Berlin: Springer-Verlag Publishers.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann Publishers.
Penny, W., Roberts, S. (1998). Bayesian classification using neural networks-how useful is the evidence framework. Neural Networks, 12, 877–892.
Plate, T. (1999). Accuracy versus interpretability in flexible modelling: Implementing a tradeoff using Gaussian process models. Behaviourmetrika special issue on “Interpreting Neural Network Models,” 26, 29–50.
Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319.
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Rasmussen, C. (1996). Evaluation of gaussian processes and other methods for nonlinear regression.
Smola, A., Schölkopf, B., & Müller, K.-R. (1998). General cost functions for support vector regression. In T. Downs, M. Frean, & M. Gallagher (Eds.), Proc. of the Ninth Australian Conf. on Neural Networks. Brisbane, Australia (pp. 79–83). University of Queensland.
Smola, A. J. (1998). Learning with Kernels. Ph.D. thesis, Technische Universitüt Berlin.
Stitson, M., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C., & Weston, J. (1999). Support vector regression with ANOVA decomposition kernels. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methods-support vector learning. Cambridge, MA (pp. 285-292). Cambridge, MA: MIT Press.
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, D.C.: W. H. Winston.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Wahba, G. (1990). Splines models for observational data. Philadelphia: Series in Applied Mathematics, Vol. 59, SIAM.
Wahba, G., Wang, Y., Gu, C., Klein, R., & Klein, B. (1994). Structured machine learning for 'soft' classification with smoothing spline ANOVA and stacked tuning, testing and evaluation. In J. Cowan, G. Tesaro & J. Alspector (Eds.), Advances in neural information processing (NIPS), vol. 6, San Mateo, CA: Morgan Kauffman.
Whittaker, J. (1990). Graphical models in applied multivariate statistics. Chichester, UK: John Wiley and Sons.
Wyatt, J. (1995). Nervous about artificial neural networks? (commentary). The Lancet, 346, 1175–1177.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gunn, S., Kandola, J. Structural Modelling with Sparse Kernels. Machine Learning 48, 137–163 (2002). https://doi.org/10.1023/A:1013903804720
Issue Date:
DOI: https://doi.org/10.1023/A:1013903804720