Learning of Models

A generic approach to study complex systems is to let the computer learn models by presenting it with data which serve as examples. Models may be considered as a ‘view on the world’, where several competing models are possible, and model selection is required. The process may also occur in an on-line learning fashion, where the system learns in a continuous way as new data come in (‘live’ models). This is especially relevant for studying time-dependent systems, for example evolutionary or adaptive systems. Big data allows huge amounts of data for learning, leading to increased robustness in the learning process and the possibility of continuous verification of models. Here statistical methods such as Bayesian networks play an important role.

The Monk system, developed at ALICE, is a continuously (24//7) learning engine, handling hundreds of millions of word images in handwritten styles varying from the Dead Sea Scrolls to Western medieval documents and Chinese poems.

This topic clearly links data science with systems complexity. In addition, system identification is important for linking data with system models. Typically, data sets abound where not only the number of observations but also the number of attributes per observation is very large. Although the system may be initially described in a high-dimensional space, a much smaller set of features may suffice to describe the qualitative behaviour of the system. As regards model validation, it is clear that in case of massive amounts of data, many statistical methods from the past are not tenable much longer. The small deviations from the stylized model (‘let’s assume a Gaussian’) cannot be ignored and may become reliable enough to warrant the development of more sophisticated, new methods of density estimation and statistical testing. We note that model testing via Bayesian techniques is now very much in use in astronomy dealing with large data sets.

Key questions:

How can we bring down the dimensionality of the complex system to a more manageable size by dimension reduction? Model order reduction for analysis and control of complex large scale systems also fits here.
Using too many variables with too little data may lead to overfitting. How does this change when big data is available for learning?
How can we address the quantification of complexity of systems and (big) data? Concepts such as Kolmogorov-Sinai complexity, Lyapunov exponent estimation, the Akaike information criterion (AIC), Bayesian information criterion (BIC), intrinsic dimension, Vapnik-Chervonenkis dimension, model-ranking via Bayesian evidence (e.g., used in astronomy), and many other descriptors may play a role here.
Can we effectively combine learning and model-based approaches?
How do we explore and visualize patterns in high-dimensional big data spaces?

Last modified:

12 August 2020 11.54 a.m.