Industrial Strength Data-Driven Modeling
about the course
This one day course opens up with a discussion on the potential of and requirements to
successful data-driven models. We introduce commonly used modeling
methods, then focus on ensemble-based symbolic regression via genetic
programming. This allows construction of transparent non-linear models using
high-dimensional input-response data with possibly coupled, correlated and
noisy input variables. The results are compared with alternative methods on a
sequence of toy problems and real-world examples.
In the last two decades data-driven computational modeling of systems and
processes has evolved from the solution of last resort to a main stream approach
to industrial problem solving (prediction, control, optimization), providing
enhanced and timely predictive capabilities to the human or business.
To be accepted by domain experts and considered by fundamental modelers,
data-driven input response models should be (1) interpretable, (2)
parsimonious, (3) accurate, (4) extrapolative, (5) trustable, and (6) robust.
In an industrial setting the capability to have a trustable prediction of the
output within and outside the training range is as important as interpretability.
The possibility of integrating information from first principles, the low
maintenance and development costs with no (or negligible) operator
interference, the robustness with respect to the variability in data, and the
ability to detect novelties in data to attune itself toward changes in system's
behavior are also essential.
The complicating factor of most real-world problems is usually the high
dimensionality of available data, noise, missing data, lack of information about
inputs are significantly related to the response, and correlated variables.
There is no single technique
producing models that are
guaranteed to fulfill all of
the requirements above, but
rather there is a continuum
of methods (and hybrids)
offering different trade-offs
in these competing
objectives. Commonly used
techniques include linear regression, and nonlinear regression, regression
random forests, radial-basis functions, neural networks, support vector
machines (SVMs), and symbolic regression. Additional approaches like
boosting and ordinal optimizations can be used to amplify the capabilities of all
of the above methods even further.
This course predominantly focuses on modeling with symbolic regression and
compares it with the use of linear regression on several toy and real-world
examples. It considers two case studies to illustrate the data-driven modeling
workflow that leads the modeler from input-response data to models, to
questions about the problem difficulty and dimensionality, to insights, and,
eventually, to actionable knowledge (prediction, validation, active design of
Symbolic regression is a field of supervised learning by evolutionary
algorithms, aimed at modeling given numeric input-response data. Unlike
classical regression, which assumes a certain model structure and optimizes the
parameters, symbolic regression searches for appropriate model structure and
coefficients. Symbolic regression models are defined in a space of all possible
explicit expressions of the response variable as functions of some of the input
variables, constants and operators from a given set.
We present symbolic regression as a powerful methodology for industrial data
analysis and data-driven modeling, and cover the state-of-the art strategies for
efficient generation of plausible regression models, which are designed to
optimize competing trade-offs of high accuracy, low complexity, improved
generalization capabilities and trustworthiness.
- Motivation and challenges for data-driven modeling
- Commonly used modeling methods
- Requirements to successful input-response models
- Linear regression for input-response modeling: details and examples
- Symbolic regression via genetic programming using DataModeler AddOn for Wolfram Mathematica:
- Model generation and initialization;
- Model Selection, and Complexity control;
- Model evaluation;
- Model-based feature selection and dimensionality reduction;
- Model-based outlier detection;
- Model ensembles and active design-of-experiments.
- Case studies
This is a hands-on course. Participants will benefit maximally if they bring
laptops with Wolfram Mathematica installed and explore examples together
with the instructor.
- Mathematica notebooks with examples and case-studies
- Evolved-Analytics DataModeler Mathematica Add-on environment for
industrial data analysis (90-day license). See evolved-analytics.com for more
21 June 2012, 10.00-16.30
Katya Vladislavleva, PhD, PDEng
Katya Vladislavleva is a Chief Data Scientist and Partner at Evolved Analytics and CEO at Evolved Analytics Europe. She did a PhD on symbolic regression at Tilburg University, the Netherlands for empirical model building on noisy high-dimensional data and on applications of system identification in industry (discovering structure-property or structure-activity relationships for product development and building soft sensors for process control).
She also holds a Professional Doctorate in Engineering (industrial mathematics) from Eindhoven University of Technology, the Netherlands, and a Master of Science in Mathematics (mathematical theory of intelligent systems) from Moscow State University of Lomonosov, Russia. Her research interests include industrial process optimization, data-driven modeling and high-performance computing, particularly in the industrial scale data analysis and feature selection for regression.
In 2008-2011 she was a Guest Professor at Antwerp University, Belgium, teaching graduate and post-graduate courses on numerical linear algebra, optimization, elements of numerical and advanced numerical methods.