Industrial Strength Data-Driven Modeling

about the course

This one day course opens up with a discussion on the potential of and requirements to successful data-driven models. We introduce commonly used modeling methods, then focus on ensemble-based symbolic regression via genetic programming. This allows construction of transparent non-linear models using high-dimensional input-response data with possibly coupled, correlated and noisy input variables. The results are compared with alternative methods on a sequence of toy problems and real-world examples.

motivation

In the last two decades data-driven computational modeling of systems and processes has evolved from the solution of last resort to a main stream approach to industrial problem solving (prediction, control, optimization), providing enhanced and timely predictive capabilities to the human or business.

To be accepted by domain experts and considered by fundamental modelers, data-driven input response models should be (1) interpretable, (2) parsimonious, (3) accurate, (4) extrapolative, (5) trustable, and (6) robust.

In an industrial setting the capability to have a trustable prediction of the output within and outside the training range is as important as interpretability. The possibility of integrating information from first principles, the low maintenance and development costs with no (or negligible) operator interference, the robustness with respect to the variability in data, and the ability to detect novelties in data to attune itself toward changes in system's behavior are also essential.

The complicating factor of most real-world problems is usually the high dimensionality of available data, noise, missing data, lack of information about inputs are significantly related to the response, and correlated variables.
There is no single technique producing models that are guaranteed to fulfill all of the requirements above, but rather there is a continuum of methods (and hybrids) offering different trade-offs in these competing objectives. Commonly used predictive modeling techniques include linear regression, and nonlinear regression, regression random forests, radial-basis functions, neural networks, support vector machines (SVMs), and symbolic regression. Additional approaches like boosting and ordinal optimizations can be used to amplify the capabilities of all of the above methods even further.

This course predominantly focuses on modeling with symbolic regression and compares it with the use of linear regression on several toy and real-world examples. It considers two case studies to illustrate the data-driven modeling workflow that leads the modeler from input-response data to models, to questions about the problem difficulty and dimensionality, to insights, and, eventually, to actionable knowledge (prediction, validation, active design of experiments).

Symbolic regression is a field of supervised learning by evolutionary algorithms, aimed at modeling given numeric input-response data. Unlike classical regression, which assumes a certain model structure and optimizes the parameters, symbolic regression searches for appropriate model structure and coefficients. Symbolic regression models are defined in a space of all possible explicit expressions of the response variable as functions of some of the input variables, constants and operators from a given set.

We present symbolic regression as a powerful methodology for industrial data analysis and data-driven modeling, and cover the state-of-the art strategies for efficient generation of plausible regression models, which are designed to optimize competing trade-offs of high accuracy, low complexity, improved generalization capabilities and trustworthiness.

topics include

  1. Motivation and challenges for data-driven modeling
  2. Commonly used modeling methods
  3. Requirements to successful input-response models
  4. Linear regression for input-response modeling: details and examples
  5. Symbolic regression via genetic programming using DataModeler AddOn for Wolfram Mathematica:
    1. Model generation and initialization;
    2. Model Selection, and Complexity control;
    3. Model evaluation;
    4. Model-based feature selection and dimensionality reduction;
    5. Model-based outlier detection;
    6. Model ensembles and active design-of-experiments.
  6. Case studies

This is a hands-on course. Participants will benefit maximally if they bring laptops with Wolfram Mathematica installed and explore examples together with the instructor.

course material

  • Slides
  • Mathematica notebooks with examples and case-studies
  • Evolved-Analytics DataModeler Mathematica Add-on environment for industrial data analysis (90-day license). See evolved-analytics.com for more information

dates

21 June 2012, 10.00-16.30

price

EURO 250,-

by

Katya Vladislavleva, PhD, PDEng

Katya Vladislavleva is a Chief Data Scientist and Partner at Evolved Analytics and CEO at Evolved Analytics Europe. She did a PhD on symbolic regression at Tilburg University, the Netherlands for empirical model building on noisy high-dimensional data and on applications of system identification in industry (discovering structure-property or structure-activity relationships for product development and building soft sensors for process control).

She also holds a Professional Doctorate in Engineering (industrial mathematics) from Eindhoven University of Technology, the Netherlands, and a Master of Science in Mathematics (mathematical theory of intelligent systems) from Moscow State University of Lomonosov, Russia. Her research interests include industrial process optimization, data-driven modeling and high-performance computing, particularly in the industrial scale data analysis and feature selection for regression.

In 2008-2011 she was a Guest Professor at Antwerp University, Belgium, teaching graduate and post-graduate courses on numerical linear algebra, optimization, elements of numerical and advanced numerical methods.

All Mathematica 9 events:
4 June 2013 Introduction to Mathematica (free) 10.00-17.00 subscribe
11 June 2013 Wolfram European Technology Conference 2 days subscribe
25 June 2013 Programming with Mathematica (free) 10.00-17.00 subscribe