Hydroinformatics
Dimitri Solomatine's pages

DATA MINING, MACHINE LEARNING and DATA-DRIVEN MODELLING


What is a model?

A model is a term that allows dozens of definitions. Here the following definition will be used. Model can be defined as a simplified representation of reality with an objective of its explanation or prediction. Depending on purpose, the desired accuracy, available data and other factors, models could use very different representation apparatus. For example, a structural engineer applies material sciences and formulates equations describing the material behaviour, or he/she builds a simplified structure resembling a real one and tests various designs, load conditions, etc. Different methods can be used to construct such representation.

There are some steps that are normally undertaken when a model is to be built. In a simple way they can be formulated as follows.

1. Explore the problem space: state the problem.
2. Explore the solution space: specify what exactly is expected as output.
3. Specify the modelling methods and choose the tools (software).
4. Perform the modelling (this part is often called data mining):
    - prepare the data;
    - survey the data;
    - build the model.
5. Apply the model (and discover some knowledge).
6. Evaluate results.

In the context of civil engineering and hydroinformatics the following models scan be considered:

 

Hydroinformatics system and the used models

 

What is data-driven modelling? Context of civil engineering

Traditionally, modelling in civil engineering is based on good understanding of the underlying processes and use so-called "physically-based" (or "knowledge-driven", behavioral) models. Physical principles describing the water dynamics are written down as the equations (for example, the Navier-Stokes equations), they are transformed them into the solvable form and solved using a computer.

These could be for example, models based on Navier-Stokes equation describing behavior of water in particular circumstances. Examples are surface (river) water 1D models, coastal 2D models, groundwater models, etc. Equations are solved using finite-difference, finite-element or other schemes and results - normally water levels, discharges - are presented to decision makers. Often such models are called simulation models. Knowledge-driven models can be also "social", "economic", etc. The observed data is used during the model calibration. Such models are referred as physically-based, simulation, or process models.

On the contrary, "data-driven" model of a system is defined as a model connecting the system state variables (input, internal and output variables) with only a limited knowledge of the details about the "physical" behavior of the system. Probably the simplest data-driven model is a linear regression model.

The general knowledge of physics is of course needed (for the proper choice of relevant parameters) but this knowledge may be not so detailed as needed for the physically-based models. "Hybrid models" combine both types of models.

Generally speaking, the physically-based models are more accurate and more general. The problem is that sometimes it is not possible to build trustworthy models. In such cases, if the observation data is available, the data-driven models may help. DDM complements the simulation modelling and in some cases could replace it.

Machine learning

ML is traditionally considered as a part of Artificial Intelligence. It is aiming at building programs that improve with experience (that is, learn). The two most important problems that ML solves are:

  • classification – when an example (a point in the input space) has to be classified to one of several classes;
  • regression (numerical prediction).

Important results were achieved in this area in the 1970-80s. The researchers from Prof. Aizerman's group in the Institute of Control Problems of the Russian Academy of Sciences could be mentioned here. One of the outstanding results of that group was the development by Vladimir Vapnik of statistical learning theory, used currently in techniques based on the so-called support vector machines (do not confuse with vector computers!). This area received lately a lot of attention (see Vapnik's book The Nature of Staistical Learning Theory, 1995).

Relation between data-driven modelling and machine learning

Simply put, a data driven model is based on machine learning techinques. The process of its training is in fact a machine learning process.

Learning in data-driven modelling

 

What is data mining?

Data mining relates to statistics, data analysis, databases, and machine learning. Data mining is closely related to the analysis (mining) of very large data sets in databases. It allows for finding trends and relationships between variables characterising systems and processes with the objective of predicting their future state.

Data mining is often mentioned as part of a wider area - knowledge discovery in databases (KDD). Important applications of data mining and KDD are seen in banking, financial services and marketing.

Data mining, knowledge discovery, neural networks, machine learning, computational intelligence: is it all the same?

The several terms (expressions) in the title compete to name the same interdisciplinary area. It is difficult, if not impossible, to accommodate in a formal definition disparate areas with their own established individualities such as fuzzy sets, neural networks, evolutionary computation, machine learning, Bayesian reasoning, etc. Following a good academic tradition, an individual or a group of researchers often identifies an area which is slightly different from an already existing one, introduces terminology, organises a new conference, a journal, professorship positions, school of thought, etc. This is what was happening with areas close to artificial intelligence (AI) during the last two decades.

Data mining (DM), knowledge discovery in databases (KDD), computational intelligence (CI), machine learning (ML), intelligent data analysis (IDA), soft computing, pattern recognition - all these areas very much intersecting, with a similar focus and application areas. It is really difficult to find a clear-cut difference between them. Still certain differences can be formulated:

  • CI is seen as "a new name" for AI embodying all other areas;
  • ML is an area of computer science, a sub-area of AI concentrating on the theoretical foundations. Classification (pattern recognition) problems are addressed by ML more often than regression (numerical prediction) problems. Technically speaking, most of ML problems can be formulated as problems of function approximation.
  • DM and KDD are focused often at very large databases and are associated with applications in banking, financial services and customer resources management (CRM). DM is seen as a part of a wider KDD. Methods used are mainly from statistics and ML.
  • IDA is relatively new and seem to concentrate more on the data analysis in medicine and research. Methods used are also from statistics and ML;
  • soft computing, and in particular fuzzy rule-base systems induced from data.

What are the most popular techniques used in data-driven modelling?

The most popular techniques in solving numerical prediction (regression) problems are:

  • statistical techniques,
  • artificial neural networks (ANN)
  • fuzzy rule-based systems.

Other methods (that are often even more accurate) include:

  • M5 model trees,
  • instance-based learning,
  • locally weighted regression,
  • chaos theory

For classification and clustering problems the most popular are:

  • decision trees
  • Bayesian methods
  • instance-based learning (k-nearest neighbor algorithm)
  • self-organizing feature maps and others.

Are there some examples of applications in civil engineering?

There are plenty of examples that can be found in peer-review journals and on Internet. Several examples that we were dealing in our projects follow:

    finding the relationships between the wind, wave, pressure data and the surge water levels and currents in the ocean allowing their predictions
    using the data on rainfall and past river flows for predicting the flows and floods in the future
    analysis of the cone penetration tests (CPT) data with the objective of predicting the soil type

Want to know more?

Check these publications and some links. Check also the web site http://datamining.ihe.nl

Presentation Data-driven modelling: paradigm, methods, experiences (4.5M) at the 5th International Conference on Hydroinformatics, Cardiff, UK, July 2002.

Check my other pages on:

  artificial neural networks
  fuzzy logic
  chaos theory
  global optimization
   

 

Back to main personal page