What is a model?
A model is a term that allows dozens of definitions. Here the following
definition will be used. Model can be defined as a simplified
representation of reality with an objective of its explanation or
prediction. Depending on purpose, the desired accuracy, available
data and other factors, models could use very different representation
apparatus. For example, a structural engineer applies material sciences
and formulates equations describing the material behaviour, or he/she
builds a simplified structure resembling a real one and tests various
designs, load conditions, etc. Different methods can be used to
construct such representation.
There are some steps that are normally undertaken when a model
is to be built. In a simple way they can be formulated as follows.
1. Explore the problem space: state the problem.
2. Explore the solution space: specify what exactly is expected
as output.
3. Specify the modelling methods and choose the tools (software).
4. Perform the modelling (this part is often called data
mining):
- prepare the data;
- survey the data;
- build the model.
5. Apply the model (and discover some knowledge).
6. Evaluate results.
In the context of civil engineering and hydroinformatics the following
models scan be considered:

Hydroinformatics system and the used models
What is data-driven modelling? Context of civil engineering
Traditionally, modelling in civil engineering is based on good
understanding of the underlying processes and use so-called "physically-based"
(or "knowledge-driven", behavioral) models. Physical
principles describing the water dynamics are written down as the
equations (for example, the Navier-Stokes equations), they are transformed
them into the solvable form and solved using a computer.
These could be for example, models based on Navier-Stokes equation
describing behavior of water in particular circumstances. Examples
are surface (river) water 1D models, coastal 2D models, groundwater
models, etc. Equations are solved using finite-difference, finite-element
or other schemes and results - normally water levels, discharges
- are presented to decision makers. Often such models are called
simulation models. Knowledge-driven models can be also "social",
"economic", etc. The observed data is used during the
model calibration. Such models are referred as physically-based,
simulation, or process models.
On the contrary, "data-driven" model of a system
is defined as a model connecting the system state variables (input,
internal and output variables) with only a limited knowledge of
the details about the "physical" behavior of the system.
Probably the simplest data-driven model is a linear regression model.
The general knowledge of physics is of course needed (for the proper
choice of relevant parameters) but this knowledge may be not so
detailed as needed for the physically-based models. "Hybrid
models" combine both types of models.
Generally speaking, the physically-based models are more accurate
and more general. The problem is that sometimes it is not possible
to build trustworthy models. In such cases, if the observation data
is available, the data-driven models may help. DDM complements the
simulation modelling and in some cases could replace it.
Machine learning
ML is traditionally considered as a part of Artificial Intelligence.
It is aiming at building programs that improve with experience (that
is, learn). The two most important problems that ML solves are:
- classification – when an example (a point in the input space)
has to be classified to one of several classes;
- regression (numerical prediction).
Important results were achieved in this area in the 1970-80s. The
researchers from Prof. Aizerman's group in the Institute of Control
Problems of the Russian Academy of Sciences could be mentioned here.
One of the outstanding results of that group was the development
by Vladimir Vapnik of statistical learning theory, used currently
in techniques based on the so-called support vector machines
(do not confuse with vector computers!). This area received lately
a lot of attention (see Vapnik's book The Nature of Staistical
Learning Theory, 1995).
Relation between data-driven modelling and machine learning
Simply put, a data driven model is based on machine
learning techinques. The process of its training is in fact a machine
learning process.

Learning in data-driven modelling
What is data mining?
Data mining relates to statistics, data analysis, databases,
and machine learning. Data mining is closely related to the analysis
(mining) of very large data sets in databases. It allows for finding
trends and relationships between variables characterising systems
and processes with the objective of predicting their future state.
Data mining is often mentioned as part of a wider area - knowledge
discovery in databases (KDD). Important applications of data
mining and KDD are seen in banking, financial services and marketing.

Data mining, knowledge discovery, neural networks,
machine learning, computational intelligence: is it all the same?
The several terms (expressions) in the title compete to name the
same interdisciplinary area. It is difficult, if not impossible,
to accommodate in a formal definition disparate areas with their
own established individualities such as fuzzy sets, neural networks,
evolutionary computation, machine learning, Bayesian reasoning,
etc. Following a good academic tradition, an individual or a group
of researchers often identifies an area which is slightly different
from an already existing one, introduces terminology, organises
a new conference, a journal, professorship positions, school of
thought, etc. This is what was happening with areas close to artificial
intelligence (AI) during the last two decades.
Data mining (DM), knowledge discovery in databases (KDD), computational
intelligence (CI), machine learning (ML), intelligent data analysis
(IDA), soft computing, pattern recognition - all these areas very
much intersecting, with a similar focus and application areas. It
is really difficult to find a clear-cut difference between them.
Still certain differences can be formulated:
- CI is seen as "a new name" for AI embodying all other
areas;
- ML is an area of computer science, a sub-area of AI concentrating
on the theoretical foundations. Classification (pattern recognition)
problems are addressed by ML more often than regression (numerical
prediction) problems. Technically speaking, most of ML problems
can be formulated as problems of function approximation.
- DM and KDD are focused often at very large databases and are
associated with applications in banking, financial services and
customer resources management (CRM). DM is seen as a part of a
wider KDD. Methods used are mainly from statistics and ML.
- IDA is relatively new and seem to concentrate more on the data
analysis in medicine and research. Methods used are also from
statistics and ML;
- soft computing, and in particular fuzzy rule-base systems induced
from data.
What are the most popular techniques used in data-driven modelling?
The most popular techniques in solving numerical prediction (regression)
problems are:
- statistical techniques,
- artificial neural networks (ANN)
- fuzzy rule-based systems.
Other methods (that are often even more accurate) include:
- M5 model trees,
- instance-based learning,
- locally weighted regression,
- chaos theory
For classification and clustering problems the most popular are:
- decision trees
- Bayesian methods
- instance-based learning (k-nearest neighbor algorithm)
- self-organizing feature maps and others.
Are there some examples of applications in civil engineering?
There are plenty of examples that can be found in peer-review journals
and on Internet. Several examples that we were dealing in our projects
follow:
| |
 |
|
finding the relationships between the wind, wave,
pressure data and the surge water levels and currents in the
ocean allowing their predictions |
| |
 |
|
using the data on rainfall and past river flows
for predicting the flows and floods in the future |
| |
 |
|
analysis of the cone penetration tests (CPT) data
with the objective of predicting the soil type |
Want to know more?
Check these publications
and some links. Check also the web site
http://datamining.ihe.nl
Presentation Data-driven modelling:
paradigm, methods, experiences (4.5M) at the 5th International
Conference on Hydroinformatics, Cardiff, UK, July 2002.
Check my other pages on:
|