Plenary Lectures

Plenary Lecture 5	Prediction and Decision Making from Bad Data
Date / Time	26 September 2019, Thursday / 14:40 - 15:20 hrs
Speaker	Scott Ferson, Professor, School of Engineering, University of Liverpool, Liverpool, UK Director of the Liverpool Institute for Risk and Uncertainty Director of the EPSRC and ESRC Centre for Doctoral Training in Quantification and Management of Risk & Uncertainty in Complex Systems & Environments

Engineering has entered a new phase in which ad hoc data collection plays an ever more important role in planning, development/construction, operation, and decommissioning of structures and processes. Intellectual attention has largely focused on exciting new sensing technologies, and on the prospects and challenges of ‘big data’. A critical issue that has received less attention is the need for new data analysis techniques that can handle what we might call bad data that does not obey assumptions required for a planned analysis. Most widely used statistical methods, and essentially all machine learning techniques, are limited in application to situations in which their input data is (i) precise, (ii) abundant, and (iii) characterised by specific properties such linearity, independence, completeness, balance, or being distributed according to a named or particular distribution.

Although statistical techniques have been developed for situations in which some of these requirements can be relaxed, the techniques often still make assumptions about the data that may be untenable in practice. For instance, methods to handle missing data may assume the data are missing at random, which is rarely true when sensors fail under stress. Of course, even in the age of big data, we may have small data sets for rare events such as those associated with tiny failure rates, unusual natural events, crime/terror incidents, uncommon diseases, etc. Although many statistical methods allow for small sample sizes, they generally require data to be representative of the underlying population, which can be hard to guarantee. Moreover, not all uncertainty has to do with small sample sizes. Poor or variable precision, missing values, non-numerical information, unclear relevance, dubious provenance, contamination by outliers, errors and lies are just a few of the other causes that give us bad data.

We review the surprising answers to a few questions about bad data:

How can we handle data that is incomplete, unbalanced, or has missing or censored values?
When investing in sensors, when are more sensors preferable to more precise sensors?
What can be done with ludicrously small data sets, like n=8, or n=2, or even n=1?
What if the data are clearly not collected randomly?
Can bad data be combined with good data? When shouldn't they be combined?
When can increasing the number of sensors counterintuitively increase uncertainty?

Analyses can be conducted along a spectrum of increasing robustness from assumption laden to assumption free. Software tools are needed to track the assumptions we are making in data analyses and automatically characterise the robustness of the estimations and conclusions we draw from them.