Some risks and opportunities of automated data analysis

Some thoughts arising from TMCF 2024 workshop at the Turing Institute.
Authors
Published

June 25, 2024

Introduction

We have all been there: you have your research question, you’ve gathered your dataset, and you are ready to start digging. Or perhaps you only have a question and are wondering how you could answer it in the first place given all the options available. Ever since the invention of data analysis, the human analyst has held the primary role of thinking, deciding and analyzing. At the same time, there has been a steady progression toward automating parts of statistical workflow that are better done by machines, such as calculation. With impressive recent advancements in general purpose intelligent assistants like chatbots, we are in a better position than ever to imagine what an intelligent assistant who acts as a helpful collaborator during data analysis might look like. These new possibilities provide an opportunity to reflect on the ideal interaction between statistical tools and human knowledge. What are the risks involved with increasing amounts of automation in data analysis steps like problem specification, data collection, model specification and selection, and interpretation of results? What are the opportunities? How can we ensure that the sum of human knowledge and AI prediction in data analysis will be more than its parts?

To formalize the problem slightly, one could say that data analysis has traditionally involved applying some sequence of operations on data (e.g., reductions or visualizations) which we shall express as functions \(f = (f_1, f_2, …, f_n)\) to data \(D\), each of which generates some intermediate output \(y_i = f_i(D)\) to produce some ultimate knowledge output or interpretation \(K = f\)\(_{human}(y)\), where \(f_{human}\) represents a human processing \(y = (y_1, y_2, …, y_n)\). For example, a conventional visual analytics workflow involves a human selecting and applying to data some set of queries or operations \(f(D)\) (which might consist of filters, aggregations, regressions, etc.), then interpreting the output of these functions to produce some interpretation or decision \(K\).

As we consider how this simple formulation changes with increasing levels of automation, we propose this gives rise to a spectrum. On the one end you have a human-directed extreme, such as in conventional visual analytics (Keim et al. 2008), where the human has chosen the specific functional form of the members of \(f\) to apply to the data given their prior knowledge to find \(K\), and automation is involved primarily to perform what would otherwise be tedious calculations.

In the current age of technological development, where products like ChatGPT are promoted for data analysis, from insight discovery to communication, we are closer to being able to imagine the other extreme of this spectrum: fully automated data analysis. In this case, the human only asks the question or provides a goal and optionally some data \(D\). The remainder of the process is automated, concluding with the machine presenting its interpretation \(K\) (i.e., \(K=f_{assistent} (y)\)). The human is not directly involved in selecting \(f\) nor in processing the intermediate outputs of \(y\). Consequently, the selected f may not be interpretable to a human in the sense of being explainable via human-meaningful parameters.

Although it may seem to some readers to be immediately apparent that removing the human from data science is a bad idea, it is not at all clear that the current extent of automation in data analysis workflows is sufficient. Integrating more machine intelligence may help address human limitations we have had no choice but to accept in data analysis. For example, the human will be limited by their own experience, and may only be able to see a limited set of all the possible operations or functions that could be applied. Or, information processing biases may lead them to misperceive \(y\), resulting in a non-optimal \(K\) even if their selection of \(f\) was in fact optimal. There would seem to be many ways that using AI could improve analysis outputs in light of the limited knowledge and experience that any human analyst might have.

We propose that thinking about the spectrum from fully human-driven to fully machine-driven analysis is helpful in several ways. First, by explicitly reflecting on what the ideal role of the human is in data analysis, we gain insight into what aspects of data analysis we think are truly human or non-automatable, and which are better given to the machine. Second, as research and practice increasingly makes use of predictive modeling, we are in a better position to apply our expertise on statistical modeling to design better futures for interactive data analysis. Below we explore this spectrum from its two extremes.

The human-directed extreme

At one end of the spectrum, the human selects the functions to be applied, processes the outputs of those functions more or less manually (i.e., without necessarily using formal decision rules), and maps the beliefs that they form to an interpretation. We assume that the human has access to visual analytics processes and technology (Keim et al., 2008). By this we mean that the human does not perform the calculation themselves (we aren’t worried about mathematical errors in generating the individual \(i\)), and they have access to visual analytics systems that perform the calculations and present the answers in a visual form for interpretation through \(f_{human}\). We assume the human has some level of understanding of how the particular set of functions work and what they mean. In other words, \(f\) consists of interpretable operations, such as parameterized functions \(f_{theta}\) where theta is considered meaningful to the human, or simple deterministic functions like finding the maximum of data series. It is worth noting that despite the interpretability of the operations applied, how exactly the human processes the available information (i.e., \(f_{human}\)) can be thought of as a black box of sorts, as the generation of \(y\) is guided implicitly by the previous experience and domain specific knowledge of the human in addition to the data. It is assumed that the human knows how to best apply their domain knowledge in this scenario.

We can loosely analogize the learning problem to Bayesian decision theory for the purpose of identifying failure points. Assume some high level learning goal given some observed dataset; for example, an analyst hired by a large school district might be investigating the question, What factors explain the dip in high school math performance between 2018 and 2020? While normally in applied Bayesian analysis we would focus on how posterior beliefs about the some parameters deemed meaningful are arrived at within the context of some particular model specification, here we will instead think of the knowledge gained from the entire analysis workflow as resulting from some combination of the analyst’s prior knowledge and beliefs, the data at hand, and modeling assumptions. Assume for example that what the analyst finds is intended to inform the school district’s decision about what to invest next year’s budget in. The human analyst will likely bring some relevant domain knowledge (prior beliefs) on which variables are likely to matter and how much, how they should model the data to achieve their goals, what aspects of context should influence their interpretation, etc. For example, maybe they have prior experience on factors that predict high school test scores, biases that exist in available data, contextual knowledge about the specific school district, etc. What they know about such problems will influence the statistical modeling approach they select (e.g., a multiple regression predicting 11th grade standardized math test performance from some particular set of covariates). The output of the models and operations they apply inform their posterior beliefs about which factors influence student performance and how this information should inform a decision.

If the analyst were fully Bayesian, they would optimally combine the new information they learn with their prior beliefs to arrive at posterior beliefs, then select the utility maximizing interpretation or decision from a space of possible decisions. But a number of things can go wrong when dealing with boundedly rational agents. The analyst’s prior knowledge itself might be biased or incomplete, leading them to analyze the wrong data, fail to select an appropriate set of operations \(f\), etc. They might make bad assumptions in model specification or selection, leading to misleading modeling outputs. They might overconstrain their analysis based on their priors (or discount them too easily) by taking advantage of degrees of freedom, for example if they are predisposed to prefer a certain interpretation leading to biased beliefs they might settle on a particular fitted model because it aligns with these preferences. They might arrive at the wrong interpretation or decision in light of their beliefs.

On the other hand, there may be advantages to the fully human case. The human understands exactly what was applied in \(f\) and drives the interpretation of \(y\), enabling them to apply domain knowledge in a flexible, unscripted way. Their understanding and experience with the functions or models they apply makes it easier for them to debug issues, and they may feel more confident about the interpretation they ultimately arrive at. They have the ability to reformulate the goals and problem upon viewing results at any point, for example if they originally failed to consider some important data. More broadly, if the purpose of most data science is to generate insight from data for human purposes, in the fully human-driven case we are 100% sure that data science is happening as every step was done by and consumed by the human, providing an opportunity for understanding, critique, and reconsideration at each step.

The fully automated extreme

At the fully automated extreme, the human provides an analysis goal and optionally data. When considering the Keim et al. model of visual analytics (Keim et al. 2008), the assistant has complete control over all stages, including automated processing of knowledge. In the extreme, the assistant simply returns the answer, \(K\), without any justification or provenance of the analysis. For the purposes of this post, we assume the assistant optimally combines what can be gleaned from all available prior statistical and scientific modeling examples in an autoregressive framework in which its task is to predict the most appropriate sequence of steps given the human’s prompt.

Several risks arise from the training constraints on the assistant. For example, one risk arises from how the assistant’s suggestions will be constrained by what it has seen, which in the best case consists of all prior observed statistical analyses. The model cannot necessarily suggest an altogether new form of analysis unless it in some way represents a combination of previous observed analyses. This raises the question of whether and when we might expect completely new analysis paradigms to emerge which could not somehow be reconstitutions of existing ideas.

Relatedly, the assistant is constrained to suggesting what is most probable. Ask any statistician if they expect frequencies within a corpus representing all historical examples of statistical analysis to capture the appropriateness of a given analysis path and they will answer with a resounding no. If ritualistic choice of models and interpretation of results is a problem in the fully human directed case, then any agent restricted to optimizing an autoregressive objective over some corpus of training data has the potential to miss something more appropriate but less prevalent in the training data. To some extent modern machine learning pipelines can overcome these biases through processes like fine-tuning, where a small amount of preference data is collected after training an unsupervised foundation model and used to adjust the conditional distribution of the model’s output. However, fine-tuning for particular cases implies a form of interaction that would seem to contradict the extreme fully-automated case: if we expect analyses to improve through interaction to obtain further context- or expert-specific input, we would seem to be advocating for an interactive dialogue between the human user and the assistant.

Relatedly, the fully automated scenario assumes that the human, at the time of prompting, is able to specify their true analysis goal, and in doing so will know how to customize the prompt to contain any relevant domain knowledge they have. But this begins to sound a lot like our trust that the human knows best how to apply their knowledge in the fully human-driven case. Is this reasonable? Or does the real human analyst require intermediate outputs along the way in order to cue lessons from their prior knowledge and experience? Consider how often real analysis workflows involve a shifting of direction. Upon viewing representations of our data in early Exploratory Data Analysis, we might realize our misconceptions of what it contained. Perhaps the fidelity of information we thought we could capture (and which we need to achieve our analysis goals) is simply not possible, and we must rethink our questions altogether. Or, we might realize upon reviewing model diagnostics that there is little signal in the variables we have collected, but have an idea of what other features we could collect. Can we imagine data analysis without such flexibility? Is it possible that an assistant could anticipate such changes in direction? The answer in the extreme case is simply returned without any reason or process to derive the answer, hence no obvious recourse for the human to take further action. Thus, knowledge is gained in some sense if they choose to trust it, but that might be hard.

Charting a generalized path through extremes of \(y\)

The above reflections suggest the ideal amount of machine assistance in data analysis will lie somewhere between the two extremes, and may differ depending on the expertise of the particular human and the specific context. However, we could say that the goal in finding the sweet spot is to identify the points where the human’s imagination holds them back from realizing a better path. If good statistical practice involves judicious use of computation to extend the human’s ability to imagine possible outcomes (e.g., alternative values for a statistic resulting from bias or sampling error, counterfactuals in causal inference, equivalently performing models in machine learning, etc.), a computational assistant who can entertain many models or theories simultaneously provides opportunities to “amplify” human cognition, a stated goal of visualization.

But how much should an assistant prompt the human to imagine outside their comfort zone? Attempting to design the optimal human-machine pairing naturally motivates reflection on the extent to which we want data-driven science to be pluralistic, allowing for different beliefs and conventions when it comes to how to best learn from data. Consider a familiar tension in the field of data analysis between inference, theory and explanation, and prediction. Researchers and practitioners in different fields and domains vary in how much they value each. It would seem that an optimal assistant would need to adjust to the specific problem in ways congruent with domain-specific values. In social and natural science applications, the assistant might, for example, behave like a scientist from those traditions, where the generated y (and consequently \(K\) derived from \(y\)) is grounded in substantive theory. In Operations research where prediction is typically the goal, we might judge possible outputs purely in terms of best out-of-sample performance, leading to a much more quantitative (minimal) inspection of \(y\) to arrive at \(K\).

Paradigmatic differences leave us with many questions around how the assistant should be constrained or configured to realize the goals of scientific knowledge development. We should think about what level of constraints are tolerable and ultimately desirable for learning in a domain, which is a question about how comfortable we are with letting an assistant push us outside our comfort zones as scientists. If the social or natural scientist fine-tunes or otherwise configures the assistant only in a way that reproduces existing practice, the risk is simply reproducing the status quo. What level of “expanded” imagination is desirable for the purpose of scientific progress? For example, should researchers working in domains that prioritize explanatory approaches be given recommendations informed by predictive modeling, as proposed by Hofman et al. (2021) in advocating for integrative modeling? Are there unified goals, constraints or features of analysis – ways of evaluatin and thinking about the y that gives rise to \(K\) – that are independent of domain, or are we ultimately constrained to work only within a particular modeling paradigm because historically boundaries have existed? To what extent is explicit design toward achieving specific features/goals of scientific analysis and different standards of evidence (Hofman et al., 2021) useful in constraining the assistant?

For example, to what extent should the ideal pairing of the human and machine for data science seek to prioritize:

  • Substantive theory – Extent to which the \(K\) that is derived supports, validates and extends knowledge.
  • Reproducibility and transfer – Extent to which \(K\) is reproducible - for example, is it invariant under perturbations we do not believe should substantively change results, such as slight variations in how the assistant is called? Notions of replicability may also matter: when should a closely related analysis on similar datasets in the same domain produce an analogous result?
  • Transparency – Extent to which \(K\) that can be easily understood and interrogated.
  • Coverage and generalizability – Extent to which \(K\) encompasses a narrow or wide range of settings (contexts, scales, etc.).
  • Expansiveness – Extent to which \(K\) extends imagination, or that enables the analyst to push beyond bounds of inherited modeling and data paradigms.

Different domains and use cases might naturally have different weightings on each of these features, leading to different modeling paradigms. Where do we find ourselves on the spectrum, and to what extent might we expect an intelligent assistant to be informed by knowledge of other points along it, so as to push analysts in a given domain outside of their comfort zone?

Conclusions

There are many other risks and benefits of the human-directed and automated extremes than those discussed above. However, several themes arise even from our partial treatment. One is that it seems unlikely that the optimal analysis approach is either fully human-directed (with the primary form of automation being calculation as assumed above) or nearly fully automated, with the human providing only the high level goal and optionally, some data. Without any visibility into how \(K\) was produced, the human has no opportunity to apply knowledge that is not contained in the training data to debug operations chosen by the machine. They may not feel confident in their results, and their lack of insight into how they were reached may prevent them from applying the knowledge that is output, leading to a question of whether it is knowledge at all.

At the same time, a machine that can run many analyses simultaneously, including approaches the human may not know of or be familiar with, has the potential to result in much more informed interpretations of data. There are many blindspots in human analysis. Forms of model multiplicity (the fact that we can get the same model fit or performance from models that imply very different interpretations of a phenomena) and sources of uncertainty (e.g., about how good our assumptions are) are routinely overlooked. In general, it is unlikely that either extreme will be sufficient, prompting a number of questions about how explicitly an analysis assistant should prompt a human analyst to think more expansively than they otherwise might. Key challenges lie in identifying when, how, and why to elicit human knowledge, so as to show the analyst what they may miss, and how to communicate results from potentially non-human interpretable operations in ways that humans can understand. In the current age of technological advancement with increasingly advanced automated methods we expect the sum will indeed be more than its parts.

References

Hofman, J.M., Watts, D.J., Athey, S. et al. Integrating explanation and prediction in computational social science. Nature 595, 181–188 (2021). doi: 10.1038/s41586-021-03659-0

Hullman, J., and Gelman, A. (2021). Comparing human to automated statistics. Section 6 of Designing for interactive exploratory data analysis requires theories of graphical inference, Harvard Data Science Review 3 (3).

Hullman, J., Holtzman, A., and Gelman, A. (2023). Artificial intelligence and aesthetic judgment. http://stat.columbia.edu/~gelman/research/unpublished/AI_aesthetic_judgment.pdf

Keim, D., Andrienko, G., Fekete, JD., Görg, C., Kohlhammer, J., Melançon, G. (2008). Visual Analytics: Definition, Process, and Challenges. In: Kerren, A., Stasko, J.T., Fekete, JD., North, C. (eds) Information Visualization. Lecture Notes in Computer Science, vol 4950. Springer, Berlin, Heidelberg. doi: 10.1007/978-3-540-70956-5_7

Citation

BibTeX citation:
@online{archambault2024,
  author = {Archambault, Daniel and Beecham, Roger and Gelman, Andrew
    and Hullman, Jessica and Pos, Edwin},
  title = {Some Risks and Opportunities of Automated Data Analysis},
  date = {2024-06-25},
  url = {https://theory4ida.github.io/tmcf/posts/01-models/},
  langid = {en}
}
For attribution, please cite this work as:
Archambault, Daniel, Roger Beecham, Andrew Gelman, Jessica Hullman, and Edwin Pos. 2024. “Some Risks and Opportunities of Automated Data Analysis.” June 25, 2024. https://theory4ida.github.io/tmcf/posts/01-models/.