Dynamic Control vs. Causal Inference

A bunch of people had really good comments on the “No more RCTs” post, and I think I owe it to people’s good faith engagement to flesh out what was a very barebones thing. Before I dive in I want to be very clear that there are other, much more in-depth formal treatments of these issues (e.g. Dean Eckles, here), so I’m not claiming to be doing anything new, just trying to get some clarity for myself and think about it in the context of the field I care most about, which is science-of-science.

I’m going to start with some basic terminology and notation definitions, but I’ll label things with Section headers so you can skip those parts if you’re familiar. Where we’re headed is a more clearly defined version of the original hypothesis: I think people are spending too much time on causal inference relative to working on dynamic control.

RL basics

Okay if you already know all about reinforcement learning (RL) / Markov decision process terminology, feel free to skip this bit. And/or if you want the actual, formal treatment without what are sure to be mistakes on my part, Sutton/Barto is the way to go. I’m going to be talking in terms of RL because I’ve worked the most with it relative to other methods, but in practice you’ll see lots of techniques that are not machine-learning based, e.g. Linear Programming, Dynamic Programming, Mixed Integer Programming, Genetic Algorithms, and I’m sure a bunch of stuff I don’t know about. The important thing is to set some notation for talking about dynamic control problems, so:

So basically speaking, we’re going to have a State, an Action, and a Reward: {S,A,R}. RL is dynamic control: dynamic because we will proceed through timesteps, and what we do at timestep t has bearing on the state, and therefore our decision at timestep t+1; control because we’re concerned with choosing the action, at each timestep, that maximizes rewards (more accurately, we want the trajectory of actions over all timesteps in the episode (or over infinite horizon, with discounting), that maximizes the discounted sum of rewards).

The State is usually a vector of relevant contextual features about the environment. In a game of chess, this would be some representation of where all the pieces are on the board. The Action can actually be any vector representing any number of actions, discrete or continuous, so in chess, you’d have as many possible actions each timestep as there are pieces * feasible moves for each piece. And the Reward is a bit tricky, but in chess canonically it would just be [0,1] for either lose or win — of course people spend a lot of time and energy figuring out ways to look at more immediate rewards, so we can learn faster, but that’s for later.

Generally, you also have a State transition function, which a lot of times is going to be unknown to whatever algorithm you use. But it would be the specific dynamics, whether deterministic or stochastic, that governs how you get from one State to the next given your Action, or more formally, S_t+1 = {S_t, A_t}.

A Policy in this context has a specific meaning — using the Sutton/Barto definition (p.13) it’s a “decision-making rule,” usually given by Pi, π, and usually used to talk about π(a|s), or in other words “probability of taking action a in state s under stochastic policy π.” Basically can think of it as a mapping from state to action — this’ll be the output of your RL algorithm, a policy to control actions.

Bringing in Causal Graphs

Okay so probably you can already imagine how all this relates to causal inference, but the easiest bridge in my mind is via causal graphs. Generally speaking, you can choose to represent the MDP notation above just as well with a causal graph.

In some joint work with people much smarter than me, we take a few canonical toy problems and show how they look as causal graphs (we talk about ‘factored graphs’ in the paper because some reviewers got cranky about terminology, but they’re causal graphs). This is not at all the only work to do this, it’s not special, just easy for me to reference. So anyway you can take problems like Bin Packing, or Newsvendor, and show how they look with a causal graph, like so:

And now you see it’s only a hop skip and a jump over to whatever causal inference method you like, on any part of this graph. DAGs don’t need you to be in Pearl’s world, go wild with diff-in-diffs, RDDs, whatever, you can always translate back and forth to a graph, and you can usually contextualize a problem you’re attacking with causal inference inside a larger MDP graph that’s really a dynamic control problem.

Getting to the Point

So to give an example with the Newsvendor problem above, one obvious extension here is you might want to forecast customer demand, so imagine we also have a bunch of nodes pointing into “Customer Demand” as State variables. Like, I don’t know, we’re trying to buy Michael Jordan Jerseys, and so we want to figure out the demand for Michael Jordan Jerseys, and there’s a node going into demand that’s like… “Release of The Last Dance.”

If you’re just doing pure Reinforcement Learning, you don’t really care one whit about causal inference, you just toss that new variable into the State vector, and off you go.

But, if your whole toolkit comes from causal inference methods, what I would see a lot of people doing is caring about estimating only the causal effect of “Release of The Last Dance” on “Customer Demand for Michael Jordan Jerseys.” And you could of course do this in any number of ways, pick your favorite method — I was picking on RCTs, but obviously lots of people like diff-in-diffs, or maybe you’re well set up for an RDD, or whatever.

(One important side note is that a fair number of RCTs are actually about testing some Policy π_1 against π_2 — but a lot of times these are crude, inflexible/brittle π that are completely state-independent. I do think there’s a solid use-case for RCTs about π_1 against π_2, but I don’t really see people spend a lot of effort making sure either π_1 or π_2 is optimal. Like the Chetty paper testing a 4-week or 6-week deadline for peer reviews — why are we not treating the deadline weeks as an Action and trying to optimize it? Even if we’re going to ignore State, it’s at minimum a bandit problem to find the optimal number of weeks).

My main point is that, how much should we actually care about getting a really well identified causal estimate of “Release of The Last Dance” on “Customer Demand for Michael Jordan Jerseys,” as opposed to spending research effort on developing a really good Policy π(a|s) for the Action “Buy Michael Jordan Jerseys” to maximize the Reward “Profit”?

Or, in the context of a problem I care much more about, how much should we care about getting a really well identified causal estimate of “author-reviewer homophily” on “reviewer score” (which, remember, is just one really small piece of what is probably a very big causal graph in a {S,A,R} MDP formulation) instead of working on a dynamic control Policy π(a|s) for the Action “Accept/Reject Paper to Journal/Conference/Grant-Funding” to maximize the Reward “Scientific Impact”?

Where I’m Not Sure

So this is the part I’m not sure — when are causal inference methods actually the best use of limited research time/resources, and when should dynamic control methods be the focus of that limited research time/resource? My guess, my intuition, is you really want both, but right now I see the research community at least in science-of-science doing like 99% causal inference and if there’s a 1% doing dynamic control I’ve never met them / they’re computer scientists.

If you read through the paper I linked above, the mere fact of knowing the directional relationships of the causal graph helps you train a Policy faster. And I think there are some massive complications when we get to talking about science-of-science policies (or Policies), because as opposed to something quite well-known and well-studied like Newsvendor, I don’t think we know very much about the relevant State, let alone the State transition function — that is to say it would be hard to bring many practical dynamic control algorithms to bear, because we’d lack an accurate simulator, and I think you’d have to rely a lot just on the data + domain knowledge, which can be scary (or more relevantly, can lead you to implement bad Policies).

Probably most central, and the part where I really would not be able to do the math without a lot of help, is that I think we’re basically talking about sample efficiency here. Which is better — to spend 100 timesteps in the mode of an RCT to obtain an estimate between two nodes, then updating your Policy at the 101st timestep, or spend those same 100 timesteps actively trying to improve your Policy each timestep?

And of course, there are a number of other complications there, both theoretical and practical. For one thing, we often don’t face such a strict choice — why not both RCTs and dynamic control? Moreover, a lot of econometrics today makes use of historical (observational) data — who’s to say we shouldn’t use such data as a starting point, with causal inference methods to sharpen our estimates about the State transition function, and then use our preferred dynamic control algorithm, which will perform much better because we now have a nice simulator or the right assumptions or whatever. Also mostly people use citations or patents as a proxy for impact, and if we want to do control, we really are going to want some type of nearer-term reward signal. And then practically, as Tom Wollmann pointed out when I was complaining about this stuff awhile back, there are quite a lot of instances where you don’t have anything close to a nice enough setup where you can adjust your Policy on the fly. E.g., if you’re giving out cash transfers by physically writing checks, it’s kind of hard to change the amounts day by day (although, much easier if you’re doing the transactions programmatically through Mpesa or something).

Closing Thoughts

  1. A large chunk of the problems we care about in science-of-science can be cast as dynamic control problems, and I think keeping the larger {S,A,R} context in mind is really critical for researchers even if you’re focusing solely on a causal estimate between two state variables.
  2. I suspect we want researchers working on both causal inference methods and dynamic control algorithms in the science-of-science field, but right now it’s weighted really heavily toward causal inference, and that’s a problem because:
  3. Decision-makers have to make decisions every day anyway (on funding, on mentorship matching, on conference reviewer assignment, etc. etc. etc.), and they are implementing what I think are probably really suboptimal dynamic control policies on their own, given that those Policies are usually informal, human-driven, manual heuristics that often don’t even take into account the causal evidence anyway and aren’t optimizing for anything.

No more RCTs

I mean not really, but…

Here’s what I worry about. Pierre Azoulay is one of the smartest people I’ve ever met, and he can’t convince the NIH to do RCTs so we can discover better ways of allocating research funding. But more than that, let’s say Pierre was successful, the NIH saw the light, and decided to run a new RCT with its funding model every single year — would that actually maximize research output more than if we just gave the NIH some funding algorithms directly?

For example, one year you might play around with reviewer blinds. The next year maybe you swap around reviewers according to homophily with each other. Then homophily with the PIs. Then research similarity with the PIs. Then number of reviewers. Then the scoring system. Then the scoring system again. Etc., etc. Each time, let’s say for the sake of argument we get back results of the RCT within the year, get a nice causal estimate of X and Y, and then update how we do things next year.

And my worry is, isn’t that way, way, way too slow? Why wouldn’t we instead spend all that research effort just coming up with funding control algorithms, handing them off to NIH managers, and letting competition sort out the best algorithm in practice? Shouldn’t we just hand algorithmic control of the relevant action levers to an optimization agent, and not worry so much about perfect identification strategies to isolate the causal effect of X on Y, when we have maybe hundreds of Xs to sort through? Isn’t every single funding decision made under the previous policy by definition worse than the state of the art policy? Why would we wait a year for a robust causal estimate instead of updating our funding policy the second we get back enough data to shift the policy decision?

Anyway these are the things that keep me up at night. If anyone knows some proper papers that have done all this formally (i.e. under what conditions does it make sense to expend limited research effort on precise causal estimates instead of on direct control/optimization algorithms), do let me know.