Here I want to briefly sketch out a vision for how to solve a key set of problems facing science-of-science researchers, using the relatively new idea of a ‘data trust.’
In my ideal world, science-of-science researchers would have access to a data utopia — a giant longitudinal dataset encompassing individual and project-level data on the end-to-end scientific process for researchers worldwide, which is technically secure and safeguards privacy.
I think if you’re one of the five people reading this, the value of such a dataset is self-evident, but to be explicit I think this would be a large advance in our ability to ask and answer questions about the nature of the scientific process, and doing so will ultimately help researchers reach their potential as they use better tools, methods, and processes, resulting in faster scientific progress to the benefit of us all!
Okay so to be more specific, here are the problems that I think such a data utopia would need to solve:
- Safeguarding privacy of the contributing individuals. This is goal number one — none of us wants to live in a world filled with surveillance.
- Individuals need to be able to specify what data they’re comfortable sharing with science-of-science researchers, if any.
- Individuals need to be able to specify how that data is to be used (e.g. for public research only; for private research conditional on payment, etc.)
- Individuals need to be able to specify who is able to access and analyze the data
- Safeguarding the technical security of the data. All of (1) is for naught if the database is unsecure
- Unify individual-level and project-level data horizontally, across steps of the scientific process. That is, we would want to see as a given project advances from the idea-stage, to funding, to team formation, to experiments/modelling, to writing up the results, to presenting at conferences, to publication. Obviously these steps will look different for every project, but the point is we want a view of the same project over time, from start to finish.
- Unify individual-level and project-level data vertically, across disparate tools. That is, say Person A uses Microsoft Word to write their research paper, and Person B uses LaTeX Studio, and Person C uses Overleaf. These are all in the same stage of the scientific process — writing up the results — but right now there is no standardized metadata format to compare the writing done with each tool. So, we would need some standardization in the way these metadata are captured. Same thing goes for say, conference presentations — we should be able to represent a paper presentation at NeurIPS or at NBER in a standardized way, despite that they’re very different conferences serving a very different set of researchers.
So that sounds like a pipe dream.
But, I’m optimistic that these are solvable problems, and that some form of a data trust can solve them.
What is a data trust?
I don’t really know! I’m only just learning about them. My high-level understanding is this: it is a professionally-managed, legal organization which mediates the relationship between individuals who can provide their own data, and third-parties (say, academic researchers) who want to analyze that data. Critically, my understanding is that the professional managers of the data trust have a legal duty to the contributing individuals to act on their behalf, with their best interests in mind.
In practice, I think what that means is something like this: someone like the Sloan Foundation funds the creation of a new Science-of-Science Data Trust, whose explicit legally mandated purpose is to advance science-of-science research on behalf of anyone contributing data to the Trust.
The managers of the Trust go out to universities, national labs, etc., and try to convince some researchers to set up a data pipeline into the Trust (I’m sure they also need to negotiate with the university/lab lawyers — seems easy!). At this step, and through some always-available mechanism (say a web UX), the individual researchers contributing data would be able to specify their requirements for privacy as in (1) above — what data they’re willing to share, with whom, and for what purpose is that data able to be used. Said data pipelines would then send structured data about the scientific process of the contributing researchers to databases managed by the Trust.
Presumably, the Trust would develop metadata standards such that we can do things like in (4) and are able to compare data across disparate tools — they might need to go talk to e.g. Overleaf, LaTeX Studio, or the NeurIPS organizers, or the NBER organizers, such that conditional on individuals’ permission, the Trust can get standardized data through their pipelines from these third-parties (whether originating from the individual’s computing resources, their lab’s shared resources, or in fact the 3P’s servers).
With all this data assembled from contributing researchers, the Trust would engage academic researchers in some vetting process, taking proposals for doing academic research with these data; maybe there’s some mechanism where these proposals are also vetted by the trustees; and the academic researchers would get access to the data and be allowed to publish research papers about it.
What’s Next?
Well for me, first I need to read everything I can about data trusts, because as outlined above, I don’t really know very much. Maybe what I’m proposing is not at all compatible with the idea of data trusts, and reading more will set me right. Second, if in my reading data trusts still seem like they could solve these problems, I need to talk to a couple sets of people: (1) Any of the folks who right now are piloting data trusts in the wild, because I’m sure their practical experience will differ a lot from what I find in just reading about it, (2) Some folks at e.g. the Sloan Foundation, to see if anyone has interest in funding such a thing. Then, I don’t know, let’s see if we can get a few institutions to buy-in, fund a pilot, and try it out!
Many chances for failure, and I’m quite open to being wrong about any or all of this, but this has been in the back of my mind for a year or two now and nothing in my brain has yet realized why the idea is stupid and bad, so I’m taking that as a signal that it’s worth a try. Will write up more as I learn more; and if you read this and have useful direction, whether to stuff to read or people to talk to or reasons why this will all fail, please do let me know!