A belated Happy New Year to all!

Buggy is back for his first post of the year. This is also my first post in a while; put that down to a combination of conference season, project planning season and too much holiday cheer.

I recently had an exchange on Twitter exchange with a colleague (and yes, before you ask we are allowed to use social media at work) about where scientists could deposit their data on the web at the end of a study. I had a few suggestions (and, as usual, a few opinions) about how, where and why we should be depositing our data.

As science moves towards a more ‘open source’, philosophy making data available as part of the publication process is becoming more common. Of course the taxonomists, systematists and gene-jockies amongst us have been doing that for a while, using systems like NCBI’s GenBank. Where the revolution (if I could be so bold as to use that word) is coming is in the ecological sciences. Expectations amongst publishers and in the broader scientific community are changing twoards expecting that data will be made available online and in an accessible format. To accommodate this, a number of projects have been launched that are meant to be a place for us to publish data sets.

But why publish your data? In theory raw data was always available: you just needed to ask for it. In practice, people can refuse, move on or pass away; data can be lost, formats can change and software can go obsolete which makes the reuse of data difficult. Publishing your data solves this problem.

Publishing your data also makes your work reproducible. With access to your data and your analysis code, anyone can repeat your work – or better yet, extend your work and gain new insight. In fact, in a great many fields your paper will not be published until you deposit the data (see here, for instance). I’d also argue that if your research is publically funded you have an obligation to make your data available. Of course, that is, after you’re done with it!

So why don’t more of us share our data? Well the biggest fear, of course, is that you might get ‘scooped’. That’s reasonable, but I think it’s unfounded, and here’s why: we expect that if someone wants to use our ideas, they will cite us. Otherwise it’s plagiarism (or at least bad manners), and there are ways to deal with that. So, extending that logic,  It’s reasonable to expect that if someone wishes to use our data, they will cite us as well (and now you can even track those citations!)

I’d go further and state that the benefits to publishing data outweighs the pitfalls. From an ‘economic’ perspective we can gain professional currency in the form of citations (see here and here), which have value in application, tenure and promotion packages.

Professionally, publishing data can help you attract new collaborators and new research opportunities. Publishing your data is just one more way people can become aware of you and your work and that awareness is important.  There is the old saying that data without context is just noise. If your data can be applied elsewhere, only you as the collector can provide that additional insight into the specifics of your system. That insight can help to explain new results, but it can also lead to new hypotheses and collaborations with people you may never have otherwise interacted with (or who would have never read your paper).

Personally, I think that the potential for greater insight resulting from others ‘playing around’ with your data can only result in a deeper understanding of your own system. And really, isn’t that something we’re all after?

Below is a list of some places where you can publish your data. Do you have any other suggestions, or want to share your experiences with publishing data? Let me know in the comments.


(With thanks to Simon Bridge of Natural Resources Canada Canadian Forest Service for suggesting I write this up.)



From their about page: “Dryad is both an international repository of data underlying peer-reviewed articles in the basic and applied biosciences, and a membership organization, governed by journals, publishers, scientific societies, and other stakeholders. Dryad welcomes data submissions related to published, or accepted, scholarly publications.”

The Ecological Society of America’s Data Registry  

A data repository for articles published in the ESA’s journals)


A repository for phylogentic trees and data

Or find a journal where you can publish the data as a digital appendix like I did here.

This post is also available in: Français

  1. davidshorthouse716
    davidshorthouse716 says:

    If folks have collections or taxonomic/ecological checklist data as major components of their manuscripts, Canadensys can archive and serve these on behalf of researchers in its repository, http://data.canadensys.net/ipt/. Similar to Dryad, we stamp DOIs on these so the link between reprint and data package can also be made. Unlike Dryad, we’ll work with researchers to ensure that the content of these submissions conforms to international data standards, which will results in greater opportunity for reuse.

  2. davidshorthouse716
    davidshorthouse716 says:

    Unfortunately, tracking citations of data via DOIs has not yet matured as has that for scientific papers (but see DataCite, a more pertinent DOI Registration Agency, http://www.datacite.org/, that has a branch at the NRC, http://cisti-icist.nrc-cnrc.gc.ca/eng/services/cisti/datacite-canada/index.html). Scientific papers are well structured with their own literature cited sections, which publishers assiduously parse and send off to CrossRef. That’s how CrossRef’s “CitedBy” mechanism works. Someone will have to step-up to do the same for synthesized data. This won’t be a technical problem – it’ll be a human effort problem. It will require that data depositors have the desire and facile tools to indicate whose data they incorporated into their study.