19 August 2014

Announcing the Dat Alpha

The first code went into dat one year ago, on August 17th 2013. Today, after a year of work, we are really excited to release the first major version of dat along with a new website.

Our overall goal with dat is to make a set of tools for creating and sharing streaming data pipelines, a sort of ETL style system but designed from the ground up to be developer friendly, open source and streaming. We are aligned with the goals of the frictionless data initiative and see dat as an important tool for sharing data wrangling, munging and clean-up code so that data consumers can simply dat clone to get good data.

The first six months of dat development were spent making a prototype (thanks to the Knight foundation Prototype Fund). In April of this year we were able to expand the team working on dat from 1 person to 3 persons, thanks to support from the Sloan foundation. At that time dat also became an official US Open Data Institute project, to ensure that open data remains a top priority going forward.

Sloan’s proposition was that they like the initial dat prototype but wanted to see scientific data use cases be treated as top priority. As a result we expanded the scope of the project from its tabular-data-specific beginnings and have focused on adding features that will help us work with larger scientific datasets.

Up until this point, the dat API has been in flux, as we were constantly iterating on it. From this point forward we will be taking backwards compatibility much more seriously, so that third party developers can feel confident building on top of dat.

How to get involved #

Try it out #

You can install dat today and play around with it by importing or cloning a dataset.

You can also click this button to deploy a dat to Heroku for testing purposes for free (but be aware of the Heroku ephemeral filesystem limitations):

The dat REST API comes bundled with the dat-editor web application.

To start learning about how to use dat please read our getting started guide.

To help you choose an approach to loading data into dat we have created a data importing guide.

Write a module or 5 #

The benefit of dat isn’t in the dat module, but rather in the ecosystem that it enables to be built around it.

There are a lot of modules that we think would be really awesome to have, and we started a wishlist here. If you see something you are interested in building, please leave a comment on that thread stating your intent. Similarly, if there is a format or storage backend that you would like to see dat support, leave it in the comments.

Pilot users #

This release of dat represents our efforts to get it to a point where we can start working with scientists on modeling their data workflows with dat. We will now be starting concrete work on these pilot use cases.

If you have a use case in mind and you want to bounce it off of us please open at issue on the maxogden/dat repository with a detailed description.

While we don’t have many details to share today about these pilots, we hope to change that over the new few months.

Bionode (Bioinformatics -- DNA) #

Dat core team member @bmpvieira, a Bioinformatics PhD student at Queen Mary University in London, is working on applying dat to the domain of working with various DNA analysis related datasets.

Bruno runs the Bionode project. We will be working on integrating Bionode with dat workflows to solve common problems in DNA bioinformatics research.

RNA-Seq (Bioinformatics -- RNA) #

Two researchers from UC-San Diego reached out to us recently and have started explaining their use case here and here. We hope to use dat to make their data management problems go away.

Sloan Digital Sky Survey (Astronomy) #

We will be working with the SDSS project to share large their scans of the visible universe, and eventually connect their data with other sky survey data from other organizations.

The future of dat #

This release is the first step towards our goal of creating a streaming interface between every database or file storage backend in the world. We are trying to solve hard problems the right way. This is a process that takes a lot of time.

In the future we would also like to work on a way to easily host and share datasets online. We envision a sort of data package registry, similar to npmjs.org, but designed with datasets in mind. This kind of project could also eventually turn into a sort of "GitHub for data".

We also want to hook dat up to P2P networks, so that we can make downloads faster but also so that datasets become more permanent. Dat advisor Juan Benet is now working on IPFS, which we are excited to hook up to dat when it is ready.

Certain datasets are simply too large to share, so we also expect to work on a distributed computation layer on top of dat in the future (similar to the ESGF project).

You can help us discuss these high level future ideas on this issue.

To keep up to date with the dat project you can follow @dat_project on Twitter or watch the repo on GitHub.