We’ve been working hard on a new implementation of the
dat command-line tool. You can try it out now (and remember to let us know if things break):
npm install -g dat
This is our first 1.0 release candidate, meaning we will work on stability and performance until we feel comfortable enough to release a stable 1.0 for long-term support. Please note that the alpha and beta versions of
dat are not compatible with these new versions, but this will be the last time that we choose to make backwards-incompatible changes.
We also recently published a post about the previous versions of the
dat CLI, and how the project has evolved, called “A Brief History of Dat.”
Dat Overview #
Dat is a command-line tool that we have been developing as an open source project with input from scientific research partners over the past two years. The pilot users have so far been in the fields of astronomy, bioinformatics, and neuroscience. These fields are becoming heavily data-based (e.g. digital imaging from telescopes and microscopes, DNA sequencing) and tend to share their research publicly on the web. We have been in a constant cycle of research and development during this period, and have rewritten the Dat CLI four times. Each rewrite incorporates the best modules and approaches from the last iteration and presents what we hope is a more straightforward workflow. We firmly believe this iterative process is the only way we could have arrived at the conclusions that led us to the features that we’re focusing on for the 1.0 release.
Dat has been redesigned with the following goals in mind:
1. Ensure datasets can be accessed even if the original data source becomes unavailable #
The key principle is to let anyone who has data share it with anyone who wants that data, in an automated and decentralized way. This makes it possible to create many copies of data, so that if the original provider disappears, other providers can take their place. We are pursuing partnerships with cloud storage companies, academic institutions, and internet freedom organizations to ensure there will be “super-sharer” nodes available, with lots of reliable storage and bandwidth. We will also be building an open-source registry service, funded by a generous grant from the John S. and James L. Knight Foundation.
2. Make available all versions of all datasets #
A drawback of existing peer-to-peer file sharing tools is that they lack the ability to dynamically update a dataset or access previous versions of a dataset. This is key for reproducibility of data, a huge issue in science. Every time a Dat repository is shared, Dat uses cryptographic fingerprinting of the files to ensure others will get exactly the same data. Peers that wish to host previous versions of data may do so at the cost of more storage space.
3. Fast dataset transfer speeds at reduced cost #
Downloads are frequently bandwidth-limited, either explicitly by the server or by the nature of clogged single-socket connections. We maximize transfer speeds by downloading in parallel from a swarm of nodes. Because we use Rabin fingerprinting (similar to rsync) Dat updates datasets by transferring only the data that changed. These two optimizations increase download bandwidth and reduces bandwidth costs, especially in cases where large datasets change only slightly between versions.
The new API is very simple. It comes with only two commands: one for sharing data, and one for downloading data.
Sharing data with
dat link #
dat link will create a unique hash for your data, and then serve it using our peer-to-peer network. This hash is generated from the contents of your files—if you change any files, the hash will change. In the future, we plan on creating feeds that will allow users to publish updates to old data, informing peers of newer versions, like Git.
$ dat link Creating share link for 19 files, 3 folders, 13.8 MB total Link: dat://a53d819bdf5c3496a2855df83daaac885686cac4b0bccfc580741b04898e3b32 Sharing data on port 3282, connected to 0/2 peers
Downloading data with
To download the data shared by someone else, you can simply type
dat <some-dat-link> and dat will traverse the peer to peer network, connect to anyone else that has a copy, and download the data directly from all of them simultaneously. File metadata and contents do not travel over any centralized servers—it flows directly from the peers to your computer. Peers are discovered using either DNS, the BitTorrent DHT, or WebRTC with the desktop app.
$ dat dat://a53d819bdf5c3496a2855df83daaac885686cac4b0bccfc580741b04898e3b32 Downloaded 1/1 files (4 MB/s, 13.8 MB total) Download complete, sharing data. Connected to 1/2 peers
We have built a variety of modules to support dat’s wide range of use cases.
- hyperdrive: The file sharing protocol dat uses to distribute files and data.
: This is the module that dat uses to find peers on the local network or the internet who are sharing the same files and data that you are.
- dat-desk: A cross-platform desktop application for managing local Dats
- dat-explorer: Explore a Dat link in the browser.
- datpy: Python client for Dat
Over the last couple of years we’ve come up with a long list of things we want to use Dat to build. Dat will be a stable building block that anyone can use for a variety of applications and use cases, such as: structured data like CSVs or database tables; reproducible code environments using containers; mounting filesystems to capture file changes; an open source file sharing app along the lines of Dropbox/BitTorrent Sync; deploying code and data to a compute cluster…the list goes on. We are excited to see what the community will build in the coming months.
It’s almost production-ready, please help us get it over the finish line!