Bitcoin logo in the center of a blockchain background

Data gardening on the Bitcoin blockchain

Quite a number of confusing things remain in public perception about the permanence of data put onto blockchains. Enough misinformation exists on the internet that it warrants need for clarification on some of the common misunderstandings. As product and innovation officer at TAAL Distributed Information Technologies Inc. (CSE:TAAL | FWB:9SQ1 | OTC: TAALF), a company building enterprise infrastructure solutions using Bitcoin, I’d like to address some of the commonly repeated things about blockchain data in general, and also specifically talk about data permanence on the Bitcoin (BSV) blockchain itself.

First off, let’s dispel some of the most commonly shared beliefs about blockchain data:

  1. It is immutable
  2. There are no storage costs
  3. It is a public service

These are all digital absolute statements. In truth they are all partially true, but the reality is more subtle, and each represents a tradeoff between cost and convenience. Let’s go over these one at a time, while highlighting the “3 innovations of Bitcoin.”

It is immutable

When people hear this, they immediately think that the data is indestructible. This is the biggest misconception, and leads one to think that the data can never be deleted or lost, implying some sort of magical data permanence. In fact, all ‘immutable’ means is that it cannot be changed. That means that any changes to it will be detected, and any tampering can be provably shown. What it really means is that it cannot be altered secretly, without being noticed. Immutability has nothing to do with the data itself, but of the data integrity—something which is much more important. Storing data redundantly is useless if there was no way to ensure that the data stored is faithful to the original version.

The problem with most of the data storage systems today is that while they are very good at ensuring that the data is stored redundantly, thereby reducing the risk of its permanent loss, there is no way (before the advent of Bitcoin) to ensure that the data copies were valid and true copies of the original. Timestamps can be altered. Records can be changed. It all boils down to who you can trust. 

Do you trust your cloud storage provider? What if they were hacked? Bitcoin solved this problem by publishing a public record and timestamp of the data. Storing the fingerprint is therefore sufficient to determine which copy is the original. Couple this with direct integration with a digital signature system[1], and all of a sudden changes to the data can be attributable to legal entities and thus allow for legal accountability. That is the technical innovation of Bitcoin.

There are no storage costs

While the previous misconception is by far the most widespread, the notion that storage costs are essentially free is one is the most relevant in terms of making businesses that build on the blockchain diverge down a path of misunderstanding and confusion.

The incorrect belief is that once stored, data is effectively stored forever and for free “by the network.” This is not the case. Just because it appears to be the case today as perceived by users due to current economics, there is no reason to believe that it will be in the future. Those that falsely believe this think that the blockchain is a replacement for a service such as Dropbox, Google Drive or Flickr. This is far from the truth. 

For one thing, data in the blockchain is public, and therefore, even encrypted files are at risk. Any cryptographer will tell you that encryption is not infallible, and it is only a matter of time before most encryption schemes are broken by brute force, if only given enough time. When something is posted publicly, you can assume that the time a hacker has at their disposal to brute force an encrypted file is infinite. And thus, you can assume that your data will be eventually decrypted and compromised. 

Secondly, to assume that there are no storage costs is to assume that you will be able to retrieve the information indefinitely from “the network.” This is not true. There is no “network,” there is only individual nodes, and those that store the blockchain. 

Currently there are plenty of free-to-use data retrieval tools[2]that allow storing, inspection and ‘browsing’ of data on the blockchain, which can be doubled as a data retrieval service. But eventually customized businesses and services will be available which will serve up data stored on the blockchain at higher bandwidth for a paid fee. The one thing that we can always be assured of, is that the free market is the ultimate force that ensures that if there is a demand, then the demand will be met, for a fee. Why don’t we have these services today? Simply because the storage burden of all the data on the blockchain has not yet become imposing enough to compel nodes (miners) on the network to prune off the data on the blockchain from spent old transactions. The time at which this starts to happen is economically driven, and difficult to predict, as it is determined by the amount of data stored as well as the operational costs and margins of the businesses and entities running a node.

One thing is for certain, if you are not paying the node to keep your data, then you will have to be content with the fact that you may have to retrieve it from some low bandwidth and slow source. There is one thing that my 14 years on Wall Street (and Robert A. Heinlein)[3] have taught me, and that is: “there ain’t no such thing as a free lunch.”

But developers may ask, is it even possible for data to be pruned from the blockchain? Wouldn’t that break the integrity of the blockchain? 

Of course it is possible. The details of which are out of scope for the purposes of this article as they are technical, but briefly, they generally involve simply discarding old transactions which are spent and not keeping any unspendable outputs. At the end of the day, technically speaking, mining nodes do not need to keep any transactions after they have validated the block that they are contained in. They only need to keep enough information in order to validate new incoming transactions. Most people believe that this implies keeping an up to date UTXO (unspent outputs) list, but even this requirement can be waived if wallets were to provide the outputs that they wish to spend along with a merkle proof that shows that they are valid outputs. Work on this is already in progress with newer wallets which are implementing the proper SPV method of payment verification. Suffice it to say, adequate means can be taken by mining nodes such that they have minimal storage requirements. Eventually, just as the task of hashing was outsourced to hashing data centres, nodes will outsource blockchain storage to other actors whose role it will be to serve historical blockchain data. And this will be the beginning of retail “validated” storage services, which may indeed compete with Dropbox and Google Drive. 

While it is very likely that data will be available somewhere from someone given that it is a public network, the economic tradeoff to not paying for historical data is that the rate at which you will be able to retrieve the data is slow. As more and more valuable data is made available online, there will be a demand for high bandwidth access to historical blockchain data. It is no longer a question of whether your data is there, but a question of how fast you can access it. If fast access is needed then it will need to be specifically provisioned at a cost. Mining nodes of the network will focus only on the efficient validation, computation, and sequencing of transactions into blocks. This allows the invisible hand of the free market to step in to solve any needed requirements for running the needed network infrastructure. This is Bitcoin’s economic innovation.

It is a public service

This is the classic collectivist belief that if there is a good service, it must be made for free. This is simply not the case in Bitcoin. Bitcoin was designed as an economic system as much as a technical one, and it relies on economic incentives on all parties in order for the system to operate and remain secure. Blockchain storage isn’t a public storage facility. It will be a service which will have economic players who will operate paid services in order to provide data retrieval.

Paying for data retrieval is something that is alien sounding to most people today, but that is because we have grown up in an age where the internet was born, and shortly after it, its first monetization model was created. This model, which can be argued as the biggest problem for personal privacy today, is the fact that the internet was monetized by selling individual users’ data. You likely have heard many times that Facebook isn’t free, they sell your data in order to monetize (very, very well!) their platform.

In the same way, Google searches are not free, websites are not free, information is not free[4]. It may have been originally, back in the early days of the internet and the World Wide Web (do people still use that term these days?) but it ceased being free after platforms like Amazon, Google, AdSense used web cookies to allow for data to be collected from your browser and passed along back to the website which you are viewing. It ceased to be free since the fall of Compuserve and AOL brought the end to the “subscription model” of the internet. Since then, the internet has encroached more and more into your privacy, and started to grab as much data as possible in order to categorize your habits, hobbies and interests, so that it could be sold to advertisers for lots and lots of money. An entire subject of study in computer science called “data science” was invented solely to train expert mathematicians and statisticians to analyze the petabytes of data that the world was providing internet platform providers for free. You became the product.

So no, surfing the web isn’t free. Hasn’t been for over 20 years. But what does this have to do with Bitcoin and blockchains? Well, one solution to this scourge of the Information Age, is with Bitcoin, we could turn this model on its head.

We could, if we had the technology, have advertisers pay the data owners directly, for collecting their data. We could have users of an internet search engine pay the operators for every search it conducted. We could have every website be paid for every page that it served to a web browser. We could even have every IP packet routed pay the router for the service of routing. If only this were possible, then we wouldn’t have internet data harvesting platforms like Facebook needing to steal our information. We wouldn’t need to tolerate advertisements on our websites or clutter our web browsing experience. But the problem in the past is the issue of micropayments. How much would a page load cost? How much should a web search cost? Or how much should viewing an ad pay? Certainly these amount to values less than 1 cent, and how can you pay a router 0.0001 of a cent to route a packet? 

The problem is that our current tools for electronic commerce, being credit cards, make such micropayment trades impossible. The cost of the payment rails themselves cost more than the transaction amount. But in the future, perhaps very soon, people will understand the need and benefit of this internet revolution in the making. The conversion from the theft and sale of data model, to the micropayment model for paying for data and services. 

In this backdrop, it is clear to see that paying for the initial “storing” of data on the blockchain is just the cost of time-stamping and validating the data (registering the data). The cost for continual storage will be paid for at the time you want to retrieve the data, and for each time that you do. Businesses that store archival blockchain data will emerge and charge a fee to retrieve the data that anyone wishes. This will not mean that subscription model for data storage that Dropbox employs will be supplanted, only that it will not be the only way you can pay for data storage.

While your data will very likely always be stored somewhere, given that transactions are publicly broadcast, if you want any guarantees on how and when you can access it, then you may want to pay someone to guarantee serving it. Data access can be free and slow, or paid and fast. This is Bitcoin’s commercial innovation.

So to summarize:

1. Bitcoin data is deletable, and can be lost, if you aren’t paying someone to explicitly keep it. That said, because of the commercial incentives put in place by the 3rd innovation of Bitcoin, and the fact that all the data is public, there is a very high likelihood that someone out there will have your data, and will be more than willing to serve it to you, for a micropayment fee, in the future. When and for what price will depend on free market forces.

2. There are storage costs to data on the blockchain, and that is going to be paid by those who wish to keep the data around with hopes of being able to monetize serving it in the future with the micropayments economy. This service need not be provided by mining nodes, but can be.

3. The blockchain is a publicly available resource, but just because it is publicly available does not mean that the services of data retrieval is going to be without cost. There is no free public service. Only a free public opportunity to build your business in the upcoming micropayment economy. Just because there are no businesses that presently provide these data persistence guarantees, does not mean there won’t be such in the future, for the free market allows for the specialization of services.

In short, Bitcoin is like having a big public yard, in which everyone can use for their own purposes. You can choose to let it grow over with weeds, leave your old broken washing machines and worn tires on it, or let the neighbor’s dog use it to as a public toilet. Alternatively, you can use it to grow vegetables for food, turn it into a garden where you grow flowers for sale, and saplings into future Christmas trees. Or you can just prune hedges and shrubs and turn it into a beautiful hedge maze and charge people an entrance fee. It is up to you. But one thing is for certain, if you are using it to store trash, then don’t expect anyone to pay you to visit your corner of the yard!

Technical notes on pruning the blockchain:

a. If you are storing data in unspendable outputs on the blockchain (putting data ofter on OP_FALSE OP_RETURN), then do not expect mining nodes to have them for any period of time after the block has been confirmed. Mining nodes can prune these transactions immediately. In reality they may keep them around for a while, but this should not be the expectation.

b. If you are storing data in spendable outputs (OP_PUSHDATA) then you can expect mining nodes to have the data, but still they are under no obligation to serve them to you. Data serving is not the business of a node of the network. Technically speaking nodes only need information necessary to validate subsequent spends of the output and OP_PUSHDATA followed by an OP_DROP while part of the unlocking script, isn’t required to validate the transaction. Some mining nodes may keep the data as part of an archive data service offering. Expect this type of storage to incur a higher than normal transaction fee, due to the fact that the network must store it perpetually. (Though nodes may still op to outsource the storage).

c. If you are thinking of building a data business on blockchain, then you should plan on storing the data yourself or arranging a contract with a party who will store it for you, if you wish to be able to retrieve transactions from the blockchain. Plenty of free block explorers already run the infrastructure. Perhaps you should approach them with cash in hand and ask them to guarantee storage of your data. Tagging your data would be a good strategy in this situation. Refer to the Metanet protocol for one such indexing scheme.

***

[1] This is why BTC with Segregated Witness or the separation of the digital signatures from the blockchain breaks this legal link to changes on the blockchain, thereby making BTC unable to provide this guarantee.

[2] Whatsonchain.com blockchair.com bitpost.app

[3] TANSTAAFL, The Moon is a Harsh Mistress, Robert Heinlein, 1966

[4] Not in the sense of “free beer” as Stallman would be apt to say

New to blockchain? Check out CoinGeek’s Blockchain for Beginners section, the ultimate resource guide to learn more about blockchain technology.