With so many distributed systems disrupted the past few days. I thought it would be a good idea to expand on the topic as a series.

#hugops to all those #sysadmin who work on these things.

Explain Distributed Storage


Background Context

If you go around distribution storage talks, conferences, or articles, you may commonly hear the following two terms:

  • hot
  • cold

What makes it extremely confusing is that these terms are used inconsistently across technology platforms.

And due to the complete absence of an official definition: SSD, for example, can be considered "hot" in one, and "cold" in another.

Of which at this point, even experienced sysadmins who are unfamiliar with the terms would go...

Wait, I'm confused


So what the heck is Hot & Cold ???

With the lack of official definition, I would suggest the following ...

Data temperature terms (hot / warm / cold), are colourful terms used to describe data or storage hardware, in relative terms of each other, along a chosen set of performance criteria (Latency, IOPS, and/or Bandwidth) on the same platform (S3, Gluster, etc).

In general, the "hotter" it is, the "faster" it is

~ Eugene Cheah

Important to note: these temperature metrics are used as relative terms for a platform. The colorful analogy of red hot and cold blue is not interchangeable across technology platform

The following table describes in very inaccurate terms, which would be considered hot or cold for a specific platform technology. Along with common key criteria (besides cost, as that is always a criteria)

platform "hot" "cold" common key criteria
High-Speed Caching (For HPC computing) Computer RAM SSD Data Latency (<1ms)
File Servers SSD HDD IOPS
Backup Archive HDD Tape Bandwidth

Hence due to the relative difference of terms, and criteria. I re-emphasize, do not mix these terms across different platforms.

For example, a very confusing oddball : intel optane memory, which is designed specifically for high-speed low latency data transfer, and is slower in total bandwidth, when compared to SSD.

This makes it a good candidate for high capacity caching server. But "in most workloads", ill-suited for file servers, due to its lower bandwidth, and high price per gig, when compared to SSD.

If you're confused by optane, its alright, move on - almost everyone including me is.


Why not put everything into RAM or SSD?

After all, ain't you a huge "keep it simple stupid" supporter?

For datasets below 1 TB, you could and probably should for simplicity, choose a single storage medium and stick with it. As mentioned in the previous article in the series - Distributed systems do have performance overheads that only makes it worth it at "scale" (especially one with hybrid storage).

However, once you cross that line, hot and cold storage, slowly becomes a serious consideration, after all...

its all about the money

In general, the faster the storage is in 2019, the more expensive it gets. The following table shows the approximate raw material cost for the storage components without its required overheads (like motherboards, networking, etc)

Technology Reference Type Price per GB Price per TB Price per PB
ECC RAM 16 GB DDR3 $2.625 $2,625 $2.6 million
SATA SSD 512 GB SSD $0.10 $100 $100,000
Hard Drive 4TB 5900RPM $0.025 $25 $25,000
Tape Storage 6TB Uncompressed LTO 7 $0.01 $10 $10,000
"Reference Type" pricing for each technology is already intentionally chosen for lowest price per GB, during 2017-2018

When it comes to Terabyte or even Petabyte worth of data, the raw cost in ram sticks alone would be $2.6k per Terabyte, or $2.6 million per Petabyte.

This is even before an easy 10x+ multiplier for any needed equipment (eg. motherboard), infrastructure, redundancy, and replacements, electricity, and finally its manpower that cloud providers give "as a service".

And since I'm neither bill nor mark, its a price out of reach for me. Or most of us for that matter.

As a result: for practicality reasons, once the data set hits a certain scale, most sysadmins of distributed systems will use some combination of "hot" storage specced out to the required performance workload (which is extremely case by case). With a mix of colder storage, to cut down on storage cost.

In such a setup, the most actively used files, which would need high performance, would be in quick and fast hot storage. While the least actively used file would be in slow cheap cold storage.

Such mix of hot and cold storage can be done on both a single server, or (as per this series) on a much larger scale across multiple servers, or even multiple nations. Where one can take advantage of much cheaper servers in Germany or Frankfurt, where electricity and cooling is cheaper.

Hetzner Helsinki Datacenter

It is pretty cool, that uiliciosu cold storage is literally being cooled by snow

How is the data split across fast and slow then?

Distribution of data, can be performed either by the storage technology chosen itself, if supported. Such as GlusterFS, or ElasticSearch. Where it can be done automatically, based on its predefined configuration.

Cloud storage technologies, such as S3, similarly has the configuration to automatically migrate such storage into "colder" state, according to a predefined age rule.

Alternatively, the distribution of "hot & cold" storage could also be done within the application, instead of single platform technology. Which would let it have very fine grained control over as it moves its storage workload from one system or another according to its expected use case.

In such a case the term "hot or cold" would refer to how the application developers plan this out with their devops, and sysadmins. A common example would be store hot data in a network file server or even SQL database, with cold data inside S3 like blob storage.


Why do some articles define it (confusingly) by the age of data then?

Because this would represent a commonly, over-generalized use case. Nothing more, nor less.

Statistically speaking, and especially over a long trend line, your users are a lot more likely to read (or write) a recently created file. Then an old 5+ year file.

xkcd : old files

xkcd 1360 : Old Files

You can easily see this in practice if you have a really good internet connection. By randomly (for the first time in a long time) jumping back in facebook or google to open up images that are over 5+ years old. And compare the load times to your immediate news feed.

Though if you are unlucky, while everyone's "hot" working files are loading blazing fast, you can find that your year old "cold" files sadly gone for a few hours.

Which is unfortunately what happened to me recently during the Google outage ...

{% twitter 1105703004622094336 %}

Btw it has been long fixed - #hugops the hardworking google SRE's

While such rule of thumbs like "if older then 31 days, put into cold storage", is convenient...

Do not fall for an easy rule of thumbs (for distributed storage, or any technology)

The flaw is that it completely ignores a lot of the nuances that one should be looking into for ones use case.

Such as situations where you would want to migrate cold data to hot data. Or to keep it in hot data way longer than a simple 31 days period. Which is very workload specific.

Small nuances if not properly understood, and planned for, can come back to either bring services down to a crawl, when heavy "hot" data like traffic hits the "cold" data storage layer. Or worse, for cloud blob storage, an unexpected giant bill.

And perhaps the worse damage is done when: starting new learners down a confusing wrong path of learning.

I have met administrators were obsesses over configuring their hot/cold structure purely by days, without understanding their workload in details.


Ok, how about showing some example workloads then?

For example, uilicious.com heaviest file workload, consist of test scripts and results, where user writes and run 1000's of test scripts like these ...

// Lets go to dev.to
I.goTo("https://dev.to")

// Fill up search
I.fill("Search", "uilicious")
I.pressEnter()

// I should see myself or my co-founder
I.see("Shi Ling")
I.see("Eugene Cheah")

which churn out sharable test result like these ...

Uilicious Snippet dev.to test

... which then get shared publicly on the internet, or team issue tracker ๐Ÿ˜Ž


The resulting workload ends up being the following :

  • When a test result is viewed only once (when executed), it is very likely to be never viewed again.
  • When a test result is shared and viewed multiple times (>10) in a short period, it means it's being "shared" within a company or even social media. And will get several hundred or even thousands more requests soon after. In such a case, we would want to immediately upgrade to "hot" data storage. Even if the result was originally over a month+ old.
  • When used with our monitoring scheduling feature. Which automates the execution of scripts. Most people view such "monitoring" test results, only within the first 2 weeks. And never again soon after.

As a result, we make tweaks to glusterfs (our distributed file store) specifically to support such a workload. In particular to the logic in moving data from hot to cold, or vice visa.


Other notable examples of workloads are ...

  • Facebook like "On This Day", showcasing images from the archive in the past
  • Accounting and audit of company data, at the start of the year, for previous year data
  • Seasonal promotion content (New Year, Christmas, etc)

Because good half of workloads I have seen so far would benefit from unique configurations that would not make sense to other workloads. And even at times fly against the simple rule of thumb.

My personal advice is to stick to the concept of hot and cold, being fast data or slow data accordingly, allowing you to better visualize your workload. Instead of date creation time, which oversimplifies the concept.

A quick reminder of one of my favorite quotes of all time :

Everything should be made as simple as possible. But not simpler. ~ Albert Einstein

Alternatively : There is no silver bullet

In the end, it is important to understand your data workload, especially as you scale into the petabyte scale.


So why didn't we just call them Fast N Slow

And don't get me started on how much I hate the term "warm" ๐Ÿ˜ก

Oh damn, I so wish they did so. And we would be never needing this whole article.

However, while I could not trace the true history of the term (this would be an interesting topic on its own)

If you rewind back before the era of SSD, and to the predecessor to SSD storage for the early '20s.

I present one, bad ass HDD

I present, one, hot HDD

These high speed 15,000 RPM HDD are so blazing hot, they needed their own dedicated fans once you stack them side by side.

These hot drives would run multiple laps around the slower cheaper drives spinning at 5,400 RPM, which served as cold storage.

Hot data was literally, blazing hot spinning wheels of data.

one of the most expensive hot wheels

btw: these collectables go for $2,000 each

For random context, your car wheels go at around 1,400 RPM when speeding along 100 mp/h (160 km/h). So imagine 10x that - its a speed that you wouldn't want your fingers to be touching. (not that you can, all the hard-drive images you see online are lies, these things come shipped with covers)

The temperature terms are also used to describe different levels of backup sites, a business would run their server infrastructure with. With hot backup sites, having every machine live and running, waiting to take over when the main servers fail. To cold backup sites, which would have the power unplugged, or may even be empty room space for servers to be placed later.

So yea, sadly there are some good reason why the term caught on ๐Ÿ˜‚ (as much as I hate it)


So what hardware should I be using for hot and cold, for platform X?

Look at the platform guide, or alternatively, wait for part 3 of this series ๐Ÿ˜

In the meantime: if you are stuck deciding, take a step outside the developer realm, and play yourself a pop song as you think about it.


Happy Shipping ๐Ÿ––๐Ÿผ๐Ÿš€


This article is mirrored on dev.to
For comments and discussion please do so on dev.to instead