The cloud offers many services for storing data which can be a double edged sword. In this blog post I am looking at some of the considerations that should be looked at before any data is moved into a cloud environment.
Know Your Data
The amount of data that you have may be staggering as humanity is producing more and more data every day. Seagate estimated that our data production is now in the zettabytes (see: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf ). Growing up I used a PC with a 10 gigabyte hard disk and floppy disks that were measured in megabytes. Having external media in the hundreds of gigabytes / terabytes still impresses me, so zettabytes worth of data leaves me speechless.
In case your trying to imagine that size – byte, kilobyte (1000 bytes), megabyte (1,000,000 bytes), gigabyte (1,000,000,000 bytes)…then terabytes, petabytes, exabytes and finally zettabytes which is 1,000,000,000,000,000,000,000 bytes, or a trillion gigabytes (Thanks Cisco).
You personally (at the time of writing this post in 2021) probably have data in the gigabyte, maybe terabyte range. A business may have data in the terabyte / petabyte range. Should all that data be placed into cloud storage? Probably not.
What Format Is The Data In?
There are programs out there that use propriety file formats, and there are files out there in formats that are no longer readable since the software has either stopped being supported or only works on older platforms (e.g. operating systems). If data is in a format that cannot be read then it should be converted to a format that can be read before placing it in the cloud – offsetting conversion is technical debt which if unmanaged / undocumented will probably rear its head in the future. Data format also covers databases – does the data need to be in a SQL database? A NoSQL database? Can it be converted from one database format to another (potentially open source )?
Is The Data Still Relevant? / Regulatory Considerations
Not all data is needed forever. My Fallout 3 video game saves were of huge importance to me back in the 00’s when I was gaming more. In 2021, not so important. My coursework from high school was important at one point, but since high school was many years ago it is no longer needed. My photos of family and friends, important and needed. From a personal perspective it can be quite easy to decide whats still needed and what is not, with the needed data being a potential candidate for cloud storage.
However, from a business / organisation perspective it can be a little harder. Regulatory rules come into play (e.g., Data Protection Act, GDPR) which should be checked before any data is considered for the cloud and before any data deletion. Then the potential tech debt mountain of data review needs to happen as various parts of the business / organisation may have been keeping data for many years and are unsure if it’s required for their role. Moving unneeded data into cloud storage is again a tech debt for someone to sort in the future and also can start to get costly (storage can be cheap, but it still involves cost).
After checking the data format, making sure the data is relevant and that it means any required regulatory requirements it’s time to size up the data. Having a rough idea of the size of data that could potentially be stored in the cloud gives opportunity to estimate pricing with the various Cloud Service Providers (CSPs). Data size also gives a good indication of how long it may take to migrate data depending on your connection upload speed.
Not all data is accessed equally. At one time I would have grabbed a movie from my blu-ray / DVD collection but those discs are rarely touched now as a lot of the content is available via a streaming media service. Same for my old MP3 library, it’s there but not accessed in a very long time. The access frequency for these is very low. The files related to my current university apprenticeship are accessed a few times a week, so their access frequency is high.
Understanding how often data is accessed allows you to understand how much you need to pay a CSP as you may be charged per read/write. A 1000 reads a month may sound good, but if its a file that is used by dozens of employees multiple times in a day, then the monthly read allowance may be run through very quickly.
Alongside access frequency is access speed. My physical movie collection may not be used much, but if I do decided to use it then it takes only a few moments to grab a disc from the shelf. If I moved it into offsite storage then it would take longer as I would need to get in the car, drive to the storage building, find my storage room, scramble through boxes…etc…etc.. and eventually decide it may be quicker just to watch something else. However, for something like insurance renewals I don’t mind my calendar reminding me a few weeks before hand and my computer slowly crawling through several document archives to find the relevant paper work (as I have a few works to resolve).
It’s similar for a business / organisation. If data is business critical and needed at the drop of a hat, then access speed is a very high priority. If the data is rarely used and there can be hours (sometimes longer) between someone requesting it and then needing it (e.g., legal data that can be responded to within a month rather than on the same day) then access speed is a low priority (unless the business / organisation is bad at planning when data is needed).
Cloud Service Providers offer different options for storage speeds with terms such as “hot” where data is generally available near instantly and “cold” where data may take a while to be retrieved. Understanding the access speed requirements helps to understand where the data should be stored and give a more accurate costing estimate.
Retention Period (How Long Should Data Live)?
Just as data should be checked for relevancy before it is sent into the cloud, it should also be given an estimated life time. Reviewing data after creation can be a long and improbable task (e.g., original data owner left, project team has been changed multiple times etc…), so it is better to tag data as it is moved into the cloud with potential life times. For example, if a business / organisation has collected data that can only be kept for 3 years, then tag it with value to say “delete in November 2024”. If an automated script / function is then used to regularly review tags the deletion will be automatic, saving someone from reviewing / deleting data manually. Or use a tag to say “review in November 2024) and an automated script / function to update the data owner saying “that data you stored 3 years ago is up for review, is it needed?”.
Having retention periods helps keep within regulatory requirements and also helps reduce potential costs as data that is not needed is deleted.
Who Needs Access?
Having a clear picture of who needs access to the data can give a clear indication of the security requirements. If the data is private then placing it in storage open to the public would be bad. If several people need access to read, and several more need read / write then differing access levels and a mechanism to control them will be needed.
Understand Cloud Service Provider Options
With your data understood, then options with a CSP come into play.
Does the CSP offer a set speed for download / upload of data? Is there any throttling? What are the options for transferring data into the cloud?
Does the CSP’s uptime average meet your requirements? What happens if the CSP storage has downtime (planned or not)?
Does the CSP store the data in a region that meets your data requirements / regulatory requirements? Is the storage close or on the other side of the world (latency may come into play).
Building on the “Who Needs Access” paragraph, what security controls does the CSP offer? Are they appropriate / do they meet requirements? Is data encrypted in transit? Is data encrypted at rest?
Cost / Pricing
With a clear (or clearer) understanding of the data requirements a CSPs price estimation calculator can be used to provide potential costings. These costings will use the information around data format, data size, data access frequency, data retention and who needs access.