Big data has become a priority for most organizations, which are increasingly aware of the central role data can play in their success. But firms continue to struggle with how to best protect, manage and analyze data within today's modern architectures. Not doing so can result in extended downtime and potential data loss costing the organization millions of dollars.
Unlike traditional data platforms (Oracle, SQL*Server, etc.), which are managed by IT professionals, big data platforms (Hadoop, Cassandra, Couchbase, HPE Vertica, etc.) are often managed by engineers or DevOps groups and there are some common misconceptions around big data backup and recovery that need to be cleared up.
Some of the most common myths include:
Myth #1: Multiple replicas of data eliminates the need for separate backup/recovery tools of big data. Most big data platforms create multiple copies of data and distribute these copies across different servers or racks. This type of data redundancy protects data in case of hardware failures. However, any other situation such as user errors, accidental deletions, data corruptions, etc. will result in data loss because these errors or corruptions quickly propagate to all copies of data.
Myth #2: Lost data can be quickly and easily rebuilt from the original raw data. This might actually work if you still have all the raw data to rebuild the lost data. But in most cases, that raw data was deleted or is not easily accessible. Even if it was available, rebuilding the lost data at big data scales can take weeks, consuming significant engineering resources, and results in an extended downtime for the big data users.
Myth #3: Backing up a petabyte of big data is not economical or practical. Periodic full backups of a petabyte of data will take weeks and require infrastructure investments north of half a million dollars.
However, there are a few things you can do to mitigate these issues. You can identify a subset of data that is valuable to the organization and only back up that data. Adopting newer backup techniques such as deduplication to store backups efficiently, incremental-forever to transfer changes, using commodity servers, etc. will also help reduce costs and speed up backup time.
Myth #4: Remote disaster recovery copies can serve as a backup copy. It is prudent to have a copy of the data in a remote data center to protect against large scale disasters such as fires and earthquakes. This is typically done by replicating data on a periodic basis from the production data center to the disaster recovery data center.
However, all changes made on the production data center are propagated to the disaster recovery site including accidental deletions, database corruptions, application corruptions, etc. As a result, the disaster recovery copy cannot serve as a backup copy since it does not have the point-in-time copies that you can roll back to.