Here comes the SUN - Legacy Data Remediation

How Legacy Data Remediation Drives Technical Innovations for Future Projects

Written by Brendan Sullivan, CEO

Legacy data remediation has become the ultimate test of technical ingenuity. Almost every large remediation we have undertaken is substantially different and unique. There are over 100 major backup formats, over 100 tape formats, and goodness knows how many file formats. More recently large, networked disk-based systems (NAS Filer – NDMP) became the point for initial backup before being dumped to deep tape. There are container files, compressed container files, multiple database formats – many saved as agent-based backups. Suffice to say that when a client requires data to be separated – “keep this, not that” It can be complex. Further challenges are always posed by the timeline for completion; Systems built are built for backup, backups are added to at a steady state, that’s as fast as the systems need to operate. However, when conducting data remediation we are often talking about multiple Petabytes, and the native systems are encumbered with inadequate bandwidth, software license limitations, hardware support and historical IT resource no longer around.

But here’s where it gets exciting – here comes the SUN. Through our innovative approaches, we don’t just tackle complexity; we deliver what matters most: your Sensitive, Useful & Necessary (SUN) data. After removing all the ROT (Redundant, Obsolete, and Trivial information), what remains is pure SUN – the data that truly is essential.

We are constantly developing ways to do this for our customer, and every time we come across a new challenge and solve the problem, we add the code developed to our existing master code – TRACS. What this means is incredible capability and flexibility has been perfected and there has never been a better time to tackle the legacy data remediation project you have been putting off. Below is a summary of four technical solutions we created for specific projects that are usable for all future projects, and we are quite proud on the innovations.

Remediate by backup session

Backup sessions are very often multi-threaded across tapes from the servers that are feeding the media server (backup server). Sessions are separated by file marks. So on a set of backup tapes you might see segments of Exchange email, other database servers, file shares, print servers etc. The client wanted to remove all email/messages as well as many database files, but wanted to keep certain database files, including some proprietary types. Traditionally you would have to restore everything, land to massive amounts of disk, filter and then destroy the originals. We came up with a system where we created a container image of targeted backup sessions. We call it a TSF (Tape Session File). It is compressed and there will be multiple of them for a backup set. The really elegant piece is we were able to code a solution where when performing a copy (duplicate), we could land only the targeted sessions and then compile them to a folder and ignore everything else. The result was high speed data remediation by backup session; all performed with about 100TB of disk from over 50 PB of source data

File level data remediation with no labor cost

We like automation but it is not always possible. We have about 50 tape libraries in our Atlanta-based vault. Legacy data remediations can be an involved process, requiring lots of staging disk traditionally. However, as our tech develops at the file extension level, we have a very effective automated solution for remediation. We consult with the client on what is to be kept, typically to a file and date period. We enter the complete criteria in a text filed in our software, TRACS. We load the library (a large single tower IBM library can hold circa 700 tapes and 8/12 tape drives). We run the library with TRACS, and it will run through every file and every tape producing a very comprehensive report of inclusions and exclusions. It lands just the inclusions to disk. Then we load the next batch of 700 tapes. We will typically get through 700 tapes per week without requiring manual intervention. The result is high volume and low-cost remediation at the file level. It is ideal for file share and Lotus Notes NSF database email. Container files like Microsoft Exchange / EDB is currently in development.

Mailbox User Identification

When a client wishes to identify particular mailbox accounts from a large volume of backup tapes over many years they could take many forms. If MS Exchange they could be PSTs in an EDB, loose PSTs separately backed up, or OST’s from Granular Exchange backups (formerly known as “Brick Level Backups”). There could also be very many instances depending upon the frequency and type of backup, Full, Differential, Incremental, Synthetic Full, Daily/Weekly/Monthly/Yearly. The accounts could be in an Exchange Information Store, or embedded further in VMs or NDMP NAS filer dumps not easily seen without a full restore. In many cases the proximity of these mailbox accounts is not easily known, and a fishing expedition ensues, often starting with a series of tape session scans followed by tape catalog scans. S2|DATA has created a singular use locator for mailbox accounts whether they are loose, in EDBs in NDMP dumps, or even backed up VMs. For large tape volumes the application can run large libraries and identify where the target mailbox accounts are, thereby significantly decreasing the number of tapes that may otherwise have to be fully restored, and account extracted.

Don’t Remediate the Legacy Data

There are two main reasons to undertake a legacy data remediation project; Costs and Legal Preservation/Discovery. If it is only the cost element that troubles the client, we can emulate the entire environment and either virtualize it or replace the existing native software licenses with our Metadata Review platform “Invenire”. We can populate Invenire with session and file level metadata from the backup environment, run a “proof-of-concept” test, and then the client can retire software/hardware/IT resources and simply log into Invenire when a file(s) needs to be found and extracted. The tapes or disk storage reside in the vault of S2|DATA. The result is an order of magnitude reduction in cost for the client and a faster time to data should it ever be required.

S2|DATA’s continued innovation continues to transform legacy data remediation from a time-consuming, resource-intensive process into a streamlined, cost-effective solution. Each new challenge solved adds to our TRACS platform’s capabilities, like processing petabytes with a small staging disk, automating the processing of thousands of tapes without manual intervention, and pinpointing specific data without time-consuming full restores. As our technology evolves and our codebase expands, we’re consistently finding faster, smarter ways to unlock your trapped data. Contact us today to see how our latest innovations can dramatically reduce both the time and cost of your legacy data remediation project.

Here comes the SUN – Legacy Data Remediation

Written by Brendan Sullivan, CEO

Remediate by backup session

File level data remediation with no labor cost

Mailbox User Identification

Don’t Remediate the Legacy Data

The Latest from S2|DATA

What we do

Why we do it

Contact