The enormous volumes of data generated by organizations will typically outgrow its infrastructure. Changes in an organization’s work flow affect data in a variety of ways, which in turn affect the use of internal data as evidence in litigation.
Often data is transformed to ensure different systems exchange, interpret, and use data cohesively in an organization. Data integration and interoperability are complex challenges for organizations deploying big data architectures, as data is heterogeneous by nature. Thus, siloed storage emerge from different demands and specifications from different departments. Legacy data, which may have been administratively useful previously, is stored, replaced, and frequently lost in the transition. All of this helps explain why roughly two-thirds of electronic data collected by organizations go unused. Constant demands to reconfigure data processes, structures, and architecture carry significant risks for organizations, as these demands outpace administrative protocols and laws.
Properly integrating different data sources for an analysis involves an awareness of all these technical complications.
Once potential data sources are identified for an analysis, the next step is inspect the variables which will be integrated. Knowing exactly what each variable means may involve additional questions and scrutiny, but it is an important step. In a given dataset, what is defined as “earnings”? What are all the potential values for a “location” variable? Are certain sensitive values, like a user’s social security number, masked to ensure privacy? Variables are also defined by a class, or acceptable input value. In one table, a given date may be stored in as a datetime class, while another may store the same value as a character string.
Having confidence in the variables’ meanings will reflect in the confidence of an analysis, and ultimately the presentation of evidence in court. A party bears additional risks if its expert witnesses are unable to explain the ‘real meaning’ of a value under scrutiny.
A party also needs to know how potential variables will merge datasets. Merging data within a database is easy done with primary keys, whereas merges between two different structured sources requires more effort. How many common variables are necessary when merging two sources to prevent the deletion of similar values? How much overlap between variables will yield an acceptable size data set? These factors affect the final output. Faulty mergers, null values, and accidental data removals cost time and resources to resolve.
There are various methods to extract, transform, and load disparate data into a unified schema. For simplicity, the ideal scenario should be to merging and aggregating the necessary inputs into the fewest datasets possible. Massive tables outweigh the difficulty of analyzing scattered sources and proving their relevance as a whole. Proper data integration will reveal whether a litigant’s data is a gold mine or a time bomb.