Carl McClain – Page 4 – Economic Thinking In Action

Data Analytics and the Law: Unstructured Data

Data analytics is only beginning to tap into the unstructured data which forms the bulk of everyday life. Text messages, emails, maps, audio files, PDF files, pictures, blog posts, these sources represent ‘unstructured data,’ as opposed to the structured data sources mentioned thus far. Up to 80% of all enterprise data is unstructured. So, how can a client’s text messages or recorded phone calls be analyzed like a SQL table? Unstructured data is not easily stored into pre-defined models or schema; some CRM tools (e.x. Salesforce) do store text-based fields. But typically, documents do not lend themselves to traditional queries from a database. This does not mean ‘structured’ and ‘unstructured’ data are in conflict with each other.

Document based evidence is of course, an integral part of the legal system. Lawyers and law offices now have access to comprehensive e-discovery programs, which sift through millions of documents based on keywords and terms. Selecting relevant information to prove a case is nothing new. The intersection with Data Analytics arises when hundreds of thousands or millions of text based data are analyzed as a whole, to prove an assertion in court.

Turning unstructured text into analyzable, structured data is made possible by increasingly sophisticated methods. Some machine learning algorithms, for example, analyze pictures and pick up on repeating patterns. Text mining programs scrape PDFs, websites, and social media for content, and then download the text into preassigned columns and variables. Analyses can be run, for example, on the positivity or negativity of a sentence, the frequency of certain words, or the correlation of certain phrases to one another. Natural language processing (NLP) includes speech recognition, which itself has seen significant progress in the past two decades. Analytics on unstructured data is now more useful in producing relevant evidence.

As important as the unstructured data is its corresponding Metadata: data that describes data. A text message or email contains additional information about itself: for example the author, the recipient, the time, and the length of the message. These bits of information can be stored in a structured data set, without any reference to the original content, and then analyzed. For example, a company has metadata on electronic documents at specific points in a transaction’s life-cycle; running a pattern analysis on this metadata could identify whether or not certain documents were made, altered, or destroyed after an event.

In instances of high profile fraud, such as the London Inter-bank Offered Rate (LIBOR) manipulation scandal, prolific emails and text messages between traders added a new dimension to the regulator’s cases against major banks. Overwhelming and repeated textual evidence, which can be produced through analyses on unstructured data, is yet another tool for litigating parties to prove a pattern of misconduct.

Upcoming CLE in Baltimore, Maryland

EmployStats is sponsoring a CLE seminar on Data Analytics in Complex Litigation at the University of Baltimore in the Merrick School of Business on April 5, 2019 from 9:30AM to 1:30PM. Complex litigation entails an enormous amount of data, which may appear impossible to sort or manage. This CLE seminar is all about how large datasets are analyzed in litigation.

The Merrick School of Business is located in the Mt. Vernon neighborhood of midtown Baltimore. Attendees are within walking distance of nearby Penn Station and a number of museums and restaurants.

Attendees will receive complementary breakfast and lunch, and hear from our accredited speakers: Roberto Cavazos, Ph.D., Kyle Cheek, Ph.D., Dwight Steward, Ph.D., and Vince McKnight. Our speakers have performed analytics work for top law firms and multinational companies across industries. Our speakers will be covering a wide range of issues on Data Analytics, and how its tools are applicable across the legal profession.

Looking to enroll? Visit: https://www.bigdatacleseminar.com/

Data Analytics and the Law: Integration

The enormous volumes of data generated by organizations will typically outgrow its infrastructure. Changes in an organization’s work flow affect data in a variety of ways, which in turn affect the use of internal data as evidence in litigation.

Often data is transformed to ensure different systems exchange, interpret, and use data cohesively in an organization. Data integration and interoperability are complex challenges for organizations deploying big data architectures, as data is heterogeneous by nature. Thus, siloed storage emerge from different demands and specifications from different departments. Legacy data, which may have been administratively useful previously, is stored, replaced, and frequently lost in the transition. All of this helps explain why roughly two-thirds of electronic data collected by organizations go unused. Constant demands to reconfigure data processes, structures, and architecture carry significant risks for organizations, as these demands outpace administrative protocols and laws.

Properly integrating different data sources for an analysis involves an awareness of all these technical complications.

Once potential data sources are identified for an analysis, the next step is inspect the variables which will be integrated. Knowing exactly what each variable means may involve additional questions and scrutiny, but it is an important step. In a given dataset, what is defined as “earnings”? What are all the potential values for a “location” variable? Are certain sensitive values, like a user’s social security number, masked to ensure privacy? Variables are also defined by a class, or acceptable input value. In one table, a given date may be stored in as a datetime class, while another may store the same value as a character string.

Having confidence in the variables’ meanings will reflect in the confidence of an analysis, and ultimately the presentation of evidence in court. A party bears additional risks if its expert witnesses are unable to explain the ‘real meaning’ of a value under scrutiny.

A party also needs to know how potential variables will merge datasets. Merging data within a database is easy done with primary keys, whereas merges between two different structured sources requires more effort. How many common variables are necessary when merging two sources to prevent the deletion of similar values? How much overlap between variables will yield an acceptable size data set? These factors affect the final output. Faulty mergers, null values, and accidental data removals cost time and resources to resolve.

There are various methods to extract, transform, and load disparate data into a unified schema. For simplicity, the ideal scenario should be to merging and aggregating the necessary inputs into the fewest datasets possible. Massive tables outweigh the difficulty of analyzing scattered sources and proving their relevance as a whole. Proper data integration will reveal whether a litigant’s data is a gold mine or a time bomb.

Data Analytics and the Law: Acquiring Data

Evidence based on Data Analytics hinges on the relevance of its underlying sources. Determining what potential data sources can prove is as important as generating an analysis. The first question should be “What claims do I want to assert with data?” The type of case and nature of the complaint should inform litigants where they should start looking in discovery. For example, a dataset of billing information could determine whether or not a healthcare provider committed fraud. Structured data sources like Excel files, SQL servers, or third party databases (e.x. Oracle), are the primary source material for statistical analyses, particularly those using transactional data.

In discovery, it’s important that both parties be aware of these structured data sources. Often, these sources do not have a single designated custodian, rather they may be the purview of siloed departments or an IT group. For any particular analysis, rarely is all the necessary data all held in one place. Identifying valuable source material is more difficult as the complexity of interactions between different sources increases. To efficiently stitch together smaller databases and tables, a party should conduct detailed data mapping by identifying links between structured data sources. For example, how two tables relate to another, how a SQL table relates to an Excel file, or how a data cube relates to a cloud file. Data mapping identifies which structured data sources are directly linked to one another through their variables, and how they as a whole fit together in an analysis.

However when using data based evidence to answer a question, structured data is rarely clean and/or well organized. Variables defined in a table may be underutilized or unused. Legacy files imported into newer systems can become corrupted. The originators of macros or scripts for data pulls may no longer work for an organization and forgo detailed instructions. Sometimes the data simply do not exist: not from a party burying evidence, but by the very nature of electronically stored information (ESI).

Any defensible analysis is inherently limited by what data is available. With data analytics the maxim “evidence of absence, is not absence of evidence,” is apparent. It’s always more dangerous to exaggerate or generalize from the available data than to produce a narrow, but statistically sound result. Thus, given the data available, what questions can be asked? What questions can be answered? Finally, if there is no data, does it mean there is no problem?

Data Analytics and the Law: The Big Picture

With businesses and government now firmly reliant on electronic data for their regular operations, litigants are increasingly presenting data-driven analyses to support their assertions of fact in court. This application of Data Analytics, the ability to draw insights from large data sources, is helping courts answer a variety of questions. For example, can a party establish a pattern of wrongdoing based on past transactions? Such evidence is particularly important in litigation involving large volumes of data: business disputes, class actions, fraud, and whistleblower cases. The use cases for data based evidence increasingly cuts across industries, whether its financial services, education, healthcare, or manufacturing.

Given the increasing importance of Big Data and Data Analytics, parties with a greater understanding of data-based evidence have an advantage. Statistical analyses of data can provide judges and juries with information that otherwise would not be known. Electronic data hosted by a party is discoverable, data is impartial (in the abstract), and large data sets can be readily analyzed with increasingly sophisticated techniques. Data based evidence, effectively paired with witness testimony, strengthens a party’s assertion of the facts. Realizing this, litigants engage expert witness to provide dueling tabulations or interpretations of data at trial. As a result, US case law on data based evidence is still evolving. Judges and juries are making important decisions based the validity and correctness of complex and at times contradictory analyses.

This series will discuss best practices in applying analytical techniques to complex legal cases, while focusing on important questions which must be answered along the way. Everything, from acquiring data, to preparing an analysis, to running statistical tests, to presenting results, carries huge consequences for the applicability of data based evidence. In cases where both parties employ expert witnesses to analyze thousands if not millions of records, a party’s assertions of fact are easily undermined if their analysis is deemed less relevant or inappropriate. Outcomes may turn on the statistical significance of a result, the relevance of a prior analysis to a certain class, the importance of excluded data, or the rigor of an anomaly detection algorithm. At worst, expert testimony can be dismissed.

Many errors in data based evidence, at their heart, are faulty assumptions on what the data can prove. Lawyers and clients may overestimate the relevance of their supporting analysis, or mold data (and assumptions) to fit certain facts. Litigating parties and witnesses must constantly ensure data-driven evidence is grounded on best practices, while addressing the matter at hand. Data analytics is a powerful tool, and is only as good as the user.