Big Data permeates our society, but how will it affect U.S. courts? In civil litigation, attorneys and experts are increasingly reliant on analyzing of large volumes of electronic data, which provide information and insight into legal disputes that could not be obtained through traditional sources. There are limitless sources of Big Data: time and payroll records, medical reimbursements, stock prices, GPS histories, job openings, credit data, sales receipts, and social media posts just to name a few.  Experts must navigate complex databases and often messy data to generate reliable quantitative results. Attorneys must always keep an eye on how such evidence is used at trial. Big Data analyses also present new legal and public policy challenges in areas like privacy and cybersecurity, while advances continue in artificial intelligence and algorithmic design.  For these and many other topics, Employstats has a roadmap on the past, present, and future of Big Data in our legal system.

Order your copy of Dr. Dwight Steward and Dr. Roberto Cavazos’ book on Big Data Analytics in U.S. Courts!

On April 5, 2019, Dr. Dwight Steward, Ph.D. will be speaking alongside Robert Cavazos, Ph.D., Kyle Cheek, Ph.D., and Vince McKnight.  The experts and attorney will be presenting together on a panel at the EmployStats sponsored CLE seminar, titled Data Analytics in Complex Litigation.  The seminar will take place at the University of Baltimore in the Merrick School of Business, and will run from 9:30 AM to 1:30 PM.

 

The speakers will cover a spectrum of issues on Big Data Analytics, and its use in legal applications.  Specifically, the general session of the CLE will provide an overview of data analytics in a legal context, discussing the various aspects of how to manage large data sets in complex litigation settings.  Attendees will then be able to choose between two breakout sessions, Data Analytics in Litigation and Healthcare Litigation.  Lunch will be included.

 

To find out more on the upcoming CLE, visit: www.bigdatacleseminar.com

Also, make sure to follow our blog and stay up to date with Employstats news and sponsored events! www.EmployStats.com

Evidence based on Data Analytics hinges on the relevance of its underlying sources. Determining what potential data sources can prove is as important as generating an analysis. The first question should be “What claims do I want to assert with data?” The type of case and nature of the complaint should inform litigants where they should start looking in discovery. For example, a dataset of billing information could determine whether or not a healthcare provider committed fraud. Structured data sources like Excel files, SQL servers, or third party databases (e.x. Oracle), are the primary source material for statistical analyses, particularly those using transactional data.

 

In discovery, it’s important that both parties be aware of these structured data sources. Often, these sources do not have a single designated custodian, rather they may be the purview of siloed departments or an IT group. For any particular analysis, rarely is all the necessary data all held in one place. Identifying valuable source material is more difficult as the complexity of interactions between different sources increases. To efficiently stitch together smaller databases and tables, a party should conduct detailed data mapping by identifying links between structured data sources. For example, how two tables relate to another, how a SQL table relates to an Excel file, or how a data cube relates to a cloud file. Data mapping identifies which structured data sources are directly linked to one another through their variables, and how they as a whole fit together in an analysis.

 

However when using data based evidence to answer a question, structured data is rarely clean and/or well organized. Variables defined in a table may be underutilized or unused. Legacy files imported into newer systems can become corrupted. The originators of macros or scripts for data pulls may no longer work for an organization and forgo detailed instructions. Sometimes the data simply do not exist: not from a party burying evidence, but by the very nature of electronically stored information (ESI).

 

Any defensible analysis is inherently limited by what data is available. With data analytics the maxim “evidence of absence, is not absence of evidence,” is apparent. It’s always more dangerous to exaggerate or generalize from the available data than to produce a narrow, but statistically sound result. Thus, given the data available, what questions can be asked? What questions can be answered? Finally, if there is no data, does it mean there is no problem?

With businesses and government now firmly reliant on electronic data for their regular operations, litigants are increasingly presenting data-driven analyses to support their assertions of fact in court. This application of Data Analytics, the ability to draw insights from large data sources, is helping courts answer a variety of questions. For example, can a party establish a pattern of wrongdoing based on past transactions? Such evidence is particularly important in litigation involving large volumes of data: business disputes, class actions, fraud, and whistleblower cases. The use cases for data based evidence increasingly cuts across industries, whether its financial services, education, healthcare, or manufacturing.  

 

Given the increasing importance of Big Data and Data Analytics, parties with a greater understanding of data-based evidence have an advantage. Statistical analyses of data can provide judges and juries with information that otherwise would not be known. Electronic data hosted by a party is discoverable, data is impartial (in the abstract), and large data sets can be readily analyzed with increasingly sophisticated techniques. Data based evidence, effectively paired with witness testimony, strengthens a party’s assertion of the facts. Realizing this, litigants engage expert witness to provide dueling tabulations or interpretations of data at trial. As a result, US case law on data based evidence is still evolving. Judges and juries are making important decisions based the validity and correctness of complex and at times contradictory analyses.

 

This series will discuss best practices in applying analytical techniques to complex legal cases, while focusing on important questions which must be answered along the way. Everything, from acquiring data, to preparing an analysis, to running statistical tests, to presenting results, carries huge consequences for the applicability of data based evidence. In cases where both parties employ expert witnesses to analyze thousands if not millions of records, a party’s assertions of fact are easily undermined if their analysis is deemed less relevant or inappropriate. Outcomes may turn on the statistical significance of a result, the relevance of a prior analysis to a certain class, the importance of excluded data, or the rigor of an anomaly detection algorithm. At worst, expert testimony can be dismissed.

 

Many errors in data based evidence, at their heart, are faulty assumptions on what the data can prove. Lawyers and clients may overestimate the relevance of their supporting analysis, or mold data (and assumptions) to fit certain facts. Litigating parties and witnesses must constantly ensure data-driven evidence is grounded on best practices, while addressing the matter at hand. Data analytics is a powerful tool, and is only as good as the user.

All data projects can benefit from building a Data Management Plan (“DMP”) before the project begins.  Typically a DMP is a formal document that describes your data and what your team will do with it during and after the data project.

There is no cookie-cutter DMP that is right for every project, but in most cases the following questions should be addressed in your DMP:

  1. What kind of data will your project analyze?  What file formats and software packages will you use?  What will your data output be?  How will you collect and process the data?
  2. How will you document and organize your data?  What metadata will you collect?  What standards and formats will you use?
  3. What are your plans for data access within your team?  What are the roles that the individuals in your team will play in the data analysis process?  How will you address any privacy or ethical issues, if applicable?
  4. What are your plans for long term archiving?  What file formats will you archive the data in?  Who will be responsible for the data after the project is complete?  Where will you save the files?
  5. What outside resources do you need for your project?  How much time will the project take your team to complete and audit?  How much will it cost?

When working on any type of data project, planning ahead is a crucial step.  Before starting in on a project, it’s important to think through as many of the details as possible so you can budget enough time and resources to accomplish all of the objectives.  As a matter of fact, some organizations and government entities require a Data Management Plan (“DMP”) to be in place in all of their projects.

 

A DMP is a formal document that describes the data and what your team will do with it during and after the data project.  Many organizations and agencies require one, and each entity has specific requirements for their DMPs.

 

DMPs can be created in just a simple readme.txt file, or can be as detailed as DMPs tailored to specific disciplines using online templates such as DMPTool.org.  The DMPTool is designed to help create ready-to-use data management plans.

Doug Berg, Ph.D., is an expert in big data, and has been working with EmployStats and Principal Economist Dr. Dwight Steward for several years regarding class action and discrimination lawsuits.  Dr. Berg is currently a professor at Sam Houston State University in the Department of Economics.  He received his Bachelor’s degree in Accounting from the University of Minnesota, and his Ph.D. in Economics from Texas A&M University.  Dr. Berg will provide additional support and his expert insight into using big data in employment litigation.  Doug Berg, Ph.D., describes litigation as “living on data”, and the better the data, the better the argument.  EmployStats welcomes his insight into the underlying meaning behind the data our clients provide us!

Due to the massive computational requirements of analyzing big data, trying to find the best approach to big data projects can be a daunting task for most individuals.  At EmployStats, our team of experts utilize top of the line data systems and software to seamlessly analyze big data and provide our clients with high quality analysis as efficiently as possible.

  1. The general approach for big data analytics begins with fully understanding the data provided as a whole.  Not only must the variable fields in the data be identified, but one must also understand what these variables represent and determine what values are reasonable for each variable in the data set.  
  2. Next, the data must be cleaned and reorganized into the clearest format, ensuring that data values are not missing and are within reasonable ranges of certainty.  As the size of the data increases, the amount of work necessary to clean the data increases.  In larger datasets there are more individual components which are typically dependent on each other, therefore it is necessary to write computer programs to evaluate the accuracy of the data.
  3. Once the entire dataset has been cleaned and properly formatted, one needs to define the question that will be answered with the data.  One must look at the data and see how it relates to the question.  The questions for big data projects may be related to frequencies, probabilities, economic models, or any number of statistical properties.  Whatever it is, one must then process the data in the context of the question at hand.
  4. Once the answer has been obtained, one must determine that the answer is a strong answer.  A delicate answer, or one that would significantly change if the technique of the analysis was altered, is not ideal.  The goal of big data analytics is to have a robust answer, and one must try to attack the same question in a number of different ways in order to build confidence in the answer.

Big data is not simply a size, it is a way of describing the type of data tools that will be utilized for an analysis.  Most, if not all, of the big data we work with at EmployStats requires specific data tools that are ever changing and evolving, as well as new tools that are being introduced into the market constantly.  

Each avenue will handle big data differently, and offer specific benefits that will determine how an analysis will be performed, as well as how results will be interpreted.  EmployStats constantly keeps up to date with the latest and greatest data analytic software for large data sets in order to optimize the outcome of these types of analyses.  

Many recent cases such as United States of America v. Abbott Laboratories and Pompliano v. Snapchat have utilized big data analysis techniques in litigation, proving that not only is it common to use big data in litigation, it is necessary to bring many cases to a successful close.

The American Statistical Association released an important statement and supporting paper concerning the use and interpretation of statistical significance and p-values in statistical research.

pvalues

The American Statistical Associations’ statement notes that the increased quantification of scientific research and a proliferation of large, complex data sets, often referred to as Big Data, has expanded the scope for statistics.  Accordingly, the importance of appropriately chosen techniques, properly conducted analyses, and correct interpretation has also increased.

This statement by the ASA furthers, and in some ways solidifies, the ground roots “counter-statistical significance” movement that many economists and statisticians, such as Steve Zillack and Diedre McCloskey, have been working on for decades.

Real-World-Matters-Statistically-Significant-1024x509

According to the ASA statement “The p-value [and the concept of statistical significance] was never intended to be a substitute for scientific reasoning,” said Ron Wasserstein, the ASA’s executive director. In research analysts use the data to calculate a p-value which shows how consistent the data is with the research hypothesis.  A small p-value is typically interpreted as having a small likelihood of being consistent with the research hypothesis.   In research papers, small p-values are in essence viewed as a ‘good thing’ and according to the ASA statement, are more favored by journal editors for publication.

The ASA statement argues against this approach.  Instead, the ASA statement states that “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold.”

See:

Ronald L. Wasserstein & Nicole A. Lazar (2016): The ASA’s statement on p-values: context, process, and purpose, The American Statistician, DOI: 10.1080/00031305.2016.1154108

Ziliak, S.T., and McCloskey, D.N. (2008), The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives, Ann Arbor: University of Michigan Press

Ziliak, S.T. (2010), “The Validus Medicus and a New Gold Standard,” The Lancet, 376, 9738, 324-325.