March 2019 – Economic Thinking In Action

EmployStats at the Upcoming NELA Spring Seminar

EmployStats is honored to be attending and speaking at the upcoming National Employment Lawyers Association (NELA) Spring Seminar. The seminar, titled Epic Advocacy: Protecting Wages in Litigation & Arbitration will take place in Denver, CO on April 12-13, 2019.

EmployStats’ principal economist Dwight Steward, Ph.D., and Matt Rigling, MA, will be presenting alongside attorneys Michael A. Hodgson and Dan Getman. The speaker’s session, Calculating Damages: Views from an Expert and Lawyers, will discuss all relevant aspects of calculating and proving liability and damages in wage and hour cases.

The panelists will present the options attorneys face when attempting to tabulate damages, discuss the best practices for obtaining and analyzing data, as well as discuss common wage and hour issues such as sampling and surveys. EmployStats’ statistical experts will also provide statistical background as they relate to labor and employment class action lawsuits, such as a explaining statistical significance, confidence intervals, stratified sampling, and margin of error.

We hope to see you at the upcoming NELA Spring Seminar in Denver on April 13, we would love to meet and discuss how EmployStats can assist you with your wage and hour lawsuit. To find out more about the seminar, please visit the NELA Website. For more on EmployStats, visit the EmployStats Website.

Data Analytics and the Law: Putting it Together

This series on data analytics in litigation emphasized how best practices help secure reliable, valid, and defensible results based off of “Big Data.” Whether it is inter-corporate litigation, class actions, or whistleblower cases, electronic data is a source of key insights. Courts hold wide discretion in admitting statistical evidence, which is why opposing expert witnesses scrutinize or defend results so rigorously. There is generally accepted knowledge on the techniques, models, and coding languages for generating analytical results from “Big Data.” However, the underlying assumptions of a data analysis are biased. These assumptions are largest potential source of error, leading parties to confuse, generalize, or even misrepresent their results. Litigants need to be aware of and challenge such underlying assumptions, especially in their own data-driven evidence.

When it comes to big data cases, the parties and their expert witnesses should be readily prepared with continuous probing questions. Where (and on what program) are the data stored, how they are interconnected, and how “clean” they are, directly impact the final analysis. These stages can be overlooked, leading parties to miss key variables or spend additional time cleaning up fragmented data sets. When the data are available, litigants should not miss on opportunities due to lack of preparation or foresight. When data do not exist or they do not support a given assertion, a party should readily examine its next best alternative.

When the proper analysis is compiled and presented, the litigating parties must remind the court of the big picture: how the analysis directly relates to the case. Do the results prove a consistent pattern of “deviation” from a given norm? In other instances, an analysis referencing monetary values can serve as a party’s anchor for calculating damages.

In Big Data cases, the data should be used to reveal facts, rather than be molded to fit assertions.

Data Analytics and the Law: Analysis

For data-based evidence, the analysis is the heart of the content: the output of the data compiled for a case. In most instances, the analytics do not need to be complex. Indeed, powerful results can be derived by simply calculating summary statistics (mean, median, standard deviation). More complicated techniques, like regressions, time-series models, and pattern analyses, do require a background in statistics and coding languages. But even the most robust results are ineffective if an opposing witness successfully argues they are immaterial to the case. Whether simple or complex, litigants and expert witnesses should ensure an analysis is both relevant and robust against criticism.

What type of result would provide evidence of a party’s assertion? The admissibility and validity of statistical evidence varies by jurisdiction. In general, data-based evidence should be as straightforward as possible; more complex models should only be used when necessary. Superfluous analytics are distractions, leading to expert witnesses “boiling the ocean” in search of additional evidence. Additionally, courts still approach statistical techniques with some skepticism, despite their acceptance in other fields.

If more complex techniques are necessary, like regressions, litigants must be confident in their methods. For example, what kind of regression will be used? Which variables are “relevant” as inputs? What is the output, and how does it relate to a party’s assertion of fact? Parties need to link outputs, big or small, to a “therefore” moment: “the analysis gave us a result, therefore it is proof of our assertion in the following ways.” Importantly, this refocuses the judge or jury’s attention to the relevance of the output, rather than its complex derivation.

Does the analysis match the scope of the complaint or a fact in dispute? Is the certified class all employees, or just a subset of in a company? Is the location a state, or a county within a state? If the defendant is accused of committing fraud, for how many years? Generalizing from a smaller or tangential analysis is inherently risky, and an easy target for opposing witnesses. If given a choice, avoid conjecture. Do not assume that an analysis in one area, for one class, or for one time automatically applies to another.

A key component of analytical and statistical work is replicability. In fields such as finance, insurance, or large scale employment cases, the analysis of both parties should be replicable. Outside parties should be able to analyze the same data and obtain the same results. In addition, replicability can expose error, slights of hand, or outright manipulation.

Data-based evidence requires focus, clarity, and appropriate analytical techniques, otherwise an output is just another number.

Changes to Texas Workforce Commission Data

The Texas Workforce Commission (“TWC”) recently announced they are no longer going to utilize their TRACER 2 application to provide information regarding the Texas labor market. For many years, the data scientists at EmployStats and other firms in Texas researched economic indicators such as employment statistics, salary and wages, and job growth using the inquiry capabilities of the TRACER 2 application.

The TWC is the state agency responsible for managing and providing workforce development services to employers and potential employees in Texas. One of the many service the TWC provides is the access for job seekers and data scientists to reliable labor and employment statistics relevant to occupations and industries within the state of Texas. Specifically, TWC’s TRACER 2 program provided search functions which allowed individuals to freely tabulate market trends and statistics such as employment/unemployment estimates, industry and occupational projections, and occupational wage data within Texas.

With the TWC’s TRACER 2 application “out to pasture” as the TWC puts it, data can now be accessed using a combination of other TWC databases, as well as United States Bureau of Labor Statistics (“BLS”) data such as the Local Area Unemployment Statistics (“LAUS”) and the Current Employment Statistics (“CES”).

Case Update: Travel Time Analyses

A common allegation in wage and hour lawsuits is off-the-clock work. In these types of cases, employees usually allege that they performed work, such as travel between job sites, that they were not paid for performing. Other common off-the-clock-work allegations typically involve activities such as spending time in security checkpoints, putting on a uniform, preparing for work, and logging onto computer systems.

Recently, the EmployStats Wage and Hour Consulting team completed work on a case where Plaintiffs alleged unpaid off-the-clock work for time spent driving from their homes to their job sites, as well as travel time between job sites. In this case, EmployStats was able to analyze and assess Plaintiffs’ allegations by combining and creating datasets of personnel and job location data, and using mapping programs to calculate the time Plaintiffs could have potentially spent traveling and performing off-the-clock work.

The following is an example of how the EmployStats Wage and Hour Consulting team typically handles a case involving travel time:

First, the Employstats team works to combine and merge multiple databases containing employee home locations, employee time and payroll records, and job site locations into a single analyzable database.
The EmployStats team then uses mapping platforms, such as Google Maps API or Mapquest API, to calculate the distance in miles and/or travel time in hours for each unique trip.
Finally, the EmployStats team uses the employee time and payroll records to assess any potential damages due to travel time off-the-clock work.

Check out the EmployStats website to see how we can help you with your wage and hour cases!

Data Analytics and the Law: Cleaning Data

After acquiring and merging data, litigants will want to rush to an analysis. But raw datasets, no matter how perfectly constructed, are inevitably riddled with errors. Such errors can potentially bias or invalidate results. Data cleaning, the process which ensures a slice of data is correct, consistent, and usable, is a vital step for any data-based evidence.

There is a often quoted rule in data science which says 80% of one’s time is spent cleaning and manipulating data, while only 20% is spent actually analyzing it. Spelling mistakes, outliers, duplicates, extra spaces, missing values, the list of potential complications is near infinite. Corrections should be recorded at every stage, ideally in scripts of the program being used (ex. R, SAS, SQL, STATA); data cleaning scripts leave behind a structured, defensible record. Different types of data will require different types of cleaning, but a structured approach will produce error free analytical results.

One should start with simple observations. Look at batches of random rows, what values are stored for a given variable, and are these values consistent? Some rows may format phone numbers differently, inconsistently capitalize, or round values. How many values are null, and are there patterns in null entries? Calculate summary statistics for variables, are there obvious mistakes (ex. negative time values)? After an assessment, cleaning can begin.

Fixing structural errors is straightforward: input values with particular spellings, capitalization, split values (ex. data containing ‘N/A’ and ‘Not Available’), or formatting issues (ex. numbers stored as strings rather than integers) can be systematically reformatted. Duplicate observations, common when datasets are merged, can be easily removed.

However, data cleaning is not entirely objective. Reasonable assumptions must be made when handling irrelevant observations, outliers, and missing values. If class X or transaction type Y is excluded from litigation, its reasonable to remove their observations. However, one cannot automatically assume Z, a similar class, can be removed as well. Outliers function the same way. What legal reasoning do I have to remove this value from my dataset? Suspicious measurements are a good excuse; but, just because a value is too big or too small, that alone does not make it reasonable to remove.

Missing data is a difficult problem: how many missing or null values are acceptable for this analysis to still produce robust results? Should you ignore missing values, or should you generate values based off of similar data points? There is no easy answer. Both approaches assume missing observations are similar to the rest of the dataset. But the fact that the observations are missing data is informative in of itself. A more cautious stance, the one with the least assumptions, will inevitably be easier to defend in court.

Skipping data cleaning, and assuming perfect data, casts doubt on any final product. Data-based evidence follows the maxim “garbage in, garbage out.”

Case Update: Time Clock Rounding Analyses

The EmployStats Wage & Hour Consulting Team recently completed work on a case in the state of New York where the Plaintiff’s alleged unpaid straight time and overtime compensation due to the Defendant’s timekeeping policies.

In this case as well as others that EmployStats has worked in the past, the Plaintiff’s alleged that the Defendant’s had a timekeeping policy which systematically understated the employee’s time worked in a given pay period. In practice, some time clock rounding policies may be neutral in principle, but non-neutral in practice. For any number of reasons, the employee or the employer may benefit more often than not from a seemingly neutral rounding policy.

The analysis that we perform typically involves manipulating, matching and analyzing big data from inherently incompatible time and payroll databases. In addition to analyzing the alleged straight time and overtime compensation owed to employees, EmployStats also assists attorneys in the calculation of penalties.

In states such as California and New York, there are penalties for noncompliance with the labor codes. We work with attorneys to calculate the appropriate penalties and interest in the lawsuit or investigation. The EmployStats Wage & Hour Consulting Team is proficient at providing calculations and tabulations that are insightful and well documented.

Data Analytics and the Law: Unstructured Data

Data analytics is only beginning to tap into the unstructured data which forms the bulk of everyday life. Text messages, emails, maps, audio files, PDF files, pictures, blog posts, these sources represent ‘unstructured data,’ as opposed to the structured data sources mentioned thus far. Up to 80% of all enterprise data is unstructured. So, how can a client’s text messages or recorded phone calls be analyzed like a SQL table? Unstructured data is not easily stored into pre-defined models or schema; some CRM tools (e.x. Salesforce) do store text-based fields. But typically, documents do not lend themselves to traditional queries from a database. This does not mean ‘structured’ and ‘unstructured’ data are in conflict with each other.

Document based evidence is of course, an integral part of the legal system. Lawyers and law offices now have access to comprehensive e-discovery programs, which sift through millions of documents based on keywords and terms. Selecting relevant information to prove a case is nothing new. The intersection with Data Analytics arises when hundreds of thousands or millions of text based data are analyzed as a whole, to prove an assertion in court.

Turning unstructured text into analyzable, structured data is made possible by increasingly sophisticated methods. Some machine learning algorithms, for example, analyze pictures and pick up on repeating patterns. Text mining programs scrape PDFs, websites, and social media for content, and then download the text into preassigned columns and variables. Analyses can be run, for example, on the positivity or negativity of a sentence, the frequency of certain words, or the correlation of certain phrases to one another. Natural language processing (NLP) includes speech recognition, which itself has seen significant progress in the past two decades. Analytics on unstructured data is now more useful in producing relevant evidence.

As important as the unstructured data is its corresponding Metadata: data that describes data. A text message or email contains additional information about itself: for example the author, the recipient, the time, and the length of the message. These bits of information can be stored in a structured data set, without any reference to the original content, and then analyzed. For example, a company has metadata on electronic documents at specific points in a transaction’s life-cycle; running a pattern analysis on this metadata could identify whether or not certain documents were made, altered, or destroyed after an event.

In instances of high profile fraud, such as the London Inter-bank Offered Rate (LIBOR) manipulation scandal, prolific emails and text messages between traders added a new dimension to the regulator’s cases against major banks. Overwhelming and repeated textual evidence, which can be produced through analyses on unstructured data, is yet another tool for litigating parties to prove a pattern of misconduct.

Upcoming CLE in Baltimore, Maryland

EmployStats is sponsoring a CLE seminar on Data Analytics in Complex Litigation at the University of Baltimore in the Merrick School of Business on April 5, 2019 from 9:30AM to 1:30PM. Complex litigation entails an enormous amount of data, which may appear impossible to sort or manage. This CLE seminar is all about how large datasets are analyzed in litigation.

The Merrick School of Business is located in the Mt. Vernon neighborhood of midtown Baltimore. Attendees are within walking distance of nearby Penn Station and a number of museums and restaurants.

Attendees will receive complementary breakfast and lunch, and hear from our accredited speakers: Roberto Cavazos, Ph.D., Kyle Cheek, Ph.D., Dwight Steward, Ph.D., and Vince McKnight. Our speakers have performed analytics work for top law firms and multinational companies across industries. Our speakers will be covering a wide range of issues on Data Analytics, and how its tools are applicable across the legal profession.

Looking to enroll? Visit: https://www.bigdatacleseminar.com/

Data Analytics and the Law: Integration

The enormous volumes of data generated by organizations will typically outgrow its infrastructure. Changes in an organization’s work flow affect data in a variety of ways, which in turn affect the use of internal data as evidence in litigation.

Often data is transformed to ensure different systems exchange, interpret, and use data cohesively in an organization. Data integration and interoperability are complex challenges for organizations deploying big data architectures, as data is heterogeneous by nature. Thus, siloed storage emerge from different demands and specifications from different departments. Legacy data, which may have been administratively useful previously, is stored, replaced, and frequently lost in the transition. All of this helps explain why roughly two-thirds of electronic data collected by organizations go unused. Constant demands to reconfigure data processes, structures, and architecture carry significant risks for organizations, as these demands outpace administrative protocols and laws.

Properly integrating different data sources for an analysis involves an awareness of all these technical complications.

Once potential data sources are identified for an analysis, the next step is inspect the variables which will be integrated. Knowing exactly what each variable means may involve additional questions and scrutiny, but it is an important step. In a given dataset, what is defined as “earnings”? What are all the potential values for a “location” variable? Are certain sensitive values, like a user’s social security number, masked to ensure privacy? Variables are also defined by a class, or acceptable input value. In one table, a given date may be stored in as a datetime class, while another may store the same value as a character string.

Having confidence in the variables’ meanings will reflect in the confidence of an analysis, and ultimately the presentation of evidence in court. A party bears additional risks if its expert witnesses are unable to explain the ‘real meaning’ of a value under scrutiny.

A party also needs to know how potential variables will merge datasets. Merging data within a database is easy done with primary keys, whereas merges between two different structured sources requires more effort. How many common variables are necessary when merging two sources to prevent the deletion of similar values? How much overlap between variables will yield an acceptable size data set? These factors affect the final output. Faulty mergers, null values, and accidental data removals cost time and resources to resolve.

There are various methods to extract, transform, and load disparate data into a unified schema. For simplicity, the ideal scenario should be to merging and aggregating the necessary inputs into the fewest datasets possible. Massive tables outweigh the difficulty of analyzing scattered sources and proving their relevance as a whole. Proper data integration will reveal whether a litigant’s data is a gold mine or a time bomb.