Working with large data sets: The new CMS medial records files

The new data files released by the CMS regarding the payments made to U.S. medical doctors by drug and medical device manufacturers contains a treasure trove of information.  However, the large size of the data will limit the use and the nuggets that can mined for some.

Using the statistical program STATA,  which is generally one of the fastest and most efficient ways to handle large data sets, required an allocation of 6G of RAM memory to just read in the program. STATA is efficient at handling large wage and hour, employment, and business data sets (like ones with many daily prices)

The table below shows what STATA required in terms of memory to be able to read the data:
Current memory allocation

current                                 memory usage
settable          value     description                 (1M = 1024k)
——————————————————————–
set maxvar         5000     max. variables allowed           1.947M
set memory         6144M    max. data space              6,144.000M
set matsize         400     max. RHS vars in models          1.254M
———–
6,147.201M

 

 

 

 

Steps to converting non-analyzable wage, time, and business electronic data

When manual data entry of non-analyzable financial or wage data  is not an option, OCR software and specialized designed and written computer software data cleaning routines is a good alternative.

For example in our approach, we use a number of OCR programs including Abbey Reader to first translate the data into a format that is recognized by statistical programs such as STATA and computer software script languages such as VBA.

Once the data is converted, we write specialized computer software routines to extract the relevant data from the converted file.  The computer code, which is written in STATA, VBA, or other scripting language, puts the extracted data into a format that can be analyzed by statistical and spreadsheet programs.

These approach to converting wage, business, employment or other types of data has the advantage of being able tobe  reproduced by either party if required.

Having both the data cleaning and statistical and economic analysis performed by the same economic outfit and team is desirable.  Data cleaning is not performed in a vacuum; that is the very definition of ‘dirty data; depends on what the data is to be used for.  Some data items may not convert very well by the OCR and software code, but the items may be of little value in the economic and statistical analysis in the first place.

One advantage of using the same research outfit to do both the data cleaning and the economic and statistical analysis is that the distinction gets made early in the analysis process.

 

Converting and analyzing wage and business data from PDFs

Some wage and business data is electronic but is not analyzable in the format that it is maintained by the employer or company.

For instance,some employers use computerized data systems for recording the start times, lunch periods, and end periods for certain employees.  When reviewing this data in the regular course of business some of these employers review standardized, pre-formatted reports of the time punch data instead of the actual underlying time punches that were made by each individual employee.  Many of these standardized reports are presented in a PDF or other non-analyzable electronic format.

Similarly, some businesses retain certain information, such as itemized copies of purchase orders, only in a PDF or other non-analyzable electronic format.

The task when addressing economic damage issues that rely on this type of non-analyzable electronic information, is to accurately and efficiently translate the data into a format that can analyzed using statistical programs, such as STATA.  In cases with relatively small amounts of data spreadsheet programs such as EXCEL could also be used.

How is this done? Next>>>>