Flat File Testing

Testing of delimited files such as CSV files or fixed width flat files.

What are Flat Files?

Flat files are extensively used for exchanging data between enterprises or between organizations within an enterprise. Flat files come in two forms - delimited files such as CSV (comma separated) files or fixed width files.


What is Flat File Testing?

Flat File testing is the process of validating the quality of data in the flat file as well as ensuring that the data in the flat file has been consumed appropriately by the application or ETL process.


Challenges in Flat File Testing?

Testing of inbound flat files presents unique challenges because the producer of the flat file is usually different organizations within an enterprise or an external vendor. Consequently, there might be differences in the format and content of the files since there is no easy way to enforce the data type and data quality constraints on the data in the flat files. Issues in flat file data can cause failures in the consuming process. While the file processing requirements are different from project to project, the focus of this use case is to list out some of the common checks that need to be performed for validating flat files.
 

Flat File Testing Categories

Flat File Ingestion Testing

When data is moved using flat files between enterprises or organizations within enterprise, it is important to perform a set of file ingestion validations on the inbound flat files before consuming the data in those files.
File name validation
Files are ftp'ed or copied over to a specific folder for processing. These files usually have a specific naming convention so that the process consuming the file is able to understand the contents and date. From a testing standpoint, the file name pattern needs to be validated to verify that it meets the requirement.

Example: A government agency that gets files from multiple vendors on a periodic basis. The arriving files should follow a naming convension of 'CompanyCode_ContentType_DateTimestamp.csv'. However, the files coming in from a specific vendor do not have have the correct company name.
Size and Format of the flat files
Although, flat files are generally delimited or fixed width, it is common to have a header and footer in these files. Sometimes, these headers have a rowcount that can be used to verify that the file contains the entire data as expected.

Some of the relevant checks are:
   Verify that the size of the file is within the expected range where applicable.
   Verify that the header, footer and column heading rows have the expected format and have the expected location within the flat file.
   Perform any row count checks to cross check the data in the header with the values in the delimited data.

Example: A financial reporting company generates files with a header that contains the summary amount with the line items having the detailed split. The sum of the amounts in the line items should match the summary amount in the header.
File arrival, processing and deletion times
Files arrive periodically into a specific network folder or an ftp location before getting consumed by a process. Usually, there are specific requirements that need to be met regarding the file arrival time, order of arrival and retaining them.

Example: A pharma company gets a set of files from a vendor on a daily basis. The process consuming this files expects the complete set of files to be available before processing
1. A file that were supposed to come yesterday was delayed. It came in sometime after today's file arrived causing issues due to difference in the order of processing the files.
2. After the files gets processed, it is supposed to be moved to a specific directory where it is to be retained for a specified period of time and deleted. However, the file did not get copied over.
3. Compare the transformed data in the target table with the expected values for the test data.
Automate file ingestion testing using ETL Validator

ETL Validator comes with Component Test Case and File Watcher which can be used to test Flat Files.

   Flat File Component: Flat file component is part of the Component Test Case. It can be used to define data type and data quality rules on the incoming flat file. The data in the flat file can also be compared with data from the database.
   File Watcher: Using File Watcher test plans can be triggered automatically when a new file comes into a directory so that the test cases on the file can be executed automatically before the files are used further by the consuming process.
   SFTP Connection: Makes it easy compare and validate flat files located in a remote SFTP location.

Flat File Data Type Testing

The purpose of Data Type testing is to verify that the type and length of the data in the flat file is as expected.
Data Type Check
Verify that the type and format of the data in the inbound flat file matches the expected data type for the file. For date, timestamp and time data types, the values are expected to be in a specific format so that they can be parsed by the consuming process.
Example: An ID column of the flat file is expected to have only numbers. However, few rows in the flat file have characters.
Data Length Check
Length of string and number data values in the flat file should match the maximum allowed length for those columns.

Example: Data for the comments column has more than 4000 characters in the inbound flat file while the limit for the corresponding column in the database is only 2000 characters.
Not Null Check
Verify that any required data elements in the flat file have data for all the rows.

Example: Date of Birth is a required data element but some of the records are missing values in the inbound flat file.

Automate flat file data type testing with ETL Validator


ETL Validator provides the capability to specify data type checks on the flat file in the flat file component. Based on the data types specified, ETL Validator automatically check all the records in the incoming flat file to find any invalid records.

Flat File Data Quality Testing

The purpose of Data Quality tests is to verify the accuracy of the data in the inbound flat files.
Duplicate Data Checks
Check for duplicate rows in the inbound flat file with the same unique key column or a unique combination of columns as per business requirement.
Example: Business requirement says that a combination of First Name, Last Name, Middle Name and Data of Birth should be unique for the Customer list flat file.
Sample query to identify duplicates (assuming that the flat file data can be imported into a database table)
SELECT fst_name, lst_name, mid_name, date_of_birth, count(1) FROM Customer GROUP BY fst_name, lst_name, mid_name HAVING count(1)>1
Reference Data Checks
Flat file standards may dictate that the values in certain columns should adhere to a values in a domain. Verify that the values in the inbound flat file conforms to reference data standards.
Example: Values in the country_code column should have a valid country code from a Country Code domain.
select distinct country_code from address
minus
select country_code from country
Data Validation Rules
Many data fields can contain a range of values that cannot be enumerated. However, there are reasonable constraints or rules that can be applied to detect situations where the data is clearly wrong. Instances of fields containing values violating the validation rules defined represent a quality gap that can impact inbound flat file processing.
Example: Date of birth (DOB). This is defined as the DATE datatype and can assume any valid date. However, a DOB in the future, or more than 100 years in the past are probably invalid. Also, the date of birth of the child is should not be greater than that of their parents.
Data Integrity Checks
This check addresses "keyed" relationships of entities within a domain. The goal is to idenfity orphan records in the child entity with a foreign key to the parent entity.
   1. Count of records with null foriegn key values in the flat file
   2. Count of invalid foriegn key values in the flat file that do not have a corresponding primary key in the parent flat file or database table

Example: Consider a file import process for a CRM application which imports contact lists for existing Accounts. The contact list are CSV files with a column having the corresponding account_id. Lets assume that the contact list can be loaded into a database table for the purpose of validation.
1. Count of null or unspecified dimension keys in a Fact table:
SELECT count(account_id) FROM contacts where account_id is null

2. Count of invalid foriegn key values in the contact list :
SELECT account_id FROM contacts
minus
SELECT s.account_id FROM accounts s, contacts c where s.account_id=c.account_id

Automate flat file data quality testing using ETL Validator


ETL Validator supports defining of data quality rules in Flat File Component for automating the data quality testing without writing any database queries. Custom rules can be defined and added to the Data Model template.

Flat File Data Completeness Testing

Data in the inbound flat files is generally processed and loaded into a database. In some cases the ouput may also be another flat file. The purpose of Data Completeness tests are to verify that all the expected data is loaded in the target from the inbound flat file. Some of the tests that can be run are : Compare and Validate counts, aggregates (min, max, sum, avg) and actual data between the flat file and target.
Record Count Validation
Compare count of records of the flat file and database table. Check for any rejected records.
Example: A simple count of records comparison between the source and target tables.
Source Query (assuming the flat file data is loaded into 'customer' table for validation)
SELECT count(1) src_count FROM customer
Target Query
SELECT count(1) tgt_count FROM customer_dim
Column Data Profile Validation
Column or attribute level data profiling is an effective tool to compare source and target data without actually comparing the entire data. It is similar to comparing the checksum of your source and target data. These tests are essential when testing large amounts of data.

Some of the common data profile comparisons that can be done between the flat file and target are:
   Compare unique values in a column between the flat file and target
   Compare max, min, avg, max length, min length values for columns depending of the data type
   Compare null values in a column between the flat file and target
   For important columns, compare data distribution (frequency) in a column between the flat file and target

Example 1: Compare column counts with values (non null values) between source and target for each column based on the mapping.
Source Query (assuming the flat file data is loaded into 'customer' table for validation)
SELECT count(row_id), count(fst_name), count(lst_name), avg(revenue) FROM customer
Target Query
SELECT count(row_id), count(first_name), count(last_name), avg(revenue) FROM customer_dim

Example 2: Compare the number of customers by country between the source and target.
Source Query (assuming the flat file data is loaded into 'customer' table for validation)
SELECT country, count(*) FROM customer GROUP BY country
Target Query
SELECT country_cd, count(*) FROM customer_dim GROUP BY country_cd
Compare entire flat file and target data
Compare data (values) between the flat file and target data effectively validating 100% of the data. In regulated industries such as finance and pharma, 100% data validation might be a compliance requirement. It is also a key requirement for data migration projects. However, performing 100% data validation is a challenge when large volumes of data is involved. This is where ETL testing tools such as ETL Validator can be used because they have an inbuilt ELV engine (Extract, Load, Validate) capabile of comparing large values of data.
Example: Write a source query on the flat file that matches the data in the target table after transformation.
Source Query (assuming the flat file data is loaded into 'customer' table for validation)
SELECT cust_id, fst_name, lst_name, fst_name||','||lst_name, DOB FROM Customer
Target Query
SELECT integration_id, first_name, Last_name, full_name, date_of_birth FROM Customer_dim

Automate flat file data completeness testing using ETL Validator


ETL Validator comes with Flat File Component and Data Profile Component as part of Component Test Case for automating the comparison of flat file and target data. It takes care of loading the flat file data into a table for running validations.

   Data Profile Component: Automatically computes profile of the flat file data and target query results - count, count distinct, nulls, avg, max, min, maxlength and minlength.
   Component Test Case: Provides a visual test case builder that can be used to compare multiple flat files and target.

Flat File Data Transformation Testing

Data in the inbound Flat File is transformed by the consuming process and loaded into the target (table or file). It is important to test the transformed data . There are two approaches for testing transformations - white box testing and blackbox testing
Transformation testing using White Box approach
White box testing is a testing technique, that examines the program structure and derives test data from the program logic/code.
For transformation testing, this involves reviewing the transformation logic from the flat file data ingestion design document and corresponding code to come up with test cases.

The steps to be followed are listed below:
   Review the transformation design document
   Apply transformations on the flat file data using SQL or a procedural language such as PLSQL to reflect the ETL transformation logic
   Compare the results of the transformed data with the data in the target table or target flat file.

The advantage with this approach is that the tests can be rerun easily on a larger data set. The disadvantage of this approach is that the tester has to reimplement the transformation logic.

Example: In a financial company, In a financial company, the interest earned on the savings account is dependent the daily balance in the account for the month. The daily balance for the month is part of an inbound CSV file for the process that computes the interest.
1. Review the requirement and design for calculating the interest.
2. Implement the logic using your favorite programming language.
3. Compare your output with data in the target table.
Transformation testing using Black Box approach
Black-box testing is a method of software testing that examines the functionality of an application without peering into its internal structures or workings. For transformation testing, this involves reviewing the transformation logic from the mapping design document setting up the test data appropriately.

The steps to be followed are listed below:
   Review the requrements document to understand the transformation requirements
   Prepare test data in the flat file to reflect different transformation scenarios
   Come with the transformed data values or the expected values for the test data from the previous step
   Compare the results of the transformed test data in the target table with the expected values.

The advantage with this approach is that the transformation logic does not need to be reimplemented during the testing. The disadvantage of this approach is that the tester needs to setup test data for each transformation scenario and come up with the expected values for the transformed data manually.

Example: In a financial company, the interest earned on the savings account is dependent the daily balance in the account for the month.
1. Review the requirement for calculating the interest.
2. Setup test data in the flat file for various scenarios of daily account balance.
3. Compare the transformed data in the target table with the expected values for the test data.
Automate data transformation testing using ETL Validator

ETL Validator comes with Component Test Case which can be used to test transformations using the White Box approach or the Black Box approach.

   Visual Test Case Builder: Component test case has a visual test case builder that makes it easy to rebuild the transformation logic for testing purposes.
   Workschema: ETL Validator's workschema stores the test data from source and target queries. This makes it easy for the tester to implement transformations and compare using a Script Component.
   Benchmark Capability: Makes it easy baseline the target table (expected data) and compare the latest data with the baselined data.

Flat File Ingestion Performance Testing

The goal of performance testing is to validate that the process consuming the inbound flat files is able to handle flat files with the expected data volumes and inbound arrival frequency.

Example 1: The process ingesting the flat file might perform well when the data when there are only a few records in the file but perform bad when there is large number of rows.

Example 2: The flat file ingestion process may also perform bad as the data volumes increase in the target table.
End-to-End Data Testing of Flat File ingestion
Integration testing of the inbound flat file ingestion process and the related applications involves the following steps:
   Estimate expected data volumes in each of the source flat files for the consuming process for the next 1-3 years.
   Setup test data for performance testing either by generating sample flat files or getting sample flat files.
   Execute the flat file ingestion process to load the test data into the target.
   Executing the flat file ingestion process again with large data in the target tables to identify bottlenecks.

ETL Validator Resources Try ETL Validator free for 30 days or contact us for demo

Interested in more information?