ETL Testing

ETL Testing (9)

Type 2 Slowly Changing Dimensions are used in the Data Warehouses for tracking changes to the data by preserving historical values. This is achieved by creating a new record in the dimension whenever a value in the set of key columns is modified and maintaining start and end date for the records. The latest records are either identified by querying for records that are not end dated or by maintaining a flag (eg. current_flg) to easily identify them. 

 

Testing SCD Type 2 Dimensions

Testing SCD Type 2 Dimensions is tricky because it cannot be achieved by a simple comparison of the source and target data.  In this article we will examine different aspects of Type 2 SCD that can be tested using ETL Validator. 

For the sake of this article, lets consider an Employee dimension (EMPLOYEE_D) of SCD Type 2 which is sourced from a table called EMPLOYEE in the source system. 

 

EMPLOYEE Table has the following columns
ROW_ID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY

EMPLOYEE_DIM SCD Type 2 Dimension Table has the following columns
ROW_WID
EMP_NO
FIRST_NAME
LAST_NAME
SSN
DOB
JOB_TITLE
SALARY
START_DT
END_DT
CURRENT_FLG

 

Test 1: Verifying the current data

Use a Query Compare test case in ETL Validator to compare the current data records in the SCD Type 2 Employee_Dim with the data in the source Employee table. 

Source Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE


Target Query : select ROW_ID, EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY from EMPLOYEE_DIM where CURRENT_FLG = 'Y'

Test 2: Verifying the uniqueness of the key columns in the SCD

The combination of the key columns in the SCD should be Unique. For the above example, the columns EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY comprise of an unique key in the EMPLOYEE_DIM dimension. This can be easily verified using the Duplicate Check Rule in the Data Rules test plan of ETL Validator. The query generated by ETL Validator using the Duplicate Check Rule should be something like below: 

 

 select EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY, COUNT(*) CNT from EMPLOYEE_DIM group by EMP_NO, FIRST_NAME, LAST_NAME, SSN, DOB, JOB_TITLE, SALARY having COUNT(*)>1

 

This query should not return any rows. 

Test 3: Verifying that historical data is preserved and new records are getting created

Whenever there is a change to the values in the key columns a new record should be inserted in the EMPLOYEE_DIM and the old record should be end dated. ETL Validator's Component Test Case can be used to verify this functionality. The Component test case has a feature of Baseline and Compare which can be used to identify the changes in the EMPLOYEE_DIM.  
 
Below are the steps :

  •   Create a Component test case and take a snapshot of the current values in the EMPLOYEE_DIM (called Baseline) 
  •   Modify a few records in the source EMPLOYEE table by updating the values in the key columns such as SALARY, LAST_NAME
  •   Execute the ETL process so the the EMPLOYEE_DIM has the latest data
  •   Run the Component test case to compare the Baseline data with the Result table and identify the differences. Verify that the differences are as expected. 

 

ETL Validator thus provides a complete framework for automating the testing of SCD Type 2 dimensions. 

Over the last few years, we have been working with a number of customers across various industries such as Life Sciences, Financial Services, Higher Education etc.  While the problems related to data testing are similar across industries and JSON and flat files are very common, there are many differences in the file formats.  In few industries, Avro is popular while in others, based on use case, Parquet is more commonly used.

 

At Datagaps, one of the key challenges is our ability to empower customers to test any of the formats with ease. Of course, we have an option to build native connections for each format but that may not be the best option sometimes.  At Datagaps, we try to leverage open standards, open source frameworks to support our customers. This is where Apache Drill (https://drill.apache.org/) comes into play.

 

Over the last 6 months, we have been drilling around a bit and absolutely love the speed and flexibility that Apache Drill provides. As of version 3.4.5, we use Drill as the interface between ETL Validator and any of the file formats mentioned above (except flat files since flat files are much more common and deserve native connectors). In this blog, I wanted to take few minutes and explain how easy it is for you to get started with Drill and integrate with ETL Validator. 

 

Over the last few years, the Salesforce platform has become an incredible force in the market for various reasons. Of course, the most obvious use case is for the CRM capabilities. In addition, many organizations have started using the power of the force platform to build and deploy custom applications in the cloud at an incredibly fast pace.

 

While this journey is truly exciting, there will always be a burning underlying need to be able to test the data and ensure that it is always as expected. In this blog, I just thought of highlighting a few use cases and how ETL Validator can help you in addressing those.

Oracle E-Business Suite (EBS) R12 is a significant new version with valuable new features and capabilities. Although there is an upgrade path from EBS 11i to R12, most companies reimplement R12 and migrate the data from their 11i instance. Reimplementation can be a complex project but it also gives them the option to improve their implementation. 

 

When transitioning from EBS R12.1 to R12.2 companies generally perform an inplace upgrade. One of our customer was upgrading from EBS R12.1 to R12.2 and wanted to verify that the upgrade did not cause any issues to the data in their data warehouse. While testing the data warehouse and the dashboards can help identify data issues during the upgrade, it is important to test the data in the EBS R12 instance from the backend. This type of testing is called database testing.

 

Wednesday, 27 January 2016 17:02

ETL Validator for Data Migration testing

Recently, I stumbled upon a relatively old article on Data Migration from TDWI that explains what it is, how it is different from other integration patterns, popular techniques and the typical challenges associated with this pattern.

 

Though the article is more than 5 years old, it still aligns with what we are seeing in the field today. Most of the customers use ETL technologies for their migration projects and thus the problems encountered are very similar to the ones we see in data warehousing patterns.

At a recent event, one of the prospects at the attendees came over to our booth and asked me to demonstrate a use case that was kind of interesting. I got a similar request from another prospect recently and thought that it might be a good idea to blog and show how it can be done using ETL Validator.

 

Problem: There is a source table and a target table. The attendee wanted to find the difference in a number field between two tables. In addition,  he was also interested in specifying an acceptable variance and define a rule on that. If the difference is within the limits, then, he wanted the test case to be marked as success. If the difference exceeds the variance, then, he wanted the test case to be marked as failure. In just few minutes, we were able to demonstrate this use case using 3.4 version of ETL Validator.

Most of the use cases that we encounter are for DW scenarios. However, recently, we are also seeing many scenarios for testing transactional systems. This blog is to highlight on one such example.

 

Tuesday, 01 July 2014 23:21

Why ETL Validator when there is Python?

I am not sure if you ever wrote Python but it is one of the coolest language out there. Easy to learn, easy to write and do yeah, easy to read! Writing code in Python is something I enjoy when I want to get a break from my routine job of putting together slides or preparing for upcoming demos.

 Over the last few years, the usage of Python has gone up drastically and one such area is testing automation. With very few lines of code, you can achieve remarkable things. As an example, sometime back I had to compare the data in two CSV files (tens of thousands or rows)  and then spit out the differences. The code looked somewhat like this:

 

Sunday, 02 February 2014 19:37

Difference between ETL and Database testing

ETL testing applies to Data Warehouses or Data integration projects while Database Testing applies to any database holding data (typically transaction systems). Here are the high level tests done in each: 

ETL Testing : Primary goal is to check if the data moved properly as expected 

 

Database Testing : Primary goal is check if the data is following the rules/standards defined in the Data Model