Removing Duplicates using Pentaho Data Integration

Today in this blog we are going to concentrate on removing duplicates using Pentaho Data Integration. This is a frequently asked question as to how one shall remove duplicacy issue which is quite cumbersome and creates a whole wide issue day in and out.

To solve this issue we can take use of Unique rows step in Pentaho Data integration. Irrespective of the number of incoming data the Unique rows step easily refines the data as per the user’s requirement. One such example is shown below.

1. First we create some dummy data from scratch which has a number of duplicacy issues persisting in it.


2. An additional step named “UNIQUE ROWS” is added  which uniquely identifies the duplicates. In here we have defined the specified fieldname i.e. “BEDROOM(TAX|MLS) ” whose duplicate values need to be removed giving us unique rows to deal with.


3. Finally when we run the transformation and preview the output we get our desired result with no duplicacy issues coming in the picture now.


Hope this help !