Data Validation Using Pentaho Data Integration
Data validation is a process which ensures that a program is operating on cleansed, correct and effective data. It uses validation rules ,validation constraints or check routines, that check whether data is correct and meaningful aswell as the security of data that are input to the system. The rules are implemented through the facilities of a data dictionary, or by the inclusion of explicit application program regular expression validation logic.
Data validation is meant to provide well-defined, fit, accurate and consistent data for any of various kinds of user input into an application or automated system. Data validation rules can be designed using any of various methodologies, and be deployed in any of various contexts. Data validation should start with business process definition and set of business rules within this process. All the specific Rules and Regulations can be collected through the requirements capture exercise.
To address all the data validation related issues PENTAHO gives us a readymade plugin named “Data Validator” which overcomes all the data-related issues .
Following illustration is a small example as to how one can use this step effectively.
1.We have three steps namely a Microsoft Excel,a Data Validator and a Select Values step.
2.When we preview the data, in the LastupdatedDt column we can notice that all dates are in DD-MM-YYYY format except for one which is MM/DD/YYYY format.So inorder to overcome this issue we used a data validator step which will filter out all rows except for that one single row which has the error data which we need to overcome.
For overcoming this issue we used a regular expression which will take in only the dates in DD-MM-YYYY format and filter out any other format.
3.Finally we segregated the data into true data and false data steps where the former holds the data with DD-MM-YYYY formatted date and the other containing the MM/DD/YYYY format.