Hands on Pig Unit Testing

Why Unit Testing ?

When we develop a distributed system, it is crucial to test the logic in an automated way and execute tests in the development environment (with an IDE). The productivity of our development process could kill because of this long development-test-development cycle for complex systems.

In the domain of distributed system and big data analytics we have to implement complex logic and sometimes we have to change the logic that we have implemented due various requirements and reasons. In this case, we need to assure that changes in our code will not break the entire process or entire application.

What is Pig ?

Pig is a highly used for data analysis problems and it is a scripting language which is used with Apache Hadoop. Pig is provided flexibility in data manipulation in Apache Hadoop and it works with data from many sources (including structured and unstructured data).

Behind the scene, pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.

You can find more details about Pig from this link.

Familiar with Pig Unit Test ?

When you are working with pig script, you need to test your Pig Latin scripts. Even after the implementation of data flows, you need to test your logic regularly when you do some changes to scripts, to UDFs or in the version of Pig and Hadoop that you are using, in order to ensure that changes do not break your code.

For unit testing of Pig scripts, there is Pig Unit library provided by Apache group. It enables running of Pig script using JUnit. Pig Unit can run in Local and MapReduce mode. By default Local mode will be used, this mode does not require a cluster. It will enable you to use your local file system as a Hadoop cluster, each time it will create a new local one. On the other hand, MapReduce mode does require Hadoop cluster and installation of HDFS. For Development testing, purpose developers can run Pig Unit in Local mode.




The minimal required set up for Pig Unit test consists of four libraries:

Maven :

Gradle :
Link to dependencies.gradle of Pig Unit Test Tutorial Project
Link to build.gradle of Pig Unit Test Tutorial Project
Here I refer my Word count project to explain the features of Pig Unit framework. Additionally, I have included a step to load some data from an HBase table and will explain how to mock external data source in Pig Unit Test (which is not explained in most of the tutorials).

NOTE: I have provided 2 shells scripts to create that HBase table on your cluster and DML for it if you really want to try it on a Hadoop Environment. You can find them from here.
My Word Count pig is like this:

First, it loads input file from an HDFS location passed from externally and extracts words while ignoring unwanted characters. Then it loads data from an HBase table (namespace and table name is provided externally) which keeps words to be filtered from HDFS data. Then it joins loaded HBase data and HDFS data. Then it groups words and gets the count of each word. Finally, it stores the result on HDFS as the path is given externally.
Let us see how we going to test this flow using Pig Unit Testing framework.
You can mock external parameters which are provided externally as a string array like,

You can use local file system files paths instead of HDFS file location when you execute Pig Unit Tests. You can provide them with relative paths.
You have to mock external data sources and for this, we can use Pig Unit override feature and it will override a line of code at the run time for a given alias.


Here Pig unit will override above Pig code by

You can verify intermediate results (units) using an external file which contains expected result or using a string array of expected results etc.
Note that Pig currently drops all STORE and DUMP commands. You have to tell PigUnit to keep the commands and execute the script like this. If you run this, Pig unit will create an  output part file inside ‘/resources/output/wordcount/’ directory


Find my Pig Unit example from git and refer ‘pig unit test tutorial’ for this.

2 comments:

  1. Hello,
    The Article on Hands on Pig Unit Testing is very informative. It give detail information about it .Thanks for Sharing the information about the Unit testing. Software Testing Company

    ReplyDelete