Load Dynamic & Complex Json Messages in Pig Script without a schema


Load Complex & Dynamic JSON messages - Problem


Suppose that you have to load dynamic and complex JSON messages and have done some transformations and aggregations in your pig script. Simply you can use Pig In built JsonLoader to load JSON messages from HDFS. As an example, let’s try to load some HDFS data which stored as following Json format.


[NOTE - You need to give JSON message as a single line string as in Pig Example GIT project ]

We can load JSON data using Pig in built JsonLoder as shown below and I have Flattened data which is inside the pig bag. Then final tuple is created.



NOTE : $jsonFilePath is an external parameter - passed to script.

Yes, In pig script we can use JSONLoader which comes with Pig as a built in function for load complex JSON messages. But the problem with the JSONLoader is that the entire schema has to be specified as shown below.
What if positions of the Json message has changed without changing the format. Then Jsonloader loads data to the wrong attribute according to the given schema.
Let's look at a simple example. Here I take a simple JSON message as below.


{
“A”:”val_a”,     
“B”:”val_b”
}

If we have given schema as follows,


JsonLoader assume that every data in that format and convert it to a pig data structure like,

{
“A”:”val_a”, --------------> A: chararray
“B”:”val_b” --------------> B: chararray
}

But if some data exists as below, without affecting to the format but has changed in order.

{
“B”:”val_b”,     
“A”:”val_a”
}

Pig will load that messages as below since we have provided the expected schema and it will break the entire logic of the data flow.

{
“B”:”val_b”, --------------> A: chararray    
“A”:”val_a” --------------> B: chararray
}

This is critical with data which has dynamic and complex Json format. But Twitter has introduced an open source library called “Elephant-Bird” which can address this drawback. This library consists many features other than loading of dynamic JSON message functionality.


How to use Elephant-bird - Solution 


With the JsonLoader from elephant-bird, there is no need to specify the schema. The JsonLoader simply returns a Pig map data type and fields can be accessed using the JSON property name as shown below.


You can download Elephant Bird library from this and refer this to be more familiar with Elephant-bird.

When you are using an external library, you have to register it to Pig script using REGISTER.

NOTE: $elephantBirdLibPath - path to elephant-bird jars. (local file path)

Please refer this  git project to familiar with the elephant-bird library. I have integrated Pig unit along with this project and you can run it in your local environment. (just run unit test)



Hadoop submit command if you run this on Hadoop.






2 comments:

  1. Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online Course Bangalore

    ReplyDelete