A Beginner's Guide To Apache Pig: Basic Navigation and Useful Commands

What is Pig?

Pig is a component of the Hadoop ecosystem that abstracts some of the lower level Hadoop functions through the use of a functional scripting language called Pig Latin.

What do I need to get started?

This guide assumes that Pig and core Hadoop are installed and working properly. If you have everything set up, you will also need a dataset to manipulate. This tutorial uses a public dataset generated from FAA ontime flight data. The actual dataset referenced in the tutorial is available for download by written request to expert@spryinc.com.

I have everything, now what?

Basic Navigation

The first lessons you need to learn are how to interact with Pig and the Pig Latin scripts you will write. Pig accepts commands either through an interactive console mode, or by specifying the location of the script as an argument. 

There are several ways to invoke the pig package, and begin processing data. Each method is outlined in the table below.




Command Interactive Mode MapReduce Mode
pig Yes Yes
pig -x local Yes No
pig [path_to_script] No Yes
pig -x local [path_to_script] No No

If provided the same input, the methods are pretty much equivalent. However, using the interactive mode and submitting one command at a time allows the user more freedom to inspect the state of the data throughout the process, and is very helpful during debugging. Using the local mode instead of the MapReduce mode removes the requirement of moving test input files into HDFS.

This guide will expect you to use the "pig" command while following along, so that we can take advantage of both the interactive console mode and the MapReduce mode. Try the following commands now, just to make sure that everything in your environment is set up correctly.

$ pig
<Successful Connection Log Messages>
grunt> -- This is a comment and doesn't affect anything
grunt> quit

Now that you have all of the information necessary to navigate within the Pig interactive console, let's start interacting with some data.

Basic Data Manipulation

Pig handles data in terms of "tuples" and "bags". A tuple can be thought of as a collection of columns or fields, and a bag can be thought of as a collection of tuples. Both bags and tuples can be complex structures - so a tuple can be made up of a combination of bags, tuples, and scalars.

When interacting with the bags and tuples, the data can be retrieved using a positional notation (e.g.: $0, $1) or by using the field name (e.g.: origin_city, dep_time). If the field is being accessed as part of a bag, you will need to  use dot notation to access the field within the bag (e.g.: flight_time.$0, flight_time.fl_date).

Essential Functions

LOAD

This function loads data from HDFS into an alias so that Pig can perform transformations against it.

flight_ontime = LOAD '/user/spry/flight_data_subset' using PigStorage('\t') as (fl_date:chararray, origin:chararray,origin_city:chararray,dest:chararray,dest_city_name:chararray,dep_time:chararray,dep_delay:int,arr_time:chararray,arr_delay:int);

DESCRIBE

This command will allow us to view the schema for an alias so that we can know which fields are available.

DESCRIBE flight_ontime;

DUMP

Once data has been stored into a variable, we can also write it out to the screen to verify that everything was loaded correctly.

DUMP flight_ontime;

FILTER

Use this command to select a subset of the data in an alias according to a particular expression. 

late_flights = FILTER flight_ontime BY dep_delay > 0;

SPLIT

This is the equivalent to an "if" statement in other programming languages, so that the dataset can be subdivided into multiple aliases.

SPLIT flight_ontime INTO early_flights IF dep_delay < 0, ontime_flights IF dep_delay == 0, late_flights > 0, new_york_flights IF origin_city_name == 'New York, NY';

STORE

After using Pig to transform the input data into the desired output dataset, we can also write the variable out to HDFS so that other applications can interact with it. For this example, we will write the list of late flights to HDFS.

STORE late_flights INTO '/user/spry/late_flights' USING PigStorage('\t');

GROUP

This command will take a set of tuples and create multiple bags such that tuples that have the specified columns in common will be placed into the same bag. Once the bags have been created, the data inside each bag can be accessed using the dot notation mentioned above.

flights_by_origin = GROUP flight_ontime BY origin;

FOREACH

This is the equivalent to a "for" loop in other programming languages, so that logic can be applied to specific columns in each row of the data.

number_flights_by_origin = FOREACH flights_by_origin GENERATE group, COUNT(group);

More Useful Built-in Functions To Be Aware Of

LIMIT - Limits the number of tuples stored in an alias.
ORDER BY - Specifies the sort order for tuples in an alias.
CONCAT - Concatenates two elements
ISEMPTY - Checks to see whether or not a bag contains elements.
SIZE - Determines the length of an array, or the number of elements in a bag or tuple
Math functions: AVG, MAX, MIN, SUM, COUNT/COUNT_STAR
String functions: IndexOf, LOWER, REGEX_EXTRACT_ALL, REPLACE, STRSPLIT, SUBSTRING

How do I know it worked?

First, create a Pig script using a combination of the commands outlined above. Then pass the script to Pig as a command-line argument.

pig <file>

If everything was successful, your script should complete with no errors.

Where can I go for more information?

Or feel free to leave your question or comment below!

No comments:

Post a Comment