總瀏覽量

2012年10月26日 星期五

Be careful using COUNT in pig


I’m running some pig script to count the number of rows.
And found that the COUNT function ignore rows in bag which has null in first field.
I change the field order and take experiments bellow
 
--------- source file ----------
A       E       G
        F       H
B       F       I
 
 
[Experiment#1]
--------- pig ----------
A =load '/user/myspn/exp' using PigStorage() as (X:chararray, Y:chararray, Z:chararray);
B = group A by (X,Y);
C = foreach B generate group.$0, group.$1, A, COUNT(A);
dump C;
--------- result ----------
(,F,{(,F,H)},0)
(A,E,{(A,E,G)},1)
(B,F,{(B,F,I)},1)
 
 
[Experiment#2]
--------- pig ----------
A =load '/user/myspn/exp' using PigStorage() as (X:chararray, Y:chararray, Z:chararray);
D = foreach A generate Y, X, Z;
B = group D by (Y,X);
C = foreach B generate group.$0, group.$1, D, COUNT(D);
dump C;
--------- result ----------
(F,,{(F,,H)},1)
(E,A,{(E,A,G)},1)
(F,B,{(F,B,I)},1)
 
 
In experiment#1 we see that the group function still works fine but COUNTs nothing.
So the conclusion is making sure that the first field of rows to be count is not null.

沒有留言:

張貼留言