I’m running some pig script to count the number of rows.
And found that the COUNT function ignore rows in bag which has null in first field.
I change the field order and take experiments bellow:
--------- source file ----------
A E G
F H
B F I
[Experiment#1]
--------- pig ----------
A =load '/user/myspn/exp' using PigStorage() as (X:chararray, Y:chararray, Z:chararray);
B = group A by (X,Y);
C = foreach B generate group.$0, group.$1, A, COUNT(A);
dump C;
--------- result ----------
(,F,{(,F,H)},0)
(A,E,{(A,E,G)},1)
(B,F,{(B,F,I)},1)
[Experiment#2]
--------- pig ----------
A =load '/user/myspn/exp' using PigStorage() as (X:chararray, Y:chararray, Z:chararray);
D = foreach A generate Y, X, Z;
B = group D by (Y,X);
C = foreach B generate group.$0, group.$1, D, COUNT(D);
dump C;
--------- result ----------
(F,,{(F,,H)},1)
(E,A,{(E,A,G)},1)
(F,B,{(F,B,I)},1)
In experiment#1 we see that the group function still works fine but COUNTs nothing.
So the conclusion is making sure that the first field of rows to be count is not null.
沒有留言:
張貼留言