patternMinor
Cumulative sum using hiveql
Viewed 0 times
hiveqlcumulativeusingsum
Problem
I have a table in Hive which looks like:
How do I, with hiveql, group up
I've only managed to come up with the grouping and sum, but am out of ideas regarding the cumulative sum. Correlated sub-queries are not supported in Hive
With the following result:
col1 col2
b 1
b 2
a 3
b 2
c 4
c 5How do I, with hiveql, group up
col1 elements together, sum them up, sort by the sum, as well as create a cumulative sum (csum) based on the sum?id sum_all csum
a 3 3
b 5 8
c 9 17I've only managed to come up with the grouping and sum, but am out of ideas regarding the cumulative sum. Correlated sub-queries are not supported in Hive
select col1 as id
sum(col2) as sum_all
from t
group by col1
order by sum_allWith the following result:
id sum_all
a 3
b 5
c 9Solution
Since correlated subqueries are not allowed, try using derived tables and then joining them.
This is essentially a self-join on the derived group-by table. It might be more efficient to first save the group-by results into a temporary table and then do the self-join.
According to the manual, Hive also has windowing aggregates, so you could use them as well:
or with:
select
a.id,
a.sum_all,
sum(b.sum_all) as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
join
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) b
on
( b.sum_all < a.sum_all )
or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
a.sum_all, a.id
order by
a.sum_all, a.id ;This is essentially a self-join on the derived group-by table. It might be more efficient to first save the group-by results into a temporary table and then do the self-join.
According to the manual, Hive also has windowing aggregates, so you could use them as well:
select
a.id,
a.sum_all,
sum(a.sum_all) over (order by a.sum_all, a.id
rows between unbounded preceding
and current row)
as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
order by
sum_all, id ;or with:
select
col1 as id,
sum(col2) as sum_all,
sum(sum(col2)) over (order by sum(col2), col1
rows between unbounded preceding
and current row)
as csum
from
t
group by
col1
order by
sum_all, id ;Code Snippets
select
a.id,
a.sum_all,
sum(b.sum_all) as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
join
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) b
on
( b.sum_all < a.sum_all )
or ( b.sum_all = a.sum_all and b.id <= a.id )
group by
a.sum_all, a.id
order by
a.sum_all, a.id ;select
a.id,
a.sum_all,
sum(a.sum_all) over (order by a.sum_all, a.id
rows between unbounded preceding
and current row)
as csum
from
( select col1 as id,
sum(col2) as sum_all
from t
group by col1
) a
order by
sum_all, id ;select
col1 as id,
sum(col2) as sum_all,
sum(sum(col2)) over (order by sum(col2), col1
rows between unbounded preceding
and current row)
as csum
from
t
group by
col1
order by
sum_all, id ;Context
StackExchange Database Administrators Q#153338, answer score: 5
Revisions (0)
No revisions yet.