百度360必应搜狗淘宝本站头条
当前位置:网站首页 > 技术文章 > 正文

Hive函数大全(含例子)之数据聚合函数、表生成函数

wxin55 2024-10-25 18:02 13 浏览 0 评论

目录

  • 一、聚合函数 Aggregate Functions (UDAF)
  • 二、表生成函数 Table-Generating Functions (UDTF)
  • 三、相关文章
  • 四、下期预告

一、聚合函数 Aggregate Functions (UDAF)

count(*), count(expr), count(DISTINCT expr[, expr...])

  • 返回结果: 返回行数
  • 返回类型: bigint
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT COUNT(col)
FROM tmp;  -- 结果为 2

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT COUNT(case when col=2 then col end)
FROM tmp;  -- 结果为 0

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT COUNT(DISTINCT col)
FROM tmp;  -- 结果为 1

sum(col), sum(DISTINCT col)

  • 返回结果: 返回求和的值
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT SUM(col)
FROM tmp;  -- 结果为 2

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT SUM(case when col=2 then col end)
FROM tmp;  -- 结果为 NULL

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col
)
SELECT SUM(DISTINCT col)
FROM tmp;  -- 结果为 1

avg(col), avg(DISTINCT col)

  • 返回结果: 返回平均值
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT AVG(col)
FROM tmp;  -- 结果为 1.3333333333333333

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT AVG(case when col=2 then col else 2 end)
FROM tmp;  -- 结果为 2.0

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT AVG(DISTINCT col)
FROM tmp;  -- 结果为 1.5

min(col)

  • 返回结果: 返回最小值
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT MIN(col)
FROM tmp;  -- 结果为 1

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT MIN(case when col=2 then col else 3 end)
FROM tmp;  -- 结果为 2

max(col)

  • 返回结果: 返回最大值
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT MAX(col)
FROM tmp;  -- 结果为 2

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT MAX(case when col=2 then col else 3 end)
FROM tmp;  -- 结果为 3

variance(col), var_pop(col)

  • 返回结果: 返回数值列方差
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT variance(col)
FROM tmp;  -- 结果为 0.25

WITH tmp AS
(
  SELECT '1' AS col UNION ALL SELECT '2' AS col UNION ALL SELECT 'a' AS col
)
SELECT variance(col)
FROM tmp;  -- 结果为 0.25

var_samp(col)

  • 返回结果: 返回数值列的无偏样本方差
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT var_samp(col)
FROM tmp;  -- 结果为 0.5

WITH tmp AS
(
  SELECT '1' AS col UNION ALL SELECT '2' AS col UNION ALL SELECT 'a' AS col
)
SELECT variance(col)
FROM tmp;  -- 结果为 0.5

stddev_pop(col)

  • 返回结果: 返回数值列的标准方差
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT stddev_pop(col)
FROM tmp;  -- 结果为 0.5

WITH tmp AS
(
  SELECT '1' AS col UNION ALL SELECT '2' AS col UNION ALL SELECT 'a' AS col
)
SELECT stddev_pop(col)
FROM tmp;  -- 结果为 0.5

stddev_samp(col)

  • 返回结果: 返回样本标准差
  • 返回类型: 1222222
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col
)
SELECT stddev_samp(col)
FROM tmp;  -- 结果为 0.7071067811865476

WITH tmp AS
(
  SELECT '1' AS col UNION ALL SELECT '2' AS col UNION ALL SELECT 'a' AS col
)
SELECT stddev_samp(col)
FROM tmp;  -- 结果为 0.7071067811865476

covar_pop(col1, col2)

  • 返回结果: 返回协方差
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col1,1 AS col2 
  UNION ALL 
  SELECT 2 AS col1,3 AS col2
)
SELECT covar_pop(col1, col2)
FROM tmp;  -- 结果为 0.5

covar_samp(col1, col2)

  • 返回结果: 返回样本协方差
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col1,1 AS col2 
  UNION ALL 
  SELECT 2 AS col1,3 AS col2
)
SELECT covar_samp(col1, col2)
FROM tmp;  -- 结果为 1.0

corr(col1, col2)

  • 返回结果: 返回皮尔逊相关系数
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col1,1 AS col2 
  UNION ALL 
  SELECT 2 AS col1,3 AS col2
)
SELECT corr(col1, col2)
FROM tmp;  -- 结果为 0.9999999999999999

percentile(BIGINT col, p)

  • 返回结果: 返回分位数(p必须介于0和1之间;如果输入为非整数,请使用percentile_approx)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT percentile(col, 0.25)
FROM tmp;  -- 结果为 1.75

WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT percentile(col, array(0.5,0.75))
FROM tmp;  -- 结果为 [6.0,12.5]

percentile_approx(DOUBLE col, p [, B])

  • 返回结果: 返回分位数(B参数以消耗内存为代价控制近似精度;较高的值产生更好的近似值,默认值为10000;当col中不同值的数目小于B时,这将给出精确的百分位值)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1.1 AS col UNION ALL SELECT 2.1 AS col UNION ALL SELECT 10.1 AS col UNION ALL SELECT 20.1 AS col
)
SELECT percentile_approx(col, 0.25)
FROM tmp;  -- 结果为 1.1

WITH tmp AS
(
  SELECT 1.1 AS col UNION ALL SELECT 2.1 AS col UNION ALL SELECT 10.1 AS col UNION ALL SELECT 20.1 AS col
)
SELECT percentile_approx(col, array(0.5,0.75), 5)
FROM tmp;  -- 结果为 [2.1,10.1]

regr_avgx(y, x)

  • 返回结果: 计算回归线的自变量的平均值(如果y,x其中之一或者全为NULL,则该行忽略,最终返回其他所有行的x的平均值)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,3 AS x
)
SELECT regr_avgx(y, x)
FROM tmp;  -- 结果为 2

regr_avgy(y, x)

  • 返回结果: 计算回归线的因变量的平均值(如果y,x其中之一或者全为NULL,则该行忽略,最终返回其他所有行的y的平均值)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,3 AS x
)
SELECT regr_avgy(y, x)
FROM tmp;  -- 结果为 1.5

regr_count(y, x)

  • 返回结果: 返回非空对的数量(y,x都不为NULL则为非空对)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,NULL AS x
)
SELECT regr_count(y, x)
FROM tmp;  -- 结果为 1

regr_intercept(y, x)

  • 返回结果: 返回可最佳拟合线性回归线的y轴截距b【b=AVG( y ) - REGR_SLOPE( y, x ) * AVG( x )】
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 2 AS y,1 AS x 
  UNION ALL 
  SELECT 3 AS y,2 AS x
)
SELECT regr_intercept(y, x)
FROM tmp;  -- 结果为 1

regr_r2(y, x)

  • 返回结果: 返回回归线的确定系数(也称为R平方或拟合度)
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 2 AS y,1 AS x 
  UNION ALL 
  SELECT 3 AS y,2 AS x
)
SELECT regr_r2(y, x)
FROM tmp;  -- 结果为 1

regr_slope(y, x)

  • 返回结果: 返回可最佳拟合线性回归线的斜率
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 2 AS y,1 AS x 
  UNION ALL 
  SELECT 3 AS y,2 AS x
)
SELECT regr_slope(y, x)
FROM tmp;  -- 结果为 1

regr_sxx(y, x)

  • 返回结果: 返回线性回归模型中使用的独立表达式的平方和,可用于计算回归模型的统计有效性(regr_sxx(y, x) = regr_count(y, x) * var_pop(x))
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,3 AS x
)
SELECT regr_sxx(y, x)
FROM tmp;  -- 结果为 2.0

regr_sxy(y, x)

  • 返回结果: 返回因变量与自变量的积和,可用于计算回归模型的统计有效性(regr_sxy(y, x) = regr_count(y, x) * covar_pop(y, x))
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,3 AS x
)
SELECT regr_sxy(y, x)
FROM tmp;  -- 结果为 1.0

regr_syy(y, x)

  • 返回结果: 返回可以计算回归模型统计有效性的值(regr_syy(y, x) = regr_count(y, x) * var_pop(y))
  • 返回类型: DOUBLE
WITH tmp AS
(
  SELECT 1 AS y,1 AS x 
  UNION ALL 
  SELECT 2 AS y,3 AS x
)
SELECT regr_syy(y, x)
FROM tmp;  -- 结果为 0.5

histogram_numeric(col, b)

  • 返回结果: 使用b个非均匀间隔的存储箱计算组中数值列的直方图。输出是一个大小为b的双值(x,y)坐标数组,表示bin中心和高度
  • 返回类型: array<struct {'x','y'}>
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 2 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT histogram_numeric(col, 3)
FROM tmp;  -- 结果为 [{"x":1.5,"y":2.0},{"x":10.0,"y":1.0},{"x":20.0,"y":1.0}]

collect_set(col)

  • 返回结果: 返回一组删除了重复元素的对象
  • 返回类型: array
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT collect_set(col)
FROM tmp;  -- 结果为 [1,10,20]

collect_list(col)

  • 返回结果: 返回具有重复项的对象列表
  • 返回类型: array
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT collect_list(col)
FROM tmp;  -- 结果为 [1,1,10,20]

ntile(INTEGER x)

  • 返回结果: 将有序分区划分为x个组,称为bucket,并为分区中的每一行分配一个bucket编号。这样可以方便地计算三位数、四位数、十位数、百分位数和其他常见的汇总统计数据
  • 返回类型: INTEGER
WITH tmp AS
(
  SELECT 1 AS col UNION ALL SELECT 1 AS col UNION ALL SELECT 10 AS col UNION ALL SELECT 20 AS col
)
SELECT ntile(2) over(order by col) as bucket_id
FROM tmp;  -- 结果为 1 1 2 2

结果为:

bucket_id

1

1

2

2


二、表生成函数 Table-Generating Functions (UDTF)

explode(ARRAY a)

  • 返回结果: 将数组分解为多行。返回具有单个列(col)的行集,数组中每个元素一行
  • 返回类型: T
SELECT explode(array(1,2,3)) as col;

结果为:

col

123

WITH tmp AS
(
  SELECT 'a' AS col1, ARRAY(1,2) AS col2
)
SELECT tmp.col1,
       tmp.col2,
       t.col3
FROM tmp LATERAL VIEW explode (col2) t AS col3;

结果为:

col1 col2 col3

a [1,2] 1

a [1,2] 2


explode(MAP<Tkey,Tvalue> m)

  • 返回结果: 将MAP分解为多行。返回包含两列(键、值)的行集,对于MAP中的每个键值对返回一行
  • 返回类型: Tkey,Tvalue
SELECT explode(map('k1','v1','k2','v2')) as (key,value);

结果为:

key value

k1 v1

k2 v2


posexplode(ARRAY a)

  • 返回结果: 使用int类型的附加位置列(项目在原始数组中的位置,从0开始)将数组分解为多行。返回包含两列(pos,val)的行集,数组中每个元素一行
  • 返回类型: int,T
SELECT posexplode(array(1,2)) as (pos,val);

结果为:

pos val

0 1

1 2


inline(ARRAY<STRUCT<f1:T1,...,fn:Tn>> a)

  • 返回结果: 将结构数组分解为多行。返回一个行集,该行集包含N列(N=结构中顶级元素的数目),数组中每个结构一行
  • 返回类型: T1,...,Tn
SELECT inline(array(struct('a','uncle',1),struct('b','bean',2))) as (code,name,id);

结果为:

code name id

a uncle 1

b bean 2


stack(int r,T1 V1,...,Tn/r Vn)

  • 返回结果: 将n个值V1,…,Vn拆分为r行。每一行都有n/r列。r必须是常数。
  • 返回类型: T1,...,Tn/r
SELECT stack(2,'a','uncle',1,'b','bean',2);

结果为:

col1 col2 col3

a uncle 1

b bean 2

SELECT stack(2,1,2,3,4,5);

结果为:

col1 col2 col3

1234 NULL 3


json_tuple(string jsonStr,string k1,...,string kn)

  • 返回结果: 接收JSON字符串和n个键,返回n个值的元组。这比get_json_object更高效,因为它只需要一次调用就可以获得多个值。
  • 返回类型: string1,...,stringn
SELECT json_tuple('{"id":1,"name":"Uncle Bean"}', 'id', 'name') as (id,name);

结果为:

id name

1 Uncle Bean


parse_url_tuple(string urlStr,string p1,...,string pn)

  • 返回结果: 接收URL字符串和n个URL部件名称,返回n个值的元组。这类似于parse_url,但可以从一个url中一次提取多个部分。有效的部件名称是:HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>。
  • 返回类型: string 1,...,stringn
SELECT json_tuple('{"id":1,"name":"Uncle Bean"}', 'id', 'name') as (id,name);

结果为:

host path

github.com /TheUncleWhoGrowsBeans/Melon-and-fruit-fields

三、相关文章

Hive函数大全(含例子)之数学函数(Mathematical Functions)

Hive函数大全(含例子)之字符串函数(String Functions)

Hive函数大全(含例子)之集合函数、日期函数、条件函数

Hive函数大全(含例子)之数据屏蔽函数、杂项函数、XML解析函数

四、下期预告

Hive函数大全(含例子)之窗口和分析函数

相关推荐

ES6中 Promise的使用场景?(es6promise用法例子)

一、介绍Promise,译为承诺,是异步编程的一种解决方案,比传统的解决方案(回调函数)更加合理和更加强大在以往我们如果处理多层异步操作,我们往往会像下面那样编写我们的代码doSomething(f...

JavaScript 对 Promise 并发的处理方法

Promise对象代表一个未来的值,它有三种状态:pending待定,这是Promise的初始状态,它可能成功,也可能失败,前途未卜fulfilled已完成,这是一种成功的状态,此时可以获取...

Promise的九大方法(promise的实例方法)

1、promise.resolv静态方法Promise.resolve(value)可以认为是newPromise方法的语法糖,比如Promise.resolve(42)可以认为是以下代码的语...

360前端一面~面试题解析(360前端开发面试题)

1.组件库按需加载怎么做的,具体打包配了什么-按需加载实现:借助打包工具(如Webpack的require.context或ES模块动态导入),在使用组件时才引入对应的代码。例如在V...

前端面试-Promise 的 finally 怎么实现的?如何在工作中使用?

Promise的finally方法是一个非常有用的工具,它无论Promise是成功(fulfilled)还是失败(rejected)都会执行,且不改变Promise的最终结果。它的实现原...

最简单手写Promise,30行代码理解Promise核心原理和发布订阅模式

看了全网手写Promise的,大部分对于新手还是比较难理解的,其中几个比较难的点:状态还未改变时通过发布订阅模式去收集事件实例化的时候通过调用构造函数里传出来的方法去修改类里面的状态,这个叫Re...

前端分享-Promise可以中途取消啦(promise可以取消吗)

传统Promise就像一台需要手动组装的设备,每次使用都要重新接线。而Promise.withResolvers的出现,相当于给开发者发了一个智能遥控器,可以随时随地控制异步操作。它解决了三大...

手写 Promise(手写输入法 中文)

前言都2020年了,Promise大家肯定都在用了,但是估计很多人对其原理还是一知半解,今天就让我们一起实现一个符合PromiseA+规范的Promise。附PromiseA+规范地址...

什么是 Promise.allSettled()!新手老手都要会?

Promise.allSettled()方法返回一个在所有给定的promise都已经fulfilled或rejected后的promise,并带有一个对象数组,每个对象表示对应的pr...

前端面试-关于Promise解析与高频面试题示范

Promise是啥,直接上图:Promise就是处理异步函数的API,它可以包裹一个异步函数,在异步函数完成时抛出完成状态,让代码结束远古时无限回掉的窘境。配合async/await语法糖,可...

宇宙厂:为什么前端离不开 Promise.withResolvers() ?

大家好,很高兴又见面了,我是"高级前端进阶",由我带着大家一起关注前端前沿、深入前端底层技术,大家一起进步,也欢迎大家关注、点赞、收藏、转发。1.为什么需要Promise.with...

Promise 新增了一个超实用的 API!

在JavaScript的世界里,Promise一直是处理异步操作的神器。而现在,随着ES2025的发布,Promise又迎来了一个超实用的新成员——Promise.try()!这个新方法简...

一次搞懂 Promise 异步处理(promise 异步顺序执行)

PromisePromise就像这个词的表面意识一样,表示一种承诺、许诺,会在后面给出一个结果,成功或者失败。现在已经成为了主流的异步编程的操作方式,写进了标准里面。状态Promise有且仅有...

Promise 核心机制详解(promise机制的实现原理)

一、Promise的核心状态机Promise本质上是一个状态机,其行为由内部状态严格管控。每个Promise实例在创建时处于Pending(等待)状态,此时异步操作尚未完成。当异步操作成功...

javascript——Promise(js实现promise)

1.PromiseES6开始支持,Promise对象用于一个异步操作的最终完成(包括成功和失败)及结果值的表示。简单说就是处理异步请求的。之所以叫Promise,就是我承诺,如果成功则怎么处理,失败怎...

取消回复欢迎 发表评论: