Hive中表的关联顺序对生成MapReduce作业数的影响案例 - 数据库编程

Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整，以获得最少的MapReduce作业数。举一个例子（案例基于Hive 0.6.0）：

create table ljn1(

k1 bigint,

k2 String,

v1 int

);

create table ljn2(

k1 bigint,

v2 int

);

create table ljn3(

k1 bigint,

v3 int

);

create table ljn4(

k1 bigint,

v4 int

);

create table ljn5(

k1 bigint,

v5 int

);

create table ljn6(

k2 string,

v6 int

);

然后看一下下面这个SQL的执行计划:

explain

select a.v1

from

ljn1 a

left outer join ljn2 b on (a.k1 = b.k1)

left outer join ljn3 c on (a.k1 = c.k1)

left outer join ljn4 d on (a.k1 = d.k1)

left outer join ljn6 e on (a.k2 = e.k2)

left outer join ljn5 f on (a.k1 = f.k1);

STAGE DEPENDENCIES:

Stage-5 is a root stage

Stage-1 depends on stages: Stage-5

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-5

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 0

value expressions:

expr: k1

type: bigint

expr: k2

type: string

expr: v1

type: int

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col1} {VALUE._col2}

handleSkewJoin: false

outputColumnNames: _col0, _col1, _col2

File Output Operator

compressed: true

GlobalTableId: 0

table:

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col0

type: bigint

sort order: +

Map-reduce partition columns:

expr: _col0

type: bigint

tag: 0

value expressions:

expr: _col1

type: string

expr: _col2

type: int

TableScan

alias: c

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

TableScan

alias: d

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 2

TableScan

alias: f

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 3

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

Left Outer Join0 to 3

condition expressions:

0 {VALUE._col3} {VALUE._col4}

handleSkewJoin: false

outputColumnNames: _col3, _col4

File Output Operator

compresse

Hive中表的关联顺序对生成MapReduce作业数的影响案例(一)