Discuss New Concept,New Technic,New Tools, Including EAI,BPM,SOA,Tibco,IBM MQ,Tuxedo, Cloud,Hadoop,NoSQL,J2EE,Ruby,Scala,Python, Performance,Scalability,Distributed,HA, Social Network,Machine Learning.

HIVE

Nov 242012
 
 [repost ]Hive中的数据分区  November 24, 2012  Posted by on November 24, 2012 at 8:29 pm HIVE Tagged with: ,  No Responses »

orignal:http://www.cnblogs.com/subsir/articles/2631635.html

首先认识什么是分区

Hive 中的分区就是分目录,把一个大的数据集根据业务需要分割成更下的数据集。

 

1. 如何定义分区,创建分区

hive> create table test(name string,sex int) partitioned by (birth string, age string);
Time taken: 0.044 seconds
hive> alter table test  add partition (birth=’1980′, age =’30′);

Time taken: 0.079 seconds
hive> alter table test  add partition (birth=’1981′, age =’29′);

Time taken: 0.052 seconds
hive> alter table test  add partition (birth=’1982′, age =’28′);

Time taken: 0.056 seconds
hive> show partitions test;
birth=1980/age =30

birth=1981/age =29

birth=1982/age =28

 

2. 如何删除分区

hive> alter table test drop partition (birth=’1980′,age=’30′);

3. 加载数据到指定分区

load data local inpath ‘/home/hadoop/data.log’ overwrite into table

test partition(birth=’1980-01-01′,age=’30′);

创建分区原则: 最少粒度原则

 

4 向partition_test的分区中插入数据:

hive> insert overwrite table partition_test partition(stat_date=’20110728′,province=’henan’) select member_id,name from partition_test_input where stat_date=’20110728′ and province=’henan’;

 

5 还可以同时向多个分区插入数据,0.7版本以后不存在的分区会自动创建,0.6之前的版本官方文档上说必须要预先创建好分区:
hive>
> from partition_test_input
> insert overwrite table partition_test partition (stat_date=’20110526′,province=’liaoning’)
> select member_id,name where stat_date=’20110526′ and province=’liaoning’
> insert overwrite table partition_test partition (stat_date=’20110728′,province=’sichuan’)
> select member_id,name where stat_date=’20110728′ and province=’sichuan’
> insert overwrite table partition_test partition (stat_date=’20110728′,province=’heilongjiang’)
> select member_id,name where stat_date=’20110728′ and province=’heilongjiang’;
Total MapReduce jobs = 4