【人工智能训练师】综合案例 HBase与Hive的集成

9.1 HBase与Hive

任务目的
简单回顾了解hive
了解hive与hbase的区别
任务清单
任务1：hive简介
任务2：hbase与hive的区别
任务步骤
任务1：hive简介
　　什么是Hive呢？ Apache Hive是一个构建在Hadoop基础设施之上的数据仓库。

构建在Hadoop之上的数据仓库
Hive定义了一种类SQL查询语言：HQL（类似SQL但不完全相同）
通常用于进行离线数据分析（采用MapReduce）
底层支持多种不同的执行引擎（例如hive on MapReduce）
支持多种不同的压缩格式、存储格式以及自定义函数
　　Apache Hive数据仓库软件可以使用SQL方便地阅读、编写和管理分布在分布式存储中的大型数据集。结构可以投射到已经存储的数据上。提供了一个命令行工具和JDBC驱动程序来将用户连接到Hive。体系架构如下：

9.1-1

而驱动又包括如：解释器、编译器、优化器和执行器。

任务2：hbase与hive的区别
特点

Hive

执行MapReduce任务，兼容JDBC
分区机制控制海量数据
HBase

通过存储KV对儿工作
限制

Hive

操作需要花费较长时间
必须提供预先定义好的schema
Hive与ACID不兼容
HBase

学习成本高
依赖zookeeper
使用场景

Hive

针对的是OLAP应用，适合用来对一段时间内的数据进行分析查询。
HBase

key-value型数据库，适合用来进行大数据的实时查询。
总结

Hive

表是逻辑表
行模式处理数据
存储数据稠密
使用Hadoop处理
批任务处理
支持SQL
HBase

物理表存储非结构化数据
列模式
存储数据密度小
实时查询
支持和row-level的更新
不适用于有join，多级索引，表关系复杂的应用场景。

9.2 安装hive

任务目的
掌握RPM安装MySQL的方式
掌握hive的安装方式
任务清单
任务1：安装MySQL
任务2：安装hive
任务步骤
任务1：安装MySQL
1、使用RPM包的安装方式安装MySQL。

rpm -ivh mysql-community-release-el7-5.noarch.rpm

9.2-1

2、安装MySQL服务

yum -y install mysql-server

9.2-2

3、改变文件拥有者和群组为MySQL。

chown -R mysql:mysql /var/lib/mysql /var/run/mysqld

9.2-3

4、由于此时的/var/run/mysqld下没有pid和sock文件，所以重建授权表，并指定MySQL用户。执行完两条命令，按回车键。

/usr/bin/mysql_install_db --user=mysql
/usr/bin/mysqld_safe --user=mysql &

9.2-4

5、进入MySQL环境

mysql -uroot

9.2-5

任务2：安装hive
1、配置系统环境变量，使之生效。

vim /etc/profile
source /etc/profile

9.2-6

2、确保配置成功，检测版本号。

hive --version

9.2-7

3、创建hive需要的HDFS目录。

hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -mkdir -p /user/hive/tmp
hadoop fs -mkdir -p /user/hive/log
hadoop fs -chmod 777 /user/hive/warehouse
hadoop fs -chmod 777 /usr/hive/tmp
hadoop fs -chmod 777 /usr/hive/log
hadoop fs -ls /user/hive

9.2-8

4、在hive的conf目录下，编辑一个hive-site.xml文件，加入以下配置。

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--Licensed to the Apache Software Foundation (ASF) under one or morecontributor license agreements.  See the NOTICE file distributed withthis work for additional information regarding copyright ownership.The ASF licenses this file to You under the Apache License, Version 2.0(the "License"); you may not use this file except in compliance withthe License.  You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.
--><configuration>
<property>
<name>system:java.io.tmpdir</name>
<value>/tmp/hive/java</value>
</property>
<property>
<name>system:user.name</name>
<value>${user.name}</value>
</property>
<!-- 资源临时文件存放位置 -->
<property>
<name>hive.exec.scratchdir</name>
<value>/user/hive/tmp</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/&lt;username&gt; is created, with ${hive.scratch.dir.permission}.</description>
</property>
<!-- 设置 hive 仓库的 HDFS上的位置 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<!-- 设置日志位置 -->
<property>
<name>hive.querylog.location</name>
<value>/user/hive/log</value>
<description>Location of Hive run time structured log file</description>
</property>
<!-- 指定MySQL的URL地址 -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
</property>
<!-- 设置MySQL驱动 -->
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<!-- 设置连接需要的用户名 -->
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<!-- 设置连接需要的密码 -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property></configuration>

5、MySQL中创建一个hive数据库，和一个Hadoop用户。并为该用户赋予足够的连接权限。最后退出MySQL环境。

CREATE DATABASE hive; 
USE hive; 
CREATE USER 'hadoop'@'localhost' IDENTIFIED BY 'hadoop';
GRANT ALL ON hive.* TO 'hadoop'@'localhost' IDENTIFIED BY 'hadoop'; 
GRANT ALL ON hive.* TO 'hadoop'@'%' IDENTIFIED BY 'hadoop'; 
FLUSH PRIVILEGES; 
quit;

6、把MySQL的驱动包放置在hive的lib目录下。

mv ./mysql-connector-java-5.1.35.jar lib/
ls lib/

9.2-9

9.2-9
7、初始化hive数据库。运行后会显示如下的信息。

schematool -dbType mysql -initSchema

8、修改 hive conf 目录下 hive-env.sh.template 为 hive-env.sh，并修改如下内容。

cp hive-env.sh.template hive-env.sh

9.2-10

9、终端输入hive测试环境。

bin/hive
show databases

9.2-11

9.3 hive表同步给hbase

任务目的
掌握创建可同步表的语法及导入数据方式
任务清单
任务1：hive表同步给hbase
任务步骤
任务1：hive表同步给hbase
　　初始化操作，执行这一步的作用在于，在hive数据库下创建运行hive所需要的一些列的表。进入到hive的bin目录，执行初始化命令。

bin/schematool -dbType mysql -initSchema

9.3-1
9.3-1

1、新建表

进入hive客户端，创建测试库后，创建一张hbase可识别的表。

create database test;
use test;
create table if not exists hbase_hive(
id int,
name String
)
stored by "org.apache.hadoop.hive.hbase.HBaseStorageHandler"
with serdeproperties("hbase.columns.mapping"=":key,cf1:uname")
tblproperties("hbase.table.name"="h_hive");

2、hbase检测

进入hbase的shell，查看是否创建了h_hive表。

9.3-2

3、这种特殊表不能使用load语句进行导入数据。演示错误的方法：testdata目录下有一个student.txt文件，存储了测试数据。导入hive表。

use test;
load data local inpath '/root/software/hive-2.3.4/testdata/student.txt' into table hbase_hive;

9.3-3

4、下面是正确方式：新建一张普通的hive表stu，为导入hbase_hive做准备。

create table stu(
id int,
name string)
row format delimited fields terminated by '\t';

9.3-4

5、把测试数据student.txt导入中间表stu。

use test;
load data local inpath '/root/software/hive-2.3.4/testdata/student.txt' into table stu;

9.3-5

6、使用insert的方式把数据载入到特殊表。

insert into table hbase_hive
select * from test.stu;
select * from table_hive;

9.3-6

7、hbase查看。进入hbase的shell环境，查看表中数据。

9.3-7

9.4 HBase表同步给Hive

任务目的
掌握hbase表同步给hive的基本方法
任务清单
任务1：HBase表同步给Hive
任务步骤
任务1：HBase表同步给Hive
初始化

在hive的bin目录下，执行初始化操作。

bin/schematool -dbType mysql -initSchema

9.4-1

创建表

进入hbase的shell环境，创建测试命名空间ns、测试表test，包括一个列族info。

bin/hbase shell
create_namespace 'ns'
create 'ns:test','info'

9.4-2

插入数据

为test表插入测试数据。

put 'ns:test','1','info:name','tom'
put 'ns:test','1','info:age','22'
put 'ns:test','1','info:sex','man'
put 'ns:test','2','info:name','jeryy'
put 'ns:test','2','info:age','21'
put 'ns:test','2','info:sex','woman'

hive建表

创建一个指向已存在hbase表的hive表。

create external table hbase_hive(key int,name string,age string,sex string
) 
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = "info:name,info:age,info:sex")
tblproperties("hbase.table.name" = "ns:test","hbase.mapred.output.outputtable" = "ns:test");