Mybatis百万数据插入(含导出）

1 一般一次性插入多条数据

传统的sql语句：

INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );
INSERT INTO `table1` ( `field1`, `field2` ) VALUES( "data1", "data2" );

在mybatis中，一次性插入多条数据的时候是用foreach循环实现的，mapper文件中的语句如下：

<insert id="batchInsert" parameterType="java.util.List">insert into USER (id, name) values<foreach collection="list" item="model" index="index" separator=","> (#{model.id}, #{model.name})</foreach>
</insert>

转化为：

INSERT INTO `table1` ( `field1`, `field2` )
VALUES( "data1", "data2" ),( "data1", "data2" ),( "data1", "data2" ),( "data1", "data2" ),( "data1", "data2" );

这个用法看上去是没有问题的，功能也能够实现，但是在一些特殊场景这个用法是不合适的。

当表的列数较多（20+），一次性插入的行数较多（5000+）时，整个插入的耗时十分漫长，一般需要十几分钟，这是不能容忍的。

2 解决方法

在mybatis执行的流程中，用户通过 SqlSession 调用一个方法，SqlSession 通过 Executor 找到对应的 MappedStatement。这一步成为了插入大量数据的关键。

MyBatis Dynamic SQL

https://mybatis.org/mybatis-dynamic-sql/docs/insert.html

2.1 会话对象中执行器的类型

1 默认SIMPLE

每次调用 insert 方法时，MyBatis 都需要创建一个预编译语句 (PreparedStatement) 并执行它。这意味着对于每个单独的插入操作，都会有一定的开销，这就导致了消耗时长成倍的增长。

在MyBatis中，当你使用<foreach>标签来构建一个大的SQL插入语句时，实际上是在构造一个单条SQL语句，只是语句中有很多占位符，这就使得只需要创建一个预编译语句 (PreparedStatement) 并执行它就可以了，这就节省了大量时间。

2 RESUSE

此类型的执行器会重用预编译语句（PreparedStatement）。这意味着对于相同的 SQL 语句，它会重用之前的 PreparedStatement 对象，而不是每次都创建新的 PreparedStatement。这可以减少预编译 SQL 语句的开销，但在某些情况下可能会引起性能问题，特别是当 SQL 语句频繁变化时。

3 BATCH

这是一种批处理执行器，用于执行批量更新操作。它会将多个 SQL 更新语句合并为一个批量执行的操作。这对于批量插入、更新或删除操作特别有用，因为它可以显著减少网络往返次数和事务提交次数，从而提高性能。

2.2 SIMPLE与BATCH的区别

SIMPLE

单行处理：在这种模式下，每次只读取Excel文件中的一行数据，并立即处理。
即时处理：数据读取后会立即交给相应的处理器进行处理，比如使用 BeanRowHandler 进行处理。
内存占用低：由于每次只处理一行数据，因此对内存的需求较低。

BATCH

批量处理：在这种模式下，数据会按照一定的批次读取和处理，而不是逐行处理，减少了I/O操作次数。
批量提交：数据会被收集到一个批次中，然后一起处理，例如一次性将多行数据提交到数据库。
内存占用较高：由于需要缓存一定数量的数据，因此相较于 SIMPLE 模式，内存占用可能会更高。

总结

SIMPLE 模式 更适合于数据量较小或需要立即处理每一行数据的场景。
BATCH 模式 更适合于数据量较大且可以批量处理的情况。

3 测试

3.1 准备工作

3.1.1 创建工程

<dependencies><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><scope>runtime</scope></dependency><dependency><groupId>org.mybatis.spring.boot</groupId><artifactId>mybatis-spring-boot-starter</artifactId><version>2.3.0</version></dependency><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><optional>true</optional></dependency><dependency><groupId>cn.hutool</groupId><artifactId>hutool-all</artifactId><version>5.8.26</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>4.1.2</version></dependency>	    <dependency><groupId>com.alibaba</groupId><artifactId>easyexcel</artifactId><version>3.3.4</version></dependency></dependencies>

3.1.2 创建实体类

@Data
public class student {private String name;private int age;private String sex;private String address;private String phone;private String email;
}

百万数据表格Excel

3.2 开始测试

3.2.1 创建”读“方法

    public static List<student> readExcel(String name){// 获取桌面路径String desktopPath = System.getProperty("user.home") + File.separator + "Desktop";String fileName = name;String filePath = desktopPath + File.separator + fileName;// 创建 Excel reader//运用trycatch确保关闭流并捕获异常try(ExcelReader reader = ExcelUtil.getReader(new File(filePath))){List<student> readAll =reader.readAll(student.class);reader.close();return readAll;}catch (Exception e){e.printStackTrace();}return null;}

3.2.2 普通插入

    @PostMapping("/insert")public Integer insert( String name){//StopWatch类是 Hutool 工具库中的类,用于测量代码执行时间StopWatch stopWatch = new StopWatch();stopWatch.start();for (student student : readExcel(name)) {studentService.insert(student);}stopWatch.stop();System.out.println(stopWatch.prettyPrint());return 0;}

这是插入一百条数据。

插入一千条数据，用时2.4秒左右

插入两万条数据可以看到因为数据量太大，通过 HTTP 请求读取数据时超时了，不过程序还在运行，耗时32秒左右。

可以看出这个插入方法在数据量比较小的时候，比如几十条，速度还是可以的，但是在数据表列数较多且数据量较大的时候是不适用的，更不要说百万级别的数据了。

3.2.3 foreach插入

    @PostMapping("/banch")public Integer insertBanch( String name){//StopWatch类是 Hutool 工具库中的类,用于测量代码执行时间StopWatch stopWatch = new StopWatch();stopWatch.start();studentService.insertBatch(readExcel(name));stopWatch.stop();System.out.println(stopWatch.prettyPrint());return 0;}

直接上强度，就不一个一个的试了，直接插入两万条数据。

可以看出两万条数据只需要4.02秒左右，因为我的excel文件的列很少，只有6列，所以foreach默认的执行器类型SIMPLE勉强还算够用，但还是没有资格越去插入百万条数据只有6列的数据。

3.2.4 BATCH插入

    @PostMapping("/add1")public Integer add1( String name){//StopWatch类是 Hutool 工具库中的类,用于测量代码执行时间StopWatch stopWatch = new StopWatch();stopWatch.start();//获取批量插入的数据SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH, false);try  {studentDao mapper = session.getMapper(studentDao.class);// 批量插入数据
//            mapper.insertBatch(readExcel(name));for (student student : readExcel(name)) {mapper.insert(student);}// 提交事务session.commit();} catch (Exception e) {e.printStackTrace();session.rollback();}stopWatch.stop();System.out.println(stopWatch.prettyPrint());return 0;}

25.95，为什么会这样呢？为什么会和上面的普通插入的时间差不多呢？

因为两万条数据在插入时，虽然使用了batch，但是还是和普通插入是一样的，每次调用 insert 方法时，MyBatis 都需要创建一个预编译语句 (PreparedStatement) 并执行它。这意味着对于每个单独的插入操作，都会有一定的开销。

可以看到，insertBatch与BATCH结合使用时，插入两万条数据只需要3.6秒左右，比4.2秒快的并不是很多，那是因为数据量太小的原因，如果加大数据量那么就会拉开两者间的差距，将BATCH插入的优势体现出来。

就用一百万的数据量将两者之间的差距彻底体现出来吧。

那么上面一下子读取所有数据的”读“方法肯定不适用了，那就换一种读的方法。

private RowHandler createRowHandler() {return new RowHandler() {@Overridepublic void handle(int sheetIndex, long rowIndex, List<Object> rowlist) {Console.log("[{}] [{}] {}", sheetIndex, rowIndex, rowlist);}};
}
ExcelUtil.readBySax("aaa.xls", 0, createRowHandler());

这是hutool工具读取大的Excel文件的方法，就把这个方法略作修改来进行验证试验吧。


@Service 
public class HutoolImportService {@Autowiredprivate SqlSessionFactory sqlSessionFactory;@Autowiredprivate com.by.dao.studentDao studentDao;private List<student> students = new ArrayList<>();public void importExcel() {String filePath = "C:\\Users\\ljj\\Desktop\\export-20240808194933.xlsx";System.out.println("开始读取数据………………");System.out.println("开始插入");StopWatch stopWatch = new StopWatch();stopWatch.start();ExcelUtil.readBySax(filePath, 0, createRowHandler());stopWatch.stop();System.out.println(stopWatch.getTotalTimeMillis());System.out.println("插入数据成功");}private RowHandler createRowHandler() {SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH, false);com.by.dao.studentDao mapper = session.getMapper(com.by.dao.studentDao.class);return new RowHandler() {@Overridepublic void handle(int sheetIndex, long rowIndex, List<Object> rowlist) {if (rowIndex > 0) {String name = (String) rowlist.get(0);Integer age = Integer.valueOf(rowlist.get(1).toString());String sex = (String) rowlist.get(2);String address = (String) rowlist.get(3);String phone =  rowlist.get(4).toString();String email =  rowlist.get(5).toString();student student = new student(name, age, sex, address, phone, email);students.add(student);if (students.size() >= 100000) {
//                        studentDao.insertBatch(students);mapper.insertBatch(students);students.clear();}}}};}
}

SIMPLE插入

BATCH插入

可以看到SIMPLE插入所用的时间是79.4秒，BATCH插入所用的时间是66.1秒，当把数据量提升到一百万的时候，BATCH插入比SIMPLE插入快了13.3秒，这样就可以看出在进行大数据量的插入时BATCH的优势所在了。

4 多线程插入百万条数据(优化)

4.1 hutool 多线程插入

4.1.1 创建合理线程池

@Configuration
public class DynamicThreadPool {@Bean(name = "threadPoolExecutor")public ThreadPoolExecutor easyExcelStudentImportThreadPool() {int processors = Runtime.getRuntime().availableProcessors();//获取系统处理器数量return new ThreadPoolExecutor(processors + 1,//最小线程数：系统处理器数量 + 1processors * 2 + 1,//最大线程数：系统处理器数量 * 2 + 110 * 60,//线程空闲时间：10分钟TimeUnit.SECONDS,//单位：秒new LinkedBlockingQueue<>(1000000));//队列长度：1000000}}

先给线程池的类命名，这个名字随意些，用线程池的时候用不到。

然后开始配置线程池，运用@Bean注解给线程池命名，name一定要注意，运用线程池的时候需要用到。

设置线程池的最小线程数量是当前系统的处理器核心数，并在此基础上加1来设置核心线程数。

好处：核心线程数通常设置为处理器核心数加上一定的增量，这样可以充分利用系统的处理器资源，提高并行处理能力。增加一个额外的线程是为了处理突发性的任务，以避免线程池立即创建更多的线程。

设置线程池的最大线程数量是处理器核心数的两倍再加上1。

好处：这种设置允许线程池在高负载下扩展更多线程，以处理更多的并发任务。这对于处理短暂的高峰负载特别有用，同时避免了过多线程带来的开销。

空闲线程的存活时间，我谨慎一点设置成10分钟，其实几分钟就够用了，因为我不会让插入的时间超过一分钟。

4.1.2 测试

controller

   @RequestMapping("/import")public void importExcel(){System.out.println("开始导入步骤！");hutoolImportService.importExcel();}

service

 public void importExcel() {String filePath = "C:\\Users\\ljj\\Desktop\\export-20240808194933.xlsx";System.out.println("开始读取数据………………");System.out.println("开始插入");StopWatch stopWatch = new StopWatch();stopWatch.start();SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH, false);//        ExcelUtil.readBySax(filePath, 0, new StudentRowHandler(0, 1, 1000000, student.class));ExcelUtil.readBySax(filePath, 0, createRowHandler());stopWatch.stop();System.out.println(stopWatch.getTotalTimeMillis());System.out.println("插入数据成功");}private RowHandler createRowHandler() {return new RowHandler() {@Overridepublic void handle(int sheetIndex, long rowIndex, List<Object> rowlist) {if (rowIndex > 0) {String name = (String) rowlist.get(0);Integer age = Integer.valueOf(rowlist.get(1).toString());String sex = (String) rowlist.get(2);String address = (String) rowlist.get(3);String phone = rowlist.get(4).toString();String email = rowlist.get(5).toString();student student = new student(name, age, sex, address, phone, email);students.add(student);if (students.size() >= 100000) {List<List<student>> lists = ListUtil.split(students, 10000);//将successList分割成多个子列表，每个子列表最多包含10000条记录CountDownLatch countDownLatch = new CountDownLatch(lists.size());SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH, false);com.by.dao.studentDao mapper = session.getMapper(com.by.dao.studentDao.class);for (List<student> list : lists) {threadPoolExecutor.execute(() -> {//提交任务到线程池中执行try {mapper.insertBatch(list);//} catch (Exception e) {System.out.println("启动线程失败,错误信息:" + e.getMessage());} finally {//执行完一个线程减1,直到执行完countDownLatch.countDown();}});}try {//使用 countDownLatch.await() 方法等待所有子任务完成。//如果当前线程被中断，await 方法将抛出 InterruptedException。countDownLatch.await();session.commit();} catch (Exception e) {System.out.println("等待所有线程执行完异常,e:" + e);}// 提前将不再使用的集合清空，释放资源students.clear();lists.clear();}}}};}

4.2 easyExcel 多线程插入

4.2.1 创建线程池

4.2.2 "监听器"实现分片”读“和插入

@Servicepublic class ReadListener implements com.alibaba.excel.read.listener.ReadListener<student> {//成功的集合    private final List<student> successList = new ArrayList<>();//每次读取100000条    private final static int BATCH_COUNT = 100000;//注入sqlSessionFactory    @Autowired    private SqlSessionFactory sqlSessionFactory;//线程池    @Resource    private ThreadPoolExecutor threadPoolExecutor;//注入dao    @Autowired    private studentDao studentDao;/**     * 读取数据     *     * @param student     * @param analysisContext     */    @Override    public void invoke(student student, AnalysisContext analysisContext) {        if (StringUtils.isNotBlank(student.getName())) {            successList.add(student);            return;        }        //size是否为100000条:这里其实就是分批.当数据等于10w的时候执行一次插入        if (successList.size() >= BATCH_COUNT) {            System.out.println("读取数据：" + successList.size());            saveData();            //清理集合便于GC回收            successList.clear();        }    }/**     * 多线程插入     */    private void saveData() {List<List<student>> lists = ListUtil.split(successList, 10000);//将successList分割成多个子列表，每个子列表最多包含10000条记录CountDownLatch countDownLatch = new CountDownLatch(lists.size());//使用计数器控制线程同步。创建一个CountDownLatch对象，用于控制多个线程之间的同步和等待。for (List<student> list : lists) {threadPoolExecutor.execute(() -> {//提交任务到线程池中执行SqlSession session = sqlSessionFactory.openSession(ExecutorType.BATCH, false);com.by.dao.studentDao mapper = session.getMapper(studentDao.class);StopWatch stopWatch = new StopWatch();stopWatch.start();try {mapper.insertBatch(list);//stopWatch.stop();session.commit();System.out.println("插入数据：" + list.size() + "条,耗时：" + stopWatch.getTotalTimeMillis() + "纳秒。");} catch (Exception e) {System.out.println("启动线程失败,错误信息:" + e.getMessage());} finally {//执行完一个线程减1,直到执行完countDownLatch.countDown();}});}try {//使用 countDownLatch.await() 方法等待所有子任务完成。//如果当前线程被中断，await 方法将抛出 InterruptedException。countDownLatch.await();
//            session.commit();} catch (Exception e) {System.out.println("等待所有线程执行完异常,e:" + e);}// 提前将不再使用的集合清空，释放资源successList.clear();lists.clear();}@Override    public void doAfterAllAnalysed(AnalysisContext analysisContext) {        //读取剩余数据        if (CollectionUtils.isNotEmpty(successList)) {            System.out.println("读取数据：" + successList.size() + "条。");            saveData();        }    }}

首先设置需要的变量，其中要将线程池注入进来，因为是自己设置的线程池，而且放入了容器，其中加入了Bean注解定义了name，Autowired注解是会报错的，要用Resource注解，通过name注入。

重写ReadListener的invoke方法，这个方法在每次读取一行数据时被调用，从头开始读取数据，首先判断对象返回的name值是否为null，如果不为null放到list集合里面，然后判断集合的大小是否超过了十万，超过了就保存起来放入大集合里面进行保存。

使用 ListUtil.split 方法将 successList 分割成多个子列表，每个子列表最多包含10000条记录。ListUtil.split 用于将列表分割成指定大小的子列表。这种分批处理可以显著提高数据插入数据库的效率，因为它减少了与数据库的交互次数。

使用 CountDownLatch 来同步所有子列表的插入操作完成。lists.size() 指定了计数器的初始计数值，表示有多少个子任务需要完成。

将List<List<student>>循环插入到数据库，对于每个子列表list<student>，向线程池 threadPoolExecutor 提交一个 lambda 表达式作为任务，每个任务将执行数据插入操作。

使用countDownLatch.await() 方法等待所有子任务完成如果当前线程被中断，await 方法将抛出 InterruptedException运用try……catch处理。

在所有任务完成后，清空 successList 和 lists 集合，释放内存资源。

这段代码实现了将从Excel文件中读取的数据分批插入数据库的功能。通过使用批处理和多线程技术，它可以有效地提高数据处理的速度和效率。同时，通过使用 CountDownLatch 控制线程同步，确保所有数据插入操作完成后才继续执行后续操作。

doAfterAllAnalysed方法是ReadListener中的方法，表示在所有数据读取完成后被调用的方法。判断successList是否为空，如果 successList 非空，则说明还有未处理的数据，继续执行saveData方法。

4.2.3 测试

controller

@RestController@RequestMapping("/excel")public class ExprotController {@Autowired    private EasyExcelImportService easyExcelImportService;@RequestMapping("/import")    public void importExcel(String name){       easyExcelImportService.importExcel(name);    }}

service

@Service
public class EasyExcelImportService {@Autowiredprivate com.by.dao.studentDao studentDao;@Autowiredprivate ReadListener readListener;public void importExcel() {try {String filePath = "C:\\Users\\ljj\\Desktop\\export-20240808194933.xlsx";FileInputStream inputStream = new FileInputStream(filePath);long beginTime = System.currentTimeMillis();//加载文件读取监听器//easyexcel的read方法进行数据读取和插入EasyExcel.read(inputStream, student.class, readListener).sheet().doRead();System.out.println("读取文件耗时：" + (System.currentTimeMillis() - beginTime) / 1000 + "秒");} catch (Exception e) {System.out.println("导入异常" + e.getMessage());}}}

通过构造注入自定义的监听器，

使用 EasyExcel.read 方法开始读取操作。
第一个参数是输入流 inputStream，用于读取 Excel 文件。
第二个参数是数据模型的类 student.class，用于表示 Excel 中的数据。
第三个参数是 readListener，用于处理读取的数据。
调用 .sheet().doRead() 方法来指定读取的第一个工作表，并开始读取操作。
记录读取操作开始的时间戳 beginTime。
在读取完成后，计算并输出读取耗时。
捕获可能发生的异常，并输出错误信息。

插入一百万条数据

经过测试插入一百万条数据的6列Excel大概需要31秒左右，可以看到插入几乎是不消耗时间的，消耗时间的地方还是”读“这一步。

这个速度虽然勉强还算能够接受，但是阅读超时了，那就是说还不够好，还有优化的空间。

5 再优化（失败了）

运用原生的jdbc进行分批操作，也许就能快点了。

想的很好，但现实给了我一巴掌，优化宣告失败。但是这种方法也是可行，也可以把数据插入到数据库，代码写在下面了。

util工具类

package com.by.util;import org.springframework.context.annotation.Configuration;import java.io.IOException;
import java.sql.*;
import java.util.Properties;
@Configuration
public class propertyUtil {private static  String driver;private static  String url;private static  String name;private static  String password;static{Properties properties = new Properties();try {properties.load(propertyUtil.class.getClassLoader().getResourceAsStream("application.properties"));driver = properties.getProperty("spring.datasource.driver-class-name");url = properties.getProperty("spring.datasource.url");name = properties.getProperty("spring.datasource.username");password = properties.getProperty("spring.datasource.password");Class.forName(driver);} catch (IOException | ClassNotFoundException e) {e.printStackTrace();}}/*** 获取数据库连接对象* @return* @throws Exception*/public static Connection getConnect() throws Exception {return DriverManager.getConnection(url, name, password);}/*** 关闭数据库相关资源* @param conn* @param ps* @param rs*/public static void close(Connection conn, PreparedStatement ps, ResultSet rs) {try {if (conn != null) conn.close();if (ps != null) ps.close();if (rs != null) rs.close();} catch (SQLException e) {throw new RuntimeException(e);}}public static void close(Connection conn, PreparedStatement ps) {close(conn, ps, null);}public static void close(Connection conn, ResultSet rs) {close(conn, null, rs);}
}

然后用下列代码代替工具类中的saveData方法就可以了。

 public void import4Jdbc(){//分批读取+JDBC分批插入+手动事务控制Connection conn = null;//JDBC存储过程PreparedStatement ps = null;try {//建立jdbc数据库连接conn = propertyUtil.getConnect();//关闭事务默认提交conn.setAutoCommit(false);String sql = "insert into student (name,age,sex,address,email,phone) values";sql += "(?,?,?,?,?,?)";ps = conn.prepareStatement(sql);for (int i = 0; i < successList.size(); i++) {student user = new student();ps.setString(1,successList.get(i).getName());ps.setInt(2,successList.get(i).getAge());ps.setString(3,successList.get(i).getSex());ps.setString(4,successList.get(i).getAddress());ps.setString(5,successList.get(i).getEmail());ps.setString(6,successList.get(i).getPhone());//将一组参数添加到此 PreparedStatement 对象的批处理命令中。ps.addBatch();}//执行批处理ps.executeBatch();//手动提交事务conn.commit();} catch (Exception e) {e.printStackTrace();} finally {//记得关闭连接com.by.util.propertyUtil.close(conn,ps);}}