• 项目中需要将 Office 文档上传并实现在线预览,用到了 LibreOffice 将 Office 文档转换为 PDF 文档,然后再用 pdfbox 将 PDF 转为图片。
  • 本文介绍借助 LibreOffice 将 Office 文当转换成 PDF 文档。
  • 本文同样适用于 OpenOffice,以下仅以 LibreOffice 为例。

相关文章

前言

  • 为了实现 Office 文档上传并实现在线预览功能,我试了 POI 和收费库 aspose,均不理想。
  • 但是将 PDF 转换为图片有成熟的方案,于是,问题变成了:如何将 Office 文档转换成 PDF。
  • LibreOffice 可将 Office 文当转换成 PDF文件,并且效果非常棒,和通过微软的 Office 直接另存为 PDF 文件的效果几乎一样。针对在线预览的需求,效果是最好的,也许是最好的解决方案。
  • 不过,前提是需要在服务器上安装 LibreOffice,为了测试,开发环境也需要安装,不过,好的是 LibreOffice 跨平台。

    尝试通过 LibreOffice 将 Office 文档直接转换成图片,可惜只得到第一页的内容,查看帮助,没有找到直接转换为图片的方法。

Java 利用 LibreOffice 将 Office 文档转换成 PDF

有两种转换方式,各有优劣,请自行选择。

异步转换

通过调用操作系统命令的方式实现,这个转换是异步的,根据文件的大小需要的时间不确定,如果在上传之后就要立即预览,需要用同步方式。

  • 优点:实现方式简单,不需要额外配置信息,不需要添加第三方依赖库(当然 LibreOffice 是必须要安装的)。
  • 缺点:发送指令之后,转换是否成功,是否有异常,无法获知,也就是说,转换是否成功,是不确定的。当然,通过严格的测试,一般还是可以保证转换的可靠性的。

同步转换

用到了 JodConverter:https://github.com/sbraconnier/jodconverter

  • 优点:转换是同步的,转换成功与否是确定的。
  • 缺点:代码运行期需要启动 LibreOffice 服务,需要占用操作系统资源,相对于异步转换方式,需要依赖第三方库,需要额外配置信息。

完整代码

添加依赖(仅同步方式需要)

1
2
3
4
5
<dependency>
<groupId>org.jodconverter</groupId>
<artifactId>jodconverter-local</artifactId>
<version>4.2.4</version>
</dependency>

在 resources 目录添加 libre.properties 文件(仅同步方式需要)

内容如下:

1
2
3
4
5
6
7
8
9
# LibreOffice主目录
libreOfficeHome=C:/dev/LibreOffice6.4
# 开启多个LibreOffice进程,每个端口对应一个进程
# portNumbers=2002,2003
portNumbers=2002
# 任务执行超时为5分钟
taskExecutionTimeoutMinutes=5
# 任务队列超时为1小时
taskQueueTimeoutHours=1

转换类 LibreOfficeUtil

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
package com.example.demo;

import com.example.factory.OfficeManagerInstance;
import org.jodconverter.JodConverter;

import java.io.File;

public class LibreOfficeUtil {
/**
* 利用 JodConverter 将 Offfice 文档转换为 PDF(要依赖 LibreOffice),该转换为同步转换,返回时就已经转换完成
*/
public static boolean convertOffice2PDFSyncIsSuccess(File sourceFile, File targetFile) {
try {
OfficeManagerInstance.start();
JodConverter.convert(sourceFile).to(targetFile).execute();
} catch (Exception e) {
e.printStackTrace();
return false;
}

return true;
}

/**
* 利用 LibreOffice 将 Office 文档转换成 PDF,该转换是异步的,返回时,转换可能还在进行中,转换是否有异常也未可知
* @param filePath 目标文件地址
* @param targetFilePath 输出文件夹
* @return 子线程执行完毕的返回值
*/
public static int convertOffice2PDFAsync(String filePath, String fileName, String targetFilePath) throws Exception {
String command;
int exitStatus;
String osName = System.getProperty("os.name");
String outDir = targetFilePath.length() > 0 ? " --outdir " + targetFilePath : "";

if (osName.contains("Windows")) {
command = "cmd /c cd /d " + filePath + " && start soffice --headless --invisible --convert-to pdf ./" + fileName + outDir;
} else {
command = "libreoffice6.3 --headless --invisible --convert-to pdf:writer_pdf_Export " + filePath + fileName + outDir;
}

exitStatus = executeOSCommand(command);
return exitStatus;
}

/**
* 调用操作系统的控制台,执行 command 指令
* 执行该方法时,并没有等到指令执行完毕才返回,而是执行之后立即返回,返回结果为 0,只能说明正确的调用了操作系统的控制台指令,但执行结果如何,是否有异常,在这里是不能体现的,所以,更好的姿势是用同步转换功能。
*/
private static int executeOSCommand(String command) throws Exception {
Process process;
process = Runtime.getRuntime().exec(command); // 转换需要时间,比如一个 3M 左右的文档大概需要 8 秒左右,但实际测试时,并不会等转换结束才执行下一行代码,而是把执行指令发送出去后就立即执行下一行代码了。

int exitStatus = process.waitFor();

if (exitStatus == 0) {
exitStatus = process.exitValue();
}

// 销毁子进程
process.destroy();
return exitStatus;
}
}

OfficeManagerInstance(仅同步方式需要)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
package com.example.factory;

import org.jodconverter.office.LocalOfficeManager;
import org.jodconverter.office.OfficeManager;
import org.springframework.core.io.support.PropertiesLoaderUtils;
import org.springframework.stereotype.Component;

import javax.annotation.PostConstruct;
import java.io.IOException;
import java.util.Properties;

/**
* github https://github.com/uncleAndyChen
* email andy@lovesofttech.com
* homepage https://www.lovesofttech.com/
* author andyChen
* since 2020/02/29
*/
@Component
public class OfficeManagerInstance {
private static OfficeManager INSTANCE = null;

public static synchronized void start() {
officeManagerStart();
}

@PostConstruct
private void init() {
try {
Properties properties = PropertiesLoaderUtils.loadAllProperties("libre.properties");
String[] portNumbers = properties.getProperty("portNumbers", "").split(",");
int[] ports = new int[portNumbers.length];

for (int i = 0; i < portNumbers.length; i++) {
ports[i] = Integer.parseInt(portNumbers[i]);
}

LocalOfficeManager.Builder builder = LocalOfficeManager.builder().install();
builder.officeHome(properties.getProperty("libreOfficeHome", ""));
builder.portNumbers(ports);
builder.taskExecutionTimeout(Integer.parseInt(properties.getProperty("taskExecutionTimeoutMinutes", "")) * 1000 * 60); // minute
builder.taskQueueTimeout(Integer.parseInt(properties.getProperty("taskQueueTimeoutHours", "")) * 1000 * 60 * 60); // hour

INSTANCE = builder.build();
officeManagerStart();
} catch (IOException e) {
e.printStackTrace();
}
}

private static void officeManagerStart() {
if (INSTANCE.isRunning()) {
return;
}

try {
INSTANCE.start();
} catch (Exception e) {
e.printStackTrace();
}
}
}

参考

https://github.com/sbraconnier/jodconverter/wiki/Getting-Started
https://github.com/sbraconnier/jodconverter/wiki/Configuration
https://github.com/sbraconnier/jodconverter/wiki/Java-Library

遇到的坑

请移步:Maven 项目 jar 包依赖冲突导致运行期错误的排查方法

附:libreoffice6.3 转换帮助文档

libreoffice6.3 转换文档的用法,官方没有详细的在线文档,通过 -h 可以查看到详细的帮助,已经可以满足开发所需。
例如将一个文件转换为 pdf :libreoffice6.3 --headless --invisible --convert-to pdf:writer_pdf_Export ./奇妙的记忆力.pptx,后面可以指定保存 pdf 的目录,不指定就保存到当前目录。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
[root@ebs-60027 lib64]#  libreoffice6.3 -h
Usage: soffice [argument...]
argument - switches, switch parameters and document URIs (filenames).

Using without special arguments:
Opens the start center, if it is used without any arguments.
{file} Tries to open the file (files) in the components
suitable for them.
{file} {macro:///Library.Module.MacroName}
Opens the file and runs specified macros from
the file.

Getting help and information:
--help | -h | -? Shows this help and quits.
--helpwriter Opens built-in or online Help on Writer.
--helpcalc Opens built-in or online Help on Calc.
--helpdraw Opens built-in or online Help on Draw.
--helpimpress Opens built-in or online Help on Impress.
--helpbase Opens built-in or online Help on Base.
--helpbasic Opens built-in or online Help on Basic scripting
language.
--helpmath Opens built-in or online Help on Math.
--version Shows the version and quits.
--nstemporarydirectory
(MacOS X sandbox only) Returns path of the temporary
directory for the current user and exits. Overrides
all other arguments.

General arguments:
--quickstart[=no] Activates[Deactivates] the Quickstarter service.
--nolockcheck Disables check for remote instances using one
installation.
--infilter={filter} Force an input filter type if possible. For example:
--infilter="Calc Office Open XML"
--infilter="Text (encoded):UTF8,LF,,,"
--pidfile={file} Store soffice.bin pid to {file}.
--display {display} Sets the DISPLAY environment variable on UNIX-like
platforms to the value {display} (only supported by a
start script).

User/programmatic interface control:
--nologo Disables the splash screen at program start.
--minimized Starts minimized. The splash screen is not displayed.
--nodefault Starts without displaying anything except the splash
screen (do not display initial window).
--invisible Starts in invisible mode. Neither the start-up logo nor
the initial program window will be visible. Application
can be controlled, and documents and dialogs can be
controlled and opened via the API. Using the parameter,
the process can only be ended using the taskmanager
(Windows) or the kill command (UNIX-like systems). It
cannot be used in conjunction with --quickstart.
--headless Starts in "headless mode" which allows using the
application without GUI. This special mode can be used
when the application is controlled by external clients
via the API.
--norestore Disables restart and file recovery after a system crash.
--safe-mode Starts in a safe mode, i.e. starts temporarily with a
fresh user profile and helps to restore a broken
configuration.
--accept={connect-string} Specifies a UNO connect-string to create a UNO
acceptor through which other programs can connect to
access the API. Note that API access allows execution
of arbitrary commands.
The syntax of the {connect-string} is:
connection-type,params;protocol-name,params
e.g. pipe,name={some name};urp
or socket,host=localhost,port=54321;urp
--unaccept={connect-string} Closes an acceptor that was created with
--accept. Use --unaccept=all to close all acceptors.
--language={lang} Uses specified language, if language is not selected
yet for UI. The lang is a tag of the language in IETF
language tag.

Developer arguments:
--terminate_after_init
Exit after initialization complete (no documents loaded)
--eventtesting Exit after loading documents.

New document creation arguments:
The arguments create an empty document of specified kind. Only one of them may
be used in one command line. If filenames are specified after an argument,
then it tries to open those files in the specified component.
--writer Creates an empty Writer document.
--calc Creates an empty Calc document.
--draw Creates an empty Draw document.
--impress Creates an empty Impress document.
--base Creates a new database.
--global Creates an empty Writer master (global) document.
--math Creates an empty Math document (formula).
--web Creates an empty HTML document.

File open arguments:
The arguments define how following filenames are treated. New treatment begins
after the argument and ends at the next argument. The default treatment is to
open documents for editing, and create new documents from document templates.
-n Treats following files as templates for creation of new
documents.
-o Opens following files for editing, regardless whether
they are templates or not.
--pt {Printername} Prints following files to the printer {Printername},
after which those files are closed. The splash screen
does not appear. If used multiple times, only last
{Printername} is effective for all documents of all
--pt runs. Also, --printer-name argument of
--print-to-file switch interferes with {Printername}.
-p Prints following files to the default printer, after
which those files are closed. The splash screen does
not appear. If the file name contains spaces, then it
must be enclosed in quotation marks.
--view Opens following files in viewer mode (read-only).
--show Opens and starts the following presentation documents
of each immediately. Files are closed after the showing.
Files other than Impress documents are opened in
default mode , regardless of previous mode.
--convert-to OutputFileExtension[:OutputFilterName] \
[--outdir output_dir] [--convert-images-to]
Batch convert files (implies --headless). If --outdir
isn't specified, then current working directory is used
as output_dir. If --convert-images-to is given, its
parameter is taken as the target filter format for *all*
images written to the output format. If --convert-to is
used more than once, the last value of
OutputFileExtension[:OutputFilterName] is effective. If
--outdir is used more than once, only its last value is
effective. For example:
--convert-to pdf *.odt
--convert-to epub *.doc
--convert-to pdf:writer_pdf_Export --outdir /home/user *.doc
--convert-to "html:XHTML Writer File:UTF8" \
--convert-images-to "jpg" *.doc
--convert-to "txt:Text (encoded):UTF8" *.doc
--print-to-file [--printer-name printer_name] [--outdir output_dir]
Batch print files to file. If --outdir is not specified,
then current working directory is used as output_dir.
If --printer-name or --outdir used multiple times, only
last value of each is effective. Also, {Printername} of
--pt switch interferes with --printer-name.
--cat Dump text content of the following files to console
(implies --headless). Cannot be used with --convert-to.
--script-cat Dump text content of any scripts embedded in the files
to console (implies --headless). Cannot be used with
--convert-to.
-env:<VAR>[=<VALUE>] Set a bootstrap variable. For example: to set
a non-default user profile path:
-env:UserInstallation=file:///tmp/test

Ignored switches:
-psn Ignored (MacOS X only).
-Embedding Ignored (COM+ related; Windows only).
--nofirststartwizard Does nothing, accepted only for backward compatibility.
--protector {arg1} {arg2}
Used only in unit tests and should have two arguments.