Java处理UTF-8带BOM的文本的读写(一)

2014-11-24 01:45:17 · 作者: · 浏览: 5

什么是BOM

BOM(byte-order mark),即字节顺序标记,它是插入到以UTF-8、UTF16或UTF-32编码Unicode文件开头的特殊标记,用来识别Unicode文件的编码类型。对于UTF-8来说,BOM并不是必须的,因为BOM用来标记多字节编码文件的编码类型和字节顺序(big-endian或little- endian)。

BOMs 文件头:

00 00 FE FF = UTF-32, big-endian

FF FE 00 00 = UTF-32, little-endian

EF BB BF = UTF-8,

FE FF = UTF-16, big-endian

FF FE = UTF-16, little-endian

下面举个例子,针对UTF-8的文件BOM做个处理:

String xmla = StringFileToolkit.file2String(new File(“D:\\projects\\mailpost\\src\\a.xml”),“UTF-8”);

byte[] b = xmla.getBytes(“UTF-8”);

String xml = new String(b,3,b.length-3,“UTF-8”);

..............

思路是:先按照UTF-8编码读取文件后,跳过前三个字符,重新构建一个新的字符串,然后用Dom4j解析处理,这样就不会报错了。

其他编码的方式处理思路类似,其实可以写一个通用的自动识别的BOM的工具,去掉BOM信息,返回字符串。

不过这个处理过程已经有牛人解决过了:http://koti.mbnet.fi/akini/java/unicodereader/

Java代码

Example code using UnicodeReader class

Here is an example method to read text file. It will recognize bom marker and skip it while reading.

//import http://koti.mbnet.fi/akini/java/unicodereader/UnicodeReader.java.txt

public static char[] loadFile(String file) throws IOException {

// read text file, auto recognize bom marker or use

// system default if markers not found.

BufferedReader reader = null;

CharArrayWriter writer = null;

UnicodeReader r = new UnicodeReader(new FileInputStream(file), null);

char[] buffer = new char[16 * 1024]; // 16k buffer

int read;

try {

reader = new BufferedReader(r);

writer = new CharArrayWriter();

while( (read = reader.read(buffer)) != -1) {

writer.write(buffer, 0, read);

}

writer.flush();

return writer.toCharArray();

} catch (IOException ex) {

throw ex;

} finally {

try {

writer.close(); reader.close(); r.close();

} catch (Exception ex) { }

}

}

Java代码

Example code to write UTF-8 with bom marker

Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.

public static void saveFile(String file, String data, boolean append) throws IOException {

BufferedWriter bw = null;

OutputStreamWriter osw = null;

File f = new File(file);

FileOutputStream fos = new FileOutputStream(f, append);

try {

// write UTF8 BOM mark if file is empty

if (f.length() < 1) {

final byte[] bom = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF };

fos.write(bom);

}

osw = new OutputStreamWriter(fos, "UTF-8");

bw = new BufferedWriter(osw);

if (data != null) bw.write(data);

} catch (IOException ex) {

throw ex;

} finally {

try { bw.close(); fos.close(); } catch (Exception ex) { }

}

}

实际应用:

Java代码

package com.dayo.gerber;

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.OutputStreamWriter;

import java.io.Reader;

imp