本篇为原创,需要引用转载的朋友请注明:《
http://stephen830.javaeye.com/blog/259350 》 谢谢支持!
用java生成一个UTF-8文件:
如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。
也就是说,如果你的文件内容没有中文内容的话,你生成的文件是ANSI编码的。
/**
* 生成UTF-8文件.
* 如果文件内容中没有中文内容,则生成的文件为ANSI编码格式;
* 如果文件内容中有中文内容,则生成的文件为UTF-8编码格式。
* @param fileName 待生成的文件名(含完整路径)
* @param fileBody 文件内容
* @return
*/
public static boolean writeUTFFile(String fileName,String fileBody){
FileOutputStream fos = null;
OutputStreamWriter osw = null;
try {
fos = new FileOutputStream(fileName);
osw = new OutputStreamWriter(fos, "UTF-8");
osw.write(fileBody);
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}finally{
if(osw!=null){
try {
osw.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
if(fos!=null){
try {
fos.close();
} catch (IOException e1) {
e1.printStackTrace();
}
}
}
}
//main()
public static void main(String[] argc){
writeUTFFile("C:\\test1.txt","aaa");//test1.txt为ANSI格式文件
writeUTFFile("C:\\test2.txt","中文aaa");//test2.txt为UTF-8格式文件
}
经比较发现,UTF-8文件比ANSI文件多个头信息FF FE,这应该就是UTF-8文件的 bom信息(与规范中所说的EF BB BF并不一样 ,晕),并且内容都以双字节表示,而test1.txt中的都是单字节。
朋友[Liteos]说这是FF FE是unicode的bom头,但同样我也发现不管是utf-8还是unicode格式文件,都是FF FE,令人费解。
估计是JAVA内部I/O处理时如果遇到都是单字节字符,则只生成ANSI格式文件(但程序中已经设定了要UTF-8,为什么不给我生成UTF-8,一个bug吗?),只有遇到多字节的字符时才根据设定的编码(例如UTF-8)来生成文件。
下面引用一段w3c组织关于utf-8的bom描述:(原文地址:
http://www.w3.org/International/questions/qa-utf8-bom)
FAQ: Display problems caused by the UTF-8 BOM
on this page: Question - Background - Answer - By the way - Further reading
Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, XSLT developers, Web project managers, and anyone who is trying to diagnose why blank lines or other strange items are displayed on their UTF-8 page.
Question
When using UTF-8 encoded pages in some user agents, I get an extra line or unwanted characters at the top of my web page or included file. How do I remove them?
Answer
If you are dealing with a file encoded in UTF-8, your display problems may be caused by the presence of a UTF-8 signature (BOM) that the user agent doesn't recognize.
The BOM is always at the beginning of the file, and so you would normally expect to see the display issues at the top of a page. However, you may also find blank lines appearing within the page if you include text from a separate file that begins with a UTF-8 signature.
We have a set of test pages and a summary of results for various recent browser versions that explore this behaviour.
This article will help you determine whether the UTF-8 is causing the problem. If there is no evidence of a UTF-8 signature at the beginning of the file, then you will have to look elsewhere for a solution.
What is a UTF-8 signature (BOM)?
Some applications insert a particular combination of bytes at the beginning of a file to indicate that the text contained in the file is Unicode. This combination of bytes is known as a signature or Byte Order Mark (BOM). Some applications - such as a text editor or a browser - will display the BOM as an extra line in the file, others will display unexpected characters, such as .
See the side panel for more detailed information about the BOM.
The BOM is the Unicode codepoint U+FEFF, corresponding to the Unicode character 'ZERO WIDTH NON-BREAKING SPACE' (ZWNBSP).
In UTF-16 and UTF-32 encodings, unless there is some alternative indicator, the BOM is essential to ensure correct interpretation of the file's contents. Each character in the file is represented by 2 or 4 bytes of data and the order in which these bytes are stored in the file is significant; the BOM indicates this order.
In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, there is no alternative sequence of bytes in a character. The BOM may still occur in UTF-8 encoding text, however, either as a by-product of an encoding conversion or because it was added by an editor.
Detecting the BOM
First, we need to check whether there is indeed a BOM at the beginning of the file.
You can try looking for a BOM in your content, but if your editor handles the UTF-8 signature correctly you probably won't be able to see it. An editor which does not handle the UTF-8 signature correctly displays the bytes that compose that signature according to its own character encoding setting. (With the Latin 1 (ISO 8859-1) character encoding, the signature displays as characters .) With a binary editor capable of displaying the hexadecimal byte values in the file, the UTF-8 signature displays as EF BB BF.
Alternatively, your editor may tell you in a status bar or a menu what encoding your file is in, including information about the presence or not of the UTF-8 signature.
If not, some kind of script-based test (see below) may help. Alternatively, you could try this small web-based utility. (Note, if it’s a file included by PHP or some other mechanism that you think is causing the problem, type in the URI of the included file.)
Removing the BOM
If you have an editor which shows the characters that make up the UTF-8 signature you may be able to delete them by hand. Chances are, however, that the BOM is there in the first place because you didn't see it.
Check whether your editor allows you to specify whether a UTF-8 signature is added or kept during a save. Such an editor provides a way of removing the signature by simply reading the file in then saving it out again. For example, if Dreamweaver detects a BOM the Save As dialogue box will have a check mark alongside the text "Include Unicode Signature (BOM)". Just uncheck the box and save.
One of the benefits of using a script is that you can remove the signature quickly, and from multiple files. In fact the script could be run automatically as part of your process. If you use Perl, you could use a simple script created by Martin Dürst.
Note: You should check the process impact of removing the signature. It may be that some part of your content development process relies on the use of the signature to indicate that a file is in UTF-8. Bear in mind also that pages with a high proportion of Latin characters may look correct superficially but that occasional characters outside the ASC
[1] [2] [3] [4] 下一页