在java中使用dom4j进行xml文件的解析

之前在做文本挖掘实验的时候,千辛万苦要到这份duc2002的语料库之后,就要对它进行处理了,它是存在xml文件里的。

一直在用java进行相关的处理,所以这次也不例外,java中,关于xml文件解析的框架有很多,在网上也看了很多,最终选择了dom4j这个框架。下面就简单记录一下使用的过程。

首先去下载,我下载的是dom4j1.6.1。下载地址:http://sourceforge.net/projects/dom4j/files/

下载之后解压就行。然后在工程的buildpath里面添加一个external jars,选择刚才解压之后根目录下的dom4j-1.6.1.jar文件。

我的这个xml文件的层级比较多,对照我要需要的内容,这里就列举一下一部分的内容:

<corpus>
	<cluster cid="d061j">
		<title>d061j</title>
		<topics>
		  <topic>d061j</topic>
		</topics>
		<queries/>
		<documents>
		  <document docid="AP880911-0016">
			<title> Hurricane Gilbert Heads Toward Dominican Coast</title>
			<text>
				<s sid="9"> Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.</s>
				<s sid="10"> The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph.</s>
				<s sid="11"> ``There is no need for alarm,'' Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday.</s>
				<s sid="12"> Cabral said residents of the province of Barahona should closely follow Gilbert's movement.</s>
				<s sid="13"> An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo.</s>
				<s sid="14"> Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night.</s>
				<s sid="15"> The National Hurricane Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.</s>
				<s sid="16"> The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a ``broad area of cloudiness and heavy weather'' rotating around the center of the storm.</s>
				<s sid="17"> The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6 p.m. Sunday.</s>
				<s sid="18"> Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet feet to Puerto Rico's south coast.</s>
				<s sid="19"> There were no reports of casualties.</s>
				<s sid="20"> San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night.</s>
				<s sid="21"> On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast.</s>
				<s sid="22"> Residents returned home, happy to find little damage from 80 mph winds and sheets of rain.</s>
				<s sid="23"> Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane.</s>
				<s sid="24"> The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month .</s>
			</text>
		  </document>
		</documents>
	</cluster>
</corpus>

每个cluster有自己的title,接下来,documents是和cluster同级的,ducuments里面又有若干document,每个document下有title,text,text下有p标签,p标签下又有各个句子,标签是s。

我的任务是,将每一篇document的内容输出到一个文本文件,命名格式为:”cl”+cid+”-“+”docid”+docid,其中cid和docid分别是cluster和document的一个属性。

PS:我把文本文件duc2002.xml已经复制到本项目的scr下了。

下面贴上源代码:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.Iterator;

import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

public class MyTest {
	public static void main(String[] args) throws Exception {
		SAXReader reader = new SAXReader();
		Document thedocument = reader.read(new File("./src/duc2002.xml"));
		Element root = thedocument.getRootElement();
		System.out.println(root);
		Iterator it = root.elementIterator();
		int count = 0;
		for (Iterator i = root.elementIterator("cluster"); i.hasNext();) {
			Element cluster = (Element) i.next();
			System.out.println("cluster id:"
					+ cluster.attribute("cid").getText()); // cluster的id
			Element documents = cluster.element("documents");
			for (Iterator j = documents.elementIterator("document"); j
					.hasNext();) {
				Element document = (Element) j.next();
				Element title = document.element("title");
				for (Iterator k = document.element("models").elementIterator(
						"model"); k.hasNext();) {
					Element model = (Element) k.next();
				}
				File file = new File("F:\\xmltest\\" + "cl"
						+ cluster.attribute("cid").getText() + "-" + "docid"
						+ document.attribute("docid").getText() + ".txt");
				file.createNewFile();
				BufferedWriter bw = new BufferedWriter(new FileWriter(file,
						true));
				bw.write("title:" + title.getText());
				bw.newLine();

				for (Iterator l = document.element("text").element("p")
						.elementIterator("s"); l.hasNext();) {
					Element sentence = (Element) l.next();
					bw.write(sentence.getText().trim());
					bw.newLine();
				}
				bw.close();
			}
			count++;
			System.out.println(count + "finished!");
		}
		
	}
}

这段程序在F盘的xmltest文件夹下不断生成文本文件,我就拿列举的那部分内容生成的文本文件做例子,输出结果是这样的:

xml文件输出为txt结果

对照程序,再简单整理一下dom4j基本的xml解析功能:

  • 第13-15行,初始化解析器,读入xml文件,获得根节点
  • 第19行,用root.elementIterator("cluster")来迭代获得root节点下的cluster
  • 第20行,用Element cluster = (Element) i.next();来获得每一个cluster,如果要获得属性,比如xml文件第2行的cid,可以用第22行的cluster.attribute("cid").getText());
  • 用element方法获得子节点,如23行的Element documents = cluster.element("documents");
  • 获得节点的内容,直接用getText()方法,如第38行,获得<title>的内容