之前在做文本挖掘实验的时候,千辛万苦要到这份duc2002的语料库之后,就要对它进行处理了,它是存在xml文件里的。
一直在用java进行相关的处理,所以这次也不例外,java中,关于xml文件解析的框架有很多,在网上也看了很多,最终选择了dom4j这个框架。下面就简单记录一下使用的过程。
首先去下载,我下载的是dom4j1.6.1。下载地址:http://sourceforge.net/projects/dom4j/files/
下载之后解压就行。然后在工程的buildpath里面添加一个external jars,选择刚才解压之后根目录下的dom4j-1.6.1.jar文件。
我的这个xml文件的层级比较多,对照我要需要的内容,这里就列举一下一部分的内容:
<corpus>
<cluster cid="d061j">
<title>d061j</title>
<topics>
<topic>d061j</topic>
</topics>
<queries/>
<documents>
<document docid="AP880911-0016">
<title> Hurricane Gilbert Heads Toward Dominican Coast</title>
<text>
<s sid="9"> Hurricane Gilbert swept toward the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains and high seas.</s>
<s sid="10"> The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph.</s>
<s sid="11"> ``There is no need for alarm,'' Civil Defense Director Eugenio Cabral said in a television alert shortly before midnight Saturday.</s>
<s sid="12"> Cabral said residents of the province of Barahona should closely follow Gilbert's movement.</s>
<s sid="13"> An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo.</s>
<s sid="14"> Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night.</s>
<s sid="15"> The National Hurricane Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo.</s>
<s sid="16"> The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westward at 15 mph with a ``broad area of cloudiness and heavy weather'' rotating around the center of the storm.</s>
<s sid="17"> The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6 p.m. Sunday.</s>
<s sid="18"> Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds and up to 12 feet feet to Puerto Rico's south coast.</s>
<s sid="19"> There were no reports of casualties.</s>
<s sid="20"> San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night.</s>
<s sid="21"> On Saturday, Hurricane Florence was downgraded to a tropical storm and its remnants pushed inland from the U.S. Gulf Coast.</s>
<s sid="22"> Residents returned home, happy to find little damage from 80 mph winds and sheets of rain.</s>
<s sid="23"> Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane.</s>
<s sid="24"> The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month .</s>
</text>
</document>
</documents>
</cluster>
</corpus>
每个cluster有自己的title,接下来,documents是和cluster同级的,ducuments里面又有若干document,每个document下有title,text,text下有p标签,p标签下又有各个句子,标签是s。
我的任务是,将每一篇document的内容输出到一个文本文件,命名格式为:”cl”+cid+”-“+”docid”+docid,其中cid和docid分别是cluster和document的一个属性。
PS:我把文本文件duc2002.xml已经复制到本项目的scr下了。
下面贴上源代码:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.Iterator;
import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
public class MyTest {
public static void main(String[] args) throws Exception {
SAXReader reader = new SAXReader();
Document thedocument = reader.read(new File("./src/duc2002.xml"));
Element root = thedocument.getRootElement();
System.out.println(root);
Iterator it = root.elementIterator();
int count = 0;
for (Iterator i = root.elementIterator("cluster"); i.hasNext();) {
Element cluster = (Element) i.next();
System.out.println("cluster id:"
+ cluster.attribute("cid").getText()); // cluster的id
Element documents = cluster.element("documents");
for (Iterator j = documents.elementIterator("document"); j
.hasNext();) {
Element document = (Element) j.next();
Element title = document.element("title");
for (Iterator k = document.element("models").elementIterator(
"model"); k.hasNext();) {
Element model = (Element) k.next();
}
File file = new File("F:\\xmltest\\" + "cl"
+ cluster.attribute("cid").getText() + "-" + "docid"
+ document.attribute("docid").getText() + ".txt");
file.createNewFile();
BufferedWriter bw = new BufferedWriter(new FileWriter(file,
true));
bw.write("title:" + title.getText());
bw.newLine();
for (Iterator l = document.element("text").element("p")
.elementIterator("s"); l.hasNext();) {
Element sentence = (Element) l.next();
bw.write(sentence.getText().trim());
bw.newLine();
}
bw.close();
}
count++;
System.out.println(count + "finished!");
}
}
}
这段程序在F盘的xmltest文件夹下不断生成文本文件,我就拿列举的那部分内容生成的文本文件做例子,输出结果是这样的:
对照程序,再简单整理一下dom4j基本的xml解析功能:
- 第13-15行,初始化解析器,读入xml文件,获得根节点
- 第19行,用
root.elementIterator("cluster")
来迭代获得root节点下的cluster
- 第20行,用
Element cluster = (Element) i.next();
来获得每一个cluster,如果要获得属性,比如xml文件第2行的cid,可以用第22行的cluster.attribute("cid").getText());
- 用element方法获得子节点,如23行的
Element documents = cluster.element("documents");
- 获得节点的内容,直接用getText()方法,如第38行,获得<title>的内容