学习数据结构搜索的的意外收获哈。一个简单的基于java的网页抓取程序。

445822357

浏览: 738716 次

最近访客更多访客>>

jiyilee

wangyy

ccsosnfs

sharpbai

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (1606)

社区版块

存档分类

2014-10 ( 19)
2014-09 ( 18)
2014-08 ( 18)
更多存档...

最近在刷水题时，意外找到了一个水题嗅探神器，在这儿：http://blog.csdn.net/hu1020935219/article/details/11697109，大婶说这个是网络爬虫，使用各种搜索方法做出来的，其实就是我们学习的数据结构的图或者树的遍历的原理而已。于是，遂对其十分感兴趣。

在图书馆恰好找到一本三年内被借了两次的书：《自己动手写网络爬虫》，开始学习如何编写网络爬虫。

看两天的书，总结一下的学习成果。（顺便复习Java，Java被我忘得差不多了）。

网络爬虫是一种基于一定规则自动抓取网络信息的脚本或则程序。

本文是用Java语言编写的一个利用指定的URL抓取网页内容并将之保存在本地的小程序。所谓网页抓取就是把URL中指定的网络资源从网络流中读取出来，保存至本地。类似于是用程序模拟浏览器的功能：把URL作为http请求的内容发送至服务器，然后读取服务器的相应资源。java语言在网络编程上有天然的优势，它把网络资源看做一种文件，它对网络资源的访问如同访问本地资源一样方便。它把请求与响应封装成流。java.net.URL类可对相应的web服务器发出请求并获得回应，但实际的网络环境比较复杂，如果仅适用java.net中的API去模拟浏览器功能，需要处理https协议、http返回状态码等工作，编码非常复杂。在实际项目中经常适用apache的HttpClient（需要用到HttpClient包）去模拟浏览器抓取网页内容。主要工作如下：

//创建一个客户端，类似打开一个浏览器
HttpClient httpClient = new HttpClient();
//创建一个get方法，类似在浏览器中输入一个地址，path则为URL的值
GetMethod getMethod = new GetMethod(path);
//获得响应的状态码
int statusCode = httpClient.executeMethod(getMethod);
//得到返回的类容
String resoult = getMethod.gerResponseBodyAsString();
//释放资源
getMethod.releaseConnection();

完整代码如下：

/************ Java 实现最简单的网页抓取程序 ************/

//////////Written By C_Shit_Hu //////////////////

////////// Time : 2013.9.20 library ////////////

import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpException;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;

public class RetrivePage {
	private static HttpClient httpClient = new HttpClient();
	static GetMethod getmethod;

	public static boolean downloadPage(String path) throws HttpException,
			IOException {
		getmethod = new GetMethod(path);
		// 获得响应状态码
		int statusCode = httpClient.executeMethod(getmethod);
		if (statusCode == HttpStatus.SC_OK) {
			System.out.println("response="
					+ getmethod.getResponseBodyAsString());
			// 写入本地文件
			FileWriter fwrite = new FileWriter("hello.txt");
			String pageString = getmethod.getResponseBodyAsString();
			getmethod.releaseConnection();
			fwrite.write(pageString, 0, pageString.length());
			fwrite.flush();
			// 关闭文件
			fwrite.close();
			// 释放资源
			return true;
		}
		return false;
	}

	/**
	 * 测试代码
	 */
	public static void main(String[] args) {
		// 抓取制指定网页，并将其输出
		try {
			Scanner in = new Scanner(System.in);
			System.out.println("Input the URL of the page you want to get:");
			String path = in.next();
			System.out.println("program start!");
			RetrivePage.downloadPage(path);
			System.out.println("Program end!");
		} catch (HttpException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

编译文件时候注意导包的，将HttpClient（需要用到HttpClient包）全部导入工程，即可运行。

此包在这里：http://download.csdn.net/detail/hu1020935219/6293223

明天继续学习如何用搜索增强这条爬虫。。。

分享到：