`
445822357
  • 浏览: 740136 次
文章分类
社区版块
存档分类
最新评论

java调用python下载网页

 
阅读更多

本篇参考:http://tonl.iteye.com/blog/1918245

python版本:2.7 64bit window版本;

下载python:http://www.python.org/getit/

首先编写下面的spider.py脚本:

# -*- coding: utf-8 -*-
#import urllib2
from urllib import urlopen
import os
import sys

class Spider:
    """
    download web site from the given file
    """
    def __init__(self,filename,downloadPath):
        """
        init the filename ,if the filename is not raise a error
        """
        if not os.path.isfile(filename):
            print 'the given file does not exist,the program will exit'
            sys.exit(0)
        else:
            self.fname=filename
        if not os.path.isdir(downloadPath):
            print 'the given download path does not exist ,the programe will exit'
        else:
            self.dpath=downloadPath
    def download(self):
        """
        download the web site from the given file by line
        """
        fp=open(self.fname,'r')
        while True:
            line=fp.readline()
            if not line:
                break
            if 'html' in line:
                tempname=filter(str.isalnum,line).replace('html','.html')
            else:
                tempname=filter(str.isalnum,line)+'.html'
            self.download_html(line,self.dpath+'\\'+tempname)
        fp.close()

    def download_html(self,website,filename):
        """
        download the html by the given web site and save to name
        """
        response=urlopen(website)
        data=response.read()
        fp=file(filename,'a+')
        fp.write(data)
        fp.close()

def test():
    """
    test program
    """
    filename=sys.argv[1]
    downloadPath=sys.argv[2]
    spider=Spider(filename,downloadPath)
    spider.download()
        
if __name__ =='__main__': test()
上面的脚本,要输入两个参数,一个是要下载的网页的地址文件,格式一般如下(websites.txt):

http://blog.csdn.net/fansy1990
http://www.baidu.com
另外一个参数是下载的网页的存放地点。

然后可以在命令行运行:

python D:\\spider.py D:\\websites.txt D:\\download_tmp
然后到D盘的download_tmp下面查找下载的文件,如果找到,则说明配置正确;

最后编写下面的java程序,需要导入jython-*.jar包(lz下载的是2.2的):

package test;

import java.io.IOException;

public class PyTest {

	/**
	 * @param args
	 * @throws IOException 
	 * @throws InterruptedException 
	 */
	public static void main(String[] args) throws IOException, InterruptedException {	
		  String py_path="D:\\spider.py";
		  String websites="D:\\websites.txt";
		  String outDir="D:\\tmp";
		  // 
		  Process pr=Runtime.getRuntime().exec("python "+py_path+" "+websites+" "+outDir );
		  pr.waitFor();
		  System.out.println("done ...");
	}

}
运行上面的命令,需要设置eclipse中的Environment属性,添加一个PATH变量,值是python的安装目录;

运行后,会提示:

*sys-package-mgr*: can't create package cache dir, *jython-2.2.jar\cachedir\packages'
这个可以不用管,不会影响程序运行。

分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990



分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics