python读取word文档内容 python读取word数据

如何在 Linux 上使用 Python 读取 word 文件信息

步：获取doc文件的xml组成文件

import zipfiledef get_word_xml(docx_filename):

with open(docx_filename) as f:

zip = zipfile.ZipFile(f)

xml_content = zip.read('word/document.xml')

return xml_content

第二步：解析xml为树形数据结构

from lxml import etreedef get_xml_tree(xml_string):

return etree.fromstring(xml_string)

第三步：读取word内容：

def _itertext(self, my_etree):

"""Iterator to go through xml tree's text nodes"""

for node in my_etree.iter(tag=etree.Element):

if self._check_element_is(node, 't'):

yield (node, node.text)def _check_element_is(self, element, type_char):

word_schema = '99999'

return element.tag == '{%s}%s' % (word_schema,type_char)

如何在 Linux 上使用 Python 读取 word 文件信息

请注意，所有的程序在它们行都是#!/usr/bin/env/python，也就是说，我们想要Python的解释器来执行这些脚本。因此，如果你想你的脚本具有执行性，请使用chmod +x your-script.py，那么你就可以使用./your-script.py来执行它了（在本文中你将会看到这种方式）

探索platform模块

platform模块在标准库中，它有很多运行我们获得众多系统信息的函数。让我们运行Python解释器来探索它们中的一些函数，那就从platform.uname()函数开始吧：

>>> import platform

>>> platform.uname()

('Linux', 'fedora.echorand', '3.7.4-204.fc18.x86_64', '#1 SMP Wed Jan 23 16:44:29 UTC 2013', 'x86_64')

如果你已知道linux上的uname命令，那么你就会认出来这个函数就是这个命令的一个接口。在Python 2上，它会返回一个包含系统类型(或者内核版本)，主机名，版本，发布版本，机器的硬件以及处理器信息元组(tuple)。你可以使用下标访问个别属性，像这样：

>>> platform.uname()[0]

'Linux'

在Python 3上，这个函数返回的是一个命名元组：

>>> platform.uname()

uname_result(='Linux', node='fedora.echorand',

release='3.7.4-204.fc18.x86_64', version='#1 SMP Wed Jan 23 16:44:29

UTC 2013', machine='x86_64', processor='x86_64')

因为返回结果是一个命名元组，这就可以简单地通过名字来指定特定的属性，而不是必须记住下标，像这样：

>>> platform.uname().

'Linux'

platform模块还有一些上面属性的直接接口，像这样：

>>> platform.()

'Linux'

>>> platform.release()

'3.7.4-204.fc18.x86_64'

WORD格式不完全开放，所以基本没有的读取WORD文档的库，而因此各家软件提供的接口都有别的，所以，你首先得确定你用的是哪款软件

python读取word文档导不入mysql数据库

题主是否想询问“python读取word文档导不入mysql数据库怎么办吗”?具体如下：

1、用python先把doc文件转换成docx文件

2、然后读取docx的文件并另存为htm格式的文件

3、python根据bs4获取p标签里的内容，如果段落中有则保存。

word和文字文混排内容怎么用python读取写入

Python可以利用python-docx模块处理word文档，处理方式是面向对象的。也就是说python-docx模块会把word文档，文档中的段落、文本、字体等都看做对象，对对象进行处理就是对word文档的内容处理。

二，相关概念

如果需要读取word文档中的文字（一般来说，程序也只需要认识word文档中的文字信息），需要先了解python-docx模块的几个概念。

1，Document对象，表示一个word文档。

2，Paragraph对象，表示word文档中的一个段落

3，Paragraph对象的text属性，表示段落中的文本内容。

三，模块的安装和导入

需要注意，python-docx模块安装需要在cmd命令行中输入pip install python-docx，如下图表示安装成功（那句英文Successfully installed，成功地安装完成，十分考验英文水平。）

注意在导入模块时，用的是import docx。

也真是奇了怪了，怎么安装和导入模块时，很多都不用一个名字，看来是很有必要出一个python版本的模块管理程序python-men了，本段纯属PS。

四，读取word文本

在了解了上面的信息之后，就很简单了，下面先创建一个D:\temp\word.docx文件，并在其中输入如下内容。

然后写一段程序，代码及输出结果如下：

#读取docx中的文本代码示例

import docx

#获取文档对象

file=docx.Document("D:\\temp\\word.docx")

print("段落数:"+str(len(file.paragraphs)))#段落数为13，每个回车隔离一段

#输出每一段的内容

for para in file.paragraphs:

print(para.text)

#输出段落编号及段落内容

for i in range(len(file.paragraphs)):

print("第"+str(i)+"段的内容是："+file.paragraphs[i].text)

运行结果：

================ RESTART: F:/360data/重要数据/桌面/学习笔记/readWord.py ================

段落数:13

啊我看见一座山

雄伟的大山

真高啊

啊这座山是！

真的很高！

第0段的内容是：啊

第1段的内容是：

第2段的内容是：我看见一座山

第3段的内容是：

第4段的内容是：雄伟的大山

第5段的内容是：

第6段的内容是：真高啊

第7段的内容是：

第8段的内容是：啊

第9段的内容是：

第10段的内容是：这座山是！

第11段的内容是：

第12段的内容是：真的很高！

>>>

总结

以上就是本文关于Python读取word文本作详解的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

Python批量读取加密Word文档转存txt文本实现

# -- coding:utf-8 --

from win32com import client as wc

import os

key = '文档密码'

def Translate(input, output):

# 转换

wordapp = wc.Dispatch('Word.Application')

try:

doc = wordapp.Documents.Open(input, False, False, False,key)

doc.SeAs(FileName=output, FileFormat=4, Encoding="gb2312")

doc.Close()

print(input, "完成")

os.remove(input)

# 为了让python可以在后续作中r方式读取txt和不产生乱码，参数为4

except:

print(input,"密码错误")

if __name__ == '__main__':

#docx文档物理路径

path = r"C:Usersdocx"

key = '文档密码'

j=0

for file in os.listdir(path):

if '.doc' in file:

name = file.split(".docx")[0]

#输入文档物理路径

input_file = r"C:Usersdocx"+""+file

#输出文档物理路径

output_file=r"C:Users xt"+""+name+".txt"

Translate(input_file, output_file)

j=j+1

print(j)

else:continue

python 读取word页内容

f=file(yourpath)

for line in f:

t = line.split("==")

part_1 = t[0] + "=="

(part_2,part_3) = t[1].split("--")

del t

print "段:%s\t第二段:%s\t第三段:%s" %(part_1,part_2,part_3)

python读取word文档内容 python读取word数据

如何在 Linux 上使用 Python 读取 word 文件信息

如何在 Linux 上使用 Python 读取 word 文件信息

python读取word文档导不入mysql数据库

word和文字文混排内容怎么用python读取写入

Python批量读取加密Word文档转存txt文本实现

python 读取word页内容

友情链接百度权重≥5符合友链交换

联系我们

python读取word文档内容 python读取word数据

如何在 Linux 上使用 Python 读取 word 文件信息

如何在 Linux 上使用 Python 读取 word 文件信息

python读取word文档导不入mysql数据库

word和文字文混排内容怎么用python读取写入

Python批量读取加密Word文档转存txt文本实现

python 读取word页内容

相关推荐

友情链接 百度权重≥5符合友链交换

联系我们

友情链接百度权重≥5符合友链交换