#LET ME SEND DOCX AND XLSX AS FILES THAT ONE SHOULD EDIT ON THEIR COMPUTER ITS BETTEE THAN THE HELL IM LIVING IN
Explore tagged Tumblr posts
tchaikovskym · 1 year ago
Text
Tumblr might be straight shooting itself into death with all the greedy copycat changes, but by far the biggest enemy on the world wide web for me is Microsoft SharePoint
0 notes
loadingtax915 · 3 years ago
Text
Python Docx
Tumblr media
Python Docx4j
Python Docx To Pdf
Python Docx Table
Python Docx To Pdf
Python Docx2txt
Python Docx2txt
When you ask someone to send you a contract or a report there is a high probability that you’ll get a DOCX file. Whether you like it not, it makes sense considering that 1.2 billion people use Microsoft Office although a definition of “use” is quite vague in this case. DOCX is a binary file which is, unlike XLSX, not famous for being easy to integrate into your application. PDF is much easier when you care more about how a document is displayed than its abilities for further modifications. Let’s focus on that.
Python-docx versions 0.3.0 and later are not API-compatible with prior versions. Python-docx is hosted on PyPI, so installation is relatively simple, and just depends on what installation utilities you have installed. Python-docx may be installed with pip if you have it available.
Installing Python-Docx Library Several libraries exist that can be used to read and write MS Word files in Python. However, we will be using the python-docx module owing to its ease-of-use. Execute the following pip command in your terminal to download the python-docx module as shown below.
Python has a few great libraries to work with DOCX (python-dox) and PDF files (PyPDF2, pdfrw). Those are good choices and a lot of fun to read or write files. That said, I know I'd fail miserably trying to achieve 1:1 conversion.
Release v0.8.10 (Installation)python-docx is a Python library for creating and updating Microsoft Word (.docx) files.
Tumblr media
Looking further I came across unoconv. Universal Office Converter is a library that’s converting any document format supported by LibreOffice/OpenOffice. That sound like a solid solution for my use case where I care more about quality than anything else. As execution time isn't my problem I have been only concerned whether it’s possible to run LibreOffice without X display. Apparently, LibreOffice can be run in haedless mode and supports conversion between various formats, sweet!
I’m grateful to unoconv for an idea and great README explaining multiple problems I can come across. In the same time, I’m put off by the number of open issues and abandoned pull requests. If I get versions right, how hard can it be? Not hard at all, with few caveats though.
Testing converter
LibreOffice is available on all major platforms and has an active community. It's not active as new-hot-js-framework-active but still with plenty of good read and support. You can get your copy from the download page. Be a good user and go with up-to-date version. You can always downgrade in case of any problems and feedback on latest release is always appreciated.
On macOS and Windows executable is called soffice and libreoffice on Linux. I'm on macOS, executable soffice isn't available in my PATH after the installation but you can find it inside the LibreOffice.app. To test how LibreOffice deals with your files you can run:
In my case results were more than satisfying. The only problem I saw was a misalignment in a file when the alignment was done with spaces, sad but true. This problem was caused by missing fonts and different width of 'replacements' fonts. No worries, we'll address this problem later.
Setup I
While reading unoconv issues I've noticed that many problems are connected due to the mismatch of the versions. I'm going with Docker so I can have pretty stable setup and so I can be sure that everything works.
Let's start with defining simple Dockerfile, just with dependencies and ADD one DOCX file just for testing:
Tumblr media
Let's build an image:
After image is created we can run the container and convert the file inside the container:
Running LibreOffice as a subprocess
We want to run LibreOffice converter as a subprocess and provide the same API for all platforms. Let's define a module which can be run as a standalone script or which we can later import on our server.
Required arguments which convert_to accepts are folder to which we save PDF and a path to the source file. Optionally we specify a timeout in seconds. I’m saying optional but consider it mandatory. We don’t want a process to hang too long in case of any problems or just to limit computation time we are able to give away to each conversion. LibreOffice executable location and name depends on the platform so edit libreoffice_exec to support platform you’re using.
subprocess.run doesn’t capture stdout and stderr by default. We can easily change the default behavior by passing subprocess.PIPE. Unfortunately, in the case of the failure, LibreOffice will fail with return code 0 and nothing will be written to stderr. I decided to look for the success message assuming that it won’t be there in case of an error and raise LibreOfficeError otherwise. This approach hasn’t failed me so far.
Uploading files with Flask
Converting using the command line is ok for testing and development but won't take us far. Let's build a simple server in Flask.
We'll need few helper function to work with files and few custom errors for handling error messages. Upload directory path is defined in config.py. You can also consider using flask-restplus or flask-restful which makes handling errors a little easier.
The server is pretty straightforward. In production, you would probably want to use some kind of authentication to limit access to uploads directory. If not, give up on serving static files with Flask and go for Nginx.
Important take-away from this example is that you want to tell your app to be threaded so one request won't prevent other routes from being served. However, WSGI server included with Flask is not production ready and focuses on development. In production, you want to use a proper server with automatic worker process management like gunicorn. Check the docs for an example how to integrate gunicorn into your app. We are going to run the application inside a container so host has to be set to publicly visible 0.0.0.0.
Setup II
Now when we have a server we can update Dockerfile. We need to copy our application source code to the image filesystem and install required dependencies.
In docker-compose.yml we want to specify ports mapping and mount a volume. If you followed the code and you tried running examples you have probably noticed that we were missing the way to tell Flask to run in a debugging mode. Defining environment variable without a value is causing that this variable is going to be passed to the container from the host system. Alternatively, you can provide different config files for different environments.
Supporting custom fonts
I've mentioned a problem with missing fonts earlier. LibreOffice can, of course, make use of custom fonts. If you can predict which fonts your user might be using there's a simple remedy. Add following line to your Dockfile.
Now when you put custom font file in the font directory in your project, rebuild the image. From now on you support custom fonts!
Summary
This should give you the idea how you can provide quality conversion of different documents to PDF. Although the main goal was to convert a DOCX file you should be fine with presentations, spreadsheets or images.
Further improvements could be providing support for multiple files, the converter can be configured to accept more than one file as well.
Photo by Samuel Zeller on Unsplash.
Did you enjoy it? Follow me@MichalZalecki on Twitter, where I share some interesting, bite-size content.
This ebook goes beyond Jest documentation to explain software testing techniques. I focus on unit test separation, mocking, matchers, patterns, and best practices.
Get it now!
Mastering Jest: Tips & Tricks | $9
Latest version
Released:
Extract content from docx files
Project description
Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object.
The code is an expansion/contraction of python-docx2txt (Copyright (c) 2015 Ankush Shah). The original code is mostly gone, but some of the bones may still be here.
shared features:
Tumblr media
extracts text from docx files
extracts images from docx files
no dependencies (docx2python requires pytest to test)
additions:
extracts footnotes and endnotes
converts bullets and numbered lists to ascii with indentation
converts hyperlinks to <a href='http:/...'>link text</a>
retains some structure of the original file (more below)
extracts document properties (creator, lastModifiedBy, etc.)
inserts image placeholders in text ('----image1.jpg----')
inserts plain text footnote and endnote references in text ('----footnote1----')
(optionally) retains font size, font color, bold, italics, and underscore as html
extract user selections from checkboxes and dropdown menus
full test coverage and documentation for developers
subtractions:
no command-line interface
will only work with Python 3.4+
Installation
Use
Note on html feature:
Tumblr media
font size, font color, bold, italics, and underline supported
hyperlinks will always be exported as html (<a href='http:/...'>link text</a>), even if export_font_style=False, because I couldn't think of a more cononical representation.
every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as <b>paragraph q</b>, <b>paragraph 2</b>. This is intentional to make each paragraph its own entity.
if you specify export_font_style=True, > and < in your docx text will be encoded as > and <
Return Value
Function docx2python returns an object with several attributes.
header - contents of the docx headers in the return format described herein
footer - contents of the docx footers in the return format described herein
body - contents of the docx in the return format described herein
footnotes - contents of the docx in the return format described herein
endnotes - contents of the docx in the return format described herein
document - header + body + footer (read only)
text - all docx text as one string, similar to what you'd get from python-docx2txt
properties - docx property names mapped to values (e.g., {'lastModifiedBy': 'Shay Hill'})
images - image names mapped to images in binary format. Write to filesystem with
Tumblr media
Return Format
Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l] will be a paragraph).
If your docx has no tables, output.body will appear as one a table with all contents in one cell:
Table cells will appear as table cells. Text outside tables will appear as table cells.
To preserve the even depth (text always at depth 4), nested tables will appear as new, top-level tables. This is clearer with an example:
becomes ...
This ensures text appears
only once
in the order it appears in the docx
always at depth four (i.e., result.body[i][j][k][l] will be a string).
Working with output
This package provides several documented helper functions in the docx2python.iterators module. Here are a few recipes possible with these functions:
Some fine print about checkboxes:
MS Word has checkboxes that can be checked any time, and others that can only be checked when the form is locked.The previous print as. u2610 (open checkbox) or u2612 (crossed checkbox). Which this module, the latter willtoo. I gave checkboxes a bailout value of ----checkbox failed---- if the xml doesn't look like I expect it to,because I don't have several-thousand test files with checkboxes (as I did with most of the other form elements).Checkboxes should work, but please let me know if you encounter any that do not.
Release historyRelease notifications | RSS feed
1.27.1
1.27
1.26
Python Docx4j
1.25
1.24
1.23
1.22
1.21
1.19
1.18
1.17
1.16
1.15
1.14
1.13
1.12
1.11
1.2
Python Docx To Pdf
1.1
Python Docx Table
1.0
0.1
Python Docx To Pdf
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Python Docx2txt
Files for docx2python, version 1.27.1Filename, sizeFile typePython versionUpload dateHashesFilename, size docx2python-1.27.1-py3-none-any.whl (22.9 kB) File type Wheel Python version py3 Upload dateHashesFilename, size docx2python-1.27.1.tar.gz (33.3 kB) File type Source Python version None Upload dateHashes
Close
Hashes for docx2python-1.27.1-py3-none-any.whl
Hashes for docx2python-1.27.1-py3-none-any.whlAlgorithmHash digestSHA25651f6f03149efff07372ea023824d4fd863cb70b531aa558513070fe60f1c420aMD54b0ee20fed4a8cb0eaba8580c33f946bBLAKE2-256e7d5ff32d733592b17310193280786c1cab22ca4738daa97e1825d650f55157c
Close
Hashes for docx2python-1.27.1.tar.gz
Python Docx2txt
Hashes for docx2python-1.27.1.tar.gzAlgorithmHash digestSHA2566ca0a92ee9220708060ece485cede894408588353dc458ee5ec17959488fa668MD5759e1630c6990533414192eb57333c72BLAKE2-25684783b70aec51652a4ec4f42aa419a8af18d967b06390764527c81f183d1c02a
Tumblr media
0 notes