Welcome to Anne Gentle's just write click blog

RSS Subscribe to RSS

Author-it and converting UTF-16 to UTF-8

In trying to modify multiple Author-it topics (okay, 5,119 topics) with variable assignments, I have had to work with the XML output that Author-it exports.

To export to XML, you select a topic or multi-select topics, then right-click on the selection and choose XML > Save to file.

Turns out, Author-it outputs its XML encoded with UTF-16, but apparently most Windows applications understand UTF-8. When I tried to open my freshly export XML file, XML Copy Editor gave me an error.

So I had to discover how to convert the XML from UTF-16 encoding to UTF-8 (and you can’t just open it in Notepad on Windows and change the 16 to 8, there are other embedded characters indicating the encoding, and, well, it’s encoded.)

First, I used the identity transform documented many places, my favorite place being the XSLT Cookbook, to convert the Author-it output to UTF-8. Here’s the XSLT code for that:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output encoding="utf-8"/> <xsl:template match="@*|node()">

	<xsl:copy>	 	<xsl:apply-templates select="@*|node()"/>

	</xsl:copy> </xsl:template>

</xsl:stylesheet>

I just ran the above transform against the AuthorIT Objects.xml file I exported, using the Instant Saxon XSLT processor.

Then, I wanted to remove all <VariableAssigments> elements, effectively removing an entire node. Again, the identity transform (or copy transform) was effective. And, I learned that I had to identify the AuthorIT namespace thanks to this excellent helper article, Handling Default Namespaces on topxml.com.

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"

xmlns:ait ="http://www.authorit.com/xml/authorit">

<!--Match everything using identity copy-->

<xsl:template match="@*|node()">

	<xsl:copy>

	<xsl:apply-templates select="@*|node()"/>

	</xsl:copy>

</xsl:template>

<!--Remove all the VariableAssignment nodes-->

<xsl:template match="ait:VariableAssignment">

	<xsl:comment>Removed VariableAssignment element</xsl:comment>

</xsl:template>

</xsl:stylesheet>

This transform is pretty dangerous, though, because it’s taking away all the VariableAssignment elements, and you might have valuable metadata stored that you would quickly blow away. So use with care.

This workaround is likely also useful for DITA and XHTML files, because Author-it outputs its DITA and XHTML files as UTF-16. I haven’t investigated its usefulness for those areas, but a quick search on the Author-it-users group on Yahoo revealed that sometimes people want UTF-8 rather than UTF-16. So, I hope this helps.


Posted on : Dec 04 2007
Tags: , , , , , , , , , , , ,
Posted under tools |

DITA round up

Just doing a little data mining of the posts I’ve written about DITA in the last few years. I think that there’s a gap for DITA users who are writers or content creators and not coders. I’d like to say that DITA bloggers can bridge that gap. Join me on the DITA blog by writing your own experiences with DITA.

These posts are ordered from newest to oldest, and I wrote them to share my experiences with DITA and to chronicle some of the Central Texas DITA User Group meetings I attended.

A watched folder for publishing from DITA source files

June 15, 2007: I’ve figured out a way to automate DITA builds where you just drop a zip file of your DITA source files into a “watched folder” and PDF and CHM files are automatically built.

Usability and inline links in user assistance systems

May 19, 2007: Examining DITA’s linking and usability.

Getting Started with DITA

April 12, 2007: A brief overview for a couple of fellow Austin writers who have asked me recently how and where to get started with DITA.

Checking out the new DITA Users website

April 10, 2007: Using a coupon code (it’s BETA) I joined the new DITA Users website for free today.

A new DITA Open ToolKit release and brand new DITA newbie blog

October 04, 2006 : A couple of blog-worthy items in the DITA world

Turning information into DITA topics

September 14, 2006: What would you do to make this particular type of content into topics?

How to substitute your custom CSS when using DITA Open Toolkit transforms

September 07, 2006 : When you want to use the DITA Open Toolkit transforms but you want to use your own CSS, here’s how to substitute your CSS for HTML Help (CHM)

DITA Open ToolKit now has a User Guide

August 22, 2006: Just released last week, the DITA Open ToolKit now has its own User Guide

Using the DITA catalog for your specializations, creating a Public ID

August 16, 2006 : Thought our discovery might help you as you specialize DITA

Evaluating XML editors for DITA

August 01, 2006: Notes from the July 2006 Central Texas DITA User Group meeting

A web-form-based DITA editor

July 14, 2006: Could this be the perfect storm for a DITA wiki?

Troubleshooting tip for the DITA Open Toolkit install

June 23, 2006 : Finally figured out the fix for my DITA Open Toolkit “resource/messages.xml” not found error

Where to put your files and other setup for DITA

June 09, 2006: Working with the environment setup for DITA

Defining OPML and relating to DITA maps

May 31, 2006: I found a nice definition for OPML from whatis.com as their word of the day, and I’m starting to wonder about similarities between OPML and DITA maps

Learning more about DITA

May 18, 2006: Learning about how to get started with DITA and a trivia item for fun

Notes from the central Texas DITA user group meeting

April 21, 2006: Two speakers shared their takeaways from DITA 2006 and CMS 2006

Our DITA experience at BMC Software

March 02, 2006: Link to a case study published about BMC’s DITA experience

DITA from the trenches

February 20, 2006: Information Architect from IBM, Kristin Thomas, presented to the Central Texas DITA User’s Group meeting last week, and here are my notes.

Moving from Books to Topic-oriented Writing

January 27, 2006 : A report from JoAnn Hackos’ talk at the Central Texas DITA Users Group meeting January 2006

DITA and wiki combo

December 05, 2005: Darwin Information Typing Architecture, meet Wiki.

Darwin Information Typing Architecture - DITA (dih tuh)

November 04, 2005: Roundup of the DITA reading I’ve been diving back in to lately.


Frame 8 is here: conversion or migration from unstructured FrameMaker to DITA

FrameMaker 8 was released yesterday, with DITA possibility built in, but you will need to do your homework to determine the best path to migrate legacy content from unstructured Frame to DITA. Questions such as Unstructured to DITA - Concise Conversion Information Needed and What are the steps to convert an FM book to DITA To CHM? are consistently appearing on the Adobe Framemaker 7.2 application pack forum. While conversion is part of the migration process, you have to examine your content and get it to a point where it could be converted, so I prefer the term migration to conversion.

I did some research on this topic, attempting a conversion of legacy unstructured Framemaker content to DITA, and wanted to write it up here in case it helps others.

In my experience with Frame 7.2, I wasn’t able to complete the entire complex process, partially due to the unstructured Frame files, partially due to my limitations in getting XSL to work through Framemaker and writing a Framemaker structured application. Feel free to comment on these instructions if you have questions or ideas for where this process may not work well. Note that I haven’t fully studied how this process might change with the introduction of Frame 8, although many of these tips and tricks will still be useful for preparing your content.

Content preparation tips and tricks

If your Frame content is not already written in topics and pretty well structured, work really hard on getting the legacy content into some sort of order that will help in conversion. For example, make each Heading a topic starting point. Next, you may want to type the content into task, concept, reference beforehand using style tags to indicate what type of topic the content is, such as heading1task, heading1concept.

Indicate your task step numbered lists with a different style name from “plain” numbered lists.

Write your abstract and shortdesc (short descriptions) as initial paragraphs of each topic.

Use test Frame files that have few paragraph tags used. Add in paragraph styles one at a time and test the step-by-step process to make debugging easier.

Use a subset of DITA tags for your conversion, meaning you will be converting content to topics with fewer tags than the full DITA tag set. After conversion you can re-tag the content more specifically but having fewer tags to go to means less troubleshooting for the conversion itself.

Performing the conversion

The overall steps for the conversion involve the following tasks:
1. Structure the current unstructured FrameMaker document with the conversion table and rules text file as mapping helpers.
2. Open Frame book file and import the EDD or Element Definition Document, a proprietary FrameMaker file that correlates to a DTD (Document Type Definition).
3. Save the new resulting file as XML, either using a customized migration structured application that applies custom XSLT, or using the DITA Application Plug-In (which is a structured application also).

Each overall step is described in the tasks below.

I’ll also describe the files you’d create to start such a conversion. To create a structured application, you create a new folder in a C:\Program Files\Adobe\FrameMaker7.2\Structure\xml\ directory. This folder contains:

  • A text file that tells FrameMaker what XML means in it’s own formatting conventions. For example, it tells Frame that the “xref” element is a cross-reference in FrameMaker.
  • A Document Type Definition file that works in concert with the EDD file that FrameMaker needs to interpret the DTD.
  • An Element Definition Document required by FrameMaker. This file is what you import into a FrameMaker chapter file.
  • An XSL file that could transform FrameMaker XML into separate DITA topics rather than maintaining many topics in each chapter file, instead each topic would be in its own file.
  • A structured FrameMaker file containing the configuration information for both the DITA plug-in and the new structured application that could use the XSL file for transformations.

To apply the conversion table

1. Open the Frame book file that has had the EDD imported into each chapter file.
2. Open the conversion table file.
3. Click File->Utilities->Structure Current Document.
4. Select the conversion table file name from the drop-down list, and then click Add Structure.

To apply the EDD

1. Open the Frame book file and select all chapter files using Shift+click.
2. Click File->Import->Element Definitions.
3. Select the EDD file name and click Import.

To save as XML

1. Click File->Choose Structured Application. Select either the trx-migrate structured application or the DITA-Topic-FM structured application.
2. Open the Frame book file and select all chapter files.
3. Click File->Save As and choose XML.
4. If you are using the trx-migrate application, the XSLT would be run on this step. If you are simply saving as DITA XML, then you would choose DITA as the Structured Application from the File menu.

Comments on the conversion table

The conversion table takes existing content and maps each paragraph and character style to an XML element, then wraps the elements in outer elements as needed. To accurately portray the information and ensure that one change doesn’t affect other areas of content, you must modify this table one element at a time and then check for accuracy. The conversion table is a painstaking and time consuming project in itself, depending on the complexity of the content you have to start.

Manual cleanup and DITA map creation

According to this article, you would have additional manual cleanup of graphic importing, copy and paste for all table cell text, creation of cross reference links, and creating the hierarchy for your DITA maps because of the flattened nature of the saving topics at the heading level.

Reading list and references for conversion


A web-form based DITA editor

Could this be the perfect storm for a DITA wiki?

Written with just HTML, Javascript, DOM, and CSS, as far as I can tell, DITA Storm is a product that enables web-form-based DITA topic authoring and display. Go check it out, their web site has a lot more content now and you can even request a copy with which to play.

As my help infrastructure buddy said, “This is so cool I think I’m going to freak out like this kid. Nintendo Sixty-FOOOOOOOOOOUR” Yep, it’s Friday, so yep, it’s a video link. Just go watch it, and get a good laugh.

I’m imagining that you could do several things with a web-based DITA editor (and topic styler). One thing would be a DITA-based wiki, where you author directly using DITA topics rather than some cryptic ASCII codes for headings and bulleted lists and so forth. End users don’t have to know DITA to enter content, either. Although I do think you’d want to also be able to copy and paste content from existing DITA topics, and I’m not sure how you’d do that (maybe there’s a “view XML code” feature in the works?). Although you can just link right to the XML topic from within the HTML page, which would be nifty for DITA topics you already had waiting in the wings.

According to the web site, “The prototype version 0.3 of DITA Storm supports following DITA elements: topic, title, shortdesc, body, section, title, note, lq, q, fn, related-links, link, linktext, desc, p, b, i, u, tt, sup, sub, task, taskbody, context, result, steps, step, cmd, stepresult.” So it is a subset of DITA so far (others are subsetting DITA but it’s not really DITA once you subset it and you can’t import DITA-compliant content later into your subsetted set. Whew.).

Another idea for using DITA Storm for end-user doc might be to build your context-sensitive help system right in with your web interface product. I suppose this idea would work only if you can lock down the content at a certain point. But the stylizing with CSS is very promising and really lets you do anything you want with the content. The stylized examples are really nice looking, such as a stylized task topic and a stylized basic topic.

And, my pet idea would also be to use DITA Storm to write blog entries. Entering blog posts in a structured language like XML would work right in with the microformatting concept. Build the next blogger.com site using DITA Storm.

This type of product in my estimation has the potential to be the next Writely, stealing from the desktop publishing user base with the beauty and simplicity of a web editor that just gets the job done. Yes, it’s another way of writing, but a nifty little tool that could help your content do some interesting cartwheels.


Notes from the Central Texas DITA User Group meeting

Two speakers shared their takeaways from DITA 2006 and CMS 2006

I attended the central Texas DITA users group meeting last night, and wanted to write up some notes. We had two speakers share their thoughts after attending two related conferences this spring.

Bob Beims from Freescale shared his thoughts on attending the DITA 2006 conference at North Carolina State in Raleigh, NC, the first conference of its kind. He thinks he heard there were 185 attendees, and was pleasantly surprised at the range of users he met there. People were from medical companies with products for nurses, from the financial industry, from power and electric companies, and there was the hardware and software crowd. He had a couple of great quotes from different sessions. How about: “This is not rocket science… it is really bow and arrow stuff that has been implemented with technology.” from Michael Priestly of IBM, or “there’s never enough time and money to do things right, but always enough time and money to do things twice!” from Bernard Aschwanden of Publishing Smarter. I personally liked “Take the leap (or fall off the cliff!)” from Bob himself.

Bob said he realized that DITA solves some topic orientation problems that our industry has faced for decades. He was pleased at the rate and pace at which the DITA Technical Committee is churning out releases… 1.1 due out soon, and 1.2 in the next nine months. He feels that the OASIS leadership proves that DITA is not “just an IBM thing.” He thinks DITA maps should be awarded innovation of the year. He said, if you hate the limitations of FrameMaker conditional text, you’ll love the future of DITA with key values ( DITA proposed feature #40) that would allow boolean queries against conditions for output. A conditional text tags contest ensued, with a starting bid of documents with 13 conditional text tags and finally someone with a Frame document with 39 conditional text markers won the contest. :) I appreciated his comments on the two strata of tools — either very expensive, very functional, and easy to use, or (almost) free, fairly functional, but you’d better be a gear head to use ‘em. He sees a definite lack in conversion helpers for legacy content. Of course, with those words, a lively discussion ensued about transforming content versus just getting the text out by converting. Nearly all those experienced in unstructured to structured conversion projects discover, a real human has to figure out how to make topics out of the text that comes from a conversion. People who had done conversions said that Perl on MIFs out of Frame does the trick for getting out text, but in some cases you’re better off starting from scratch to plan for reuse and true topic orientation. Still, a conversion script (or set of scripts) at least takes your existing text into a structured start. Bob also said that something he has learned while researching from many presentations inside and outside of the DITA conference is that you must develop an Information Architect role or you’ll end up chasing your tail when it comes to truly gaining benefit from a topic-oriented architecture for your information.

What does Bob see as next for DITA? He’d like to see a lower bar for entry. Currently the entry “fee” includes a lot of time for preparing your content and training your writers, skills necessary to participate are high, and there’s money required for a bat and ball. He thinks there can be integration with non-DITA XML information streams, especially for those who interface with manufacturing industries. His example from Freescales’ perspective was the RosettaNet effort, where hardware manufacturers can offer “product and detailed design information, including product change notices and product technical specifications” via XML specifications. Incorporating that with DITA topics would help them build their information deliverables. He also noted that the DITA community might be a small one, but it is definitely composed of bleeding edge technology and technologists.

Next, Paul Arellanes, an information architect at IBM, gave his impressions of the Content Management Strategies 2006 conference in San Francisco. He saw a definite eagerness to adopt and use the DITA Open ToolKit as well as eagerness to reuse, reuse, reuse. His talk, Taxonomy Creation and Subject Classification for DITA Topics was highly attended (standing room only) and very well received. He also stressed the importance of training on topic orientation before going to XML. He has a programming background, and likens DITA to object-oriented documentation. He’d like to see code reviews of how the tags are used and if they’re used correctly. He got a couple of good ideas at the conference for how to build code reviews into the document review cycle. I’ll talk about those in the next paragraph. Paul talked about reuse and asked if it’s a boon or a curse? Can you reuse a topic if you can’t find it? What if the topic was never designed for reuse in the first place? How can you design for reuse in the first place? He’d like some best practices for reuse.

He said that implementing DITA is a chance to change your documentation processes — going to topics with a fresh start at content is more successful than a legacy conversion due to being able to build and design for reuse. His takeaways are that we need best practices for reuse, he’d like to build in source code reviews, and found a cool method for doing that with an editor’s CSS process that checks syntax. These are the common errors that you could find and mark up with CSS (basically, colorcoding the output after running it through a syntax checker built on CSS). Often these types of syntax/markup errors happen because the writer is tagging for looks, not for meaning of the content, but it can also happen with legacy conversion.

  • placement of index entries
  • sections that should be separate topics
  • use of definition lists to create sections
  • ordered list tags instead of using step tags
  • lists of parameters in ordered lit tags instead of param list
  • use of unordered list tags with bold instead of definition list
  • use of <ol> or <ul> instead of substeps or choices element in a task topic
  • use of <filepath> for variables and terms
  • menucascade not used
  • uicontrol not used

Paul also has good ideas for the future, including a troubleshooting or problem analysis and determination specialization from the task topic, and perhaps a way to pull out DITA elements from a topic and plugging it into interactive content using AJAX. He was pleased to see that the skill set among attendees is pretty high, including XML, XSLT, SAX, FOP, CSS and Ant build tool skills.

Interestingly, as far as our group could see, Adobe was not represented at the DITA 2006 conference, even though they have a group implementing DITA for solutions documentation.

If you’re like me and didn’t attend the DITA 2006 conference, you might enjoy (as I did) the transcript of Norman Walsh’s talk. Norm is the chair of the DocBook Technical Committee, and DocBook and DITA are constantly pitted against each other for solving the problems of information developers.