Main page Tropes, Semantic Text Analysis - Online Reference Manual 
  info@semantic-knowledge.com 
Home | News | Reference | Support | Download | Buy | About 

CHAPTER 5 - Appendixes

File conversions

Observations about files formats conversions:

Format

Extension

Description

Adobe PDF

*.PDF

Software uses a specific driver (Adobe® IFilter) to extract the relevant text of PDF® (Portable Document Format) files. If Adobe Acrobat® installation is not needed, it is necessary on the other hand to install the Adobe® IFilter driver to benefit from this file format. The usage of an external character recognition software (OCR) can be necessary for certain files (those that come from a digitalization by scanner).

ASCII

*.ASC

The software uses Windows API function OemToChar to convert this format in ANSI ISO-8859-1.

HTML

*.HTM

See XML remarks (below) for HTML (Hypertext Markup Language) file format.

Macintosh

*.MAC

The Apple Macintosh® Latin character sets are converted to ANSI ISO-8859-1.

Microsoft Powerpoint

*.PPT

The software makes a binary analysis of these files to extract the relevant text. The installation of Microsoft Powerpoint® or Office® is not necessary.

Microsoft Excel

*.XLS

The software makes a binary analysis of these files to extract the relevant text. The installation of Microsoft Excel® or Office® is not necessary.

Microsoft Word

*.DOC

The software makes a native conversion of Microsoft Word® files. Besides, software can extract the text by binary analysis, for problematic or not recognized files. The installation of Microsoft Word® or Office® is not necessary.

RTF

*.RTF

The software interprets RTF (Rich Text File) formats containing characters in ANSI ISO-8859-1, Apple Macintosh, or ASCII IBM, coded on 7 or 8 bits. Unicode format is not accepted. Because disparities exist between RTF standards, parasite characters can appear in certain files.

SGML

*.SGM

The software discards the tags of SGML (Standard Generalized Markup Language, ISO 8879) and convert some HTML specific variables (as characters with accents). It does not interpret the DTD. UTF-8 format (UCS transformation Format-8, ISO 10646 / RFC 2279) is automatically converted in ANSI ISO-8859-1. The other Unicode formats (Universal Character Set, ISO 10646) are not directly accepted.

Text

*.TXT

No conversion is made on this file format, considered in ANSI ISO-8859-1 (also called ISO-LATIN1, or ANSI Windows).

Macromedia Flash

*.SWF

The software makes a binary analysis of these files to extract the relevant text. The installation of Macromedia® software is not necessary.

XML

*.XML

The software uses its SGML parser (see above) to read XML files (Extensible Markup Language). The same limitations are so applicable to this format. The engine does not interpret scripts or style sheets.

Besides, an utility (Essential Conversions Wizard, downloadable on Semantic-knowledge's private area) allows to extract automatically the messages from Microsoft Outlook® mailboxes and to convert files to the text format by using Microsoft Word® or Adobe® Ifilter.

The conversion of documents PDF® requires the installation of a Adobe® IFilter component on your system. You can download Ifilter last version on Adobe's® site, at the following address: http://www.adobe.com/support/downloads/detail.jsp?ftpID=2611. Note well that the addition of this small component does not require to install (or update) Adobe Acrobat®.

Software uses an internal constituent to convert documents Microsoft Word® (detailed conversion). In case of problem (corrupted file, etc.) conversion is tipped over automatically on the binary analysis (heuristic), which is going to try to get back the text by binary origin. If you notice that software jams on certain Word files, you can deactivate conversion using the Analysis options dialog (tabsheet [Conversions], deepened in the parameters of indexation if you are using Zoom).

In every case, the password protected files can not be read and be directly converted by the software. You have to convert these files manually to analyze them.

When software makes a binary analysis (heuristic) to extract the text, it is possible that characters parasites appear to the posting. By definition, this method of origin of the text can not be perfect, because software does not take into account the native file format.

For more information about the supported files, see:


First page Previous Next Last page

Copyright Acetic and Semantic Knowledge, all rights reserved
www.semantic-knowledge.com