Main page Tropes, Semantic Text Analysis - Online Reference Manual 
  info@semantic-knowledge.com 
Home | News | Reference | Support | Download | Buy | About 

CHAPTER 5 - Appendixes

File conversions

Observations about files formats conversions:

Format

Extension

Description

ASCII

*.ASC

The software uses Windows API function OemToChar to convert this format in ANSI ISO-8859-1.

HTML

*.HTM

See XML remarks (below) for HTML (Hypertext Markup Language) file format.

Macintosh
text files

*.MAC

The Apple Macintosh® Latin character sets are converted to ANSI ISO-8859-1.

Microsoft Powerpoint

*.PPT

The software makes a binary analysis of Microsoft Powerpoint ® 97/2003 files to extract the relevant text. The installation of Microsoft Powerpoint® or Office® is not necessary.

Microsoft Excel

*.XLS

The software makes a binary analysis of Microsoft Excel® 97/2003 files to extract the relevant text. The installation of Microsoft Excel® or Office® is not necessary.

Microsoft Word

*.DOC
*.DOCX

The software makes a native conversion of Microsoft Word® 97/2003 files. Besides, software can extract the text by binary analysis, for problematic or not recognized files.
The installation of Microsoft Word® 2007 or Microsoft Office® 2007 or Microsoft Filter Pack 2007 (free) is required for Microsoft Word 2007 DOCX file formats.

PDF

*.PDF

Software uses a specific driver (Adobe® IFilter) to extract the relevant text of PDF® (Portable Document Format) files. If Adobe Acrobat® installation is not needed, it is necessary on the other hand to install the Adobe® IFilter driver to benefit from this file format. The usage of an external character recognition software (OCR) can be necessary for certain files (those that come from a digitalization by scanner).

RTF

*.RTF

The software interprets RTF (Rich Text File) formats containing characters in ANSI ISO-8859-1, Apple Macintosh, or ASCII IBM, coded on 7 or 8 bits. Unicode format is not accepted. Because disparities exist between RTF standards, parasite characters can appear in certain files.

SGML

*.SGM

The software discards the tags of SGML (Standard Generalized Markup Language, ISO 8879) and convert some HTML specific variables (as characters with accents). It does not interpret the DTD. UTF-8 format (UCS transformation Format-8, ISO 10646 / RFC 2279) is automatically converted in ANSI ISO-8859-1. The other Unicode formats (Universal Character Set, ISO 10646) are not directly accepted.

Text

*.TXT

No conversion is made on this file format, considered in ANSI ISO-8859-1 (also called ISO-LATIN1, or ANSI Windows).

Flash

*.SWF

The software makes a binary analysis of these files to extract the relevant text. The installation of Adobe® or Macromedia® software is not necessary.

XML

*.XML

The software uses its SGML parser (see above) to read XML files (Extensible Markup Language). The same limitations are so applicable to this format. The engine does not interpret scripts or style sheets.

Besides, an utility (Essential Conversions Wizard, bundled with Tropes Zoom Pro) allows to extract automatically the messages from Microsoft Outlook® mailboxes and to convert files to the text format by using Microsoft Word® or Adobe® IFilters.

Tropes and Zoom software use the existing IFilters installed on your system. For more information about IFilter technology and error codes, read the Microsoft documentation (http://msdn.microsoft.com/en-us/library/ms691105%28VS.85%29.aspx).

The conversion of PDF documents may require the installation of a Adobe® IFilter component on your system. You can download PDF Ifilter last free version on Adobe's® site or buy IFilters made by third-party vendors.

Software uses an internal component to convert documents Microsoft Word® 97/2003 files. In case of problem (corrupted file, etc.) conversion is tipped over automatically on the binary analysis (heuristic), which is going to try to get back the text by binary origin. If you notice that software jams on certain Word files, you can deactivate conversion using the Analysis options dialog (tabsheet [Conversions], deepened in the parameters of indexation if you are using Zoom).

In every case, the password protected files can not be read and be directly converted by the software. You have to convert these files manually to analyze them.

When software makes a binary analysis (heuristic) to extract the text, it is possible that characters parasites appear to the posting. By definition, this method of origin of the text can not be perfect, because software does not take into account the native file format.

For more information about the supported files, see:


First page Previous Next Last page

Copyright Acetic and Semantic Knowledge, all rights reserved
www.semantic-knowledge.com