|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.opencms.search.extractors.A_CmsTextExtractor
public abstract class A_CmsTextExtractor
Base utility class that allows extraction of the indexable "plain" text from a given document format.
| Field Summary | |
|---|---|
protected byte[] |
m_inputBuffer
A buffer in case the input stream must be read more then once. |
| Constructor Summary | |
|---|---|
A_CmsTextExtractor()
|
|
| Method Summary | |
|---|---|
I_CmsExtractionResult |
extractText(byte[] content)
Extracts the text and meta information from the given binary document. |
I_CmsExtractionResult |
extractText(byte[] content,
String encoding)
Extracts the text and meta information from the given binary document, using the specified content encoding. |
I_CmsExtractionResult |
extractText(InputStream in)
Extracts the text and meta information from the document on the input stream. |
I_CmsExtractionResult |
extractText(InputStream in,
String encoding)
Extracts the text and meta information from the document on the input stream, using the specified content encoding. |
InputStream |
getStreamCopy(InputStream in)
Creates a copy of the original input stream, which allows to read the input stream more then once, required for certain document types. |
protected String |
removeControlChars(String content)
Removes "unwanted" control chars from the given content. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected byte[] m_inputBuffer
| Constructor Detail |
|---|
public A_CmsTextExtractor()
| Method Detail |
|---|
public I_CmsExtractionResult extractText(byte[] content)
throws Exception
I_CmsTextExtractorThe encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.
Delivers is the same result as calling
when I_CmsTextExtractor.extractText(byte[], String)String == null.
extractText in interface I_CmsTextExtractorcontent - the binary content of the document to extract the text from
Exception - if the text extration failsI_CmsTextExtractor.extractText(byte[])
public I_CmsExtractionResult extractText(byte[] content,
String encoding)
throws Exception
I_CmsTextExtractor
The encoding is a hint for the text extractor, if the value given is null then
the text extractor should try to figure out the encoding itself.
extractText in interface I_CmsTextExtractorcontent - the binary content of the document to extract the text fromencoding - the encoding to use
Exception - if the text extration failsI_CmsTextExtractor.extractText(byte[], java.lang.String)
public I_CmsExtractionResult extractText(InputStream in)
throws Exception
I_CmsTextExtractorThe encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.
Delivers is the same result as calling
when I_CmsTextExtractor.extractText(InputStream, String)String == null.
extractText in interface I_CmsTextExtractorin - the input stream for the document to extract the text from
Exception - if the text extration failsI_CmsTextExtractor.extractText(java.io.InputStream)
public I_CmsExtractionResult extractText(InputStream in,
String encoding)
throws Exception
I_CmsTextExtractor
The encoding is a hint for the text extractor, if the value given is null then
the text extractor should try to figure out the encoding itself.
extractText in interface I_CmsTextExtractorin - the input stream for the document to extract the text fromencoding - the encoding to use
Exception - if the text extration failsI_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)
public InputStream getStreamCopy(InputStream in)
throws IOException
in - the inpur stram to copy
IOException - in case of read errors from the original input streamprotected String removeControlChars(String content)
content - the content to remove the unwanted control chars from
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||