Package org.htmlcleaner
Class HtmlCleaner
- java.lang.Object
-
- org.htmlcleaner.HtmlCleaner
-
public class HtmlCleaner extends java.lang.Object
Main HtmlCleaner class.It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.
Typical usage is the following:
// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behavior with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected class
HtmlCleaner.NestingState
-
Constructor Summary
Constructors Constructor Description HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider,default version and default properties.HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description protected void
addPruneNode(TagNode node, org.htmlcleaner.CleanTimeValues cleanTimeValues)
TagNode
clean(java.io.File file)
TagNode
clean(java.io.File file, java.lang.String charset)
TagNode
clean(java.io.InputStream in)
TagNode
clean(java.io.InputStream in, java.lang.String charset)
TagNode
clean(java.io.Reader reader)
protected TagNode
clean(java.io.Reader reader, org.htmlcleaner.CleanTimeValues cleanTimeValues)
Basic version of the cleaning call.TagNode
clean(java.lang.String htmlContent)
TagNode
clean(java.net.URL url)
Creates instance from the content downloaded from specified URL.TagNode
clean(java.net.URL url, java.lang.String charset)
Deprecated.protected java.util.Set<ITagNodeCondition>
getAllowTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
protected java.util.Set<java.lang.String>
getAllTags(org.htmlcleaner.CleanTimeValues cleanTimeValues)
java.lang.String
getInnerHtml(TagNode node)
For the specified node, returns it's content as string.CleanerProperties
getProperties()
protected java.util.Set<ITagNodeCondition>
getPruneTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
ITagInfoProvider
getTagInfoProvider()
CleanerTransformations
getTransformations()
void
initCleanerTransformations(java.util.Map transInfos)
protected boolean
isRemovingNodeReasonablySafe(TagNode startTagToken)
void
setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content.
-
-
-
Constructor Detail
-
HtmlCleaner
public HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider,default version and default properties.
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancing
-
HtmlCleaner
public HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified properties- Parameters:
properties
- Properties used during parsing and serializing
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancingproperties
- Properties used during parsing and serializing
-
-
Method Detail
-
clean
public TagNode clean(java.lang.String htmlContent)
-
clean
public TagNode clean(java.io.File file, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.File file) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
@Deprecated public TagNode clean(java.net.URL url, java.lang.String charset) throws java.io.IOException
Deprecated.Deprecated because unmanaged network IO does not handle proxies, slow servers or broken connections well. the htmlcleaner caller should be managing the connections themselves and just providing the htmlcleaner library with a stream.- Parameters:
url
-charset
-- Returns:
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.net.URL url) throws java.io.IOException
Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.- Parameters:
url
-- Returns:
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.Reader reader) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
protected TagNode clean(java.io.Reader reader, org.htmlcleaner.CleanTimeValues cleanTimeValues) throws java.io.IOException
Basic version of the cleaning call.- Parameters:
reader
- (not closed)- Returns:
- An instance of TagNode object which is the root of the XML tree.
- Throws:
java.io.IOException
-
isRemovingNodeReasonablySafe
protected boolean isRemovingNodeReasonablySafe(TagNode startTagToken)
- Parameters:
startTagToken
-- Returns:
- true if no id attribute or class attribute
-
getProperties
public CleanerProperties getProperties()
-
getPruneTagSet
protected java.util.Set<ITagNodeCondition> getPruneTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
-
getAllowTagSet
protected java.util.Set<ITagNodeCondition> getAllowTagSet(org.htmlcleaner.CleanTimeValues cleanTimeValues)
-
addPruneNode
protected void addPruneNode(TagNode node, org.htmlcleaner.CleanTimeValues cleanTimeValues)
-
getAllTags
protected java.util.Set<java.lang.String> getAllTags(org.htmlcleaner.CleanTimeValues cleanTimeValues)
-
getTagInfoProvider
public ITagInfoProvider getTagInfoProvider()
- Returns:
- ITagInfoProvider instance for this HtmlCleaner
-
getTransformations
public CleanerTransformations getTransformations()
- Returns:
- Transformations defined for this instance of cleaner
-
getInnerHtml
public java.lang.String getInnerHtml(TagNode node)
For the specified node, returns it's content as string.- Parameters:
node
-- Returns:
- node's content as string
-
setInnerHtml
public void setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.- Parameters:
node
-content
-
-
initCleanerTransformations
public void initCleanerTransformations(java.util.Map transInfos)
- Parameters:
transInfos
-
-
-