<?xml version="1.0" encoding="utf-8"?>
<!-- generator="FeedCreator 1.7.2-ppt DokuWiki" -->
<?xml-stylesheet href="http://bithack.se/projects/methabot/lib/exe/css.php?s=feed" type="text/css"?>
<rss version="2.0">
    <channel>
        <title>The Methabot Project</title>
        <description></description>
        <link>http://bithack.se/projects/methabot/</link>
        <lastBuildDate>Tue, 09 Mar 2010 22:49:55 +0100</lastBuildDate>
        <generator>FeedCreator 1.7.2-ppt DokuWiki</generator>
        <image>
            <url>http://bithack.se/projects/methabot/lib/images/favicon.ico</url>
            <title>The Methabot Project</title>
            <link>http://bithack.se/projects/methabot/</link>
        </image>
        <item>
            <title>index</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=index&amp;rev=1245767569&amp;do=diff</link>
            <description>Methabot is an open source web crawler and command line tool optimized for speed. It supports scripted filetype parsing, a wide variety of customization options and is easily configured to fit anyones particular needs.

WEBSITE MOVED: This project has moved to a new website: &lt;http://metha-sys.org/&gt;</description>
            <author>sdac</author>
            <pubDate>Tue, 23 Jun 2009 16:32:49 +0100</pubDate>
        </item>
        <item>
            <title>docs:methanol:commands</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:methanol:commands&amp;rev=1236958308&amp;do=diff</link>
            <description>Master =&gt; Slave

CLIENT

The slave should generate a login token for the given client. Command syntax:



CLIENT 0.0.0.0\n



Possible return values:
100 &lt;token&gt;A token was registered and the client may log in using the token400The slave failed/refused to register the client.
CONFIG

Update the active configuration. The slave should not reply to the command.</description>
            <author>sdac</author>
        <category>docs:methanol</category>
            <pubDate>Fri, 13 Mar 2009 16:31:48 +0100</pubDate>
        </item>
        <item>
            <title>docs:methanol:status_codes</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:methanol:status_codes&amp;rev=1235741360&amp;do=diff</link>
            <description>General

1xx - Success
100Successful command/login.101Successful command/login, but the client should wait for further information before proceeding.
2xx - Client Error
200Denied201Bad Request202Login type unavailable
3xx - Server Error
300Internal Error301Temporarily Unavailable
Master &lt;-&gt; Slave

400 - Refuse
400The slave refused to register the client, possibly due to max_clients being reached
Slave &lt;-&gt; Client</description>
            <author>sdac</author>
        <category>docs:methanol</category>
            <pubDate>Fri, 27 Feb 2009 14:29:20 +0100</pubDate>
        </item>
        <item>
            <title>faq</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=faq&amp;rev=1235680557&amp;do=diff</link>
            <description>*  General
		*  
		*  
		*  

	*  The Command Line Tool
		*  
		*  

	*  Configuration Files
		*  
		*  
		*  

	*  E4X Scripting
		*  
		*  


General

What is The Methabot Project and Methanol?

To make a long story short, the Methabot project is an open source project spanning four child projects; The Methabot command line utility, a web crawling library (libmetha), a web crawling daemon, and a search engine server. The name Methanol applies to whenever the client daemon is used in combinatio…</description>
            <author>sdac</author>
            <pubDate>Thu, 26 Feb 2009 21:35:57 +0100</pubDate>
        </item>
        <item>
            <title>docs:index</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:index&amp;rev=1235433466&amp;do=diff</link>
            <description>This page holds an index over documents regarding Methabot, its internals and how to configure and script Methabot. If you are a beginner, you should most likely want to refer to the getting started page. You might also want to have a look at the How-to index.</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Tue, 24 Feb 2009 00:57:46 +0100</pubDate>
        </item>
        <item>
            <title>docs:builtin_parsers</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:builtin_parsers&amp;rev=1235432990&amp;do=diff</link>
            <description>This page applies to libmetha/1.6.0+

How to bind your filetype to any of these parsers:



filetype[&quot;your_filetype&quot;]
{
    /* replace &quot;name&quot; with the name of the parser you would like to bind */
    parser = &quot;name&quot;;
}


html

Default HTML parser. Extracts links from HTML elements by looking at href and src attributes. This parser will also invoke the CSS parser when a style element is found, and the text parser for extracting links in plain text.</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Tue, 24 Feb 2009 00:49:50 +0100</pubDate>
        </item>
        <item>
            <title>docs:configuration_file_reference</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:configuration_file_reference&amp;rev=1235432890&amp;do=diff</link>
            <description>Directives

include

Include another configuration file. Run mb --info to get information about where libmetha will search for the file. You may also specify a full path.

Example:



include &quot;default.conf&quot;
include &quot;image.conf&quot;


Keywords

Added in libmetha/1.5.0.</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Tue, 24 Feb 2009 00:48:10 +0100</pubDate>
        </item>
        <item>
            <title>howto:jail_methabot_in_a_specific_directory</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=howto:jail_methabot_in_a_specific_directory&amp;rev=1235432116&amp;do=diff</link>
            <description>Support for the --jail option was added in libmetha/1.6.0

The --jail or -j option lets you jail methabot in a specific directory. Let's say you are extracting image files from a website, but you don't want Methabot to go up a directory to other content. This is when jailing is useful. Consider the URL http://example.com/images/, to jail Methabot in the /images/ directory, run:</description>
            <author>sdac</author>
        <category>howto</category>
            <pubDate>Tue, 24 Feb 2009 00:35:16 +0100</pubDate>
        </item>
        <item>
            <title>howto:index</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=howto:index&amp;rev=1235431852&amp;do=diff</link>
            <description>You are welcome to contribute with your own how to-guides! Go to the registration page  or Log in and you will get edit permissions to this page and sub-pages.


	*  How to: Create a how-to page

Topics

Command Line

	*  How to perform automatic downloading of image files
	*  How to Automate Downloading
	*  How to convert ugly HTML to nice XHTML
	*  How to Search on youtube
	*  How to jail Methabot in a specific directory</description>
            <author>sdac</author>
        <category>howto</category>
            <pubDate>Tue, 24 Feb 2009 00:30:52 +0100</pubDate>
        </item>
        <item>
            <title>docs:e4x:writing_your_first_parser</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:e4x:writing_your_first_parser&amp;rev=1235431267&amp;do=diff</link>
            <description>E4X is a very simple but powerful and flexible extension to Javascript. This article will help you getting started with writing parsers for Methabot using Javascript and E4X. You should have some experience with javascript or another procedural programming/scripting language before attempting this article.</description>
            <author>sdac</author>
        <category>docs:e4x</category>
            <pubDate>Tue, 24 Feb 2009 00:21:07 +0100</pubDate>
        </item>
        <item>
            <title>docs:handlers</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:handlers&amp;rev=1235431143&amp;do=diff</link>
            <description>Handlers are called before the parsers. The purpose of a handler is to download the data referenced by the target URL. The handler can, but should not, modify the downloaded data.

A handler should make a decision whether the data should continue to the parser chain or be discarded, and return true or false.</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Tue, 24 Feb 2009 00:19:03 +0100</pubDate>
        </item>
        <item>
            <title>docs:parser_chaining</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:parser_chaining&amp;rev=1235430981&amp;do=diff</link>
            <description>Parser chaining was a concept introduced in libmetha-1.6.0. Parser chaining allows multiple parsers to work on the same data.

Overview


data -&gt; [ parser_1 -&gt; parser_2 -&gt; parser_3 ] -&gt; output_data
             |           |           |
             v           v           v
        +----------------------------------+
        |        List of found URLs        |
        +----------------------------------+</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Tue, 24 Feb 2009 00:16:21 +0100</pubDate>
        </item>
        <item>
            <title>docs:configuration_file_basics</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:configuration_file_basics&amp;rev=1235429746&amp;do=diff</link>
            <description>This page will help you get started with writing basic configuration files.
FIXME This page is a stub.

Basic Concepts

Currently there are two kinds of objects you can define in configuration files; crawlers and filetypes. Crawlers specify crawling behaviour, while each filetype specify properties for different filetypes such as audio files or HTML files. A basic configuration file needs at least one crawler and one filetype. The crawler should be named “default”, but the filetype can be na…</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Mon, 23 Feb 2009 23:55:46 +0100</pubDate>
        </item>
        <item>
            <title>docs:modules:lmm_mysql:index</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:modules:lmm_mysql:index&amp;rev=1235427551&amp;do=diff</link>
            <description>The module lmm_mysql was added in Methabot/1.6.0. It binds native MySQL API functions to an object-oriented javascript interface. This module is in an early stage, but the implemented functions are stable and usable. Below is a complete list of the supported functions.</description>
            <author>sdac</author>
        <category>docs:modules:lmm_mysql</category>
            <pubDate>Mon, 23 Feb 2009 23:19:11 +0100</pubDate>
        </item>
        <item>
            <title>download</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=download&amp;rev=1235425667&amp;do=diff</link>
            <description>Have a look at this page if you need help with building Methabot.

Latest Methabot release is: Methabot/1.6.0.1

Source Code Packages

	*  Methabot/1.6.0.1, Feb 23 2009, Release notes
	*  Methabot/1.6.0, Feb 21 2009, Release notes
	*  Methabot/1.5.0, Jan 15 2009, Release notes
	*  Methabot/1.4.1, Jan 2 2009, Release notes
	*  Methabot/1.4.0, Dec 24 2008, Release notes</description>
            <author>sdac</author>
            <pubDate>Mon, 23 Feb 2009 22:47:47 +0100</pubDate>
        </item>
        <item>
            <title>docs:e4x:this</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:e4x:this&amp;rev=1235222320&amp;do=diff</link>
            <description>Every time your parser function is called, an object called “this” will be filled in with some useful and informative values for you to parse.

Child elements
status_codeA number value depending on the used protocol and what value the server returned. For HTTP, this value should be 200 on success. Added in libmetha/1.5.0.content_typeString identifying the content of the data in this.data. For HTTP, this is equal to the value returned in the Content-Type header from the server. Added in libme…</description>
            <author>sdac</author>
        <category>docs:e4x</category>
            <pubDate>Sat, 21 Feb 2009 14:18:40 +0100</pubDate>
        </item>
        <item>
            <title>docs:robotstxt</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=docs:robotstxt&amp;rev=1235130135&amp;do=diff</link>
            <description>Support for robots.txt was added in Methabot/1.6.0. The purpose of the  robots.txt file is to allow or deny web crawler access to specific parts of a website. Below is a list of the directives supported by methabot, any unknown directives will silently be ignored.</description>
            <author>sdac</author>
        <category>docs</category>
            <pubDate>Fri, 20 Feb 2009 12:42:15 +0100</pubDate>
        </item>
        <item>
            <title>option_reference</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=option_reference&amp;rev=1234787033&amp;do=diff</link>
            <description>Short Option       Long Option        Parameters  Description   -M  --mode            aggressive,friendly,coward  Set the amount of time Methabot should wait between all network communication. Default is aggressive.  -D  --depth-limit     (int)    Decides how deep Methabot will crawl  -e  --external                 If set, external URLs will not be discarded, temporarily disabled  -j  --jail                     Restrict the crawling to only subfolders          --spread                   Spread w…</description>
            <author>pajlada</author>
            <pubDate>Mon, 16 Feb 2009 13:23:53 +0100</pubDate>
        </item>
        <item>
            <title>libmetha:api:lmetha_destroy - created</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=libmetha:api:lmetha_destroy&amp;rev=1234785214&amp;do=diff</link>
            <description>Free everything, clean up metha object

Synopsis


void lmetha_destroy(metha_t *m);


Usage

Return value</description>
            <author>pajlada</author>
        <category>libmetha:api</category>
            <pubDate>Mon, 16 Feb 2009 12:53:34 +0100</pubDate>
        </item>
        <item>
            <title>libmetha:api:lmetha_create</title>
            <link>http://bithack.se/projects/methabot/doku.php?id=libmetha:api:lmetha_create&amp;rev=1234785114&amp;do=diff</link>
            <description>Allocate a metha object

Synopsis


metha_t *lmetha_create(void);


Usage

Return value

0 on failure.

metha object on success.</description>
            <author>pajlada</author>
        <category>libmetha:api</category>
            <pubDate>Mon, 16 Feb 2009 12:51:54 +0100</pubDate>
        </item>
    </channel>
</rss>
