<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Linux für alle &#187; xsane</title>
	<atom:link href="http://www.simplylinux.ch/tag/xsane/feed" rel="self" type="application/rss+xml" />
	<link>http://www.simplylinux.ch</link>
	<description>Jeder kann Linux beherrschen lernen...</description>
	<lastBuildDate>Sat, 19 Nov 2011 17:37:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Texterkennung mit XSane und Tesseract</title>
		<link>http://www.simplylinux.ch/texterkennung-mit-xsane-und-tesseract</link>
		<comments>http://www.simplylinux.ch/texterkennung-mit-xsane-und-tesseract#comments</comments>
		<pubDate>Thu, 06 May 2010 11:40:37 +0000</pubDate>
		<dc:creator>hyper_ch</dc:creator>
				<category><![CDATA[Debian]]></category>
		<category><![CDATA[dergringo]]></category>
		<category><![CDATA[Desktops]]></category>
		<category><![CDATA[KDE]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[Xfce]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[tesseract]]></category>
		<category><![CDATA[xsane]]></category>

		<guid isPermaLink="false">http://www.simplylinux.ch/?p=341</guid>
		<description><![CDATA[Es gibt verschiedene Tools für Texterkennung in Linux. Am einfachsten ist meiner Meinung nach immer noch XSane zu benutzen um die Dokumente einzuscannen. XSane kann dann die eingescannten Dokumente an verschiedene Texterkennungsprogramme weiterleiten. Ich habe sowohl GOCR wie auch Ocrad verwendet, bin aber zum Schluss gekommen, dass Tesseract für deutsche Texte am zuverlässigsten funktioniert. Notwendige [...]]]></description>
			<content:encoded><![CDATA[<p>Es gibt verschiedene Tools für Texterkennung in Linux. Am einfachsten ist meiner Meinung nach immer noch XSane zu benutzen um die Dokumente einzuscannen. XSane kann dann die eingescannten Dokumente an verschiedene Texterkennungsprogramme weiterleiten. Ich habe sowohl GOCR wie auch Ocrad verwendet, bin aber zum Schluss gekommen, dass Tesseract für deutsche Texte am zuverlässigsten funktioniert.</p>
<p><span id="more-341"></span></p>
<h2>Notwendige Pakete installieren</h2>
<p>Zuerst müssen alle notwendigen Pakete installiert werden:<br />
<div id="wpshdo_1" class="wp-synhighlighter-outer"><div id="wpshdt_1" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_1"></a><a id="wpshat_1" class="wp-synhighlighter-title" href="#codesyntax_1"  onClick="javascript:wpsh_toggleBlock(1)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_1" onClick="javascript:wpsh_code(1)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_1" onClick="javascript:wpsh_print(1)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_1" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;"><span class="kw2">sudo</span> <span class="kw2">apt-get</span> <span class="kw2">install</span> xsan sane sane-utils imagemagick tesseract-ocr tesseract-ocr-deu</pre></div></div></p>
<p>&nbps;</p>
<h2>XSane2Tess installieren</h2>
<p>Damit man Tesseract direkt von XSane aus verwenden kann, muss ein entsprechendes Wrapper Script installiert werden:<br />
<div id="wpshdo_2" class="wp-synhighlighter-outer"><div id="wpshdt_2" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_2"></a><a id="wpshat_2" class="wp-synhighlighter-title" href="#codesyntax_2"  onClick="javascript:wpsh_toggleBlock(2)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_2" onClick="javascript:wpsh_code(2)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_2" onClick="javascript:wpsh_print(2)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_2" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;"><span class="kw2">sudo</span> <span class="kw2">touch</span> <span class="sy0">/</span>usr<span class="sy0">/</span>local<span class="sy0">/</span>bin<span class="sy0">/</span>xsane2tess
<span class="kw2">sudo</span> <span class="kw2">chmod</span> 0755 <span class="sy0">/</span>usr<span class="sy0">/</span>local<span class="sy0">/</span>bin<span class="sy0">/</span>xsane2tess
<span class="kw2">sudo</span> <span class="kw2">nano</span> <span class="sy0">/</span>usr<span class="sy0">/</span>local<span class="sy0">/</span>bin<span class="sy0">/</span>xsane2tess</pre></div></div><br />
Und dann folgenden Inhalt einfügen:<br />
<div id="wpshdo_3" class="wp-synhighlighter-outer"><div id="wpshdt_3" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_3"></a><a id="wpshat_3" class="wp-synhighlighter-title" href="#codesyntax_3"  onClick="javascript:wpsh_toggleBlock(3)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_3" onClick="javascript:wpsh_code(3)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_3" onClick="javascript:wpsh_print(3)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_3" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;"><span class="co0">#!/bin/bash</span>
<span class="co0">#</span>
<span class="co0">#</span>
<span class="co0">##############################################################################</span>
<span class="co0">#</span>
<span class="co0">#                                   xsane2tess 1.0</span>
<span class="co0">#</span>
<span class="co0">#                          *** tesseract made simple ***</span>
<span class="co0">#</span>
<span class="co0">#</span>
<span class="co0">##############################################################################</span>
<span class="co0">#</span>
<span class="co0"># xsane2tess is a TesseractOCR wrapper to be able to use tesseract with xsane</span>
<span class="co0">#</span>
<span class="co0">#</span>
<span class="co0">#</span>
<span class="re2">TEMP_DIR</span>=~<span class="sy0">/</span>tmp<span class="sy0">/</span>      <span class="co0"># folder for temporary files (TIFF &amp; tesseract data)</span>
<span class="re2">ERRORLOG</span>=<span class="st0">&quot;xsane2tess.log&quot;</span>  <span class="co0"># file where STDERR goes </span>
<span class="kw1">if</span> <span class="br0">&#91;</span><span class="br0">&#91;</span> <span class="re5">-z</span> <span class="st0">&quot;$1&quot;</span>  <span class="br0">&#93;</span><span class="br0">&#93;</span>
  <span class="kw1">then</span>
  <span class="kw3">echo</span> <span class="st0">&quot;Usage: $0 [OPTIONS]
  xsane2tess converts files to TIF, scans them with TesseractOCR
  and outputs the text in a file.
  OPTIONS:
    -i   define input file (any image-format supported)
    -o   define output-file (*.txt)
    -l   define language-data tesseract should use
  Progress- &amp; error-messages will be stored in this logfile:
     <span class="es2">$TEMP_DIR</span><span class="es2">$ERRORLOG</span>
  xsane2tess depends on
    - ImageMagick  http://www.imagemagick.org/
    - TesseractOCR http://code.google.com/p/tesseract-ocr/
  Some coding was stolen from 'ocube'
http://www.geocities.com/thierryguy/ocube.html
&quot;</span>
  <span class="kw3">exit</span>
<span class="kw1">fi</span>
<span class="co0"># get options...</span>
<span class="kw1">while</span> <span class="kw3">getopts</span> <span class="st0">&quot;:i:o:l:&quot;</span> OPTION
  <span class="kw1">do</span>
  <span class="kw1">case</span> <span class="re1">$OPTION</span> <span class="kw1">in</span>
    i<span class="br0">&#41;</span>  <span class="co0"># input filename (with path)</span>
      <span class="re2">FILE_PATH</span>=<span class="st0">&quot;<span class="es2">$OPTARG</span>&quot;</span>
    <span class="sy0">;;</span>
    o <span class="br0">&#41;</span>  <span class="co0"># output filename</span>
      <span class="re2">FILE_OUT</span>=<span class="st0">&quot;<span class="es2">$OPTARG</span>&quot;</span>
    <span class="sy0">;;</span>
    l <span class="br0">&#41;</span>  <span class="co0"># Language-selection</span>
      <span class="re2">TES_LANG</span>=<span class="st0">&quot;<span class="es2">$OPTARG</span>&quot;</span>
    <span class="sy0">;;</span>
  <span class="kw1">esac</span>
<span class="kw1">done</span>
<span class="co0"># redirect STDOUT to FILE_OUT</span>
<span class="kw3">exec</span> 1<span class="sy0">&gt;&gt;</span><span class="re1">$FILE_OUT</span>
<span class="co0"># redirect STDERR to ERRORLOG</span>
<span class="kw3">exec</span> 2<span class="sy0">&gt;&gt;</span><span class="re1">$TEMP_DIR</span><span class="re1">$ERRORLOG</span>
<span class="co0"># strip path from FILE_PATH, use filename only</span>
<span class="re2">IN_FILE</span>=<span class="co1">${FILE_PATH##*/}</span>
<span class="re2">TIF_FILE</span>=<span class="st0">&quot;<span class="es2">$TEMP_DIR</span>&quot;</span><span class="st0">&quot;<span class="es3">${IN_FILE%.*}</span>&quot;</span>.tif
<span class="re2">TXT_FILE</span>=<span class="st0">&quot;<span class="es2">$TEMP_DIR</span>&quot;</span><span class="st0">&quot;<span class="es3">${IN_FILE%.*}</span>&quot;</span>
<span class="co0"># converting image into TIFF (ImageMagick)</span>
convert <span class="st0">&quot;<span class="es2">$FILE_PATH</span>&quot;</span> <span class="re5">-compress</span> none  <span class="st0">&quot;<span class="es2">$TIF_FILE</span>&quot;</span> <span class="nu0">1</span><span class="sy0">&gt;&amp;</span><span class="nu0">2</span>
<span class="co0"># start OCR (tesseract expands output with *.txt)</span>
tesseract <span class="st0">&quot;<span class="es2">$TIF_FILE</span>&quot;</span> <span class="st0">&quot;<span class="es2">$TXT_FILE</span>&quot;</span> <span class="re5">-l</span> <span class="st0">&quot;<span class="es2">$TES_LANG</span>&quot;</span> <span class="nu0">1</span><span class="sy0">&gt;&amp;</span><span class="nu0">2</span>
<span class="co0"># STDOUT scanned text =&gt; FILE_OUT</span>
<span class="kw2">cat</span> <span class="st0">&quot;<span class="es2">$TXT_FILE</span>&quot;</span>.txt
<span class="co0"># delete graphic file after use</span>
<span class="kw2">rm</span> <span class="st0">&quot;<span class="es2">$TIF_FILE</span>&quot;</span>
<span class="co0"># delete tesseract output</span>
<span class="kw2">rm</span> <span class="st0">&quot;<span class="es2">$TXT_FILE</span>&quot;</span>.txt</pre></div></div><br />
Eine aktuelle Version des Scripts kann <a href='http://doc.ubuntu-fr.org/xsane2tess' target='_blank'>hier</a> bezogen werden.</p>
<p>&nbps;</p>
<h2>XSane anpassen</h2>
<p>Nachdem das Wrapper Skript &#8220;installiert&#8221; ist, kann XSane gestartet werden. Dann unter Preferences ins Setup gehen (oder Alt+s drücken). Dort in den OCR Tab wechseln.</p>
<p>Als OCR Command muss folgendes eingegeben werden:<br />
<div id="wpshdo_4" class="wp-synhighlighter-outer"><div id="wpshdt_4" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_4"></a><a id="wpshat_4" class="wp-synhighlighter-title" href="#codesyntax_4"  onClick="javascript:wpsh_toggleBlock(4)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_4" onClick="javascript:wpsh_code(4)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_4" onClick="javascript:wpsh_print(4)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_4" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;">xsane2tess <span class="re5">-l</span> deu</pre></div></div></p>
<p>Als Input File option muss folgendes eingegeben werden:<br />
<div id="wpshdo_5" class="wp-synhighlighter-outer"><div id="wpshdt_5" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_5"></a><a id="wpshat_5" class="wp-synhighlighter-title" href="#codesyntax_5"  onClick="javascript:wpsh_toggleBlock(5)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_5" onClick="javascript:wpsh_code(5)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_5" onClick="javascript:wpsh_print(5)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_5" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;"><span class="re5">-i</span></pre></div></div></p>
<p>Als Output File option muss folgendes eingegeben werden:<br />
<div id="wpshdo_6" class="wp-synhighlighter-outer"><div id="wpshdt_6" class="wp-synhighlighter-expanded"><table border="0" width="100%"><tr><td align="left" width="80%"><a name="#codesyntax_6"></a><a id="wpshat_6" class="wp-synhighlighter-title" href="#codesyntax_6"  onClick="javascript:wpsh_toggleBlock(6)" title="Click to show/hide code block">Code block</a></td><td align="right"><a href="#codesyntax_6" onClick="javascript:wpsh_code(6)" title="Show code only"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/code.png" /></a>&nbsp;<a href="#codesyntax_6" onClick="javascript:wpsh_print(6)" title="Print code"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/printer.png" /></a>&nbsp;<a href="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/About.html" target="_blank" title="Show plugin information"><img border="0" style="border: 0 none" src="http://www.simplylinux.ch/wp-content/plugins/wp-synhighlight/themes/default/images/info.gif" /></a>&nbsp;</td></tr></table></div><div id="wpshdi_6" class="wp-synhighlighter-inner" style="display: block;"><pre class="bash" style="font-family:monospace;"><span class="re5">-o</span></pre></div></div></p>
<p>Zusätzlich zu Deutsch stellt Tesseract auch noch weitere Sprachen zur Verfügung. Diese einfach installieren und dann im XSane Setup das OCR Command ändern um die jeweilige Sprache zu verwenden.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.simplylinux.ch/texterkennung-mit-xsane-und-tesseract/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

