What is OCR and how does it work?

OCR stands for Optical Character Recognition. It works by analyzing the pixel content of each page image and identifying character shapes using pattern recognition models trained on large sets of text samples. The recognized characters are assembled into words and lines. This tool uses Tesseract running as a WebAssembly module in your browser, so no file is uploaded to any server during the process.

Is my file uploaded to a server for OCR?

No. The OCR engine runs entirely in your browser as a WebAssembly module. Your file is read locally by the browser File API and processed on your device. No data is transmitted to any server at any point.

Why is my scanned PDF not searchable?

A scanned PDF is created by photographing or scanning a physical page. The result is a raster image with no underlying text data. PDF viewers render the image correctly so it looks like a normal document, but there is no text layer for the viewer to search or select. OCR reads the pixel content and identifies the characters, producing selectable text from the recognized content.

Can OCR recognize handwritten text?

OCR is primarily designed for printed or typed text and is not reliable for handwriting. Handwritten characters vary significantly between individuals in shape, size, spacing, and slant, which makes accurate recognition much harder than for printed fonts. The tool may extract some handwritten words correctly, particularly if the writing is neat and consistent, but accuracy on handwritten documents is generally low.

How do I convert a scanned PDF to Word?

Run the scanned PDF through the OCR tool first to extract the text. Then take that extracted text and use the PDF to Word tool on the original scanned file, or paste the extracted text directly into a Word document. The PDF to Word converter works best on text-based PDFs - running OCR first gives you the raw text content you need.

Does OCR change the appearance of my PDF?

No. This tool extracts the recognized text and delivers it as a plain .txt file or clipboard copy. It does not modify the original PDF in any way. The PDF file itself is unchanged - only the extracted text is returned as output.

What languages does the OCR tool support?

The tool supports 19 languages: English, German, French, Spanish, Portuguese, Italian, Polish, Russian, Turkish, Japanese, Korean, Chinese (Simplified and Traditional), Arabic, Hindi, Indonesian, Malay, Vietnamese, and Thai. Select the language of your document from the dropdown before processing. Tesseract loads the language-specific trained data model, which significantly improves recognition accuracy compared to using the wrong language setting.

Can I run OCR on just some pages of a PDF?

Yes. The tool lets you specify which pages to process using the pages field. Enter individual page numbers separated by commas, or ranges using a hyphen, for example 1, 3, 5-7. Pages not included in the selection are skipped. This is useful for large documents where only certain pages are scanned images and you only need text from those specific pages.

Why is the OCR output missing some words or characters?

OCR accuracy depends on the quality of the source image. Common causes of missing or incorrect text include low scan resolution, skewed pages, faded ink, text that overlaps with images or watermarks, unusual fonts, and heavy JPEG compression artifacts. Scanning at 300 DPI or higher with good contrast between text and background produces the most accurate results. If recognition quality is poor, rescanning the original document at higher resolution before running OCR will give significantly better output.

Can I use OCR on an image file instead of a PDF?

Yes. The tool accepts JPG, PNG, GIF, and WebP image files in addition to PDF. When you upload an image, Tesseract processes it directly and extracts the recognized text. This is useful for extracting text from photographs of documents, screenshots, or scanned pages that were saved as images rather than PDFs.

返回工具

通过 OCR 识别文本

使用光学字符识别从扫描的 PDF 和图片中提取文本。

上传扫描的 PDF 或图片以提取文本。处理完全在你的浏览器中进行。

拖放 PDF 或图片文件到这里或点击浏览

选择一个 PDF 或图片文件(JPG、PNG 等)以提取文本

从任意 PDF 中提取文字 - 免费 OCR 工具

扫描版 PDF 和基于图像的文档看起来和普通文件一样，但实际上不包含任何可读的文字层，它们本质上是页面的照片。我们的 OCR PDF 工具会分析每一页的图像，使用 Tesseract 识别其中的字符，然后将提取出的文字以纯文本文件的形式输出，方便你复制、编辑或粘贴到其他工具中使用。整个处理过程完全在你的浏览器中运行，底层采用 OCR 引擎的 WebAssembly 构建版本，你的文件始终不会离开你的设备。

OCR 是什么，什么时候需要用它

OCR 是光学字符识别（Optical Character Recognition）的缩写。当一份文档经过扫描、拍照，或由某个将页面栅格化（rasterize）的系统导出时，生成的 PDF 中不包含任何文字数据，只有字母的像素图像。PDF 阅读器虽然能正常显示这些文件，但你无法选中文字、用 Ctrl+F 搜索，也无法复制其中的句子。OCR 通过分析每一页上的视觉形状，识别出它们所代表的字符，从而解决这个问题。

以下情况你需要用到 OCR：

你有一份扫描版的合同、发票或表单，需要从中复制文字
你的 PDF 是由照片或传真生成的，无法搜索
你想在使用 PDF 转 Word 工具之前，先从扫描文档中提取文字
你需要让归档文档变得可搜索，以满足合规或记录保存的要求
你收到的 PDF 中文字以图像形式呈现，无法被选中

我们的 OCR 工具是如何工作的

上传文件 - 拖入一个扫描版 PDF 或图像文件（JPG、PNG、WebP）。文件由浏览器在本地读取，不会发送到任何服务器。
选择语言 - 选择文档文字所使用的语言。Tesseract 会加载对应语言的字符训练模型，以提升识别准确率。
选择页面 - 处理整份文档，或指定单个页面或页面范围。
运行 OCR - Tesseract 通过 WebAssembly 在你的浏览器中分析每一页图像，并提取识别出的文字。
下载结果 - 将提取的文字保存为纯 .txt 文件，或直接复制到剪贴板。如果想将内容作为文档进行编辑，可以将文字粘贴到 PDF 编辑器中，或在对原始扫描文件运行 OCR 后，将其作为 PDF 转 Word 转换器的输入来源。

哪些因素会影响 OCR 准确率

OCR 的准确率取决于源文档的质量。高分辨率、白色背景上有清晰深色文字的扫描件效果最佳。常见的影响准确率的因素包括：扫描分辨率过低、页面倾斜或旋转、手写文字、装饰性字体、彩色背景、水印与文字重叠，以及严重的压缩失真（artifact）。在处理前选择正确的语言也会带来显著差异，因为 Tesseract 会使用特定语言的字符和词汇模型来解析模糊字符。

如果你想深入了解 OCR 技术的工作原理以及处理扫描文档的最佳方法，我们的 OCR PDF 文件博客文章用通俗易懂的语言介绍了核心技术和使用场景。

常见问题

OCR 是光学字符识别（Optical Character Recognition）的缩写。它通过分析每一页图像的像素内容，使用在大量文字样本上训练的模式识别模型来识别字符形状，再将识别出的字符组合成单词和行。本工具使用以 WebAssembly 模块形式运行在浏览器中的 Tesseract，因此整个处理过程中不会有任何文件上传到服务器。

不会。OCR 引擎以 WebAssembly 模块的形式完全在你的浏览器中运行。你的文件通过浏览器的 File API 在本地读取，并在你的设备上处理。整个过程中不会有任何数据传输到任何服务器。

扫描版 PDF 是通过拍摄或扫描实体页面生成的。结果是一张栅格图像（raster image），即像素网格，不包含任何底层文字数据。PDF 阅读器能正确渲染图像，所以看起来像普通文档，但实际上没有文字层可供阅读器搜索或选择。OCR 通过读取像素内容并识别字符，从识别结果中生成可选中的文字。

OCR 主要针对印刷体或打印文字设计，对手写内容的识别效果并不可靠。不同人的手写字符在形状、大小、间距和倾斜度上差异很大，这使得准确识别比印刷字体难得多。对于书写工整、风格统一的手写内容，工具可能会正确提取部分单词，但总体来说，手写文档的识别准确率通常较低。

先将扫描版 PDF 通过 OCR 工具处理，提取出文字内容。然后将提取的文字用于 PDF 转 Word 工具处理原始扫描文件，或直接将提取的文字粘贴到 Word 文档中。PDF 转 Word 转换器在处理基于文字的 PDF 时效果最佳，提前运行 OCR 可以为你提供所需的原始文字内容。

不会。本工具只提取识别出的文字，并以纯 .txt 文件或剪贴板内容的形式输出。它不会以任何方式修改原始 PDF。PDF 文件本身保持不变，只有提取的文字作为输出结果返回。

本工具支持 19 种语言：英语、德语、法语、西班牙语、葡萄牙语、意大利语、波兰语、俄语、土耳其语、日语、韩语、中文（简体和繁体）、阿拉伯语、印地语、印度尼西亚语、马来语、越南语和泰语。处理前请从下拉菜单中选择文档所使用的语言。Tesseract 会为所选语言加载对应的训练数据模型，与使用错误的语言设置相比，这能显著提升识别准确率。

可以。你可以在页面输入框中指定要处理的页面。用逗号分隔单个页码，或用连字符表示范围，例如 1, 3, 5-7。未包含在选择范围内的页面将被跳过。这对于大型文档特别有用，当只有某些页面是扫描图像，而你只需要从这些特定页面提取文字时，这个功能非常方便。

OCR 准确率取决于源图像的质量。文字缺失或识别错误的常见原因包括：扫描分辨率过低、页面倾斜、墨迹褪色、文字与图像或水印重叠、非常规字体，以及严重的 JPEG 压缩失真（artifact）。以 300 DPI 或更高分辨率扫描，并保持文字与背景之间良好的对比度，可以获得最准确的识别结果。如果识别质量较差，在运行 OCR 之前以更高分辨率重新扫描原始文档，将会显著改善输出效果。

可以。除 PDF 外，本工具还支持 JPG、PNG、GIF 和 WebP 图像文件。上传图像后，Tesseract 会直接对其进行处理并提取识别出的文字。这对于从文档照片、截图或以图像而非 PDF 格式保存的扫描页面中提取文字非常实用。

Edit & Organize

Optimize

安全

Convert & Images