The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika. You can use the ingest attachment plugin as a replacement for the mapper attachment plugin. The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install ingest-attachment
The plugin must be installed on every node in the cluster, and each node must be restarted after installation. This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.5.0.zip. Sign Up for ExpertRec
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove ingest-attachment
The below code here Pdf to elastic search , the code extracts pdf and put into elastic search
import PyPDF2 import re import requests import json import os from datetime import date class ElasticModel: name = "" msg = "" def toJSON(self): return json.dumps(self, default=lambda o: o.__dict__, sort_keys=True, indent=4) def __readPDF__(path): # pdf file object # you can find find the pdf file with complete code in below pdfFileObj = open(path, 'rb') # pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # number of pages in pdf print(pdfReader.numPages) # a page object pageObj = pdfReader.getPage(0) # extracting text from page. # this will print the text you can also save that into String line = pageObj.extractText() line = line.replace("\n","") print(line) return line #line = pageObj.extractText() def __prepareElasticModel__(line, name): eModel = ElasticModel(); eModel.name = name eModel.msg = line return eModel def __sendToElasticSearch__(elasticModel): print("Name : " + str(eModel)) ############################################ #### #CHANGE INDEX NAME IF NEEDED ############################################# index = "samplepdf" url = "http://localhost:9200/" + index +"/_doc?pretty" data = elasticModel.toJSON() #data = serialize(eModel) response = requests.post(url, data=data,headers=< 'Content-Type':'application/json', 'Accept-Language':'en' >) print("Url : " + url) print("Data : " + str(data)) print("Request : " + str(requests)) print("Response : " + str(response)) ################################# #Change pdf dir path ################################### pdfdir = "C:/Users/muthali/Desktop/TemplatesPDF/SamplePdf" listFiles = os.listdir(pdfdir) for file in listFiles : path = pdfdir + "/" + file print(path) line = __readPDF__(path) eModel = __prepareElasticModel__(line, file) __sendToElasticSearch__(eModel)
If you want to skip all the coding, you can just create a PDF search engine using expertrec.
How do I index a PDF as an Elasticsearch index? Follow these steps to index a PDF file as an Elasticsearch index: Install the PDF plugin for Elasticsearch: PDF indexing is not natively supported by Elasticsearch. You will need to install a plugin in order to index PDF files. The Elasticsearch PDF plugin is one such plugin. The plugin can be installed on your Elasticsearch server by downloading it from the official Elasticsearch website. Make a text file out of the PDF file: You must convert the PDF file to a text file before you can index it. Apache PDFBox, Tika, and Poppler are just a few of the open-source libraries that you can use to convert a PDF file to a text file. To extract the text from the PDF file, you can use any of these libraries. Create an index for Elasticsearch: You must create an Elasticsearch index to store the data after installing the PDF plugin and converting the PDF file to a text file. An index can be made with a program like Kibana or the Elasticsearch API. The PDF file’s index: Finally, you can index the PDF file by sending an Elasticsearch PUT request with the text extracted from the PDF file and specifying the document type and index. To index a PDF file, for instance, you can use the cURL command listed below. FAQs