LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376

galvangoh · 2024-09-03T02:45:28Z

Describe the bug
It is not possible to parse document with AzStorageBlobReader with the take_screenshot=True featurefrom LlamaParse. Also, theAzStorageBlobReader` class does not provide any interface to download screenshots of the document.

Reproducible example:

from llama_parse import LlamaParse
from llama_index.readers.azstorage_blob import AzStorageBlobReader

import os
from dotenv import load_dotenv
load_dotenv()
LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")

instructions = """The provided document is an invoice. Please extract all basic
               information of the document, customer & supplier. The document
               also contains table of line items that needs to be extracted as
               well."""

container = 'CONTAINER_NAME'
folder_path = 'SUBDIR_1/SUBDIR_2'
connection_string = 'my_connection_string'
blob_name = 'MultiPageInvoice.pdf'

# parameters for LlamaParse
parser_params = {
    'api_key': LLAMA_CLOUD_API_KEY,
    'result_type': 'markdown',
    'parsing_instruction': instructions,
    'invalidate_cache': True,
    'do_not_cache': True,
    'skip_diagonal_text': True,
    'num_workers': 5,
    'ignore_errors': False,
    'use_vendor_multimodal_model': True,
    'vendor_multimodal_model_name': 'openai-gpt4o',
    'take_screenshot': True
}

# instantiate the parser
parser = LlamaParse(**parser_params)

file_extractor = {'.pdf': parser}

azure_loader = AzStorageBlobReader(
    container_name=f'{container_name}/{folder_path}', 
    connection_string=connection_string,
    blob=blob_name,
    file_extractor=file_extractor,
)

# begin parsing
document = azure_loader.load_data() # error out here

Error message:

Started parsing the file under job_id 2e1f4eb4-2025-4a23-9297-a71fd979de62
Error while parsing the file '<bytes/buffer>': Failed to parse the file: 2e1f4eb4-2025-4a23-9297-a71fd979de62, status: ERROR
Failed to load file file:///C:/####/####/####/####/####/####/MultiPageInvoice.pdf with error: Failed to parse the file: 2e1f4eb4-2025-4a23-9297-a71fd979de62, status: ERROR. Skipping...

Files
MultiPageInvoice.pdf

Job ID
2e1f4eb4-2025-4a23-9297-a71fd979de62

Screenshots

Client:

Python Library

Additional context
llama-parse==0.5.1
llama-index-readers-azstorage-blob==0.2.0

The text was updated successfully, but these errors were encountered:

galvangoh added the bug Something isn't working label Sep 3, 2024

galvangoh changed the title ~~LlamaParse integration with AzStorageBlobReader~~ LlamaParse feature take_screenshot does not work with AzStorageBlobReader Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376

LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376

galvangoh commented Sep 3, 2024

LlamaParse feature take_screenshot does not work with AzStorageBlobReader #376

LlamaParse feature take_screenshot does not work with AzStorageBlobReader #376

Comments

galvangoh commented Sep 3, 2024

LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376

LlamaParse feature `take_screenshot` does not work with AzStorageBlobReader #376