Cameron Williams: Intelligent Document Processing for Artificial Intelligence

awschicago 53 views 17 slides Jun 24, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Cameron Williams
Intelligent Document Processing for Artificial Intelligence
AWS Community Day Midwest 2024


Slide Content

MIDWEST | OHIO

Intelligent Document
Processing for Artificial
Intelligence
Cameron Williams | June 13
th
2024

What’s included
1.Introduction to
business need
2.Our IDP workflow
3.Lessons learned
4.Moving forward with AI
5.Questions

Need for an AI powered-IDP
solution at VA

Use Case
The PACT Act of 2022 expanded benefits for an estimated 5 million Veterans exposure to
hazardous chemicals and materials
•Situation: The average claim folder contains 100+ documents averaging 13 pages each
that can take ~30 minutes to review per document
•Goal: Expedite disability benefits determination by reducing the manual review of
documents
•Plan: Extract raw text from the benefits application documents and enable a search
engine that helps to quickly find and review the documents

Why Amazon Textract
•Available on GovCloud and is FedRAMP High
•Ability to scale as required to process 50–80 million pages per day
•High accuracy with both print and handwritten texts
•Reduce time spent reviewing data by highlighting results using Amazon Textract
geometry data
•System already running in AWS

Processing
•500 million documents
•10 billion+ pages
processed
•Largest implementation
of Amazon Textract in
under one year
Results/impact
Efficiency
•Smart search engine to
search and filter the
content, minimizing
number of documents
to review
•Estimated 9x efficiency
boost to document
processing and manual
review time
AI/ML
•Enabling future
innovation through ML
applications including
generative AI by
leveraging the
extracted text

•System is FISMA High
•In GovCloud, all FedRAMP High
•Translation: services are super secure and reliable
Our IDP workflow
S3 bucket
UI Job queues Job initiator
OpenSearch
Results analyzer
Retrieve
Textract
results
APIResults queueResults topic
Store
Textract
results
Textract

•Job queues invoke job initiator via event source mapping
•SQS and Lambda are scalable, durable, and easy to integrate
•Consideration: balance message creation with rate limits
Our IDP workflow: job queues and initiator
Results queue
S3 bucket
Results topic
Store
Textract
results
UI
OpenSearch
Results analyzer
Retrieve
Textract
results
APITextract Job queues Job initiator

•Textract reads from and writes to S3 bucket using Lambda IAM role
•Synchronous (single-page) vs asynchronous (multipage) operations
•Synchronous operations return the result on the response, simplifying architecture
•Asynchronous sends status to SNS (results topic)
•Consideration: Textract default vs client S3 bucket for results
Our IDP workflow: Textract
UI Job queues Job initiator
OpenSearch
Results analyzer
Retrieve
Textract
results
API
S3 bucket
Results queueResults topic
Store
Textract
results
Textract

•Results analyzer retrieves all pages of blocks from S3
•Max of 1,000 blocks per page (each page stored as separate S3 object)
•Retrieving and storing data from hundreds of S3 objects uses a lot of memory
•Maps results to custom internal schema and persists to RESTful API
•Store data in OpenSearch and S3
Our IDP workflow: results analyzer
UI Job queues Job initiator Results queueResults topic
Store
Textract
results
Textract
OpenSearch
Results analyzer
Retrieve
Textract
results
API
S3 bucket

Our IDP workflow: user interaction
Job queues Job initiator Results queueResults topic
Store
Textract
results
Textract
OpenSearch
Results analyzer
Retrieve
Textract
results
API
S3 bucket
UI

Lessons learned
Own your data
Configure Textract to store results in your S3 Bucket and
correlate Job IDs to a document identifier in your system
Test thoroughly
Identify and test your edge cases, including file types, high
page/word counts, and difficult handwriting
Start small
Start small and determine bottlenecks early
Plan for the worst
Build with reprocessing in mind

Moving forward with AI
•Built prototype to answer predefined questions
•Collaborating with AWS on personally identifiable information recognition models
•Prepopulate long, complex forms based on model findings
•Generate recommended adjudication of claims with Benefits Copilot

Thank You!
Implement smart document search index with Amazon
Textract and Amazon OpenSearch
https://aws.amazon.com/blogs/machine-learning/implement-smart-do
cument-search-index-with-amazon-textract-and-amazon-opensear
ch/