Developing Financial Systems: Web Scraping with AWS Lambda in Python

M 917 536 3378

maksim_kozyarchuk@yahoo.com

APIs are not always available or feasible

APIs are often the best way to retrieve data from a provider, offering a structured, reliable, and efficient mechanism for data access. However, APIs may not always be available, especially when you are looking to get data for free. In such cases, web scraping becomes a valuable alternative, enabling you to extract data directly from web pages meant for human consumption. In this article, I will guide you through building a simple web scraping application using Python deployed as AWS Lambda.

Exploring Scraping Tools

When working with Python, the BeautifulSoup library is often one of the first tools considered for web scraping. It is simple to use and well-integrated with Python. However, it does not support modern, JavaScript-heavy websites built with frameworks like React, which have become prevalent across the web. The three prominent alternatives that support React are Puppeteer (by Google), Playwright (by Microsoft) and Selenium. My choice is Playwright!. It’s clean Python interface and ease of packaging make it a pleasure to develop, test and deploy scraping tools as AWS Lambda functions. Below script demonstrates how Playwright can be used to download daily GTR Rates files from CFTC dashboard, notice how compact and easy to follow the code is

def handler(event, context):

with sync_playwright() as p:

browser = p.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)

page = browser.new_page()

page.goto("https://pddata.dtcc.com/ppd/cftcdashboard")

# Click on Cumulative Reports Tab

page.wait_for_selector("text=Cumulative Reports")

page.get_by_text("Cumulative Reports").click()

# Click on Cumulative Rates Tab

page.wait_for_selector("text=Rates")

page.get_by_text("Rates").click()

# Iterate for each row and download the file

page.wait_for_selector("role=row")

rows = page.locator("role=row")

for i in range(rows.count()):

row = rows.nth(i)

with page.expect_download() as download_info:

row.locator("mat-icon").click() # Trigger the download

download = download_info.value

save_to_s3(download)

Deploying Playwright based scripts to AWS Lambda

To begin, I would like to acknowledge @shantanuo for sharing Dockerfile example for packaging Playwright as an AWS Lambda function. The Dockerfile can be found here and serves as a starting point for building a Playwright-compatible Lambda container. To complement this Dockerfile, below are template.yml for cloudformation deployment and gitlab-ci.yml for GitLab/AWS Automated Deployment.

For more details on configuring GitLab CI/CD, including trusted connections and authentication mechanisms, refer to my earlier article https://kozyarchuk.blogspot.com/2024/11/connecting-gitlab-and-aws-m-917-536.html.

AWSTemplateFormatVersion: '2010-09-09'

Transform: 'AWS::Serverless-2016-10-31'

Parameters:

DeploymentTimestamp:

Type: String

Default: 'Notprovided'

Description: 'The timestamp of the deployment'

ECRRegistry:

Type: String

Default: "<your_account_id>.dkr.ecr.<your_region>.amazonaws.com"

Resources:

KupalaGatherLambdaRole:

Type: AWS::IAM::Role

Properties:

AssumeRolePolicyDocument:

Version: '2012-10-17'

Statement:

- Effect: Allow

Principal:

Service: lambda.amazonaws.com

Action: sts:AssumeRole

ManagedPolicyArns:

- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Policies:

- PolicyName: KupalaGatherLambdaPolicy

PolicyDocument:

Version: '2012-10-17'

Statement:

- Effect: Allow

Action:

- execute-api:Invoke

- execute-api:ManageConnections

Resource: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:*

Path: "/"

KupalaGatherPipeline:

Type: AWS::Serverless::Function

Properties:

FunctionName: KupalaGatherPipeline

ImageUri: !Sub "${ECRRegistry}/kupala/gather:${DeploymentTimestamp}"

PackageType: Image

Timeout: 900

MemorySize: 1024

Role: !GetAtt KupalaGatherLambdaRole.Arn

KupalaGatherPipelineInvokeConfig:

Type: AWS::Lambda::EventInvokeConfig

Properties:

FunctionName: !Ref KupalaGatherPipeline

MaximumRetryAttempts: 0 # Disable retries

Qualifier: "$LATEST"

KupalaECRRepository:

Type: AWS::ECR::Repository

Properties:

RepositoryName: "kupala/gather"

LifecyclePolicy:

LifecyclePolicyText: |

{

"rules": [

{

"rulePriority": 1,

"description": "Keep only the latest image",

"selection": {

"tagStatus": "any",

"countType": "imageCountMoreThan",

"countNumber": 1

"action": {

"type": "expire"

}

]

}

RepositoryPolicyText:

Version: "2012-10-17"

Statement:

- Effect: Allow

Principal:

Service: lambda.amazonaws.com

Action:

- ecr:BatchCheckLayerAvailability

- ecr:BatchGetImage

- ecr:GetDownloadUrlForLayer

- ecr:DescribeRepositories

- ecr:GetRepositoryPolicy

- ecr:ListImages

deploy dev:

stage: deploy

image: docker:20.10.16

services:

- docker:20.10.16-dind

id_tokens:

GITLAB_OIDC_TOKEN:

aud: FooBar

variables:

ROLE_ARN: "arn:aws:iam::<accountId>:role/<DeployRole>"

S3_BUCKET: "<deployment_bucket>"

ECR_REGISTRY: <accountId>.dkr.ecr.us-east-1.amazonaws.com

script:

- echo "Building Docker image..."

- apk add --no-cache python3 py3-pip

- apk add --no-cache python3-dev libffi-dev gcc musl-dev # Add build dependencies

- apk add --no-cache jq

- python3 -m venv /venv # Create a virtual environment

- . /venv/bin/activate # Activate the virtual environment

- pip install --no-cache-dir awscli aws-sam-cli # Install AWS CLI and SAM CLI

- export IMAGE_TAG=$(date +%Y%m%d%H%M%S)

- docker build -t $ECR_REPOSITORY:$IMAGE_TAG .

- echo "Assuming role ${ROLE_ARN} with GitLab OIDC token"

- |

CREDS=$(aws sts assume-role-with-web-identity \

--role-arn ${ROLE_ARN} \

--role-session-name "GitLabRunner-${CI_PROJECT_ID}-${CI_PIPELINE_ID}" \

--web-identity-token ${GITLAB_OIDC_TOKEN} \

--duration-seconds 3600 \

--query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \

--output json)

- export AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.[0]')

- export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.[1]')

- export AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.[2]')

- aws sts get-caller-identity

- echo "Logging in to Amazon ECR..."

- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ECR_REGISTRY

- echo "Creating repository"

- aws ecr create-repository --repository-name "${ECR_REPOSITORY}" --region "${AWS_DEFAULT_REGION}" || echo "Repository already exists."

- echo "Tagging Docker image..."

- docker tag $ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

- echo "Pushing Docker image to ECR..."

- docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

- echo "Preparing deployment..."

- aws s3 mb s3://$S3_BUCKET || true

- sam package --s3-bucket $S3_BUCKET --output-template-file packaged.yaml --image-repository $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG

- sam deploy --template-file packaged.yaml --stack-name $STACK_NAME --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset --no-confirm-changeset --on-failure ROLLBACK --parameter-overrides DeploymentTimestamp=$IMAGE_TAG ECRRegistry=$ECR_REGISTRY --image-repository $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG

Performance and Closing Thoughts

The runtime statistics of this setup are acceptable for daily scraping tasks where the goal is to pull data into a repository for analysis or archival. However, this approach falls short for ad-hoc or on-demand user queries, where faster and more efficient solutions are required.

Performance Metrics

Memory Usage: The script requires approximately 550MB of memory.
Initialization Time: Cold starts take around 3.5 seconds in the AWS Lambda environment.
Execution Time: The script completes in about 40 seconds for the given task.

I would welcome thoughts or suggestions on improving web scraping performance to bring it closer to API efficiency.

Practical Considerations

The example provided here, while functional for scraping the CFTC GTR dashboard, is lacking some components you would see in many other real-world scenarios, such as:

Manage cookies and session data to maintain authentication across requests.
Handle complex page layouts involving nested elements or dynamically loaded content.

These challenges require additional considerations and more advanced techniques, which can significantly increase the complexity of scraping workflows. If you're interested in handling advanced scenarios like dynamic forms, complex interactions, feel free to reach out for further examples or discussions.

Developing Financial Systems

Monday, January 27, 2025

Web Scraping with AWS Lambda in Python

APIs are not always available or feasible

Exploring Scraping Tools

Deploying Playwright based scripts to AWS Lambda

Performance and Closing Thoughts

Performance Metrics

Practical Considerations

No comments:

About Me