Monday, January 27, 2025

Web Scraping with AWS Lambda in Python

  




M 917 536 3378

maksim_kozyarchuk@yahoo.com






APIs are not always available or feasible

APIs are often the best way to retrieve data from a provider, offering a structured, reliable, and efficient mechanism for data access. However, APIs may not always be available, especially when you are looking to get data for free. In such cases, web scraping becomes a valuable alternative, enabling you to extract data directly from web pages meant for human consumption. In this article, I will guide you through building a simple web scraping application using Python deployed as AWS Lambda.



Exploring Scraping Tools

When working with Python, the BeautifulSoup library is often one of the first tools considered for web scraping. It is simple to use and well-integrated with Python. However, it does not support modern, JavaScript-heavy websites built with frameworks like React, which have become prevalent across the web. The three prominent alternatives that support React are Puppeteer (by Google), Playwright (by Microsoft) and Selenium. My choice is Playwright!. It’s clean Python interface and ease of packaging make it a pleasure to develop, test and deploy scraping tools as AWS Lambda functions. Below script demonstrates how Playwright can be used to download daily GTR Rates files from CFTC dashboard, notice how compact and easy to follow the code is

def handler(event, context):

    with sync_playwright() as p:

        browser = p.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)

        page = browser.new_page()

        page.goto("https://pddata.dtcc.com/ppd/cftcdashboard")


        # Click on Cumulative Reports Tab

        page.wait_for_selector("text=Cumulative Reports")

        page.get_by_text("Cumulative Reports").click()

       

        # Click on Cumulative Rates Tab

        page.wait_for_selector("text=Rates")

        page.get_by_text("Rates").click()


        # Iterate for each row and download the file

        page.wait_for_selector("role=row")

        rows = page.locator("role=row")

        for i in range(rows.count()):

            row = rows.nth(i)

            with page.expect_download() as download_info:

                row.locator("mat-icon").click()  # Trigger the download

            download = download_info.value

save_to_s3(download)



            


Deploying Playwright based scripts to AWS Lambda

To begin, I would like to acknowledge @shantanuo for sharing Dockerfile example for packaging Playwright as an AWS Lambda function. The Dockerfile can be found here and serves as a starting point for building a Playwright-compatible Lambda container.  To complement this Dockerfile, below are template.yml for cloudformation deployment and gitlab-ci.yml for GitLab/AWS Automated Deployment.  

For more details on configuring GitLab CI/CD, including trusted connections and authentication mechanisms, refer to my earlier article https://kozyarchuk.blogspot.com/2024/11/connecting-gitlab-and-aws-m-917-536.html.

AWSTemplateFormatVersion: '2010-09-09'

Transform: 'AWS::Serverless-2016-10-31'

Parameters:

  DeploymentTimestamp:

    Type: String

    Default: 'Notprovided'

    Description: 'The timestamp of the deployment'

  ECRRegistry:

    Type: String

    Default: "<your_account_id>.dkr.ecr.<your_region>.amazonaws.com"


Resources:

  KupalaGatherLambdaRole:

    Type: AWS::IAM::Role

    Properties:

      AssumeRolePolicyDocument:

        Version: '2012-10-17'

        Statement:

          - Effect: Allow

            Principal:

              Service: lambda.amazonaws.com

            Action: sts:AssumeRole

      ManagedPolicyArns:

        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

      Policies:

        - PolicyName: KupalaGatherLambdaPolicy

          PolicyDocument:

            Version: '2012-10-17'

            Statement:

              - Effect: Allow

                Action:

                  - execute-api:Invoke

                  - execute-api:ManageConnections

                Resource: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:*

      Path: "/"


  KupalaGatherPipeline:

    Type: AWS::Serverless::Function

    Properties:

      FunctionName: KupalaGatherPipeline

      ImageUri: !Sub "${ECRRegistry}/kupala/gather:${DeploymentTimestamp}"  

      PackageType: Image  

      Timeout: 900

      MemorySize: 1024

      Role: !GetAtt KupalaGatherLambdaRole.Arn  


  KupalaGatherPipelineInvokeConfig:

    Type: AWS::Lambda::EventInvokeConfig

    Properties:

      FunctionName: !Ref KupalaGatherPipeline

      MaximumRetryAttempts: 0  # Disable retries

      Qualifier: "$LATEST"


  KupalaECRRepository:

    Type: AWS::ECR::Repository

    Properties:

      RepositoryName: "kupala/gather"

      LifecyclePolicy:

        LifecyclePolicyText: |

          {

            "rules": [

              {

                "rulePriority": 1,

                "description": "Keep only the latest image",

                "selection": {

                  "tagStatus": "any",

                  "countType": "imageCountMoreThan",

                  "countNumber": 1

                },

                "action": {

                  "type": "expire"

                }

              }

            ]

          }

      RepositoryPolicyText:

        Version: "2012-10-17"

        Statement:

          - Effect: Allow

            Principal:

              Service: lambda.amazonaws.com

            Action:

              - ecr:BatchCheckLayerAvailability

              - ecr:BatchGetImage

              - ecr:GetDownloadUrlForLayer

              - ecr:DescribeRepositories

              - ecr:GetRepositoryPolicy

              - ecr:ListImages

             


deploy dev:

  stage: deploy

  image: docker:20.10.16

  services:

    - docker:20.10.16-dind

  id_tokens:

    GITLAB_OIDC_TOKEN:

      aud: FooBar

  variables:

    ROLE_ARN: "arn:aws:iam::<accountId>:role/<DeployRole>"

    S3_BUCKET: "<deployment_bucket>"

    ECR_REGISTRY: <accountId>.dkr.ecr.us-east-1.amazonaws.com


  script:

    - echo "Building Docker image..."

    - apk add --no-cache python3 py3-pip

    - apk add --no-cache python3-dev libffi-dev gcc musl-dev  # Add build dependencies

    - apk add --no-cache jq

    - python3 -m venv /venv  # Create a virtual environment

    - . /venv/bin/activate  # Activate the virtual environment

    - pip install --no-cache-dir awscli aws-sam-cli  # Install AWS CLI and SAM CLI    

    - export IMAGE_TAG=$(date +%Y%m%d%H%M%S)

    - docker build -t $ECR_REPOSITORY:$IMAGE_TAG .

    - echo "Assuming role ${ROLE_ARN} with GitLab OIDC token"

    - |

      CREDS=$(aws sts assume-role-with-web-identity \

        --role-arn ${ROLE_ARN} \

        --role-session-name "GitLabRunner-${CI_PROJECT_ID}-${CI_PIPELINE_ID}" \

        --web-identity-token ${GITLAB_OIDC_TOKEN} \

        --duration-seconds 3600 \

        --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \

        --output json)    

    - export AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.[0]')

    - export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.[1]')

    - export AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.[2]')

    - aws sts get-caller-identity


    - echo "Logging in to Amazon ECR..."

    - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ECR_REGISTRY

    - echo "Creating repository"

    - aws ecr create-repository --repository-name "${ECR_REPOSITORY}" --region "${AWS_DEFAULT_REGION}" || echo "Repository already exists."

    - echo "Tagging Docker image..."

    - docker tag $ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

    - echo "Pushing Docker image to ECR..."

    - docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

    - echo "Preparing deployment..."

    - aws s3 mb s3://$S3_BUCKET || true

    - sam package --s3-bucket $S3_BUCKET --output-template-file packaged.yaml  --image-repository $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG

    - sam deploy --template-file packaged.yaml --stack-name $STACK_NAME --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset --no-confirm-changeset --on-failure ROLLBACK --parameter-overrides DeploymentTimestamp=$IMAGE_TAG ECRRegistry=$ECR_REGISTRY --image-repository  $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG




Performance and Closing Thoughts

The runtime statistics of this setup are acceptable for daily scraping tasks where the goal is to pull data into a repository for analysis or archival. However, this approach falls short for ad-hoc or on-demand user queries, where faster and more efficient solutions are required.

Performance Metrics

  • Memory Usage: The script requires approximately 550MB of memory.

  • Initialization Time: Cold starts take around 3.5 seconds in the AWS Lambda environment.

  • Execution Time: The script completes in about 40 seconds for the given task.

I would welcome thoughts or suggestions on improving web scraping performance to bring it closer to API efficiency.

Practical Considerations

The example provided here, while functional for scraping the CFTC GTR dashboard, is lacking some components you would see in many other real-world scenarios, such as:

  • Manage cookies and session data to maintain authentication across requests.

  • Handle complex page layouts involving nested elements or dynamically loaded content.

These challenges require additional considerations and more advanced techniques, which can significantly increase the complexity of scraping workflows.  If you're interested in handling advanced scenarios like dynamic forms, complex interactions, feel free to reach out for further examples or discussions.

No comments: