M 917 536 3378 maksim_kozyarchuk@yahoo.com |
APIs are not always available or feasibleAPIs are often the best way to retrieve data from a provider, offering a structured, reliable, and efficient mechanism for data access. However, APIs may not always be available, especially when you are looking to get data for free. In such cases, web scraping becomes a valuable alternative, enabling you to extract data directly from web pages meant for human consumption. In this article, I will guide you through building a simple web scraping application using Python deployed as AWS Lambda. Exploring Scraping ToolsWhen working with Python, the BeautifulSoup library is often one of the first tools considered for web scraping. It is simple to use and well-integrated with Python. However, it does not support modern, JavaScript-heavy websites built with frameworks like React, which have become prevalent across the web. The three prominent alternatives that support React are Puppeteer (by Google), Playwright (by Microsoft) and Selenium. My choice is Playwright!. It’s clean Python interface and ease of packaging make it a pleasure to develop, test and deploy scraping tools as AWS Lambda functions. Below script demonstrates how Playwright can be used to download daily GTR Rates files from CFTC dashboard, notice how compact and easy to follow the code is def handler(event, context): with sync_playwright() as p: browser = p.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True) page = browser.new_page() page.goto("https://pddata.dtcc.com/ppd/cftcdashboard") # Click on Cumulative Reports Tab page.wait_for_selector("text=Cumulative Reports") page.get_by_text("Cumulative Reports").click()
# Click on Cumulative Rates Tab page.wait_for_selector("text=Rates") page.get_by_text("Rates").click() # Iterate for each row and download the file page.wait_for_selector("role=row") rows = page.locator("role=row") for i in range(rows.count()): row = rows.nth(i) with page.expect_download() as download_info: row.locator("mat-icon").click() # Trigger the download download = download_info.value save_to_s3(download)
Deploying Playwright based scripts to AWS LambdaTo begin, I would like to acknowledge @shantanuo for sharing Dockerfile example for packaging Playwright as an AWS Lambda function. The Dockerfile can be found here and serves as a starting point for building a Playwright-compatible Lambda container. To complement this Dockerfile, below are template.yml for cloudformation deployment and gitlab-ci.yml for GitLab/AWS Automated Deployment. For more details on configuring GitLab CI/CD, including trusted connections and authentication mechanisms, refer to my earlier article https://kozyarchuk.blogspot.com/2024/11/connecting-gitlab-and-aws-m-917-536.html. AWSTemplateFormatVersion: '2010-09-09' Transform: 'AWS::Serverless-2016-10-31' Parameters: DeploymentTimestamp: Type: String Default: 'Notprovided' Description: 'The timestamp of the deployment' ECRRegistry: Type: String Default: "<your_account_id>.dkr.ecr.<your_region>.amazonaws.com" Resources: KupalaGatherLambdaRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: lambda.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole Policies: - PolicyName: KupalaGatherLambdaPolicy PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - execute-api:Invoke - execute-api:ManageConnections Resource: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:* Path: "/" KupalaGatherPipeline: Type: AWS::Serverless::Function Properties: FunctionName: KupalaGatherPipeline ImageUri: !Sub "${ECRRegistry}/kupala/gather:${DeploymentTimestamp}" PackageType: Image Timeout: 900 MemorySize: 1024 Role: !GetAtt KupalaGatherLambdaRole.Arn KupalaGatherPipelineInvokeConfig: Type: AWS::Lambda::EventInvokeConfig Properties: FunctionName: !Ref KupalaGatherPipeline MaximumRetryAttempts: 0 # Disable retries Qualifier: "$LATEST" KupalaECRRepository: Type: AWS::ECR::Repository Properties: RepositoryName: "kupala/gather" LifecyclePolicy: LifecyclePolicyText: | { "rules": [ { "rulePriority": 1, "description": "Keep only the latest image", "selection": { "tagStatus": "any", "countType": "imageCountMoreThan", "countNumber": 1 }, "action": { "type": "expire" } } ] } RepositoryPolicyText: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: lambda.amazonaws.com Action: - ecr:BatchCheckLayerAvailability - ecr:BatchGetImage - ecr:GetDownloadUrlForLayer - ecr:DescribeRepositories - ecr:GetRepositoryPolicy - ecr:ListImages
deploy dev: stage: deploy image: docker:20.10.16 services: - docker:20.10.16-dind id_tokens: GITLAB_OIDC_TOKEN: aud: FooBar variables: ROLE_ARN: "arn:aws:iam::<accountId>:role/<DeployRole>" S3_BUCKET: "<deployment_bucket>" ECR_REGISTRY: <accountId>.dkr.ecr.us-east-1.amazonaws.com script: - echo "Building Docker image..." - apk add --no-cache python3 py3-pip - apk add --no-cache python3-dev libffi-dev gcc musl-dev # Add build dependencies - apk add --no-cache jq - python3 -m venv /venv # Create a virtual environment - . /venv/bin/activate # Activate the virtual environment - pip install --no-cache-dir awscli aws-sam-cli # Install AWS CLI and SAM CLI - export IMAGE_TAG=$(date +%Y%m%d%H%M%S) - docker build -t $ECR_REPOSITORY:$IMAGE_TAG . - echo "Assuming role ${ROLE_ARN} with GitLab OIDC token" - | CREDS=$(aws sts assume-role-with-web-identity \ --role-arn ${ROLE_ARN} \ --role-session-name "GitLabRunner-${CI_PROJECT_ID}-${CI_PIPELINE_ID}" \ --web-identity-token ${GITLAB_OIDC_TOKEN} \ --duration-seconds 3600 \ --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \ --output json) - export AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.[0]') - export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.[1]') - export AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.[2]') - aws sts get-caller-identity - echo "Logging in to Amazon ECR..." - aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $ECR_REGISTRY - echo "Creating repository" - aws ecr create-repository --repository-name "${ECR_REPOSITORY}" --region "${AWS_DEFAULT_REGION}" || echo "Repository already exists." - echo "Tagging Docker image..." - docker tag $ECR_REPOSITORY:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG - echo "Pushing Docker image to ECR..." - docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG - echo "Preparing deployment..." - aws s3 mb s3://$S3_BUCKET || true - sam package --s3-bucket $S3_BUCKET --output-template-file packaged.yaml --image-repository $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG - sam deploy --template-file packaged.yaml --stack-name $STACK_NAME --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset --no-confirm-changeset --on-failure ROLLBACK --parameter-overrides DeploymentTimestamp=$IMAGE_TAG ECRRegistry=$ECR_REGISTRY --image-repository $ECR_REGISTRY/$ECR_REPOSITORY=$IMAGE_TAG Performance and Closing ThoughtsThe runtime statistics of this setup are acceptable for daily scraping tasks where the goal is to pull data into a repository for analysis or archival. However, this approach falls short for ad-hoc or on-demand user queries, where faster and more efficient solutions are required. Performance Metrics
I would welcome thoughts or suggestions on improving web scraping performance to bring it closer to API efficiency. Practical ConsiderationsThe example provided here, while functional for scraping the CFTC GTR dashboard, is lacking some components you would see in many other real-world scenarios, such as:
These challenges require additional considerations and more advanced techniques, which can significantly increase the complexity of scraping workflows. If you're interested in handling advanced scenarios like dynamic forms, complex interactions, feel free to reach out for further examples or discussions. |
No comments:
Post a Comment