Sunday, November 17, 2024

Connecting GitLab and AWS

 


Connecting Gitlab and AWS

M 917 536 3378
maksim_kozyarchuk@yahoo.com








     Bypassing the GitLab vs GitHub or GitLab vs AWS Code Pipeline debates, this article will focus on logistics of connecting GitLab that acts as CI/CD and deploys to AWS.  One way to achieve this is to add AWS access keys as variables on GitLab, while this works it’s not a recommended practice from security perspective.  It’s beyond the scope of the article to expand on this, I would just add that the accounts/roles that are needed for AWS deployment are typically quite powerful and should they get compromised, damage is likely to be extensive.  

The recommended option is to connect AWS with GitLab using the OpenID protocol.  With this setup, GitLab acts as both an OpenID Authenticator and initiator of the authentication commands.  AWS is then configured to trust GitLab as Authenticator for a specific role, guarded by three aspects:

  • requests are coming from specific instance of gitlab that is configured for the role

  • requests are coming from a particular group, project or branch

  • Optionally by a secret key also known as Audience, this is advertised by AWS as the key aspect of security, but while supported by GitLab it can be easily spoofed.   This point will cause quite a bit of confusion in getting this handshake setup, if you follow the official docs and forums on AWS.

There are generally 5 steps in getting this setup configured and I would recommend you take a deep breath and allocate a good portion of a day to get through them.

Step 1. Optional in practice, but appears crucial from reading relevant articles is to establish ClientId or Audience.  To start I would highly recommend you read this AWS article, while it’s misleading about a couple of points, namely that “ you must register your application with the IdP to receive a client ID.” and that /.well-known/openid-configuration must support id_token, it’s generally a well article that gives you a good overview of the process.    If you want to go ahead with creation of ClientID, you can do so by logging on to your gitlab instance, then going to your group level, then Settings->Application and create a new application.  The one confusing point you’ll encounter is entering a value of a callback url, feel free to enter https://localhost as it will not be used.  Reference document from GitLab can be found here. https://docs.gitlab.com/ee/integration/oauth_provider.html


Step 2: Add new Identity Provider on AWS IAM.  I would refer you back to the article reference in Step 1 as it provides a reasonable description of the process.  A few points on this doc is (Step 1 as before is optional) and if you skip it, then for step 6( Audience) you can just add any value, it’s simply there to create a placeholder policy that you will then edit via json.

Step 3: Finish setting up AWS role.  With step 2, you created an AWS Role, but that role relies on Audience which with GitLab is not much more secure than storing AWS Access Keys as GitLab variables.  To secure it further you should restrict access to particular group, projects or branches.  I would refer you to this article on gitlab to learn more.  Below is a sample AWS Role setup you will end up with.  If you setup clientid/audience in Step 1, you can keep it using the template below or you can simply skip it.  In below example, client id is, group is ‘superproj’ hosted on gitlab.com and all repos and branches within the group can assume AWS role.

"Action": "sts:AssumeRoleWithWebIdentity",

"Condition": {

    "StringEquals": {

        "gitlab.com:aud": "My Favorite Client"

    },

    "StringLike": {

        "gitlab.com:sub": "project_path:superpoj/*:ref_type:branch:ref:*"

    }

}


Step 4: Getting your .gitlab-ci.yml setup.  The https://docs.gitlab.com/ee/ci/cloud_services/aws/ offer an example for setting up the pipeline, but I found a few issues with it.

  • It doesn’t talk about which image to use, I found another doc from gitlab that provides relevant answer. https://docs.gitlab.com/ee/ci/cloud_deployment/.  

  • The parsing logic to assign response of aws sts assume-role-with-web-identity, instead I used the following.

  • It suggests you ‘aud’ section of GITLAB_OIDC_TOKEN to https://gitlab.com, this does’t work, instead you need to set it to whatever your AWS policy is set to or ‘My Favoritie Client’ sticking with the example I started.

I found the below template to work for me.  jq is included in the aws base image, but if you need to add it to your image it’s not a big deal as well.

  image: public.ecr.aws/sam/build-python3.11

  id_tokens:

    GITLAB_OIDC_TOKEN:

      aud: “My My Favorite Client”

  variables:

    ROLE_ARN: "arn:aws:iam::<your account>:role/<NameOfYourRole>"

  script:

    - CREDS=$(aws sts assume-role-with-web-identity --role-arn ${ROLE_ARN} --role-session-name "GitLabRunner-${CI_PROJECT_ID}-${CI_PIPELINE_ID}" --web-identity-token ${GITLAB_OIDC_TOKEN} --duration-seconds 3600 --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' --output json)

    - export AWS_ACCESS_KEY_ID=$(echo $CREDS | jq -r '.[0]')

    - export AWS_SECRET_ACCESS_KEY=$(echo $CREDS | jq -r '.[1]')

    - export AWS_SESSION_TOKEN=$(echo $CREDS | jq -r '.[2]')


Step 5: Is to add required permissions to your AWS role, this is something that I am still working through, but the entitlements required are extensive.

Is this better than storing AWS Keys in GitLab?  Not if you are just using aud that can be overridden in your pipeline.  I would recommend restricting your role further to only the protected deployment branch and make sure that only people who you want to empower to make use of the powerful AWS deployment role are allowed to merge into the deployment branch.  Then there is also a question of security of the gitlab server and user access to gitlab servers, but that feels like a risk on the level of securing access to AWS.  At that level you should be concerned about protecting sensitive data with encryption and perhaps less concerned about costs of people spinning up unexpected AWS resources on your tab or even bringing down your application that you should be able to recover from backups.

















Thursday, November 14, 2024

Building on AWS



Building on AWS

M 917 536 3378
maksim_kozyarchuk@yahoo.com




When I started developing the Kupala-Nich platform, I knew I wanted to leverage AWS’s serverless environment. New to serverless technology, I spent considerable time refining the platform’s architecture, learning the limitations of certain tools, and discovering AWS components along the way. Here, I’ll share the current architecture of the platform and invite feedback on whether best practices or AWS stack components could further enhance its robustness, security, and efficiency. Below is a high-level diagram of the Kupala-Nich application architecture followed by a brief description of the components.


Platform Architecture Overview The frontend is a React application that maintains a WebSocket connection to the API Gateway, handling most server interactions. The backend consists of several Lambda functions, DynamoDB, and S3 storage.
  • WS Lambda: A lightweight Lambda function that exposes various endpoints for the UI. Built in Python without dependencies beyond the standard library, on average it responds in under 50ms and requires less than 100MB of RAM.
  • PyCaret Lambda: Deployed via Docker, this function packages PyCaret for ML analysis, taking 3-5 minutes to execute with ~500MB of RAM. Training datasets and generated analysis are stored in S3.
  • CalcPosition Lambda: Triggered by DynamoDB Streams upon updates to the Position and Market Data tables, it calculates positions and P&L values, updates the CalcPosition table, and publishes results to WebSocket clients. Although light, it will scale as position complexity increases.
  • EOD & YFinance Lambdas: These are event-triggered by timers. YFinance refreshes market data, and EOD snapshots positions to the EODPosition table and performs maintenance, such as rebalancing and closed position aggregation. The YFinance Lambda requires pandas and yfinance libraries, so it’s deployed as a zip package.


Code Repositories & CI/CD Pipeline
The platform’s complexity lies not only in its architecture but in the automation of CI/CD pipelines and management of permissions. Here’s an overview of the code structure and CI/CD practices:
  • Frontend Repo: This contains all React code. The CI/CD pipeline is straightforward, running npm install, tests, npm build, and finally deploying the build to the Apache server. The pipeline, built on Node, completes in about two minutes, with most time in npm install and build steps. Unit test coverage is moderate, focusing on formatting logic and component stability across changes.
  • Backend Repo: This includes all backend code, except for PyCaret-specific functions, which are in a separate repository. Each Lambda has its own Python package with shared components in a common package. Tests coverage is extensive, most integration-level scenarios are using Moto for AWS mocking. Lightweight Lambdas share a package with different entry points based on the trigger type. YFinance Lambda is packaged separately due to dependencies. The repo also includes a CloudFormation template that defines the application’s tables, Lambdas, API Gateway configurations, and security roles. CI/CD here includes a test stage for type checking and validation and a deployment stage for package and SAM deployments.
  • PyCaret Repo: This houses code for PyCaret-based analysis and data retrieval. The Lambda can be triggered by WebSocket API Gateway or AWS Step Function events. Test coverage is minimal as the focus is on PyCaret invocations. Docker packaging takes about 7 minutes due to PyCaret’s dependencies. To expedite testing, different base images are used for test and package steps. This repo also includes two CloudFormation templates, one for the Lambda and another for the AWS Step Function definition.


Next Steps: Scalability & Security
As the Kupala-Nich platform evolves from a demo to a production product, scaling and security will be crucial focuses. I’ll cover these topics in detail in a future article. If you would like access to the code repo, please reach out to me by email.



Monday, November 11, 2024


Making Machine Learning accessible to individual investors with Kupala-nich

M 917 536 3378
maksim_kozyarchuk@yahoo.com




You’ve heard the buzz about AI, and you’ve likely spent time with tools like ChatGPT, finding them useful for general queries and assistance. However, when it comes to analyzing your portfolio or uncovering investment opportunities, they fall short. To bridge this gap, Kupal-nich platform brings Machine Learning (ML) models tailored for financial applications within the reach of a classically trained financial analyst and individual investor.
There are many ML models and frameworks used across industries like healthcare, retail, finance, and manufacturing. Typically, applying these models requires scientists skilled in data engineering and model selection. AutoML frameworks recently emerged to simplify ML access by bundling various algorithms into a single package with a low-code API. This has lowered the entry barrier for users with basic Python skills and a foundational grasp of ML workflows, allowing them to train, select, and apply models more easily. A leading open-source AutoML framework, PyCaret, is powerful but tailored to the research community—users familiar with Jupyter notebooks and standard ML steps like data preprocessing, feature engineering, model selection, and hyperparameter tuning. This leaves a gap for financial analysts who primarily work with tools like Excel and are accustomed to spreadsheet formats.
Kupala-nich bridges this gap by making PyCaret and AutoML technology accessible to traditional financial analysts. It does this by integrating rich datasets on broadly held companies with Excel-like tools for data pivoting, filtering, sorting, and custom column derivation, enabling deeper insights and analysis. Users can visually explore and identify columns of interest before kicking off model training with a simple button click. It also allows users to upload their proprietary Excel sheets and datasets.
Behind the scenes, Kupala-nich uses PyCaret’s AutoML features to train multiple models and produce predictive analysis based on the best-performing model. This functionality is delivered via an AWS Lambda deployment. While AWS Lambda might seem like an unconventional choice for deploying PyCaret due to its package size (~1.5 GB), it has proven to be highly cost-effective for typical analysis workloads associated with S&P 500 datasets, with 10 runs costing only around $0.01. For users who require fewer than a thousand runs daily, Lambda is a far more economical solution compared to options like AWS Fargate. This calculus may shift for more intensive workloads, such as time series analysis and backtesting on large datasets.

If you're interested in learning more, check out the Financial Analysis tab on Kupala-Nich or reach out to me at maksim.kozyarchuk@gmail.com.

Wednesday, May 13, 2015

Open Source Trade Model



Open Source Trade Model
M 917 536 3378
maksim_kozyarchuk@yahoo.com




Overview

Over the past couple of month, I’ve written several articles that discuss Trade Model, Domain Model and Data Entry. Many of those ideas are now checked in as code on GitHub in the following projects.

NCT-portal -- Front end that demonstrates the type of functionality that can be expected from a Trade Model.  It contains front end JavaScript that leverages Booktstrap and Python based backend that runs on Flask.  It is designed to be deployed on Elastic Beanstalk web server tier.

NCT-workers  -- Contains business logic and domain model for entering and representing trades.  It’s built in python and is designed to be deployed on Elastic Beanstalk worker tier.

These projects are made available under GPL 2 license


The Goal
   Over the past 15+ years, I’ve had an opportunity to contribute to code around trading, portfolio management and accounting systems.  Most of the systems I’ve worked with were able to support multiple asset classes side-by-side and enable multiple user groups, including traders, middle office and accountants working off a single copy of the data.  Combination of cross asset and cross functional capability of an asset management systems is the ZEN of Trading Systems.  It’s the Holy Grail that allows high data quality as well as simple and consistent usability. This in turn allows better transparency of data and accuracy of reporting.

   However, it is my experience that vast majority of trading, portfolio management and accounting systems are not cross asset and have highly specialized domain/data models requiring lots and lots of customized “data feeds and compares” between them.

This project has two goals:
  1. Demonstrate trade model capable of supporting cross asset portfolio enabling single data store behind multiple asset management applications.
  2. Develop a set of tools and APIs for loading the data in and extracting the data out of the system


Commitment

Quantity:  Code for this project will be developed test first and will be monitored by a continuous integration system.  Currently unit tests are running on Travis-CI (https://travis-ci.org/kozyarchuk/NCT-workers and https://travis-ci.org/kozyarchuk/NCT-portal)

Simple:  Solutions developed in this project will favor simplicity and API consistency over product specific market conventions

Functional::  Solutions developed in this project will be deployable and runnable via an API.  This project is currently designed to be deployed to AWS Elastic Beanstalk.


Roadmap

There is a significant backlog of functionality that needs to be developed. This backlog is currently maintained in my notes but will be moved to a work management system at some point in time.  The following are some of the key functional goals for the near term.
  • Trade File Management process capable of importing trade executions from several broker platforms
  • Extensions to Trade Model to support functionality discussed here
  • Positions, P&L and Cash Balances
  • Time Series and Daily Mark to Market
  • P&L, Risk and Position reporting
  • Trade Fills, Trade Allocations and commission calculations



Call to Action
Developing an effective Trade Model is a large undertaking, but if done well it can have a profound impact on the technology landscape within the asset management world.  If you are interested in learning how you can contribute.  Please send me a note at maksim_kozyarchuk@yahoo.com



Monday, May 4, 2015

Implementing SOA on AWS



Implementing SOA on AWS
by  Maksim Kozyarchuk




History of SOA and SOA in the Cloud

   It is a common design pattern to offload processing of business tasks away from the user interface.  This design pattern is at the core of many modern applications and has many benefits including:
  1. Improved responsiveness of user interfaces( web, mobile or desktop)
  2. Decoupling of presentation and business logic layers, allowing specialization and improving code reuse
  3. Ability to effectively scale applications and gracefully handle bursts in user activity

   A number of technologies has been developed to support this design pattern under the name of Service Oriented Architectures. The first and most prominent was Java’s J2EE and ESB stacks.  Within the Python ecosystem, Celery with (RabitMQ or Redis) as transport has is a fairly robust and popular implementation of this design pattern.

  Technologies such as VMWare and Docker make it possible to run SOA implementation on the cloud with little architectural change.   However, current generation of cloud computing solutions led by AWS include SOA toolkit as part of their managed offering and it’s worth looking at how they may compare to self managed SOA deployments.   In this article, I will walk through setup of Elastic Beanstalk Worker Tier environment and discuss how to used it as the backbone of the Service Oriented Architecture in the Cloud.


Components of EB Worker Tier

Elastic Beanstalk’s Worker Tier builds on top of Elastic Beanstalk model for deployment, load balancing, scaling and health monitoring.   It adds a daemon process that runs within each EB instance.  That daemon process pulls requests off a preconfigured SQS queue and sends them as HTTP POST request to a local URL.  This simple add-on to the EB stack effectively allows an SOA platform that supports:
  1. Rolling upgrades of the underlying software enables A/B testing and zero downtime
  2. Automatic scaling and load balancing of the workers
  3. Configurable health monitoring of the application
  4. Use of HTTP as invocation method makes the platform largely technology agnostic

Furthermore, EB Worker Tier also supports dead letter queue allowing for troubleshooting and replay of failed messages.

Recent addition of scheduling capabilities into the EB Worker Tier, adds a powerful capability of adding seamless automation of simple repetitive tasks without any additional infrastructure.


Implementing EB Worker Tier

Implementation of Elastic Beanstalk’s Worker Tier is straightforward and is well documented here.  The only component that was not clearly documented were needed permissions.  To get the right permissions, I ended up adding AmazonSQSFullAccess  and CloudWatchFullAccess policies to the   aws-elasticbeanstalk-ec2-role role used to run my Worker Tier components.
   

Final thoughts  

Cloud service providers have done a great job to accommodate existing SOA platforms in the cloud.  However, when considering a move to the cloud, it is worthwhile to look at new technologies that cloud providers bring.  They will likely help you solve many of the long functionality gaps within your SOA platform.  Another exciting technology emerging in the AWS stack in Lambda which further abstracts the SOA platform.  However Lambda is still fairly new and limiting, supporting only Node.js based deployment.