PrivAIcy Matters: Balancing Privacy, Price, and Performance in In-House LLM Deployments

More from Author

Welcome to the enthralling universe of Generative AI! Recognized as a realm of artificial intelligence algorithms with the magic to conjure fresh content based on available data, this technology has unassumingly become the new frontier for a plethora of industries, including tech, banking, and media. And, no surprises here, the technological world has openly embraced it in myriad innovative ways.

Let's talk about the recent sensation that has been taking the digital world by a storm: ChatGPT. An innovation by OpenAI, this AI-powered chatbot, which enjoys robust backing from notable investors like Microsoft, boasts of advanced conversational capabilities that have caught the world's attention. The technology has been smoothly integrated into Microsoft's Bing search engine and Edge browser, with further plans to weave it into “every layer of the stack,” as revealed by CEO Satya Nadella. 

The narrative took a new twist in March 2023 when OpenAI introduced GPT-4, the latest in the ChatGPT software series, currently available to a circle of subscribers and developers. And the competition is heating up! Alphabet, for instance, is on the brink of launching a conversational AI service, aptly named Bard.

A Peek into Privacy Concerns with OpenAI

However, when dealing with sensitive data, we wade into a more complex scenario. Imagine your medical history or bank account details being accessible - these are elements that most of us prefer to cloak under a veil of privacy. The corporate world is no stranger to this necessity either, where sensitive data, like proprietary research data, needs to be handled with kid gloves.

A few headlines to ponder:

ChatGPT under investigation by Canadian privacy watchdog

Germany considers following Italy in banning ChatGPT

ChatGPT bug leaked payment data, conversation titles of users, confirms OpenAI

OpenAI gives in to Italy's data privacy demands, ending ChatGPT ban

Exploring the In-House Model 🏠

Here's an alternative: an in-house AI model. However, it does not come without its own set of challenges. This approach demands a devoted team of data scientists to upkeep and refresh the model, making it a resource-intensive strategy. But there's a silver lining - adopting an in-house model bestows a superior level of control over data and its processing, offering a secure option for handling sensitive data.

Yet, the dilemma persists. Re-training the data will be confined to your private dataset unless your data scientists dive into merging the LLM with updates from public sources, ensuring that the model benefits from external enhancements without transmitting any internal data outwards.

Navigating the Path to Deploy Open Source LLMs Privately

And so, we arrive at the million-dollar question: How does one deploy open source LLMs on a private network? 🔐

Generative AI applications often utilize common components like Embedding models, vector stores, and LLMs. Those tech aficionados following Bluetick Consultants' generative AI journey might have a grasp on utilizing open source vector stores and Embedding models. The enigma, however, lies in deploying LLMs while considering aspects like latency and cost.

    1. Harnessing Amazon SageMaker's Might for LLM Deployment

    Amazon SageMaker, celebrated as a fully-managed machine learning service, presents a seamless horizon where data scientists and developers can not only effortlessly build and train machine learning models but also usher them into a production-ready hosted environment. It arms you with an integrated Jupyter authoring notebook instance, ensuring smooth access to your data sources for exploration and analysis, all while liberating you from the shackles of server management. Moreover, it extends a plethora of common machine learning algorithms, optimized to perform with utmost efficiency against sizeable data in a distributed environment.

    For additional information, see Amazon SageMaker developer resources.

    • Step 1: Craft Your Domain and Step into the Studio
    • Start by creating a domain and opening the studio, navigating to the SageMaker Jumpstart option. For our journey today, we'll deploy the "llama 2 7B" model.

      📌 A Quick Detour: Why the llama 2 7B model, you ask? The face-off among LLMs (GPT 3.5 Turbo, Llama 2 7B, and Falcon 7B) unfolds the story clearly. Dive into “The LLM Face-Off” to explore the nuances.

      The LLM Face-Off: GPT 3.5 Turbo, Llama 2 7B, and Falcon 7B

      SageMaker defaults its instance type to ml.g5.2xlarge, limiting the downgrade to versions like ml.g5.xlarge but permitting upgrades to, say, ml.g5.4xlarge or ml.g5.8xlarge to minimize latency.

    • Step 2: Open and Operate Your Notebook
    • Within the SageMaker studio, initiate a notebook, each operating on an instance type, with the default being ml.t3.medium.

    • Step 3: Examine the Model and Endpoint
    • Ensure to turn off the studio when it's idle to conserve resources. Navigate through SageMaker's left menu, selecting “Inference.” Here, your locally deployed model, endpoint, and other configurations stand ready for exploration.

    • Step 4: Time for Cleanup
    • Especially crucial for those in the testing or research phases, don't overlook the cleanup. Wipe the slate clean and remove all necessary components within “Inference,” deleting both the model and endpoint.

    2. Empowering Deployment with RunPod

    RunPod is a cloud computing platform meticulously sculpted for AI and machine learning applications. With offerings like GPU Instances, Serverless GPUs, and AI Endpoints, RunPod is steadfast in its mission: making cloud computing not only accessible and affordable but also rich in features, user-friendly, and an immersive experience. Here, we strive to arm both individuals and enterprises with pioneering technology, unlocking the colossal potential that AI and cloud computing bring to the table.
    Embark with us on a journey to deploy the Llama 2 7B model using RunPod, ensuring every step is unveiled, from account creation to API utilization.

    • Step 1: Begin with Basics - Account Creation & API Key Generation
    • Commence by creating your RunPod account and weaving through the settings page to generate an API key. It's crucial to note that RunPod operates as a prepaid service, so let's kick things off by adding $10 to get the ball rolling.

    • Step 2: Navigate the “Secure Cloud”
    • Find your way to the “Secure Cloud” tab, where you'll be greeted by a myriad of GPU servers, available with either on-demand or monthly options to cater to your specific needs.

      For deploying llama 2 7B model, we'll use NVIDIA RTX A5000 GPU instance with specifications 24 GB vRAM 27 GB RAM, 7 vCPU

    • Step 3: Deploying the Llama 2 7B Model - Let's Get Technical with Python
    • Now, let's get our hands a bit dirty in the codes. We will deploy the Llama 2 7B model utilizing Python and here's how we'll do it:

      Install the Needed Packages:

                                  
       !pip install -quiet requests==2.31.0 --progress-bar off
       !pip install -quiet runpod --progress-bar off
                                  
                              

      Import the relevant modules

                                  
       import requests
       import runpod
                                  
                              

      Add your API key generated in step 1

                                  
       runpod.api_key = "Add your Runpod API Key"
                                  
                              

      Create a pod on runpod

                                  
       gpu_count = 1
      
       pod = runpod.create_pod(
           name="Llama-7b-chat",
           image_name="ghcr.io/huggingface/text-generation-inference:0.9.4",
           gpu_type_id="NVIDIA RTX A5000",
           cloud_type="SECURE",
           docker_args="--model-id TheBloke/Llama-2-7b-chat-fp16",
           gpu_count=gpu_count,
           volume_in_gb=50,
           container_disk_in_gb=5,
           ports="80/http,29500/http",
           volume_mount_path="/data",
       )
                                  
                              

      To access the API via Swagger Docs, utilize the URL, as shown below.

                                  
       SERVER_URL = f'https://{pod["id"]}-80.proxy.runpod.net'
       print(SERVER_URL)
                                  
                              

      To view this source code -Source code

    3. LLM Deployment on Your Private, GPU-Enabled Server! 🚀

    In the world of AI and generative algorithms, the notion of privacy isn't just a luxury—it's an imperative.When privacy and control are paramount, hosting LLM models on a private server provides that extra layer of security and customization that businesses often need.

    Why a GPU-Enabled Server?

    GPUs (Graphics Processing Units) are designed to handle multiple tasks simultaneously. They've been the powerhouses behind the computational might needed for high-intensity tasks like deep learning. LLM models, given their intricate architecture and the vast amounts of data they process, can benefit immensely from the parallel processing capabilities of GPUs.

    So, let's unravel the steps to establish this deployment.

    • Step 1: LLM Model - Download It!
    • Begin by downloading your chosen LLM model. Ensure that the model aligns with your objectives and has been validated for its performance and reliability.

                                  
       # Define Model ID
       model_id = "TheBloke/Llama-2-7b-chat-fp16"
      
       # Load Tokenizer
       tokenizer = AutoTokenizer.from_pretrained(model_id)
      
       # Load Model
       model = AutoModelForCausalLM.from_pretrained(model_id, 
       cache_dir='/opt/workspace/',
           torch_dtype=torch.bfloat16, trust_remote_code=True, 
       device_map="auto", offload_folder="offload")
                                  
                              

      OR

                                  
       model_name_or_path = "TheBloke/Llama-2-7b-chat-fp16"
       model_basename = "llama-2-7b-chat.ggmlv3.q5_0.bin" # the model is in bin 
       format
      
       model_path = hf_hub_download(repo_id=model_name_or_path, 
       filename=model_basename)
                                  
                              
    • Step 2: Load Up with Tokenizer or Ctokenizer
    • With the LLM model in your possession, employ a Tokenizer or Ctokenizer to load it. These tools assist in preparing your text data in a format that’s comprehensible and usable for the LLM, ensuring that the model interprets the input effectively to generate desired outputs.

                                  
       from langchain.llms import CTransformers
       llm = CTransformers(
               model = model_path,
               model_type="llama",
               max_new_tokens = 512,
               temperature = 0.5
           )
                                  
                              

      OR

                                  
       pipeline = transformers.pipeline(
           "text-generation",
           model=model,
           tokenizer=tokenizer,
           device_map="auto",
           max_length=400,
           do_sample=True,
           top_k=10,
           num_return_sequences=1,
           eos_token_id=tokenizer.eos_token_id
       )
                                  
                              
    • Step 3: Bridge the Gap with a Framework
    • Introduce a framework that’s specifically designed to form a stable and efficient connection between the LLM and your programming language. A popular choice in this context is Langchain - renowned for its ability to seamlessly integrate LLMs with different programming languages, thereby, enabling an efficient interaction between the two.

                                  
       # Setup prompt template
       template = PromptTemplate(input_variables=['input'], template='{input}')
       # Pass hugging face pipeline to langchain class
       llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs = 
       {'temperature':0})
       # Build stacked LLM chain i.e. prompt-formatting + LLM
       chain = LLMChain(llm=llm, prompt=template)
                                  
                              

      Alternatively, you can build an API.

                                  
       @app.route('/generate_response', methods=['POST'])
       def generate_response():
       try:
           data = request.get_json()
           user_prompt = data['user_prompt']
      
           # Generate a response using LLMChain
           response = LLM_Chain.run(user_prompt)
      
           return jsonify({"response": response})
      
       except Exception as e:
           return jsonify({"error": str(e)}), 500
                                  
                              

      Eager to explore more? Dive deeper with the following resources where similar deployments have been carved out meticulously:

      Falcon LLM in Action: A Step-by-Step Tutorial

      Exploring Llama 2: From Installation to Interaction

      In these resources, you'll find a treasure trove of insights, strategies, and step-by-step guidelines to navigate the exciting world of LLM deployments.

Pricing

Ah, pricing! The crucial decider when it comes to choosing a platform for deploying your LLM. As your guide through the terrain of LLM deployments, we're here to dissect the costs related to deploying open-source LLMs using different cloud platforms, ensuring that your AI adventures don't burn a hole in your pocket.

Let's delve into the nitty-gritty by comparing the pricing, specification, and latency across different platforms:

Cloud Service
SageMaker

Instance Type
ml.g5.2xlarge

Specification
Accelerated Computing
8 vCPU
32 GiB Memory

Latency
1 to 2 secs

Cost
On Demand
$1.515 /hour
$1090 /month
Cloud Service
RunPod

Instance Type
NVIDIA RTX A5000

Specification
24 GB vRAM 27 GB RAM, 7 vCPU

Latency
2 to 3 secs

Cost
On Demand
$0.44 /hour
$316 /month
Cloud Service
OVHcloud

Instance Type
ai1-le-1-GPU

Specification
NVIDIA Tesla V100S 32 GiB 16.4 TFLOPS
13 vCore
40 GiB memory

Latency
-

Cost
$1.01 /hour
$727 /month
Cloud Service Instance Type Specification Latency Cost
SageMaker ml.g5.2xlarge Accelerated Computing
8 vCPU
32 GiB Memory
1 to 2 secs On Demand
$1.515 /hour
$1090 /month
RunPod NVIDIA RTX A5000 24 GB vRAM 27 GB RAM, 7 vCPU 2 to 3 secs On Demand
$0.44 /hour
$316 /month
OVHcloud ai1-le-1-GPU NVIDIA Tesla V100S 32 GiB 16.4 TFLOPS
13 vCore
40 GiB memory
- $1.01 /hour
$727 /month

Choosing the right platform essentially boils down to balancing your budget against your need for speed (latency) and computational power (specifications). It's imperative to ponder upon:

  • Will the latency impact user experience?
  • Is the computational power sufficient to manage the workload?
  • Does the cost align with your budget allocations?

Always ensure to factor in additional costs like data transfer, storage, and potential overage charges to get a holistic view of your total expenditure.

Conclusion

Embarking upon the intricate pathways of deploying LLMs leads us to the intersection of innovation, pragmatism, and prudent financial planning. With each model we deploy, a new chapter of digital communication unfolds, steering us toward a future where our virtual interactions are increasingly sophisticated, personalized, and secure. It's vital to remember that every algorithm we bring to life in the silent depths of our private networks not only empowers our digital endeavors but also holds the mirror to our ethical and strategic choices in the enthralling realm of Generative AI.

“In every byte of data, there's a story yet to be told; let's write it wisely, let's write it boldly.”

Back To Blogs


Find out our capabilities to match your requirements

contact us