Share Large Amounts of Live Data With Delta Sharing and Docker

Written by frankmunz | Published 2021/09/03
Tech Story Tags: delta-lake | pandas | apache-spark | open-source | linux-foundation | machine-learning | python | programming

TLDR Delta Sharing is an independent open source project that makes it easy to share massive amounts of live data with other organizations in an open and secure way. This article describes a smoke test of the new Docker-based Delta Sharing server. Delta Sharing works across vendors, system boundaries, and clouds. Delta Sharing follows the lake first approach with Delta Lake you can share data from your existing data lake as Delta tables.via the TL;DR App

Delta Sharing is an independent open source project that makes it easy to share massive amounts of live data with other organizations openly and securely. This article describes a smoke test of the new Docker-based open-source Delta Sharing server.

Introduction

Delta Sharing works across vendors, system boundaries, and clouds. It’s open format and open source.

Any pandas or Apache Spark client can read the data. Many commercial clients support it too. Delta Sharing is a simple yet powerful and elegant way of sharing your data. And without anything proprietary, it is naturally multi-cloud.

Delta Sharing Linux Foundation

At the time of writing this, version 0.2.0 of Delta Sharing was released.

My favorite feature of this release is the new Docker image of the sharing server. The Docker image gives you more flexibility and options for running the server.

Delta Sharing using Docker

Delta Sharing Docker Image

Let’s look at the Docker image first, published at Docker hub. Assuming you have Docker installed, you can start Delta Sharing with the following command:

docker run -p 9999:9999 \
--mount type=bind,source=/home/ec2-user/config.yaml,target=/config/delta-sharing-server-config.yaml \
deltaio/delta-sharing-server:0.2.0 -- \
--config /config/delta-sharing-server-config.yaml

Docker then retrieves the image if necessary and runs a container.

The -p parameter maps the Delta Sharing server port to the same Port on my local machine; I am using 9999. The --mount makes the config.yaml from my local machine visible inside the container. See the discussion below about how the Delta Sharing server can get access to the cloud object storage.

Configuration config.yaml

The config.yaml file specifies the port 9999 and endpoint /delta-sharing the sharing server is listening to, whether it uses a bearer token, etc.

The most important configurations are the shares, schemas, and tables offered by the sharing server from your Lakehouse. Delta Sharing follows the lake first approach.

With Delta Lake, you can share data from your existing data lake as Delta tables. For this article, I decided to export car data stored in an S3 bucket with the name: deltafm2805.

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "share1"
  schemas:
  - name: "default"
    tables:
    - name: "cars"
      location: "s3a://deltafm2805/cars"
 

#authorization:
#  bearerToken: “1234”

host: "localhost"
port: 9999
endpoint: "/delta-sharing"

S3 Storage

To check the data on the S3 bucket, you can use the AWS CLI and list the bucket deltafm2805. Note that I have other data that is not shared in my case because it is not enabled in the config.yaml

$ aws s3 ls deltafm2805
                           PRE cars/
                           PRE range/

Authorization

The Delta Sharing server requires authorization to access S3. There are various ways to enable this.

  1. Technically, you could pass the AWS key ID and the secret key as an environment variable using -e with docker run.

    docker run -p 9999:9999 \
    -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID  \
    -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY  \
    ...
    

    I don’t recommend doing this since everyone on the host machine with the privilege to run a simple ps command will be able to see your AWS credentials.

  2. A better way is to pass the credentials using the user-data metadata service on EC2.

  3. Myself, I recommend creating an EC2 instance profile in IAM that grants EC2 access to S3. Then, assign the role the instance as shown in the screenshot below.

    This approach eliminates the need to put the keys on the instance or into its user-data configuration. Wherever there are no keys, they cannot be compromised.

Verify it Works!

Profile File

To verify that Delta Sharing works as expected, retrieve a shared car data set and display a filtered subset. The code is using the aws.share profile file, which defines the endpoint and bearer token of the server. The content of that JSON file looks as follows:

{
 "shareCredentialsVersion": 1,
 "endpoint": "http://localhost:9999/delta-sharing",
 "bearerToken": "" 
}

Python Receiver Client

In the receiver code, create a delta_sharing client and use it to read the data as a pandas data frame with load_as_pandas.

import delta_sharing

# Point to the profile file. 
profile_file = "aws.share"

# Create a SharingClient.
client = delta_sharing.SharingClient(profile_file)

# List all shared tables.
print(client.list_all_tables())

# load data as pandas dataframe (or Spark if you prefer that)
table_url = profile_file + "#share1.default.cars"
cars = delta_sharing.load_as_pandas(table_url)

Then use pandas to filter the data to retrieve all Volkswagen cars:

cars[cars['car_make']=="Volkswagen"]

In my case, I have two Volkswagen in the data set, and the result looks as follows:

Under the Hood

Pre-signed URLs

Delta Sharing is using pre-signed, short-lived URLs; therefore, data is retrieved at the speed of the cloud object storage. Throughput and bandwidth are not limited by the sharing server.

Delta Sharing REST API

The example above used pandas after installing the necessary library with pip. The pandas library is implemented on top of the Delta sharing REST API documented at GitHub.

If you want to explore accessing the API directly, you could, e.g., use the following command to get all tables in share1 and the default schema:

curl -v   -H 'Accept: application/json'  'http://localhost:9999/delta-sharing/shares/share1/schemas/default/tables'

The output then looks as follows:

{"items":[{"name":"cars","schema":"default","share":"share1"}

Unless you want to build your own integration with Delta Sharing, using the REST API directly won’t be necessary for most users.

Multiple Delta Lakes

A single Delta Sharing server can be set up to serve data from various Delta Lakes. Make sure to list them in the profile file and grant access to all Delta Lakes from the EC2 instance profile IAM role explained above.

Other Options and Conclusion

At the moment, there are three main options to deploy Delta Sharing:

  1. Host your own Delta Sharing server, e.g., on your laptop or a cloud instance. If you want to get started but feel uncomfortable with Docker, I’d recommend this way. The Delta Sharing server itself can be downloaded.

  2. Run your sharing server using Docker, as explained in this article. Actually, this isn’t more difficult than running the server without Docker. Using Docker, you will always run the newest image of the sharing server. So I recommend this option for running your own server.

    [If you don’t feel comfortable with Docker, ask a good friend to show you the basics and take a self-paced training course].

  3. Share data using Delta Sharing from a Databricks notebook. This is by far the easiest solution because it eliminates “the heavy lifting.”

    Delta Sharing is built into your Databricks workspace, so you don’t need to worry about installing, configuring, operating, patching, and monitoring the sharing server.

    Using a Databricks notebook, everything can be done with simple SQL statements. You create a share, add a table to a share, create a recipient and finally grant the recipient access to that share (see diagram below for the SQL statements).

More?


Please like this article and share it on social media if you enjoyed reading it. For more cloud-based data science, data engineering, and AI/ML follow me on Twitter (or LinkedIn).


Written by frankmunz | Databricks DevRel EMEA. AWS & GCP, Big Data & ML certified. My Uber rating was 4.9 before the world shut down.
Published by HackerNoon on 2021/09/03