What is the best practice for manage identities in Databricks?
is one of the
most common questions a new Databricsk customer will ask the Databricks
implementation team. The answer is always It depends
as it depends on the
size of your organization, the number of users, the number of groups …etc.
If you keep pushing for a more specific answer, most likely the answer is
Using Databricks UI
for small organizations and
Using Terraform or SCIM provisioning
for large organizations.
Unfortunately, none of the above answers is satisfying. The first one is not scalable and the second one is not flexible due to the nature of our workflow. I want to write this article to share my experience on this topic.
In order to understand the solution, we need to understand the problem first.
Why not Using Databricks UI
should be very self-explanatory. It is not
scalable and have a lot of room for human error. It makes sense for using UI
for eloxporation and testing but not for production.
Why not Using Terraform or SCIM provisioning
? In my opinion, Terraform is for
infrastructure as code not for identity management. It is possible to use it
for the this purpose but it will end up with many data resource objects which
make the infrastructure code hard to understand and maintain.
SCIM provisioning
is a better option than Terraform as it is designed for
identity management. This is also the solution We plan to do in the very first
place. However, We did not go with this option in the very beginning because of
the Communication Overhead
. In our workflow, the group creation and its
permission is done by Terraform. Using SCIM provisioning mean let identity
provider to take the driver seat which might result in the group configuration
from Terraform got overwritten. In addition, our team does not have the
visibility on the identity provider configuration which make it hard to debug
when something goes wrong.
I think SCIM provisioning can be a good solution when everything is stable and a dedicated team is responsible for the identity management. But we are not there yet so we need to find a better solution.
After a few weeks of research, I found a better way to manage identities in Databricks. In a nutshell, the solution is to use Databricks Notebook to manage the user and group. The notebook will read the YAML file which contains the user and group information and calling Databricks API to add or remove the user from the group. The benefit of this solution is that
Nothing is required per se. However, I do have some pro tips for you if you consider adopting this solution.
The YAML file contains the user and group information. It can be any format you feel comfortable with. I only keep the necessary information in the YAML file because you can get the rest of the information from Databricks API.
team_1:
members:
- user1@email.com
- user2@email.com
team_2:
members:
- user1@email.com
- user2@email.com
This is the pseudocode
of what the notebook will do and I will go over each
step in detail.
import all the required libraries
get the token from somewhere safe
get Databricks API token
call users API to get all the users information
call groups API to get all the group information
read the YAML file
loop through group information to add/remove users
This code block is self-explanatory.
import json
import requests
import yaml
I am using Databricks secret management to store the token to reduce the dependency on other tools.
git_token = dbutils.secrets.get( # noqa: F821
scope="github",
key="github_token",
)
databricks_sp_client_id = dbutils.secrets.get( # noqa: F821
scope="databricks_sp",
key="client_id",
)
databricks_sp_client_secret = dbutils.secrets.get( # noqa: F821
scope="databricks_sp",
key="client_secret",
)
I am using service principal token to handles all API calls. This can prevent the workflow not working due to the end user is no longer in the organization.
For other authentication methods, you can find more information on Authentication for Databricks automation - overview.
response = session.post(
url=f"{DATABRICKS_OAUTH_BASE_URL}/{DATABRICKS_ACCOUNT_ID}/v1/token",
auth=(databricks_sp_client_id, databricks_sp_client_secret),
data={
"grant_type": "client_credentials",
"scope": "all-apis",
},
timeout=10,
)
if response.status_code == 200:
# Success!
access_token = response.json()["access_token"]
else:
# Something went wrong
print(response.status_code)
print(response.content)
Generate a dictionary with email as key and user id as value so we can use it to translate the YAML file.
response = session.get(
url=f"{DATABRICKS_ACCOUNT_LEVEL_BASE_URL}/{DATABRICKS_ACCOUNT_ID}/scim/v2/Users",
headers={"Authorization": f"Bearer {access_token}"},
timeout=10,
)
active_users_dict = {
row["emails"][0]["value"]: row["id"]
for row in response.json()["Resources"]
if row["active"]
}
Similar to the prior block. The dictionary key is the group name and the value is the group id and the members id.
response = session.get(
url=f"{DATABRICKS_ACCOUNT_LEVEL_BASE_URL}/{DATABRICKS_ACCOUNT_ID}/scim/v2/Groups",
headers={"Authorization": f"Bearer {access_token}"},
timeout=10,
)
groups_dict = {}
for row in response.json()["Resources"]:
groups_dict[row["displayName"]] = {
"id": row["id"],
"members": {user["value"] for user in row.get("members", [])},
}
As I mentioned before, it is highly recommended to store the YAML file in a repository. This will make the workflow more transparent and easier to share the provisioning responsibility with other teams as the end user does not need Databricks access to update the YAML file.
After the YAML file is read, we need to convert the email to user id so we can communicate with Databricks API.
res = session.get(
url=(
f"https://api.github.com/repos/{user_name}/{repo_name}/contents/"
f"{the_file_path}/config.yml"
),
headers={
"Authorization": f"Bearer {git_token}",
"Accept": "application/vnd.github.v3.raw",
},
timeout=10,
)
if res.status_code == 200:
config = yaml.safe_load(res.content.decode("utf-8"))
for _, value in config.items():
value["members"] = (
{
active_users_dict[member]
for member in value["members"]
if member in active_users_dict
}
if value["members"]
else set()
)
A basic set
operation to find the delta between the group members from
Databricks API and the YAML file. Then call the PATCH API to update it.
for key, value in groups_dict.items():
if key in config:
group_name = key
group_id = value["id"]
members_from_databricks_api = value["members"]
members_from_yaml = config[key].get("members", set())
print(f"Check {group_name}")
operations = []
if len(members_from_databricks_api - members_from_yaml):
operations.append(
{
"op": "remove",
"path": "members",
"value": [
{"value": member}
for member in members_from_databricks_api - members_from_yaml
],
}
)
if len(members_from_yaml - members_from_databricks_api):
operations.append(
{
"op": "add",
"path": "members",
"value": [
{"value": member}
for member in members_from_yaml - members_from_databricks_api
],
}
)
if operations:
print(f"[{group_name}] delta found: {operations}")
if operations:
url = (
f"{DATABRICKS_ACCOUNT_LEVEL_BASE_URL}/{DATABRICKS_ACCOUNT_ID}"
f"/scim/v2/Groups/{group_id}"
)
payload = {
"schemas": ["urn:ietf:params:scim:api:messages:2.0:PatchOp"],
"Operations": operations,
}
response = session.patch(
url,
data=json.dumps(payload),
headers={"Authorization": f"Bearer {access_token}"},
timeout=10,
)
if response.status_code == 200:
print(f"[{group_name}] provision is finished.")
else:
# Something went wrong
print(response.status_code)
print(response.content)
With this solution, my team is able to spend less time on the Databricks administration work. Even with the new group been created to support the new project, we can provision the new group to the end user who need it because we also wire this job run to the CI/CD pipeline to fully automate the process.
As you can see, this solution is built on top of Databricks API and python basic operation to make it very flexible and scalable. You can easily modify the logic to fit your need.
I hope you find this post helpful. If you have any questions or suggestions, please feel free to reach out to me