[[How I Built an Automated Dependency Diagram#The Result|Click here]] to see the end result. # The Problem At the City of Boston we have hundreds of data pipelines which do everything from creating simple automated reports to powering mission-critical applications. Due to both the amount of pipelines and complexity of some of them, we needed a way to be able to easily find out who and what would be affected when a problem arose. # The Action At first, we tried a simple solution which was to just document the pipelines in [draw.io](http://draw.io/). This worked okay when we had just a handful of pipelines but it quickly became difficult to navigate and edit without messing something up or accidentally deleting part of the diagram. It was also a very manual and time consuming process which meant that it usually just didn't get done and then the diagram was out of date. Due to how I standardized and structured our pipelines, I believed we could programatically parse our pipelines and build the diagram with that information. I had to solve how to parse our pipelines and then how to visualize them/make a searchable diagram. ## Parsing our pipelines First, a bit of background on how our pipelines are structured. We use a platform called [Civis Analytics](https://www.civisanalytics.com/) which uses the [Mistral Workflow Service](https://docs.openstack.org/mistral/latest/) for defining and orchestrating data pipelines/workflows. In the [Mistral workflow language](https://docs.openstack.org/mistral/latest/user/wf_lang_v2.html), you define pipelines using a combination of [[YAML]] for the pipeline definition and [[YAQL]] or [[Jinja2]] for evaluating expressions such as "if the task succeeds but doesn't have an output, go to x instead of y task." Early on in using Civis, I standardized how we created tasks in our pipelines - part of which included surfacing all of the important parameters of a task like the table(s) it references, business logic files, and source/destination information. Originally, I made this decision so engineers reading a pipeline could quickly understand how it worked without having to dig into the details but it also made parsing the pipelines a much easier job. Here's a slightly condensed example workflow that pulls from a database and updates a dataset of [women-owned businesses](https://data.boston.gov/dataset/women-owned-businesses/resource/91ff216b-4f54-4a2c-af00-2c0f987070ee) on our open data portal. ```YAML version: '2.0' workflow: ... tasks: extract_from_database: action: civis.scripts.custom <<: *default_retry input: name: 'Extract: Women-Owned Businesses' from_template_id: *custom_database_import arguments: SOURCE: *specific_db_credential_id SOURCE_LOGIC: cool_dataset_certain_columns.sql DESTINATION_TABLE: oed_open_data.women_owned_businesses EXISTING_TABLE_ROWS: truncate on-success: - ckan_update_resource ckan_update_resource: action: civis.scripts.custom <<: *default_retry input: name: 'Update: Women-Owned Businesses' from_template_id: *custom_ckan_update_resource arguments: RESOURCE_ID: 91ff216b-4f54-4a2c-af00-2c0f987070ee RESOURCE_NAME: Women-Owned Businesses CKAN_ENVIRONMENT: *ckan_prod CKAN_API_KEY: *ckan_api_key_credential_id SOURCE_TABLE: oed_open_data.women_owned_businesses ``` I might go into details on how this works behind the scenes in a different post but for now you can see that the workflow is structured by different types of tasks and all of the important information is in the task arguments. This allowed me to create simple python functions for each type of task because all of the tasks of a certain type would have the same arguments. I parsed the workflows into nodes and relationships and then stored this information in tables in our data warehouse while I figured out how to best present the data. I also enriched this data with some of the data from our data warehouse such as table descriptions and statistics. ## Finding a visualization/search tool I was looking for something that we could both use to visualize and query the data. There were a handful of Python and Javascript libraries which we could use for visualization but it would mean we would need to spend time creating and maintaining another app and it would require some web development skills our team didn't have. I remembered reading awhile ago how NASA used a graph database called [[Neo4j]] for some similar purpose and decided to check that out first. After doing some research and going through the documentation, I decided Neo4J could be a potential low-maintenance solution and proceeded to create a proof of concept in it. I found that you could quickly [bulk load CSV files into Neo4J](https://neo4j.com/developer/guide-import-csv/#import-load-csv) and map the columns into nodes and relationships using a [[Cypher Query Language|Cypher]] query which is it's query language. I used the [neo4j python library](https://github.com/neo4j/neo4j-python-driver) to accomplish this and here's an example of the Cypher query that maps a CSV file to database nodes: ```Python def create_database_nodes(session, data): return session.run( ''' load csv with headers from $csv_url as t create (:Database {id: t.index, name: t.name, description: t.description, created_at: t.created_at, owner: t.owner, remote_host_id: t.remote_host_id, remote_host_name: t.remote_host_name, state: t.state, updated_at: t.updated_at}); ''', csv_url=data ) ``` And you might be surprised but that's pretty much it. When I was developing the proof of concept, I was using Neo4j desktop and after it was deemed a success I asked one of our engineers to spin up a Neo4j instance on AWS so the rest of team could use it going forward. One of the notable things I found was that there was an add-on called Bloom which allowed for more customization of the nodes/relationships but also a powerful feature called parameterized queries. Parameterized queries allows you to create a Cypher query in plain language and reference the data dynamically. Here's an example of a parameterized query that would let me search "tables dependent on ..." and pop up with database options based on the data. ![[Pasted image 20210131182136.png]] ![[Pasted image 20210131182437.png]] # The Result We now had an automated solution that gave us the ability to answer more complex questions about our data pipelines and dependencies. Each time a pipeline was added or modified in GitHub it would kick off a job to parse that pipeline and update the data. The beauty of this solution is that it's relatively simple to maintain because the workflow just creates CSVs and bulk imports them into Neo4j. An added bonus is we also don't have to maintain another application - just hosting a database. Since I started the project I also helped integrate it as part of a larger asset dependency diagram for the entire department and convinced the team building it to also adopt Neo4J as the backend. One downside is that it requires some of the team to learn Cypher and the team is already stretched to support a dozen skill sets as it is but I'm hoping the parameterized queries will help minimize that. In the future, I'd like to also attach stakeholder information to pipelines so if a service goes down we could quickly get a list and contact anyone affected. **High-level overview of the dependency diagram** ![[dependency-diagram-high-level.png]] **Example of a single data pipeline** ![[Inkeddependency-diagram-low-level.jpg]] **Example of clicking on a table node and viewing it's details** ![[dependency-diagram-node-details.png]]