Boston - CityScore - Eat Sleep Data

# Overview [CityScore](https://www.boston.gov/innovation-and-technology/cityscore) is a city-wide KPI that we on the Boston Analytics Team developed so that the Mayor and city managers could track the overall "health" of the city every day. ![[Pasted image 20210128002621.png]] _The aggregated KPI that would be displayed around Boston City Hall._ CityScore is not only public and anyone can view exactly what the Mayor uses, but it also has proven to drive meaningful change: - [Improving EMS Response Time](https://www.boston.gov/innovation-and-technology/cityscore#case-studies) - [Installing Traffic Signs](https://www.boston.gov/innovation-and-technology/cityscore#case-studies) # My Contributions CityScore existed before I started working for the City of Boston but the first iteration of it left plenty of room for improvements in terms of optimization, simplicity, and reliability of the data pipeline that powered it. I created the second iteration of it which we internally called CityScore 2.0 ## Optimization The pipeline pulls metrics from 23 different data sources from departments scattered across the city. Each metric is weighted based on relative importance and tracks topics important to the mayor such as transportation, crime, and city services. Each of these metrics are calculated by day, week, month, and quarter and bundled together. The first win here was adding parallel processing to the ingestion step. We could now ingest data from all sources at the same time and then asynchronously start processing the data as soon as it was available. Later steps that required multiple pieces of data be available to do a calculation would wait until upstream tasks had successfully completed using Airflow trigger rules. I stored these metrics using a modified [activity schema](https://dataengineering.wiki/Concepts/Activity+Schema) which improved query performance and made it easier to use because it was all in one table and no further joins were required. ## Simplification All of the metric definitions were originally stored in a table in our data warehouse. This made it easier to change, and add or remove metrics. However, this meant metrics weren't version controlled and it was really difficult to understand the queries unless you took them out and formatted them. I had previously created a standardized ETL framework for us and I adapted the pipeline to fit. This meant taking out all of the business logic (aka. SQL embedded in a table) and storing them in proper SQL files in the transform section of our codebase. ## Increased Reliability Then I also broke up the pipeline into logical pieces. Before, this was all done in 2 or 3 python files, but logically, there were multiple transformations happening in each file which wouldn't be clear if you were looking at the pipeline from a high level. Breaking it up into logical tasks made it very easy for anyone to understand the pipeline at a high level but it also increases the reliability of the pipeline. If a task fails near the end of the pipeline, we no longer had to re-run the entire pipeline. Instead, we could simply make a change and then re-run the task that failed and onward. This saved us a lot of time and compute due to the size of the pipeline.