# Clean Data

![clean-merge-data](img/scrub-process-diagram.png)

## Import Libraries

### Standard Libraries


In [1]:
import json

For more on working with `json` in Python, see {cite}`lofaro2018json`.

### External Libraries

In [2]:
import geopandas as gpd

## Define Variables

In [14]:
nyc_street_flooding_input = 'data/street-flooding/street-flood-complaints_rows-all.geojson'
nyc_street_flooding_output = 'data/street-flooding/clean_street-flood-complaints_rows-all.geojson'
data_stats_json_output = 'data/data-stats.json'

## Get Original Data

In [4]:
street_flooding_gdf = gpd.read_file(nyc_street_flooding_input)

## Before Count

In [5]:
street_flooding_complaints_before_count = len(street_flooding_gdf)
print(f'There were {street_flooding_complaints_before_count:,} street flooding complaints from 2010 to the present.')

There were 35,056 street flooding complaints from 2010 to the present.


## Set `unique_key` as Index

In [6]:
street_flooding_gdf.set_index('unique_key', inplace=True)

## Remove Rows With Missing `geometry`

In [7]:
street_flooding_gdf.dropna(subset = ['geometry'], inplace = True)

## After Count

In [8]:
street_flooding_complaints_after_count = len(street_flooding_gdf)
print(f'There were {street_flooding_complaints_after_count:,} street flooding complaints after rows with missing geometry have been removed.')

There were 34,049 street flooding complaints after rows with missing geometry have been removed.


## Preview Street Flooding Data

In [9]:
street_flooding_gdf[['created_date', 'borough', 'bbl', 'geometry']].head(10)

Unnamed: 0_level_0,created_date,borough,bbl,geometry
unique_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
15639934,2010-01-02 08:26:00,BROOKLYN,3089000064.0,POINT (-73.92178 40.58778)
15640572,2010-01-02 12:00:00,STATEN ISLAND,,POINT (-74.14329 40.63866)
15640664,2010-01-02 17:45:00,QUEENS,4120050012.0,POINT (-73.79530 40.68140)
15655327,2010-01-04 16:47:00,QUEENS,4106210008.0,POINT (-73.73843 40.72006)
15668560,2010-01-05 10:37:00,BROOKLYN,3086550021.0,POINT (-73.90969 40.61250)
15674300,2010-01-06 19:26:00,BROOKLYN,3029270015.0,POINT (-73.93297 40.71584)
15674896,2010-01-06 08:24:00,QUEENS,4119960122.0,POINT (-73.80255 40.67925)
15674924,2010-01-06 09:17:00,STATEN ISLAND,5040740044.0,POINT (-74.10646 40.55866)
15675505,2010-01-06 06:00:00,QUEENS,4030030044.0,POINT (-73.87694 40.71804)
15683503,2010-01-07 10:16:00,STATEN ISLAND,5014850078.0,POINT (-74.14943 40.61979)


In [10]:
street_flooding_gdf[['created_date', 'borough', 'bbl', 'geometry']].tail(10)

Unnamed: 0_level_0,created_date,borough,bbl,geometry
unique_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
56894127,2023-02-25 21:17:00,QUEENS,4066360043,POINT (-73.82293 40.71523)
56895026,2023-02-25 12:47:00,QUEENS,4067470075,POINT (-73.81219 40.73705)
56899909,2023-02-25 20:08:00,BROOKLYN,3056230001,POINT (-73.99062 40.63595)
56900879,2023-02-26 09:08:00,QUEENS,4015360120,POINT (-73.88446 40.73925)
56904542,2023-02-26 18:05:00,STATEN ISLAND,5061080026,POINT (-74.20391 40.54321)
56909777,2023-02-27 12:26:00,QUEENS,4046820038,POINT (-73.81259 40.78476)
56911030,2023-02-27 15:38:00,BROOKLYN,3078620043,POINT (-73.93696 40.61734)
56913386,2023-02-27 17:47:00,BROOKLYN,3056220059,POINT (-73.99228 40.63696)
56914818,2023-02-27 08:17:00,QUEENS,4051937501,POINT (-73.82129 40.75430)
56915899,2023-02-27 10:23:00,QUEENS,4137350027,POINT (-73.74750 40.65427)


## Save Datasets

### Save Street Flooding GeoDataFrame

In [11]:
street_flooding_gdf.to_file(nyc_street_flooding_output, driver='GeoJSON')

### Save Counts to JSON file

In [12]:
gdf_counts = {
    "street_flood_orig": street_flooding_complaints_before_count,
    "street_flood_clean": street_flooding_complaints_after_count
}

In [16]:
with open(data_stats_json_output, 'w') as write_json:
    json.dump(gdf_counts, write_json, indent = 4)

## References

### JSON

[Working With JSON Data in Python| Real Python](https://realpython.com/python-json/)