EV Charger Site Selection with Geospatial Machine Learning

For an energy infrastructure company evaluating locations for EV charger installation, I developed a geospatial machine learning pipeline integrating real charging network data, EV registrations, demographics, and traffic to predict weekly electricity consumption. The system analyzes hundreds of potential sites and generates ranked recommendations, enabling data-driven capital allocation for EV charging infrastructure deployment.

The Challenge

An energy infrastructure company needed to prioritize hundreds of locations for EV charger installation. The decision faced several challenges:

High capital costs: Each installation requires investment with multi-year payback periods
No historical data: Locations have no charging history (cold-start problem)
Complex location factors: Success depends on EV adoption, traffic patterns, demographics, and competition
Data fragmentation: Required data sources spread across government databases and commercial providers
Risk mitigation: Poor site selection could result in underutilized chargers with negative ROI

They needed:

Predictive model forecasting weekly kWh consumption for each potential site
Data integration from diverse geospatial sources
Feature engineering capturing location-specific demand drivers
Ranked recommendations prioritizing high-value installation sites

Solution Architecture

I designed a five-phase data pipeline integrating multiple geospatial datasets with machine learning prediction:

Data Integration:

I integrated five diverse data sources to build comprehensive location profiles:

Charging Network: Real operational data from ~10,000 EV charging stations across the US (September-November). This included weekly charging session duration and power levels, providing ground truth for model training.
EV Registration Data: State-level data aggregated by ZIP code. I separated Battery Electric Vehicles (BEVs) from Plug-in Hybrids (PHEVs) since BEVs charge more frequently and represent higher-value customers.
Demographics: Census data providing population, median household income, and housing units by ZIP code. These socioeconomic indicators correlate strongly with EV adoption patterns.
Highway Traffic: HPMS (Highway Performance Monitoring System) data with Annual Average Daily Traffic (AADT) for 2M+ road segments. I preprocessed this data and built BallTree spatial indices for efficient distance queries.
Urban Boundaries: Census TIGER/Line shapefiles classifying locations as urban or rural, since charging patterns differ significantly between these environments.

Feature Engineering:

I engineered 19 features organized into five categories capturing location-specific demand drivers:

EV Registration Features: Total EV count and BEV-only count within 15-mile radius, since Battery Electric Vehicles charge more frequently than plug-in hybrids.

Neighbor Features: Weekly kWh consumption and station counts in distance bands (0-2mi, 2-5mi). These features solve the cold-start problem by using nearby station performance as proxy for demand at locations without charging history.

Demographic Features: Inverse-distance weighted averages of population, median income, and housing units within 15 miles, capturing socioeconomic characteristics that correlate with EV adoption.

Traffic Features: Daily traffic volume with distance weighting, arterial ratio (major roads vs local streets), and nearest highway traffic, distinguishing highway corridors from residential areas.

Interaction Features: Derived features like traffic-per-capita (identifies through-traffic locations), income-weighted traffic (affluent corridors), and average kWh per nearby station (market health vs saturation).

Model Training:

Using the ~10,000 charging stations as training data, I built a Random Forest regression model:

Data preparation: Aggregated kWh by summing across chargers per station and averaging across weeks.
Log transformation: Applied log(1+x) transformation to handle right-skewed consumption distribution.
Model architecture: Random Forest with 600 trees, max depth 20, bootstrap sampling with 80% subsample, and 50% feature subsampling per split.
Validation: 5-fold cross-validation.

Prediction and Ranking:

For each location, the model predicted weekly kWh consumption. I ranked all locations and assigned them to four tiers based on percentiles:

Tier 1: Top 20% (install first - highest predicted demand)
Tier 2: 20th-50th percentile (medium priority)
Tier 3: 50th-80th percentile (lower priority)
Tier 4: Bottom 20% (not recommended)

Technical Implementation

Geospatial Feature Calculations:

I implemented vectorized haversine distance functions for computational efficiency. For each location, the system calculates distances to all ZIP code centroids and charging stations, then aggregates features within specified radius bands.

For traffic features, I built BallTree spatial indices enabling O(log n) distance queries instead of O(n) brute force. This optimization was critical for scaling to 2M+ road segments without performance degradation.

Distance-Based Aggregation:

For demographic and EV registration features, I used inverse-distance weighting where closer ZIP codes have more influence (weight = 1/distance). This captures local characteristics while incorporating regional context.

For traffic features, I used inverse-squared distance weighting for gentler decay and calculated arterial ratios to distinguish highway corridors (travelers needing fast charging) from residential areas (locals charging at home).

Data Quality and Validation:

I implemented comprehensive input file validation checking for required CSV files, shapefiles, and EV registration data before processing. The system provides detailed error reporting and exits early if data is incomplete, preventing partial processing.

Key Design Decisions

19-feature model with interaction terms:

Base features capture primary demand drivers
Interaction features enable non-linear relationships
Limited count prevents overfitting while maintaining interpretability

Neighbor features as cold-start solution:

Locations lack charging history
Nearby station performance validates market demand
High average kWh indicates proven demand with room for growth
Low average kWh indicates saturation or poor location

Distance-band segmentation:

0-2mi captures immediate competition and demand validation
2-5mi captures regional market characteristics
15mi radius for demographics balances local context with sample size
0.5-1mi for traffic captures accessibility without dilution

Log-transformed target variable:

Weekly kWh highly right-skewed
Log transformation normalizes distribution improving model performance
Predictions transformed back for interpretable business metrics

Spatial indexing for traffic data:

BallTree enables efficient queries on large datasets
Pre-built indices amortize construction cost
Separate index for major roads optimizes highway proximity queries

Results & Impact

Built geospatial ML pipeline analyzing over 500 locations for EV charger suitability
Integrated 5 diverse data sources: charging network, EV registrations, demographics, traffic, urban boundaries
Engineered 19 features capturing location-specific demand drivers
Trained Random Forest model on ~10,000 real charging stations achieving R² = 0.52
Generated ranked recommendations with predicted weekly kWh consumption forecasts
Implemented distance-based feature calculations with spatial indexing for efficiency
Created percentile-based tier system enabling phased deployment strategy
Enabled data-driven site selection reducing risk of underutilized installations
Feature importance analysis identified neighbor kWh, EV counts, and traffic as top predictors

Technologies

Python, scikit-learn, Random Forest, pandas, NumPy, GeoPandas, Shapely, matplotlib, seaborn, Geospatial Analysis, Feature Engineering, Machine Learning