mapsanalyticsdatabases

Build a Maps Analytics Dashboard: Comparing Google Maps and Waze Data with ClickHouse

UUnknown

2026-01-25

13 min read

Learn how to ingest Google Maps and Waze traffic into ClickHouse for fast OLAP analysis, realtime dashboards, and provider comparisons.

Hook: Turn fragmented traffic signals into fast, actionable analytics

You're trying to build an analytics dashboard that answers real questions: Which navigation source (Google Maps or Waze) predicts travel time more accurately in my city? Which roads show persistent congestion? But you face common pain points: multiple APIs with different schemas, rate limits, noisy community-sourced events, and heavy aggregation queries that slow down your dashboard. This guide shows how to ingest route and traffic data from the Google Maps API and Waze into ClickHouse for fast OLAP analysis and visualization — a production-ready pattern for realtime and historical traffic analytics in 2026.

The high-level approach (inverted pyramid first)

At the top level you'll:

Collect route requests and traffic events from Google Maps and Waze (respecting terms of service).
Normalize and enrich data (geohash, route hashing, map-matching) in an ETL layer.
Stream into ClickHouse using the Kafka engine or HTTP bulk inserts for OLAP storage.
Create materialized views and aggregated tables for sub-second dashboards.
Visualize via Grafana/Superset and run comparative queries to measure discrepancy.

Why ClickHouse in 2026?

ClickHouse remains a top choice for high-concurrency OLAP workloads. Recent 2025–2026 momentum (including large funding rounds and rapid feature growth) has pushed ClickHouse deeper into the cloud and analytics ecosystem. With improved geo functions, Kafka/streaming integrations, and first-class cloud offerings, ClickHouse balances low-latency ad-hoc queries and high-throughput ingestion — perfect for traffic analytics.

Fact (2026): ClickHouse's growth and funding signal robust enterprise adoption and continued innovation in OLAP features that help pipelines like the one in this tutorial scale from POC to production.

Design decisions and constraints to set up first

Before writing code, decide on these constraints:

Data licensing & compliance: Join Waze Connected Citizens or use Waze for Cities feeds; use Google Maps APIs per terms. Avoid scraping the Live Map.
Sampling & rate limits: Google Maps and Waze have quotas. Use sensible sampling and local caching to minimize cost.
Latency target: Do you need near-realtime (<5s), near-real-time (30s–1min), or batch (hourly)?
Retention & storage costs: Store raw events for 7–30 days and aggregated rollups longer. Use TTL to manage cold data.
Spatial precision: Use geohash or ClickHouse geometry types to spatially join across providers.

Architecture overview

Use a pipeline pattern that decouples collection and storage. Here’s a resilient architecture that scales:

Producers: Microservices that call Google Maps Directions/Distance Matrix/APIs and Waze feeds.
Message bus: Apache Kafka (recommended) for buffering and replayability.
ClickHouse ingestion: Kafka engine tables + materialized views to transform and persist to MergeTree tables.
Rollups: Materialized views precompute hourly/daily aggregates.
Visualization: Grafana / Apache Superset querying ClickHouse directly.

Why Kafka?

Using Kafka provides replay, backpressure handling, and decoupling between real-time ingestion and ClickHouse writes. ClickHouse's Kafka engine lets you consume and transform messages into MergeTree with materialized views — a common, battle-tested pattern in 2026.

Step 1 — Collecting data from Google Maps and Waze

Collect two primary kinds of signals:

Route requests and travel_time estimates (Google Directions API, Distance Matrix API).
Traffic events & jams (Waze for Cities: alerts, jams, and traffic flow).

Google Maps: what to call

Important endpoints:

Directions API: route geometry, legs, and duration_in_traffic.
Distance Matrix API: travel_time estimates and statuses for many origin-destination pairs.
Roads API: map-matching and snapped points for cleaner segments.

Example: call Directions API with departure_time=now to get duration_in_traffic.

Waze: what to call

Waze provides city program feeds (alerts, jams) via Waze for Cities or Connected Citizens Program. These include geometry for jams and timestamps for events. Do not scrape Waze Live Map — use official feeds and respect licensing.

Step 2 — Normalization: unify schemas

Google and Waze provide different shapes. Build a normalized event schema that captures common fields and preserves provider metadata for later comparison.

Normalized JSON event model

{
  "provider": "google|waze",
  "event_type": "route|traffic_event|jam",
  "event_id": "string",
  "timestamp": "ISO8601",
  "origin": {"lat": float, "lng": float},
  "destination": {"lat": float, "lng": float},
  "route_polyline": "encoded_polyline_or_wkt",
  "duration_seconds": int,    // Google: duration_in_traffic; Waze: derived
  "distance_meters": int,
  "congestion_level": "low|medium|high|unknown",
  "metadata": { "raw": { ... } }
}

Key normalizations:

Convert all timestamps to UTC and ISO8601.
Map congestion indicators to a common scale.
Store raw payload under metadata.raw for debugging and future fields.

Step 3 — ClickHouse schema design

Design for fast OLAP queries. Use a MergeTree family table partitioned by date and sorted by a spatial key & time for range lookups.

Example DDL: raw events table

CREATE TABLE traffic_raw (
  provider String,
  event_type String,
  event_id String,
  event_time DateTime64(3),
  origin_lat Float64,
  origin_lng Float64,
  dest_lat Float64,
  dest_lng Float64,
  route_polyline String,
  duration_seconds Int32,
  distance_meters Int32,
  congestion_level String,
  geohash UInt64,               -- encoded geohash for spatial joins
  raw_payload String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (geohash, event_time, provider)
TTL event_time + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;

Notes:

Use geohash (e.g., 6-8 chars packed to UInt64) for fast spatial bucketing.
TTL enforces retention for raw events; keep longer for aggregated tables.
Store raw as JSON for future reprocessing.

Example DDL: hourly rollup

CREATE TABLE traffic_hourly AS
SELECT
  toStartOfHour(event_time) AS hour,
  provider,
  geohash,
  count() AS events,
  avg(duration_seconds) AS avg_duration_s,
  avg(distance_meters) AS avg_distance_m,
  quantile(0.9)(duration_seconds) AS p90_duration_s
FROM traffic_raw
GROUP BY hour, provider, geohash
ENGINE = SummingMergeTree
PARTITION BY toYYYYMM(hour)
ORDER BY (geohash, hour);

Step 4 — Ingest with Kafka engine or HTTP

You have two production-friendly ingestion choices:

Kafka engine: Good for high-volume streaming and replay. ClickHouse consumes JSON messages directly from Kafka topics and writes through materialized views into MergeTree.
HTTP bulk insert (JSONEachRow/CSV): Simple for small-scale or batch ingestion; implement backoff & batching. See audit-ready pipelines for patterns that validate and normalize JSON before ingest.

Kafka engine pattern

Create a Kafka table and a materialized view that transforms JSON to the MergeTree table.

CREATE TABLE kafka_traffic (
  message String
) ENGINE = Kafka SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'traffic-events',
  kafka_group_name = 'clickhouse-consumer',
  kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW mv_kafka_to_raw TO traffic_raw AS
SELECT
  JSONExtractString(message, 'provider') AS provider,
  JSONExtractString(message, 'event_type') AS event_type,
  JSONExtractString(message, 'event_id') AS event_id,
  parseDateTimeBestEffort(JSONExtractString(message, 'timestamp')) AS event_time,
  JSONExtractFloat(message, 'origin.lat') AS origin_lat,
  JSONExtractFloat(message, 'origin.lng') AS origin_lng,
  JSONExtractFloat(message, 'destination.lat') AS dest_lat,
  JSONExtractFloat(message, 'destination.lng') AS dest_lng,
  JSONExtractString(message, 'route_polyline') AS route_polyline,
  toInt32OrZero(JSONExtractString(message, 'duration_seconds')) AS duration_seconds,
  toInt32OrZero(JSONExtractString(message, 'distance_meters')) AS distance_meters,
  JSONExtractString(message, 'congestion_level') AS congestion_level,
  cityHash64(concat(JSONExtractString(message, 'origin.lat'), ',', JSONExtractString(message, 'origin.lng')) ) AS geohash,
  message AS raw_payload
FROM kafka_traffic;

Notes: ClickHouse JSONExtract functions are flexible; use low-latency parsers and pre-validate messages to reduce bad rows.

HTTP bulk insert pattern (Python example)

For batch inserts, send JSONEachRow to ClickHouse's HTTP endpoint. Example using requests:

import requests, json
CLICKHOUSE_URL = 'http://clickhouse:8123/?query=INSERT+INTO+traffic_raw+FORMAT+JSONEachRow'
batch = [ ... ]  # list of normalized events as dicts
resp = requests.post(CLICKHOUSE_URL, data='\n'.join(json.dumps(e) for e in batch))
resp.raise_for_status()

Example collector: Python microservice (Google Directions + Waze)

The collector fetches routes from Google and events from Waze, normalizes, and produces to Kafka. Keep collectors lightweight and idempotent.

import time, requests, json
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='kafka:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8'))

def fetch_google_route(origin, dest, api_key):
  url = 'https://maps.googleapis.com/maps/api/directions/json'
  params = {'origin': f"{origin[0]},{origin[1]}", 'destination': f"{dest[0]},{dest[1]}", 'departure_time': 'now', 'key': api_key}
  r = requests.get(url, params=params)
  r.raise_for_status()
  return r.json()

# Simplified example, production code must handle retries, quota, and polyline decode

def normalize_google(resp, origin, dest):
  route = resp['routes'][0]
  leg = route['legs'][0]
  e = {
    'provider': 'google',
    'event_type': 'route',
    'event_id': route.get('overview_polyline', {}).get('points','')[:64],
    'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
    'origin': {'lat': origin[0], 'lng': origin[1]},
    'destination': {'lat': dest[0], 'lng': dest[1]},
    'route_polyline': route.get('overview_polyline', {}).get('points', ''),
    'duration_seconds': leg['duration_in_traffic']['value'] if 'duration_in_traffic' in leg else leg['duration']['value'],
    'distance_meters': leg['distance']['value'],
    'congestion_level': 'unknown',
    'metadata': {'raw': route}
  }
  return e

# produce to Kafka
producer.send('traffic-events', normalize_google(fetch_google_route((40.7,-74.0),(40.8,-73.9),'GOOGLE_API_KEY'), (40.7,-74.0),(40.8,-73.9)))
producer.flush()

Step 5 — Spatial joins and comparing Google vs Waze

You want to align events across providers to compare travel time or congestion. Two robust strategies:

Geohash bucketing: compute a geohash for each segment and group by geohash + time window.
Geometry intersection: use ClickHouse geometry functions (st_intersects) to see if Waze jam polygons intersect the Google route polyline.

Sample query: compare average duration by geohash

SELECT
  hour,
  geohash,
  anyIf(avg_duration_s, provider='google') AS google_avg,
  anyIf(avg_duration_s, provider='waze') AS waze_avg,
  (google_avg - waze_avg) AS diff_seconds
FROM (
  SELECT toStartOfHour(event_time) AS hour, provider, geohash, avg(duration_seconds) AS avg_duration_s
  FROM traffic_raw
  WHERE event_time >= now() - INTERVAL 7 DAY
  GROUP BY hour, provider, geohash
)
GROUP BY hour, geohash
ORDER BY hour DESC
LIMIT 100;

This query surfaces where providers disagree the most. Combine with p90 and count to avoid noise from low-sample cells.

Using geometry for precise matches

If you store route geometry as WKT (LINESTRING) and Waze jams as POLYGON, use st_intersects to match exactly. ClickHouse has been improving geo support — prefer geometry joins for high-precision analysis if your dataset size permits.

Step 6 — Aggregations, materialized views, and dashboard performance

For UI performance, precompute common aggregates:

Hourly/daily averages and percentiles by road segment (geohash or segment_id).
Top-K congested roads per area.
Delta tables that measure provider disagreement.

Materialized view example — hourly rollup

CREATE MATERIALIZED VIEW mv_hourly_rollup TO traffic_hourly AS
SELECT
  toStartOfHour(event_time) AS hour,
  provider,
  geohash,
  count() AS events,
  avg(duration_seconds) AS avg_duration_s,
  quantile(0.9)(duration_seconds) AS p90_duration_s
FROM traffic_raw
GROUP BY hour, provider, geohash;

Materialized views reduce dashboard latency and avoid repeated heavy scans of raw tables. For orchestration and repeatable transformations, consider automation tooling (CI and orchestrators) such as FlowWeave to codify rollups and backfills.

Operational best practices (ETL & reliability)

Idempotency: Use event_id and dedupe at ingest to avoid double-counting.
Backfills: Keep raw_payload to reprocess with updated logic; use Kafka replay or batch loaders.
Monitoring: Track ingestion lag, ClickHouse replica health, and Kafka consumer group lag.
Security: Store API keys in a secret manager, use least-privilege service accounts, and encrypt traffic to ClickHouse. Also consider procurement & device hygiene when building edge collectors (procurement guidance).
Costs: Monitor Google Maps API billing; prefer Distance Matrix batching where possible.

Performance tuning for ClickHouse

For 2026-scale workloads, tune these settings:

Partitioning: Partition by month/day to prune quickly.
ORDER BY: Use geohash + time to accelerate spatial-time queries.
Index granularity: Increase index_granularity for larger rows to reduce memory at query time.
Compression: Use ZSTD or ZSTD(level) to save storage while maintaining speed.
Materialized views: Pre-aggregate heavy workloads and use MergeTree engines optimized for aggregation like SummingMergeTree if you can express rollups as additive merges.

Visualization best practices (UX and metrics)

Design dashboards that help users act:

Map overlay: Show Waze jams as polygons and Google route lines; color by deviation between providers.
Time slider: Let users compare time-of-day patterns and historical anomalies.
KPIs: p50/p90 travel_time, disagreement_rate, incident_count.
Alerting: Trigger alerts when Google & Waze diverge above a threshold for critical corridors.

Sample analytic questions and ClickHouse queries

1) Where do Google and Waze disagree the most (7-day window)?

SELECT
  geohash,
  avg(google_avg) AS google_avg,
  avg(waze_avg) AS waze_avg,
  avg(abs(google_avg - waze_avg)) AS avg_abs_diff
FROM (
  SELECT
    hour, geohash,
    anyIf(avg_duration_s, provider='google') AS google_avg,
    anyIf(avg_duration_s, provider='waze') AS waze_avg
  FROM traffic_hourly
  WHERE hour >= now() - INTERVAL 7 DAY
  GROUP BY hour, geohash
)
GROUP BY geohash
ORDER BY avg_abs_diff DESC
LIMIT 50;

2) Top 10 congested segments this morning (p90)

SELECT geohash, provider, hour, p90_duration_s
FROM traffic_hourly
WHERE hour >= toStartOfDay(now())
ORDER BY p90_duration_s DESC
LIMIT 10;

Case study: from POC to production

Imagine a city operations team that wants to validate Waze-sourced jam alerts against Google travel_time estimates. Using the pipeline above, they:

Deployed collectors for a curated set of critical OD pairs to reduce Google billing.
Joined Waze Connected Citizens to receive official alerts and jams.
Streamed normalized events to Kafka and used ClickHouse materialized views for hourly rollups.
Built a dashboard showing disagreement by corridor and configured alerts for >30% deviation.

Result: The team can now prioritize incident responses where Waze reports jams but Google does not, and quantify how often Google underestimates congestion on certain arterial roads.

Advanced strategies and 2026 trends

As of 2026, several trends shape traffic analytics:

Edge collection: Mobile-edge collectors and CDN-based functions reduce API round-trips and cost for frequent route sampling.
Vector tiles & geometry compute: More tooling supports vector tile-based joins and server-side geometry ops in ClickHouse, enabling richer spatial analyses.
AI-assisted anomaly detection: Use LLMs and time-series models to flag unusual deviations; ClickHouse async queries + export to ML pipelines remains common. See audit-ready text pipelines for provenance and normalization patterns that feed ML workflows.
Privacy-first data: Differential privacy and coarse-grained sampling for user-level events to comply with stricter post-2024 privacy regimes — consider local-first sync appliances and edge storage options when designing collectors.

In practice, pair ClickHouse with a lightweight feature store or ML pipeline to run models that predict incident propagation or forecast travel_time changes.

Common gotchas and troubleshooting

Duplicate events: Ensure dedupe by provider+event_id and/or content hashing.
Bad JSON: Validate payloads before sending to Kafka to avoid poisoning ClickHouse JSON parsers.
High-cardinality geohash: Choose geohash precision wisely to avoid too many partitions or index inefficiency.
Query spikes: Use pre-aggregations and cache commonly accessed tiles to avoid spike-induced cluster slowdowns.

Security, cost, and compliance checklist

Use secrets manager for API keys and rotate regularly.
Monitor Google Maps API billing and use batching to reduce calls.
Document data retention and anonymization policies for GDPR/CCPA compliance.
Confirm Waze/Google usage rights for derived analytics and dashboards (especially public-facing).

Next steps: concrete checklist to implement this pipeline

Register for Google Maps APIs and join Waze for Cities / Connected Citizens if needed.
Prototype a collector for a small set of OD pairs and a Waze feed consumer.
Set up Kafka (or use managed alternatives) and a ClickHouse cluster (managed or self-hosted).
Create the raw and rollup tables shown above and test ingestion with small batches.
Build a Grafana dashboard and iterate on the UX with stakeholders.
Scale ingestion by adding sampling, backpressure, and monitoring.

Final thoughts

Combining Google Maps and Waze data gives you complementary perspectives: Google offers route-based, predictive travel_time estimates and enterprise-grade APIs; Waze contributes community-sourced incident signals and highly localized jams. In 2026, with ClickHouse's OLAP speed and streaming integrations, you can build dashboards that answer operational questions in near-real-time and scale to city-wide analysis.

Action — try it now

Ready to build this pipeline? Start with a single OD pair and one Waze feed. Put together the normalized JSON model, stream a small dataset to ClickHouse (HTTP insert), and create the hourly rollup. If you want a reference repository, cloning a template collector + ClickHouse schema will get you from 0 to dashboard in a day. Comment with the city you're analyzing and the road corridor — I’ll suggest geohash precision and sampling rates to get meaningful results quickly.

Call to action: Implement the pipeline for one critical corridor, run the comparison queries in this guide, and share your dashboard screenshots or query results in the comments to get feedback. If you'd like, I can provide a starter GitHub scaffold (collector + ClickHouse DDL + Grafana dashboard) tailored to your dataset.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.