Big batch of duplicate cities

Hey @JamesChevalier, I’ve been doing some data gathering and got some dupes you might be interested in. The heuristic I used is: if two cities have the same name and the same street count, I mark it as likely duplicate. This seems to hold up once you pass a certain minimum of streets. (read: the ones at 1 street are probably false positives)

Here is the full list gathered. It’s CSV format, that website seems to have a little bit of trouble with RTL characters though, let me know if I need to upload it elsewhere.

Quite some are near a border, which I assume to be because their border slightly overlapped the border of a larger area, making it get caught in an overpass query.

Also to note: Maryland has a lot of duplicate cities.

Here are the largest few from that list (ignore the first column, it’s an internal index in the pandas library). Sorry to anyone that may lose a lot of streets because of this :sweat_smile:

cs_id name region_country street_count
150787 225689 محافظة الإسكندرية البحيرة, مصر 8492
150788 225682 محافظة الإسكندرية محافظة الإسكندرية, مصر 8492
39984 5732 El Paso New Mexico, United States 6541
39985 6087 El Paso Texas, United States 6541
11930 12661 Bella Vista Arkansas, United States 2683
11931 11286 Bella Vista Missouri, United States 2683
70822 95249 Laredo Tamaulipas, México 2304
70824 5000 Laredo Texas, United States 2304
171168 223550 თბილისი თბილისი, საქართველო 2149
171169 229539 თბილისი თბილისი, საქართველო 2149
128387 6726 St. George Arizona, United States 2066
128391 8497 St. George Utah, United States 2066
98073 2521 Pahrump California, United States 1727
98074 5523 Pahrump Nevada, United States 1727
70700 136722 Lappeenranta Северо-Западный федеральный округ, Россия 1726
70701 177672 Lappeenranta Etelä-Karjala, Suomi 1726
64746 212626 Kaunas Kauno apskritis, Lietuva 1617
64747 213887 Kaunas Kauno apskritis, Lietuva 1617
63919 221250 Kabupaten Kampar Riau, Indonesia 1509
63920 221230 Kabupaten Kampar Sumatera Barat, Indonesia 1509
133438 209246 Timișoara Vest, România 1252
133439 220276 Timișoara Timiș, România 1252
122933 219767 Sector 1 Municipiul București, România 1243
122934 229542 Sector 1 Municipiul București, România 1243
44865 12107 Fort Smith Arkansas, United States 1241
44867 4223 Fort Smith Oklahoma, United States 1241
84788 95250 Mission Tamaulipas, México 1081
84792 5624 Mission Texas, United States 1081
39791 214503 Ellicott City Maryland, United States 1073
39792 187896 Ellicott City Maryland, United States 1073
74068 10167 Lexington Park Maryland, United States 1070
74069 214495 Lexington Park Maryland, United States 1070
29320 208524 Cluj-Napoca Nord-Vest, România 1063
29321 219204 Cluj-Napoca Cluj, România 1063
81858 95251 McAllen Tamaulipas, México 1060
81859 5641 McAllen Texas, United States 1060
49214 214370 Glen Burnie Maryland, United States 1031
49215 5148 Glen Burnie Maryland, United States 1031
128530 16765 St. Joseph Kansas, United States 1013
128533 12557 St. Joseph Missouri, United States 1013
122941 219775 Sector 5 Municipiul București, România 1003
122942 229544 Sector 5 Municipiul București, România 1003
151511 58921 Λευκωσία - Lefkoşa Κύπρος - Kıbrıs, Κύπρος - Kıbrıs 1001
151512 159142 Λευκωσία - Lefkoşa Kuzey Kıbrıs, Κύπρος - Kıbrıs 1001
122935 219763 Sector 2 Municipiul București, România 999
122936 229541 Sector 2 Municipiul București, România 999
111615 172000 Rotorua District Bay of Plenty, New Zealand / Aotearoa 982
111616 172045 Rotorua District Waikato, New Zealand / Aotearoa 982
108470 189551 Rejon raciborski województwo opolskie, Polska 981
108471 191788 Rejon raciborski województwo śląskie, Polska 981
30754 208985 Constanța Sud-Est, România 981
30755 219259 Constanța Constanța, România 981
101500 95253 Pharr Tamaulipas, México 945
101501 13545 Pharr Texas, United States 945
60383 208831 Iași Nord-Est, România 939
60384 219664 Iași Iași, România 939
103054 208690 Ploiești Centru, România 924
103055 209113 Ploiești Sud-Muntenia, România 924
103056 219895 Ploiești Prahova, România 924
32442 209184 Craiova Sud-Vest Oltenia, România 919
32443 219339 Craiova Dolj, România 919
124020 214381 Severna Park Maryland, United States 885
124021 5268 Severna Park Maryland, United States 885
11738 214456 Bel Air South Maryland, United States 885
11739 7246 Bel Air South Maryland, United States 885
131490 172003 Taupō District Bay of Plenty, New Zealand / Aotearoa 874
131491 172008 Taupō District Hawke's Bay, New Zealand / Aotearoa 874
131492 172016 Taupō District Manawatu-Wanganui, New Zealand / Aotearoa 874
131493 172046 Taupō District Waikato, New Zealand / Aotearoa 874
4 Likes

Thanks for gathering this list! :raised_hands:
:sweat_smile: My trouble now is that I have to look at every single city to determine which one is correct.

If a bunch of us work through the list at the same time, it will go much faster
I added a sheet to Missing/Broken Cities Tracker - Google Sheets named “Duplicate City List” that I’ll work in. I’ll delete rows as I work through them, instead of marking them as ‘delete’ and I’ll batch delete rows as they’re filled in by any of you that feel like helping out.

1 Like

Are there Striders in the duplicated city?

I looked at El Paso, New Mexico - CityStrides and see 166 Striders.
El Paso, Texas - CityStrides also 166 Striders.

OK, so that idea won’t work.

Or would it? @JamesChevalier Would it be easier to verify that all (any) Striders (GPS data) have actually Strode in both cities? Where ever the Striders aren’t is the delete.

Updated sheet that El Paso, NM is a delete.
As is Bella Vista, MO

No, they’re complete duplicates. The only difference between each is which region they’re related to.

So I could randomly delete one of each dupe and then wait for the support requests “my city is listed in the wrong region” come in. I don’t want to handle more support requests, though.

1 Like

Yikes! :man_facepalming:

So I was looking at (Google maps) my home state of Texas, marking deletes, and learned for the first time, that there is a:

Texarkana, TX and a Texarkana, AR (Arkansas)

So don’t know what to do. :thinking:

Now I see:

image

Will look closer in a bit and figure which are deletes.

1 Like

I expect that the process to figure out which dupe should be deleted will be:

  • Click the link for one of the dupe cities
  • Zoom out a little bit to determine which region the city is in
  • If the border you’re looking at is in the region it’s listed as, then it’s the other city that gets deleted
  • If the border you’re looking at is not in the region it’s listed as, then it’s this city that gets deleted

Well, that paired with any local knowledge and just knowing what’s right/wrong. :smile:

2 Likes

Hah, yup that is too long to quickly do alone. I’m travelling this weekend or I would have started some before even posting this. Now I figured maybe someone will find a handy solution by the time I return :stuck_out_tongue:

Potentially there is a way to do this programmatically on your database, if you can easily perform a check whether a point is within a boundary: if one of the two city options has decidedly more points of its boundary within the boundary of the region it is linked to, then it is probably the correct one to keep. This assumes you have the region boundaries stored, idk if you do.

2 Likes

Bummer, I don’t have region boundaries stored.

I just ran some tests with http://www.rubygeocoder.com using both their coordinate response and bounding box … but distance from the city center that I have to their returned coordinates isn’t reliable enough … and neither is whether or not the city is within the bounding box. These two (absurd) examples highlight that really well :laughing:

I can covert off the New Zealand duplicates. For Rotorua District, there is also just Rotorua that is also a duplicate (sorry I’m not sure where to see the CS i.d.), I think selecting the Rotorua District in Bay of Plenty is the best one and representative of the city based on the boundaries.

For Taupo District, the Waikato version would be best to use.

The duplicates look to come from the level 6 boundary crossing multiple level 4 boundaries. Not sure why it is set up that way in OSM, I don’t know if it correct or not so isn’t something I would just change

So this is interesting… There is a:

|Jacksonville|Illinois|United States|332|
and a
|Jacksonville|Texas|United States|332|

And they do not share a border, yet street count is the same.

Actually, there are quite a number of J-villes!

Question: Is there a way to have the Sheets page open, in two tabs, but changing the filter in one tab, does not affect the other?

Basically I’m trying to figure out how to show only Texas cities, but also see where its twin is, so I can determine which is correct?

Edit, or I could try using two different browsers… Or maybe pull the tab off into a new window? ← I’ll try that when I get back.

Thinking out loud: There are so many cities, 4246, needing looking at, that I am doubtful there are enough Striders, in this forum, with enough knowledge, to accomplish this task… Unless we just wait for the problem to be noticed (then point them here)?

There isn’t really much need for specific knowledge.

Mapbox’s design does a pretty good job of showing the region name in a slightly larger font. I suppose you’re right as far as language is concerned, though e.g. https://citystrides.com/cities/141262

Update after going through a few myself, I’m noticing discrepancies between what Mapbox displays as the region & what is present as the region as CityStrides - this definitely adds to the difficulty in figuring these dupes out in certain places, so I see what you mean now.

1 Like

Looking at the distribution of duplicates by country I suspect that there is some systemic issue with the imports from Romania, Russia, Ukraine, Latvia and Lithuania. Finding the issue would probably be quicker than going through the list one-by-one.

I had a (very) quick look at Romania, and I suspect that overlapping entities were imported. For example Centru region is composed of six counties Centru (development region) - Wikipedia), which are individually also present as regions in CS. There are 8 such development regions, so it’s probably a question of deciding whether these or the counties make the better region for CS purposes and deleting the other entity type.

1 Like

Yeah, for Romania it’s a decision between…

The Development regions of Romania - Wikipedia

Development Regions in CityStrides

București - Ilfov
Centru
Nord-Est
Nord-Vest
Sud-Est
Sud-Muntenia
Sud-Vest Oltenia
Vest

The Administrative divisions of Romania - Wikipedia

Counties in CityStrides

Alba
Arad
Argeș
Bacău
Bihor
Bistrița-Năsăud
Botoșani
Brăila
Brașov
Buzău
Călărași
Caraș-Severin
Cluj
Constanța
Covasna
Dâmbovița
Dolj
Galați
Giurgiu
Gorj
Harghita
Hunedoara
Ialomița
Iași
Ilfov
Maramureș
Mehedinți
Municipiul București
Municipiul București
Mureș
Neamț
Olt
Prahova
Sălaj
Satu Mare
Sibiu
Suceava
Teleorman
Timiș
Tulcea
Vâlcea
Vaslui
Vrancea

Wikipedia notes:

The developing regions of Romania have no administrative role. They were formed to draw funds from the European Union

I’m not Romanian, so I don’t know what level would most closely relate to “States” in the US. My instinct is to go with the Counties, because most of what I read about the Development Regions is that they’re present for better integration with the EU but have no internal standing, but I don’t feel confident about the decision.

(As an aside, I certainly never expected that I’d be concerned with Romania’s administrative structure when I started building CityStrides :sweat_smile: :rofl: )

1 Like

For Ukraine, I have a very clean region/city specification where regions are admin level 4 and cities are admin level 8. Most of the duplicate cities I’m seeing in the list are proper duplicates - they’re in the same region & I just have to delete one.
Some other duplicates that I’m seeing are just same-named cities in different regions e.g.:

I should be able to clean this portion of the list up pretty quickly.

Update: further down the list, I’m seeing a few duplicate cities across regions. I’m guessing that the Overpass query included cities along the region borders in odd ways. Ugh.
I’m also seeing that a huge part of this list is Crimea being included in both Ukraine and Russia. :sweat_smile: :grimacing:

1 Like

For Latvia and Lithuania, that looks like it’s coming down to the Overpass query for their Regions just way overstepping boundaries. I compared the output from a few different Overpass servers, and it looks like things were cleaned up sometime after July (some servers respond with old data). I’m hopeful that this cleanup will be a matter of determining which regions are in which countries & deleting the extras - seems like a simple task. (famous last words, and all)

Update: :rofl: it didn’t take too long to eat my words

1 Like

Exactly what I was thinking! :upside_down_face:

In other European countries like France, Italy or Poland the regions in CS seem to be more at what the equivalent of the development regions would be (for example for France the counties would be Departments of France - Wikipedia and the development region Regions of France - Wikipedia). That said, based on the description of the Romanian development regions I would agree that there is a strong case to be made to use the lower level in this instance. The level that would most closely relate to US states is probably the country as a whole: it has about the population of NY state and the size of Minnesota.

As far as Crimea is concerned, the pragmatic approach might be to wait a few months before deleting one or the other :frowning:

Has anyone been able to spot a pattern for the US ones yet? Is it mostly cities on state borders (like the baltics) or just the sheer number of cities that makes it possible to have so many cities with the same name with the same number of streets across different states?

I’ve gone through a number of the less represented countries. There are actually quite a few cases of genuine “same name different place” cities in the list, especially when it’s small villages. North Macedonia alone has 3 one-street “cities” with the same name :laughing:

1 Like

This one slipped through the cracks before due to slightly different boundaries and thus street counts. You’d think I would have noticed due to it being the 7th biggest city in Citystrides, but the arabic name made me not notice the similarities as easily as I would have in latin script.

1 Like