Hey @JamesChevalier, I’ve been doing some data gathering and got some dupes you might be interested in. The heuristic I used is: if two cities have the same name and the same street count, I mark it as likely duplicate. This seems to hold up once you pass a certain minimum of streets. (read: the ones at 1 street are probably false positives)
Here is the full list gathered. It’s CSV format, that website seems to have a little bit of trouble with RTL characters though, let me know if I need to upload it elsewhere.
Quite some are near a border, which I assume to be because their border slightly overlapped the border of a larger area, making it get caught in an overpass query.
Also to note: Maryland has a lot of duplicate cities.
Here are the largest few from that list (ignore the first column, it’s an internal index in the pandas library). Sorry to anyone that may lose a lot of streets because of this
Thanks for gathering this list!
My trouble now is that I have to look at every single city to determine which one is correct.
If a bunch of us work through the list at the same time, it will go much faster…
I added a sheet to Missing/Broken Cities Tracker - Google Sheets named “Duplicate City List” that I’ll work in. I’ll delete rows as I work through them, instead of marking them as ‘delete’ and I’ll batch delete rows as they’re filled in by any of you that feel like helping out.
Hah, yup that is too long to quickly do alone. I’m travelling this weekend or I would have started some before even posting this. Now I figured maybe someone will find a handy solution by the time I return
Potentially there is a way to do this programmatically on your database, if you can easily perform a check whether a point is within a boundary: if one of the two city options has decidedly more points of its boundary within the boundary of the region it is linked to, then it is probably the correct one to keep. This assumes you have the region boundaries stored, idk if you do.
I just ran some tests with http://www.rubygeocoder.com using both their coordinate response and bounding box … but distance from the city center that I have to their returned coordinates isn’t reliable enough … and neither is whether or not the city is within the bounding box. These two (absurd) examples highlight that really well
I can covert off the New Zealand duplicates. For Rotorua District, there is also just Rotorua that is also a duplicate (sorry I’m not sure where to see the CS i.d.), I think selecting the Rotorua District in Bay of Plenty is the best one and representative of the city based on the boundaries.
For Taupo District, the Waikato version would be best to use.
The duplicates look to come from the level 6 boundary crossing multiple level 4 boundaries. Not sure why it is set up that way in OSM, I don’t know if it correct or not so isn’t something I would just change
Question: Is there a way to have the Sheets page open, in two tabs, but changing the filter in one tab, does not affect the other?
Basically I’m trying to figure out how to show only Texas cities, but also see where its twin is, so I can determine which is correct?
Edit, or I could try using two different browsers… Or maybe pull the tab off into a new window? ← I’ll try that when I get back.
Thinking out loud: There are so many cities, 4246, needing looking at, that I am doubtful there are enough Striders, in this forum, with enough knowledge, to accomplish this task… Unless we just wait for the problem to be noticed (then point them here)?
Update after going through a few myself, I’m noticing discrepancies between what Mapbox displays as the region & what is present as the region as CityStrides - this definitely adds to the difficulty in figuring these dupes out in certain places, so I see what you mean now.
Looking at the distribution of duplicates by country I suspect that there is some systemic issue with the imports from Romania, Russia, Ukraine, Latvia and Lithuania. Finding the issue would probably be quicker than going through the list one-by-one.
I had a (very) quick look at Romania, and I suspect that overlapping entities were imported. For example Centru region is composed of six counties Centru (development region) - Wikipedia), which are individually also present as regions in CS. There are 8 such development regions, so it’s probably a question of deciding whether these or the counties make the better region for CS purposes and deleting the other entity type.
The developing regions of Romania have no administrative role. They were formed to draw funds from the European Union
I’m not Romanian, so I don’t know what level would most closely relate to “States” in the US. My instinct is to go with the Counties, because most of what I read about the Development Regions is that they’re present for better integration with the EU but have no internal standing, but I don’t feel confident about the decision.
(As an aside, I certainly never expected that I’d be concerned with Romania’s administrative structure when I started building CityStrides )
For Ukraine, I have a very clean region/city specification where regions are admin level 4 and cities are admin level 8. Most of the duplicate cities I’m seeing in the list are proper duplicates - they’re in the same region & I just have to delete one.
Some other duplicates that I’m seeing are just same-named cities in different regions e.g.:
I should be able to clean this portion of the list up pretty quickly.
Update: further down the list, I’m seeing a few duplicate cities across regions. I’m guessing that the Overpass query included cities along the region borders in odd ways. Ugh.
I’m also seeing that a huge part of this list is Crimea being included in both Ukraine and Russia.
For Latvia and Lithuania, that looks like it’s coming down to the Overpass query for their Regions just way overstepping boundaries. I compared the output from a few different Overpass servers, and it looks like things were cleaned up sometime after July (some servers respond with old data). I’m hopeful that this cleanup will be a matter of determining which regions are in which countries & deleting the extras - seems like a simple task. (famous last words, and all)
In other European countries like France, Italy or Poland the regions in CS seem to be more at what the equivalent of the development regions would be (for example for France the counties would be Departments of France - Wikipedia and the development region Regions of France - Wikipedia). That said, based on the description of the Romanian development regions I would agree that there is a strong case to be made to use the lower level in this instance. The level that would most closely relate to US states is probably the country as a whole: it has about the population of NY state and the size of Minnesota.
As far as Crimea is concerned, the pragmatic approach might be to wait a few months before deleting one or the other
Has anyone been able to spot a pattern for the US ones yet? Is it mostly cities on state borders (like the baltics) or just the sheer number of cities that makes it possible to have so many cities with the same name with the same number of streets across different states?
I’ve gone through a number of the less represented countries. There are actually quite a few cases of genuine “same name different place” cities in the list, especially when it’s small villages. North Macedonia alone has 3 one-street “cities” with the same name
This one slipped through the cracks before due to slightly different boundaries and thus street counts. You’d think I would have noticed due to it being the 7th biggest city in Citystrides, but the arabic name made me not notice the similarities as easily as I would have in latin script.