Search
Jennings Anderson and Amy Rose - SE Radio Guests

SE Radio 694: Jennings Anderson and Amy Rose on Overture Maps

Jennings Anderson, a Software Engineer with Meta Platforms, and Amy Rose, the Chief Technology Officer at Overture Maps Foundation, speak with host Gregory M. Kapfhammer about the Overture Maps project, which creates reliable, easy-to-use, and interoperable open map data. After exploring the foundations of geospatial information systems, Gregory and his guests dive deep into the implementation of Overture Maps through features like the Global Entity Reference System (GERS). In addition to discussing the organizational structure of the Overture Maps Foundation and the need for a unified database of geospatial data, Jennings and Amy explain how to implement applications using data from Overture Maps.

Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episodes

Other References


Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Gregory Kapfhammer 00:00:19 Welcome to Software Engineering Radio. I’m your host, Gregory Kapfhammer. Today’s guests are Amy Rose and Jennings Anderson. Amy is the CTO at the Overture Maps Foundation, and Jennings is a software engineer at Meta Platforms. Amy and Jennings, welcome to the show.

Amy Rose 00:00:36 Thanks, Greg. Thanks for having us.

Jennings Anderson 00:00:37 Happy to be here.

Gregory Kapfhammer 00:00:38 Hey, I’m glad that you’re here for this interview where we’re going to be exploring Overture Maps. It’s a project that’s aimed at creating reliable, easy to use and interoperable open map data. We’re going to start by diving into the world of geospatial data for software engineers, and then we’re going to talk more about overtures implementation and its ecosystem. Amy and Jennings, are you ready to dive in?

Jennings Anderson 00:01:02 Ready. Let’s do it.

Gregory Kapfhammer 00:01: 03 All right. So we’re going to start by introducing some of the core concepts associated with geospatial data and the systems that software engineers may use in order to explore or build geospatial data platforms. Amy, to start off our discussion, what is a geospatial information system? Can you give me more details about it?

Amy Rose 00:01:22 Yeah, sure. So Geographic Information System, so shorthand would be GIS. It’s basically a system that’s designed to capture, store, manipulate, analyze, manage, and then present, I guess is a good word to say, all types of geographical data. So if you think of GIS as kind of a super powered map, that can do a lot more than just show you where things are. So it’s very akin to any other kind of information system, combines hardware and software, data, methods, and of course there has to be the people that operate that system and interpret the results that come out of it. So that’s kind of GIS in a nutshell.

Gregory Kapfhammer 00:02:01 Many of our listeners may be familiar with relational data. Can you explain how geospatial data might be different to or similar to relational data?

Amy Rose 00:02:11 Yeah, I mean, in a lot of ways it’s very similar and it can be stored in databases, queried, linked, just like relational data. In fact, we can store it very similarly. For example, let’s say in a relational database, you might have a table of customers with their names and addresses in GIS, you could actually link that address to a specific point on the map. So a specific place in the world. The biggest difference I think, is that geospatial data obviously has a spatial component, meaning that because it’s tied to a specific location on the earth’s surface, most relational databases don’t, out of the box, understand that aspect. And so when you’re talking about geospatial data, it’s really built to not just, but understand the relationships that you would have through like a linked table, but also the spatial relationships between records, like the distance between things, the proximity between different features, how they might overlap or connect. So following on that example that I gave earlier, if you have customers in New York City, that’s something that’s query able in a relational data table. But in GIS, you could also take that steps further and think about how far those customers are from perhaps a new store that you might build or where those customers might live within a certain flood zone or any other contextual location information.

Gregory Kapfhammer 00:03:33 Thanks, that was really helpful, Jennings, a moment ago I heard Amy talk about the concept of a point, and when I was learning more about geospatial information systems, I learned that things like points and lines and polygons are all critical to GIS. Can you tell us a little bit more about what points and lines and polygons are?

Jennings Anderson 00:03:54 Yeah, so points, lines and polygons are the primitive data types for GIS systems. And when you think about modeling the real world, you can imagine a point as say like a point of interest to which you’re attaching data. So that could be a customer address and you might have other information about that point. But then fundamentally that point is going to have a latitude and a longitude when we’re talking about geographic coordinates. And that’s going to represent an actual point on the surface of the earth. And if we string a bunch of points together, we can create a line string, and you can imagine a road would be best represented as a line string. And then if you have a line string that can close on itself, you can create a polygon and that’s going to be the best way to represent something like an area. And so we can put all these pieces together and you might have a polygon that’s representing a city park, for example.

Jennings Anderson 00:04:45 And so on a map you’re going to show that as some sort of green space. And then you’re going to have your line strings around that would be your roads most likely. And then you could have points within that park denoting points of interest such as a fountain or an information kiosk. And so you now have this inherent spatial relationship between each of these entities where in this database, they might each just be these roads, but because they have these coordinates attached to their geometries, you’re able to do these further operations as Amy was describing, where you could query for, oh, okay, what are the points that are actually within this polygon representing the park? And you can create these spatial relationships. And another way to think about that is it’s another form of join on a relational database, but you’re getting a different wear clause right now. You’re joining on a condition that is using those geometries, and that’s fundamentally different from say, joining on an ID or another type of key.

Gregory Kapfhammer 00:05:40 Okay. As I mentioned a moment ago, we’re going to be talking about something called Overture Maps, and the documentation for Overture says that Overture provides free and open map data and that it’s normalized to one schema. So I’d like to talk a little bit at a high level about what Overture Maps provides, and then additionally, could one of you explain what it means when we talk about the Overture Maps schema?

Amy Rose 00:06:03 Yeah, sure. So schema, one of my favorite topics. I’m obviously the life of the party. So when you’re thinking about schema, whether you’re talking about traditional databases that are non-geospatial or if you’re talking about geospatial databases, the schema typically defines the structure of the data. And so it’s kind of like the blueprint for rules that dictate anything from what types of data can be stored to the names of the fields or the columns. And then of course the relationships between the different pieces of data. The key difference, and Jennings touched on this a little bit, is that the schema will also include definitions for the spatial data types. So where in a non-geospatial schema you might have a field for address, which is just represented as a string geospatial schema would actually include the geometry field. So like a point representing the coordinates of an actual location of that address and potentially other spatial attributes.

Gregory Kapfhammer 00:07:00 Thanks, that was really helpful. A moment ago you mentioned the idea of addresses, and I think that connects to the six themes that are associated with Overture Maps data. The six themes are addresses, buildings, base, divisions, places, and transportation. Can the two of you help us out to understand more about what each of those data themes actually are?

Jennings Anderson 00:07:21 Sure thing. So going back to that list, addresses is going to be it, there’s about 300 million features around the world now in our addresses theme mostly coming from open data sources such as open addresses, and these are point features that represent actual addresses. So this is your street number, your house number, and the street name and the geographical region for your location and buildings is pretty self-explanatory. It’s a data set now of about 2.4 billion buildings around the world. This is another combination of open data sources from Microsoft’s machine learning buildings dataset and Google Open Buildings and OpenStreetMap buildings, as well as Esri Community Maps. And what we do is conflate all these data sets together and produce one unified layer of buildings across the entire planet. And so it’s a very, we like to think it’s maybe one of the most complete open building data sets because we’ve been combining all of these other gigantic, fairly complete open building data sets.

Jennings Anderson 00:08:24 And these are all going to be polygon representations of these buildings. And where we have information about the height, we also include that as one of the attributes on these buildings. And so you can get these nice, extruded building models. And then base is a contextual layer, so base has types within it, and this is going to be your land use coming from OpenStreetMap, for example. So these are your parks and kind of green spaces, water as well from OpenStreetMap, and then other natural features and infrastructure such as chair lifts and fences and other stuff. And so that’s all coming from OpenStreetMap and is converted into the Overture Maps schema so that it’s compatible with the rest of the Overture Maps dataset. And we like to think of this as going to be the rest of the color and detail kind of under the rest of your features on the map. And then we also include information in there for land cover as extracted from ESA satellite imagery and a bathymetry layer as well, so you can show some depth to the ocean. So yeah, that base theme covers everything you might want to add color and context if you’re rendering a map from Overture data.

Gregory Kapfhammer 00:09:28 Thanks for that response. It was really helpful. What I want to do now is pick up on something that I heard both of you say. First of all, Amy, you mentioned that you’ve normalized all of the data to a single schema, which describes the structure of the data. And then Jennings, you talked about how this dataset comes from multiple sources. So the next thing that I’d like to talk about further is the challenges that you’ve experienced when it comes to unifying this geospatial data from different organizations into one normalized schema for data. Can you talk about that in greater detail?

Amy Rose 00:10:01 Well, maybe I’ll start with just why that’s important and then maybe Jennings can talk a little bit about some of the challenges. So imagine you’re trying to build an application that uses map data. You’re pulling all this information from different sources if each source has its own unique way of describing things. So let’s say some data refers to roads as quote unquote streets and others as highways. Or if you have in some datasets, the height is measured in meters and others in feet just because of cultural differences. It’s a total nightmare to get them to all work together, right? We’ve all been there. And so you spend forever writing this custom code and these translators to convert the data just to get it to a point where it’s usable. So that’s where normalizing to a single schema comes in and it’s like we all agree on a common language, a common structure for the data, and that’s really key for interoperability. So when you have the data that adheres to that single schema, it’s easy to combine, it’s easy to compare, it’s easy to use from different sources. So for the Overture releases, what we’re doing is we’re kind of doing that work up front to do a lot of that translation as we’re bringing all these different data sets together so that developers that are using the data can just plug and play.

Gregory Kapfhammer 00:11:14 Jennings, did you want to add to that further?

Jennings Anderson 00:11:16 I think that was a fantastic description. One thing that I think Overture is doing, Amy mentioned all of these custom scripts to normalize the data that people get stuck doing. And unfortunately there’s no way around that. Overture says, well, let’s do it once in one place and then we will provide that unified schema. And that’s something that Overture is trying to provide there then to the community. So we are doing a lot of those, as you mentioned, these complications and difficulties that we run into. We’re doing a lot of this normalization ourselves as we bring in these different open data sets and then putting them into this schema. And so part of the idea here is just we’ll do it so you don’t have to, to some level.

Gregory Kapfhammer 00:11:57 Okay, that makes a lot of sense. When I was checking out the documentation for Overture, I remember reading that it had 4.2 billion features and that it was continuing to expand in terms of size and scale and the data sources that you’re pulling in. I’m curious, what do you mean when you say 4.2 billion features? What are those features in particular?

Jennings Anderson 00:12:19 Yeah, those features are across all of the six themes that we’ve brought up. And so each feature might not carry the same weight. So for example, there are hundreds of millions of points that represent the actual intersections of transportation segments. We call them connectors in Overture. And so these are single points that represent two roads coming together and making an opportunity for a routing decision. So it needs to be absolutely modeled in the database, but that’s single point or intersection might not have as much information as say a point in the places data set that describes a restaurant and all the attributes attached to it. So ultimately, our data sets are, yes, massive hundreds of gigabytes in their Parquet format and four point, yeah, billions and billions of features. But then each theme does break down and have their own schemas of how we represent that data.

Gregory Kapfhammer 00:13:12 Thanks for that response. I appreciate it. In a moment, I want to talk about a few more basics of geospatial data sets, but before I do that, let’s pick up on some of the things you said a moment ago, Jennings. Can you both tell us a little bit more about the overall size of the Overture data set? And I know that Overture is deployed on the public cloud, so if you could tell us a little bit more about like for example, what are the number of API invocations that you get per month, or how frequently you release data or how frequently people download data? If you could help our listeners to understand a little bit more about the scale and scope of Overture, I think that would be helpful for us.

Amy Rose 00:13:49 Well, I think I would start with one caveat, which is we want to make this data as open as possible. So by design, we don’t necessarily put-up gates for things like downloads. So there are certain ways that we think about the data. One of the reasons why we decided to go with the cloud native format Geo Parquet and actually storing the data on the cloud is so that it would be easier for a variety of different applications to access the data. And just by default, in a lot of ways, we miss out on some of the, what you would consider ways to track those metrics of who’s downloading data or who is using the data, who is accessing the data. But I would say that the goal is really to not so much track those direct downloads or hits, but understand how it’s kind of cycling out into the ecosystem. So for example, Esri uses Overture data as part of their living atlas. So if you can imagine hundreds of thousands, millions of users really that use Esri software are able to access Overture Maps data, they might not know it’s Overture Maps data. There’s no attribution clause, so they might not even know. But the reality is that they might be using it and it’s probably percolating out to, like I said, millions if not hundreds of millions of users just through one specific platform.

Gregory Kapfhammer 00:15:11 Okay, that makes sense. What I want to do now is follow up on something you said a moment ago, Amy, you mentioned something called the Esri Living Atlas. Can you tell us what that is?

Amy Rose 00:15:21 Right, so Esri Living Atlas is basically a compilation of variety of different data sets that they put together to operate as an out of the box space map. So anybody using their software, whether it’s their desktop or their online platform, can have access to this Living Atlas data. And so it’s really meant for getting you to a certain level of operability within the context of developing maps so that you’re not creating all of this out of the box. So it will have styling, it’ll have cardio graphic properties. So if you really just wanted to create a quick map, that’s an easy way to do it. They draw on a variety of different data sets, like I said, of which one is Overture.

Gregory Kapfhammer 00:16:05 Okay, that’s really helpful. Now, I wanted to summarize a few points before we move on to the next phase of our conversation. First of all, my understanding is that Overture Maps and its dataset is deployed on the public cloud. Did I get that part right?

Jennings Anderson 00:16:18 Yes.

Gregory Kapfhammer 00:16:19 Okay, cool. And then because it’s deployed on the public cloud, one of the things that I’m understanding is that you don’t necessarily track the number of API invocations or things of that nature. Did I understand that correctly as well?

Jennings Anderson 00:16:32 Yes.

Gregory Kapfhammer 00:16:33 And then the next thing, just to make sure that I’m understanding everything carefully, you can access the data from a public cloud like Amazon AWS, and that there are other organizations like for example, Meta or Microsoft or Tom Tom or the Living Atlas that are all drawing on the data that Overture Maps provides. Did I get that correct as well?

Jennings Anderson 00:16:54 Exactly.

Gregory Kapfhammer 00:16:55 Okay, cool. Now I wanted to cover just a few quick terms that are related to geospatial data, and then we’re going to do a deep dive into Overture Maps. So there’s a couple of terms that I think our listeners should try to understand a little bit better, including things like a coordinate system and then a projection and a transformation. Could we work together to figure out what those three terms mean?

Jennings Anderson 00:17:16 Sure. This feels like a GIS pop quiz. I’m happy to take a look at this. So in GIS, everything comes back to the datum. The datum defines the coordinate system. And so the datum that I think we’re all familiar with today is something called WGS 84, which is an agreed upon representation of the earth as an ellipsoid. And when you’re modeling the earth as an ellipsoid, allows you to come up with these specific transformations between different coordinate systems, if we all have a shared understanding of the shape and size of this ellipsoid of the planet. So for GIS, we have the WGS 84 enables these projections. And so the best way to imagine a projection is if you think of the earth as an orange and then you start to peel the orange, you’re left with these pieces. If you peel it off all in one piece, it’s going to be a representation of something you could lay out flat, but it doesn’t necessarily look like that globe or that circular shape that you might be used to seeing on a map or that oval shape.

Jennings Anderson 00:18:18 So that’s what we’re talking about with projections. And so different transformations allow us to create different projections. And one that we’re probably most common with in our current digital world is something called Web Mercator, in which that’s what’s most popular across all of the web-based maps that we see. And that’s something that does use our standard WGS 84 latitude and, and longitudes adjusted a little bit at the polls, but essentially we can still describe data in these more fairly common latitude and longitude ideas, right where we have zero to at the equator up to positive 90 up at the North Pole and negative 90 at the South Pole. And then you have your longitudes, which is going to go from negative 180 to positive 180, so 360 degrees in total around the earth. And that allows us to have a shared reference of what these actual coordinates are that we can then start constructing these our primitive geometry types from.

Gregory Kapfhammer 00:19:13 Jennings, thanks for taking the GIS quiz. I really appreciate that. What I wanted to do very quickly is to talk about some commercial or open-source applications that are powered by geospatial data. I know that Amy, you previously mentioned something called the Living Atlas. Could the two of you give me some more examples of real-world applications that use GIS?

Amy Rose 00:19:34 Well, I’ll give you one, and I think if Jennings would like to elaborate on it, he’s definitely the person that’s better suited for it. But recently, if you saw the announcement from Instagram, they have Instagram maps. And so that’s a big part of how Meta is starting to use Overture data within their own platforms.

Jennings Anderson 00:19:51 Yeah, I’m happy to talk about that for a second. So yes, Instagram just added the maps capability to their platform where you’re able to see your friend’s location and your location on the map as well. And I’ll take that one step further to say that Meta powers all of their open base maps with Overture data. And so that’s one of the reasons that Meta is involved in Overture. And these are going to be base maps that you see across any number of Meta products. So there is the Instagram map and there’s also, if you’re on Facebook Marketplace or on any Facebook page, if there’s a map in the background describing the location of say, a business, that is going to be the map that is derived from Overture Maps data, which is really exciting.

Gregory Kapfhammer 00:20:32 Okay, thanks for those responses. Now we’re ready to start talking about the key implementation details and the technical components of Overture Maps. So Amy, I wanted to start by talking about how Overture Maps is a collaboration among multiple organizations. So I know that Meta and Microsoft and Tom Tom and others are all part of this collaboration. You alluded to this previously, but can you tell us a little bit more about why this collaboration is needed and what you couldn’t solve if you did it individually versus doing it in a collaborative fashion?

Amy Rose 00:21:04 Right. Yeah, I mean that’s a big part of what Overture is and how we operate. So about three years ago, the organizations got together AWS, Tom Tom, Microsoft, and Meta, Esri shortly following and came together and we’re talking about, hey, we’re all doing very similar things, trying to solve the same problem of creating and maintaining a comprehensive high quality up to date global map data set. And if you think about it, creating and maintaining any global data set is very difficult and you start to consider the implications of a very dynamic world. Things are ever changing, right? Businesses open, they close, they move, new roads are built, new buildings are built or demolished. How do you constantly maintain that so that any of the work that you’re doing or the applications that you’re building on top of this map data are as up to date and are providing the best information possible.

Amy Rose 00:21:59 And so it’s really incredibly expensive and resource intensive to do that. So they got together and really thought, hey, we’re all doing this thing, it’s pretty competitive. We’re not necessarily building our company on top of this work, but we all need it to do what we need to do for our particular business value. And so how do we not just get together and build and maintain this data, but how can we start to think about it more as an interoperable open ecosystem so that it becomes much easier for anybody, not just these companies that got together, but of course any Overture member, but then the broader community to be able to very quickly put together data and have some information about how fresh it is, what is the quality of it, the things that take quite a bit of engineering to do, do it once with a lot of people involved so that you’re really being much more efficient and effective about it.

Amy Rose 00:22:56 And at the same time opening up all of that opportunity for a broader ecosystem. And I think one of the biggest benefits that we don’t talk about a whole lot is just how that can then spur additional innovation. So for example, if you’re no longer spending all this time as your own company doing this work alone, instead you’re collaborating, that frees up that time that was being spent on building the map independently to do things that are either more value added to your own organization or contributing back to value added for an open-source community.

Gregory Kapfhammer 00:23:27 Thanks for that response. It was really helpful. For listeners who want to learn more in the show notes, we’ll link you to a web-based system that’s called the Overture Maps Explorer, and they can actually try out some of the data sets and the visualizations that are associated with it. In a moment we’re going to talk about something called the Global Entity Reference System. But before I do that, I also want to point out to listeners, they can check SE Radio Episode 607 and 546 for some additional details that are related to geospatial systems. Now building on the response that Amy gave a moment ago, I want to talk a little bit about the Global Entity Reference System that’s built into Overture Maps and Jennings, if you could pick this up, can you tell us a little bit more about why G-E-R-S or GERS is important for Overture Maps and what it actually is?

Jennings Anderson 00:24:17 Yeah, so GERS, as it’s affectionately called is the, as you said, the Global Entity Reference System. And I think the key piece there is that E in the middle, which is saying entity. And this is an entity-based system which allows us to first define an entity across our themes and then create this shared reference to it. So the idea being, let’s take a building for example. Buildings are modeled in many ways. They might have many different attributes depending on the data set, and there’s going to be multiple data sets of buildings for any city, for example, from different departments to different, yeah, different organizations. And the idea of GERS is to say, okay, well there might be different representations of each of these buildings, but each building is in fact its own building entity, the building ness of the building, so to say. So we will define a single entity for a building, give that a unique identifier.

Jennings Anderson 00:25:18 We’re using standard UIDs, so 128 bit random IDs. And Overture then says, okay, as long as Overture has any representation of this building across any data set, we are going to continue to use this ID and we’re going to do conflation on incoming data sets to compare them against our recurrent understanding of this entity to see if a new incoming building matches spatially overlaps with some tolerance, this existing building. And if it does, we’re going to group that together and say, okay, this might be its own new unique feature from this other source, but in the Overture perspective, it is this GERS entity, it is the original building, and that allows the Overture release to go out and have a specific ID on this building. And then any two groups who are also using the Overture data set now have a shared reference, have this shared ID for this building. And we can take that one step further that if you are say another group looking to share information or enrich cell information about a building, you could also then just use that ID to refer to that entity. And of course this goes for buildings and places and addresses and transportation segments, et cetera. And so that’s the promise of the GER system is building up this shared universal reference for geographic entities globally.

Gregory Kapfhammer 00:26:45 Okay, that makes sense. So you talked about how the GERS are going to be unique IDs. Another thing that I read is that these IDs have to be stable. Can one of you briefly explain what it means for a GER to be stable?

Jennings Anderson 00:26:58 Yeah, we do stability through our matchers. And so for a given theme, I like to go back to buildings again, for example, because I think it’s easier to visualize if you have two polygons that might represent the same building. These data sets come together and the first thing that Overture does is compare these two data sets by, you can think of it as stacking them on top of each other and then looking at where there is this overlap. And we do what’s called an intersection over a union comparison in the buildings matcher, which is a fairly common spatial operation where you take the area of the intersection of the two polygons divided by the area of the union of the two polygons, and you get this ratio that compares the two polygons similarity essentially. And depending on the value of that, and we set various thresholds, we will then say, this is the same building and therefore, and that next release it will have that same GERS ID. And so that’s how we achieve stability over time across releases with our GERS ID system is based on the quality of our matchers. And you can imagine that’s just for buildings and transportation obviously has to do with something entirely different in how they line up the transportation network and ensure that those are maintained stable and places is also different using different attributes and comparisons. But that’s how we ensure stability at Overture.

Gregory Kapfhammer 00:28:18 Okay, that makes a lot of sense. So thanks for explaining matching and stability. A moment ago I heard you talk about the concept of conflation. Can you briefly define what conflation is?

Jennings Anderson 00:28:28 Yes, it’s when we compare those two data sets, or really not just two data sets, any number of data sets and determine how we’re going to combine the attributes of these data sets or determine their quality. And so it’s also a form of duplication within a data set, identifying what is the same. And so depending on the conflation algorithm that we’re using, we’re going to get different results. And when I say conflation algorithm here, I just mean whatever steps we’re going to take, like that intersection of reunion or maybe if the two names of a place are equal, we’re going to develop various thresholds and systems to determine the equality of these features.

Gregory Kapfhammer 00:29:05 So my understanding is that there must be many incredible challenges associated with unifying data sets that come from disparate organizations and that things like conflation and matching could have a lot of gotchas that go along with them. So I’m hoping that two of you could tell me some stories that are related to the challenges that you faced when it comes to unifying the data or handling issues related to matching and conflation. Can you give us some more details?

Amy Rose 00:29:32 Maybe I’ll start with just the point that we wouldn’t be bringing all these different data sets together if there weren’t a reason to do so. And so that reason really goes back to there’s not always one version of a feature that’s the right version that goes back to fitness for purpose. So if somebody’s just using, really just needs the 2D footprint of a building, that’s great, but other data sets might actually represent 3D characteristics of that information. And so when we bring these data sets together, what we’re trying to ideally do is put together a representation that will be able to accommodate GERS, so be able to be kind of a stable representation of that feature, but also easily linked to other data sets. And so one of the biggest challenges is building out the logic or making the decisions and the rules around what pieces and parts of a feature, whether it be the geometry or the attributes we want to keep for that single representation. I mean that’s why conflations really important, right? So that we don’t push to the release six versions of the same building, we want one version. Obviously it will be an opinionated version to some degree, but the idea is to use it as a reference map so that it can refer back to any source that also has that GERS ID for that building.

Gregory Kapfhammer 00:30:50 You mentioned the idea of a reference map. I think that must mean that there are types of maps that would not be considered a reference map. Can you give us an example of a map that’s not a reference map and then compare and contrast that with what Overture provides?

Amy Rose 00:31:04 I think some people might say you have to be very specific about what you mean by reference, but I would say what we’re trying to get at is not a cardio graphic map per se. And so by cardio graphic map that’s a very akin to very stylized the labeling and the representation is such that everything looks good together and it’s very obvious whether you’re using it to get from point A to point B or really just kind of getting the context in the sense of an area. And what we are trying to do with building a reference map, in this case, the reference map is actually our data releases is to come up with that, as Jennings said earlier, that common reference feature, meaning no matter who has that building represented in their data set, if we all are using the same ID, we can very easily talk across those data sets so it becomes more akin to a table join, a column join rather than these very complex spatial joins, which are messy at best just because of the geometry that’s usually involved. And so the reference map is really meant to just act as kind of this common way to reference the same features on earth. It can be used as the foundation for cardio graphic maps or visualizations or analysis or any of those things, but the idea is really to be a reference across datasets.

Gregory Kapfhammer 00:32:27 Okay, that makes a lot of sense. Now, a moment ago when Jennings was talking, you mentioned the idea of thresholds and then Amy when you were just giving that response now, I’m assuming that somehow there has to be a threshold. If one map says a building is in a certain location and another map says it’s in a different location represented by a different polygon, how do you decide when it’s actually the same building based upon a threshold for the overlap of that building when it comes from different maps? Can you help us to understand that a little bit better?

Jennings Anderson 00:32:58 Yeah, so right now we’re using a very average threshold of literally 0.5 of this intersection over union ratio. So our current building’s dataset, what we want to produce there is the most complete dataset, representing 2D footprints on the map just so that we can assign that ID and then anyone can use that ID and build on top of those buildings. So our matcher is fairly simple right now and what we do is we take a priority of our input data sources. And so right now for buildings we start with open OpenStreetMap at the top, and that way we can get edits back from anyone in the community who’s editing OpenStreetMap and updating buildings around the world. We’re always going to take that as the top priority. And then we essentially fill in the rest of the map with ML data sets from Google or Microsoft where there isn’t an OpenStreetMap building or a building from Esri Community Maps program where, a lot of local governments will open up their building data sets and our conflation algorithm then is really optimized for just ensuring that at the end, at the end of the process we have a map where we don’t have buildings that overlap one another and that each building has this distinct ID.

Jennings Anderson 00:34:17 And so it might not be, it’s, there’s many areas where it’s definitely not perfect, but we want to build this up from the IDs and so that’s why we have this kind of very specific conflation order and create the most globally complete. And so we won’t necessarily debate the perfection between any two buildings, we just want to make sure that we’re actually creating that reference to that building so we can build from there.

Gregory Kapfhammer 00:34:41 Okay. Now one of the things I noticed is that through our conversation today, we’ve mentioned OpenStreetMap more than once, and I remember that the FAQ for Overture Maps explains that Overture and OpenStreetMap are like complimentary parts of the open map data ecosystem. So I’m wondering, does data flow from OpenStreetMap to Overture or is it from Overture to OSM? How does that process actually work? Can you explain that in greater detail, Amy?

Amy Rose 00:35:08 Yeah, I mean to answer your question directly, ideally it would be both that there would be, kind of a flow both ways. Overture and OpenStreetMap are both open data projects, the work and the goals for each project though are kind of fundamentally different. So Overture is more focused on the end user requirements, particularly around interoperability of global open map data, so this is where things like the standardized schema and GERS come in, ideally making it easier to use a variety of data together, whereas the OSM community has created and continues to edit, maintain worldwide map data. If you’re not familiar with OpenStreetMap, it’s really an amazing project and you think about what that community has done just for kind of geospatial in general moved it forward in really big ways. So if you don’t know about it, I definitely encourage you to check it out.

Amy Rose 00:35:59 But in the context of that relationship, Overture obviously uses OSM data, so data that’s being created and maintained by the OSM community, along with hundreds of other data sources to produce new open map data sets. The idea there is that those new open map data sets are more complete. So going back to Jennings description about bringing a variety of different building sources together to make the complete reference map that they’re standardized. So they adhere to a very standard schema and interoperable. So whether it’s GERS or other things that we’re talking about, moving in the direction of making sure that data is very easy to use and combine with other data sets.

Gregory Kapfhammer 00:36:39 That makes sense. We’ll link our listeners to details about OpenStreetMap and the FAQ for Overture Map so that they can learn more about how those two projects are related. In a moment, I wanted to talk a little bit more about Overture Maps and specifically how you might build an application using something like DuckDB. But before I do that, is there anything else that either of you wanted to highlight when it comes to GERS or stable and unique identifiers? Because that seems to be a central component to Overture.

Amy Rose 00:37:07 I would maybe add just a couple of notes about stability is obviously really important. You want to be able to make sure that you can reliably link data sets. And so GERS is a way to do that. The IDs have to be unique, they have to be global, ideally stable, but the world around us is not stable, right? We’re things are constantly changing, like I mentioned before. And so what we don’t want to do is try to preserve stability instead of having correctness. And so the way to think about this is we’re not going to keep a stable ID if the feature we have in a release is actually not the right feature. What we want to do instead is to provide a mechanism for tracking the lineage of IDs. So for example, if suddenly a feature, let’s call it disappears, so let’s say a business goes out of business and, that’s no longer showing up in the map, but then a new business opens up at the same location, those aren’t going to have the same ID, they shouldn’t have the same ID because they’re two different businesses or technically two different places.

Amy Rose 00:38:08 So we want to be able to capture that one ID. So one place that was referenced is no longer active and there’s a new place that’s being referenced in the same location that that is active. And so we provide this thing called the GERS registry. So if you think about UU IDs, we don’t worry as much about the uniqueness because there’s a pretty low chance of collision. But what we do worry about is making sure that as people are starting to adopt GERS and use it to link data sets, that there’s a way to see, when did a GERS ID first get assigned? When was it the last time it was seen in a release? What is its current status, where is it on the map? So the lineage is a really key aspect of what we’re doing for the interoperability piece.

Gregory Kapfhammer 00:38:52 Wow, I really appreciate that clarification. So we need to think about whether or not it’s unique, we need to think about stability and then as you’ve said now we need to consider issues related to Lineage. Now before we move on, one last thing. Each of us have been talking about open data and we talk about open street map or open mapping data. Many of our listeners may be familiar with the concept of open-source software. What is open data? Can you talk briefly about what that concept actually means?

Amy Rose 00:39:19 So open- source software and open-source data you might want to lump them together because it’s all open source, but they’re very different things. So open data can be a variety of things. There are data sets that might be commercially available, but they’re not open, meaning they have certain restrictions about the use and access. When you get into open data, they’re much less restrictive about the use and access, although licenses still vary. So, with Overture we want to be as permissive as possible and with the ability to license quote unquote license data openly. When you get into something like open-source software, you can think about writing code and it’s the licensing’s much clearer because it’s been something that’s been going on for quite some time. How you build open-source software is much clearer. There’s a lot of tools out there already to very cleanly execute on these types of collaborations, particularly in big projects. And code is pretty easy to think about how you would release and continue to maintain and update. Data’s a very different animal, not the least of which is because it can get really big. And so in terms of storage, so one consideration that we have, whether it’s open data or not, is just how much cloud storage you would need to store an open data project versus an open-source code project.

Gregory Kapfhammer 00:40:42 That was really helpful. So the scale of the data makes a lot of sense and what I want to do now is to build on your response and actually talk about how one of our listeners could use Overture Maps in practice. Before we dive into that topic, I should point out that there are some great tutorials on the Overture Maps website and we’ll make sure to link our listeners to that episode material in the show notes. So let’s talk a little bit now about how I would actually build the program using Overture Maps. I’m going to assume for now that I’m getting my data from Amazon S3 because as you mentioned Amy, we’re talking about a really big scale of data and then additionally I’m going to assume that we’re going to be storing our data in something called DuckDB. So my goal is something of the following, which connects to something that one of you mentioned earlier in the show.

Gregory Kapfhammer 00:41:28 So say I’m interested in searching for all pizza restaurants in a bounding box for a specified region in New York City. And I know that it’s going to be hard for us to go over specific details related to code, but I’m hoping the two of you could walk me through some of the specific things that I would need to do if I want to access the data from Amazon S3, store it inside of a DuckDB and then actually run a query so that I could achieve my goal of finding the pizza restaurants within the bounding box within a region of New York City. So can one of you help us to get started on how we would actually build this kind of system using Overture Maps?

Jennings Anderson 00:42:08 Absolutely. Happy to take a crack at this. So our data is all released in GeoParquet format, which is a columnar format that’s optimized for the cloud native environment here. And DuckDB is a fantastic open-source query engine that allows us to investigate that data in situ in the cloud. And so, a lot of the traditional geospatial workflow is go get my data, download it to probably to a giant external hard drive and then run my GIS software to interrogate that data and click run and then go get lunch and come back and hope that it finished. However, with DuckDB and with really with Geo Parquet and what that enables us is the ability to actually query the data in place and a lot of stuff’s happening behind the scenes, but effectively the user is just retrieving the data most interesting and relevant to them and their query.

Jennings Anderson 00:43:07 So for your example of New York City here, we would identify the bounding box as you said. So that would just be the minimum and maximum latitudes and longitudes for the area of New York City. So we would first get those, and we would then use those as our where statement in the query because each row of our data actually has a bounding box feature. And so what happens is DuckDB reaches out to the cloud data, whether it’s on AWS or Microsoft Azure and it then the data is chunked into something called Row Groups. And each row group, because of how the data is organized, knows for that row of data, which is going to be probably hundreds of thousands of features, what the minimum and maximum bounding boxes are for each of those features. And by reading just the row group Metadata first, it’s able to make an intelligent decision about whether or not it needs to actually access that part of the file or not.

Jennings Anderson 00:44:04 And this is all done through http range requests. And so at the end of the DuckDB query, you’re actually able to write something like, select name geometry and category from Overture data where Bounding Box is within, these parameters and category equals, pizza restaurant and DuckDB is then going to do all that work on the backend to make the minimum number of requests to the cloud storage to identify across all of the dataset. There’s about 600 different file partitions kind of on blob storage here. It’s able to then turn that request into really intelligent HGTP range requests to only fetch those pieces that we are interested in. And so the first part is only downloading what you really, really need and then you have your bounding box with your data stored locally in in DuckDB and from there you can write it out to any other format that you’re more familiar with or put it into your application however you see fit.

Gregory Kapfhammer 00:45:07 Thanks Jennings, that was really helpful. A moment ago, I remember hearing you talk about how DuckDB was columnar in nature or sometimes I think people say it’s column oriented. Can you say for our listeners briefly, what does it mean for DuckDB to be column oriented and how is that specifically helpful in the context of geospatial systems?

Jennings Anderson 00:45:26 Yeah, so it’s much more optimized for things like large bulk reads and analytics than say other relational database forms. And it allows you to, as the data’s then, as it says, the data is organized columnarly with data of similar types in the columns as opposed to across rows. So if you’re asking questions like what is the average value of this column, it’s going to have all of that data actually located, closer together when you’re running that query. And so it’s able to more efficiently run that rather than scan across all the rows and extract just that column. And so that’s what we mean by column or format here.

Gregory Kapfhammer 00:46:07 Okay, I see what you’re getting at. I wanted to briefly talk about some of the evaluation metrics that would be associated with a system that’s using Overture Maps data. You mentioned a moment ago issues related to like bulk queries or bulk downloads. So I’m thinking about things like response time or throughput or resource utilization. Jennings, can you talk a little bit about what some of the trade-offs are when it comes to building a system that uses Overture data and DuckDB in order to balance the trade-offs between things like utilization and response time and throughput?

Jennings Anderson 00:46:41 Yeah, I’d maybe take a step back and as Amy mentioned earlier, this idea of kind of fitness for purpose and Overture data as organized, it’s organized for the best distribution format and we want people to be able to easily obtain the data that they need for their purpose and then put that into their application. And so we hope that our users are only limited here by perhaps their bandwidth or the speed of their internet connection in terms of accessing the data and that they’re not instead bound by these complex geospatial operations to retrieve the data that they’re most interested in. And then once they have that data to put that into any application that works for them. And so I think there’s a fine line here between data and service. We think of Overture Maps as providing the data and then we want to enable, that’s kind of a key word here is we want to enable both interoperability but also enable anyone to build the best mapping service for their product or their use case from that data. So to that end, I would say that some of these what we’re optimizing for would be robustness of delivering the data consistently stably and allowing anyone to then access it and put it directly into their product.

Gregory Kapfhammer 00:47:58 Okay, that’s helpful and in fact it leads into my next question because some of our listeners, maybe they’re not familiar with DuckDB, but they might be familiar with Geo Pandas or maybe CARTO or QGIS or maybe Grass. Could one of you talk a little bit about how you might enable someone to use Overture data if they’re not using DuckDB, but perhaps using one of these other systems that I mentioned a moment ago?

Amy Rose 00:48:21 Yeah, that’s a great question. You know it kind of goes back to something that used to be somewhat of a mantra in geospatial, we would say spatial is special. So to work with it you need very specialized tools and you had a fairly small community of practitioners. But actually the reality is that geospatial data is just another type of data and anybody should be able to harness that data. Take advantage of locationís context. So one of the big goals for Overture data is to be easily used by a wide variety of tools and by many different types of professionals, not just people specializing in, in geospatial data. So that’s becoming a lot easier now. So for example we talked about Geo Parquet. Often when you see geo in front of a well-known tool or library or data format, now it’s really a good indicator that it’s an extension of that popular tool and it’s specifically built to handle geospatial data types.

Amy Rose 00:49:16 So that means like if you’re already comfortable using those tools, jumping into Overture data will feel pretty natural. So for example, software engineers working already working in Python, a great way to start exploring Overture data is by using Geo Pandas as the name suggests, Geo Pandas extends the Pandas library, something that a lot of people already use for tabular data to handle geospatial data. So you can load Overture geo parquet files directly into Geo Pandas geo data frame to be exact. And once it’s in a geodata frame, you can do all sorts of pretty powerful spatial analysis. So you still get the benefits of how you would normally deal with data that’s in tabular format. So you can query things that are, are more attribute based, like buildings above a certain height. But you can also do things that are more akin to dealing with spatial relationships.

Amy Rose 00:50:14 So like spatial joins, you can, for example, join Overture buildings with other data sets that are spatial in nature that maybe have a demographic data associated with it. So you could get, population density, poor building, you can do spatial operations like proximity or distance or buffering around features. So if I wanted to find all the points of interest within 50 meters of a road, that’s something you could do. And then aggregating data, but doing it in the context of a spatial operation, so calculating the total area of buildings within a city block and then not the least of which is the ability then to visualize your results also, I mean that’s a great example of using it in Geo Pandas, but there are plenty of ways to connect to Overture data. If you go to our website, you’ll see a lot of these things listed out with examples, but I think one of the key pieces is trying to make the data available in ways that people popularly consume it.

Amy Rose 00:51:19 So while we do have the data sitting out in AWS and Azure and there’s also data mirrors that you can connect to. So for example, Carta lists and maintains mirrors in Google Big Query, Databricks marketplace, snowflake data marketplace, all of those are ways that people typically interact with data in the cloud and are available for anybody who wants to try out Overture data. There’s also plenty of working examples of kind of tools and libraries. So if you’re used to using things like Jupyter Notebooks, you can go to our website and see working examples of how you can pull the data into platforms like that. Things like Kepler GL or QGIS, all of these things span both popular tools that, that people are already using or specialized geospatial tools and kind of cut across both cloud platforms and desktop.

Gregory Kapfhammer 00:52:13 Okay, that’s cool. Now, many of our listeners may not have previously built a geospatial system or they might not be familiar with CARTO or QGIS. So I’m wondering if both of you could briefly share a story or a concrete example of like how you debug an Overture Maps-based application or how you write a test case or how you actually fix a bug after you found it. So I’m hopeful that before we move to the next phase of our conversation, that both of you could share a quick story about issues related to testing or debugging or fixing faults.

Amy Rose 00:52:46 Maybe I’ll start, since we were just talking about QGIS. I would say that a quick story anybody who works with data knows that there’s going to be a laundry list of unexpected issues that come up and for Overture data, but really for any geospatial data, what can go wrong is kind of amplified because you’re now getting into geometry and not just attributes. And so, to debug it’s not just about kind of inspecting the data from schema perspective and the table perspective, it’s also being able to visually see the geometries and see how the data might be interacting with other data sets. So in the context of QGIS, I kind of look at that as a really great, what I’ll call like a sniffer tool, right? Something that just by the nature of loading geospatial data into it, it tells you a lot about what kind of issues you might be seeing right out of the gate. So do you have conflicting coordinate reference systems across the data that you’re looking at? Do you have incompatibility with the schema? So is it expecting one data type and it’s really seeing a different type? So, maybe not an exact story, but I think the idea is the geospatial data, being able to have these more contextual clues that have to do with the geometry can be really helpful to interrogate some of these potential bugs or issues.

Gregory Kapfhammer 00:54:06 Thanks for that response. Amy, Jennings, did you want to add anything here?

Jennings Anderson 00:54:10 Yeah, it’s kind of also in the form of a shout out to one of our members as well, and that would be whereabouts who supports the Apache Sedona Library. But our technical use case here is that we need to be building on the existing work that’s happening in this space, which is an incredibly fast-moving target. Geotech is evolving very quickly, and we’re working right up on the edge with Geo Parquet. And since this is these massive data sets, we’re obviously working in the cloud and distributed platforms, leveraging things like Apache Spark. And one thing that, we tend to do when you start into a project is try to unfortunately end up reinventing little parts of the wheel just because you don’t know what else is out there or what was published last week. And then is it safe to take your production dependency on something that was just updated?

Jennings Anderson 00:54:59 It’s all these software engineering best practices and concerns. But one thing that we were doing early on was, maybe what we thought was over optimizing and trying to break the data up in very specific ways that worked well for this part of the world, but not as well for the data over in this part of the world, just based on the size of the shapes. And turns out that, that problem had actually totally been solved in the latest version of Apache Sedona, which enables you to do all these complex geospatial operations in Spark. And so what we learned was to make sure that you’re building on top of the right tools from the beginning. And that’s, definitely something that we’ve learned on the technical side. And then, so our pipelines end to end embrace open-source geospatial software from a very active, evolving community of geospatial software developers.

Gregory Kapfhammer 00:55:47 Thanks for that response. Jenning and Amy, thank you for yours as well. As we draw our episode to a conclusion, I wanted to ask you each one final question and then see if you have any final topics that you’d like to discuss. Amy, in particular to you, I noticed that Overture Maps is a project that’s under the Linux Foundation. Can you tell us why it’s under the Linux Foundation and what lessons you’ve learned from the perspective of governance and the Overture Maps project as a whole?

Amy Rose 00:56:13 Right. Yeah, that’s a great question. In fact, we get that question quite a bit. I hope that folks are familiar with the Linux Foundation. Obviously, this is with the Linux kernel is associated with and still maintained under. And so as part of the Linux Foundation, Overture operates as a nonprofit technology consortium. So the idea of operating under that structure is that it provides a neutral kind of vendor agnostic home for the project. It gives us a much better way to remain open and truly collaborative. And so Overture has a steering committee. So the steering committee is made up of certainly the founding members. And that steering committee is really the ones that guide and oversee those strategic direction of the project. But most of the tactical and operational decisions are really made at the working group levels. And the working groups are include any of the members that are interested in participating in that particular working group.

Amy Rose 00:57:11 And so the technical decisions are really the result of collaborative discussions and a lot of consensus building. The nice thing about it is that it creates a lot of space for the best ideas to rise, but still takes advantage of the diversity of perspectives that are coming from all the different organizations that are involved as members. And so when I first joined Overture, I kind of wondered how is this going to work? Is this really going to work or is it kind of pie in the sky? And I think some people look at this as is this going to be a bunch of competitors getting in a room? And the reality is, well, yeah, sure some folks are, but everything that we’re doing is really pre-competitive and you don’t really see that coming through in the project.

Amy Rose 00:57:57 The real challenge is really more about kind of bringing different organizations together and they have different ways of working and different cultures of in within their own organizations trying to bring all that together to solve these really big problems. And so, finding the best ways to work together and kind of thinking about we have a shared mission and how do we align on that shared mission while everybody kind of operates in a different way from organizationally all the way down to the tech stack that they’re used to using. How do we bring those things together and be mindful that those are still key pieces of how these individual organizations operate, but that we have to find a way to make it all work together for the mutual benefit of this larger consortium.

Gregory Kapfhammer 00:58:43 Thanks for that insightful response, Amy. Amy, I know that you and Jennings together have decades of experience in the geospatial community. Could both of you give some advice to the software engineers who are listening to this episode who want to get involved with GIS or who want to explore Overture? What do you think should be their first step in order to engage with the geospatial community?

Jennings Anderson 00:59:04 Yeah, I would say to echo what Amy said earlier, this idea, that traditional idea that spatial is special as Geotech keeps evolving and improving here, we’re really trying to dissolve that notion. And I think that’s an invitation to more people outside of this kind of siloed area to get more involved and see what they can do with this type of data, because it is an extraordinarily valuable data. So selfishly, I’d love to see more folks looking at the Overture GitHub repositories and helping to identify issues or raise questions with the data, because we do take all of these discussions and issues that people raise very, very seriously. And if there’s something in the schema that’s not working for somebody, we’d love to know how we can all work together to evolve our own data set and improve our product.

Gregory Kapfhammer 00:59:49 Okay, thanks. Amy, did you want to add anything?

Amy Rose 00:59:52 I mean, I think that was a pretty great answer. Maybe just adding that to underscore what Jennings was saying about different people coming into kind of this geospatial community because those walls around geospatial are coming down. It’s really true. I mean, you see folks that have backgrounds in journalism or art or anthropology or definitely computer science engineering, across the board because this idea of location is such an important concept in the world and how we see the world and interact with the world. And so not thinking about geospatial data as different and not looking at tools as very unique but looking more at there’s ways to extend how you operate now to really take advantage of geospatial context. And again, we’ll also just plus one on the come visit our GitHub site, reach out if there’s a way that you feel like you want to give feedback or get involved. It’s a fun community.

Gregory Kapfhammer 01:00:52 Thanks for those responses. That was really helpful. We’ve had a brief conversation about a wide variety of topics. We talked about Overture Maps and how it’s connected to a wide variety of different GIS concepts. We’ve also discussed the specifics about how you built and deployed Overture Maps and how people can actually build their own applications and access the Overture Maps data. Before we get to the end of our episode, I’m wondering if either of you have any remaining topics that you think we should briefly discuss.

Amy Rose 01:01:20 Yeah, I’d be happy to go back and revisit our six data themes. I believe we covered the first three addresses, buildings, and base in detail. But I just wanted to add a note on the remaining three, which are divisions, places, and transportation. Divisions is commonly referred to as our administrative boundaries. And so these are going to be all of our geographic divisions representing everything from countries, counties, down to cities. And then we also have our transportation layer or our transportation theme, which is broken into segments and connectors. And this is a full global routable transportation network that has been kind of re-segment. It’s a combination of data from OpenStreetMap as well as Tom Tom and it’s been re-segmented for kind of optimized routing such with linear referencing, et cetera. And then the last one is places, and this is our overtures, places of business and points of interest data set. And so this is going to be data from both Microsoft and Meta and is continues to expand.

Gregory Kapfhammer 01:02:21 Thanks for that response, Jenning, that was really helpful and it helps to make sure that we’ve covered all the key details related to Overture Maps. Jennings, it was great to have you on the program. I really appreciate it. And Amy, thanks for being on the show as well, and glad that you could give such insightful responses throughout the episode.

Jennings Anderson 01:02:37 Yeah, thanks for having me. This was a lot of fun. I always love talking about geospatial data, so thanks for having both of us.

Gregory Kapfhammer 01:02:44 Okay. It was great for each of you to be on the episode. For our listeners who want to learn more about Overture Maps and the Overture dataset, make sure you check the show notes of the episode to learn more details. At this point, Amy and Jennings, thanks again for being a part of the fantastic conversation we’ve had today. This is Gregory Kapfhammer signing off for Software Engineering Radio.

[End of Audio]

Join the discussion

More from this show