Stardog and corona

I'm not a huge fan of when companies jump on a crisis or when they swing the IT hammer to solve any and all problems but I'm going to suggest this because I don't know what else to do. Is there any chance that Stardog could open the sandbox to covid-19 datasets and maybe start a github repo for mappings? There seems to be lots of data out there Search · covid · GitHub I thought it might be something to do while we're all stuck inside.

2 Likes

Interestingly I haven't been able to find much raw data and apps but it's difficult to tell where they're pulling their information from. Due to the global nature of this virus the multilingual support in rdf might be helpful with these datasets. This seems to be the best resource I've found yet. GitHub - soroushchehresa/awesome-coronavirus: 🦠 Huge collection of useful projects and resources for COVID-19 (2019 novel Coronavirus)

Here's a nice dataset for Korean cases that highlight the multilingual aspects of it. Maybe we can get translations for Italian. GitHub - jihoo-kim/Data-Science-for-COVID-19: DS4C: Data Science for COVID-19 in South Korea

I've started a github page at GitHub - semantalytics/coronavirus

The data set I posted has lat/long, geographic region information, which we can pull in geonames to get population density, and contact chaining although I haven't had a chance to see how that's done yet.

I did a really quick mapping just to get the contact chaining. You can see the now infamous patient 31 on the right.

Zach,

Incidentally last night I had the same intent - to put covid-19 csv data into stardog and then visualize it in Linkurious to learn more about it spread. How many features can be gathered from data?

Hey Zach,

Are you planning on publishing the mappings to that github repo you created? I also have an interest in assembling a knowledge graph here.

Cheers,
Al

1 Like

I was planning on it but I'd be happy to have Stardog take the lead and submit PR's to your repo. I also thought it would be a good exercise in taking notes on the pain points for rapidly producing mappings.

One thing that I struggle with, especially with CSV files, is if I should just quickly import the file and fix it up after with SPARQL update queries or to load it into a relational database where I have more control over the mappings. If it's even a moderately sized data set I go the relational database route since it's really painful to reimport every time you update the mappings.

1 Like

Hi Zach,

My 2c - previously I have solved similar complex mapping challenges by using Python to transform the source into JSON-LD and load it into Stardog. My source was plain enterprise-attack.json - a simple JSON but the same challenge exists when dealing with CSVs ... I suppose.

Be glad to provide guidance with a Jupyter notebook to get this custom mapping and loading.

Hope this helps,
Regards

I found a great general resource for contributing to covid response efforts https://helpwithcovid.com/

I posted initial mappings of the korean dataset to GitHub - semantalytics/stardog-covid-19-south-korea-kcdc

Where can one find a description of the fields in those files?

The original data set is [NeurIPS 2020] Data Science for COVID-19 (DS4C) | Kaggle with some short descriptions. It's hard to tell exactly where all the data is coming from.People are copying it and adding additional stuff. This data set appears to use it but appears to have some additional data GitHub - parksw3/COVID19-Korea: Public line list and summaries of the COVID-19 outbreak in South Korea

This might be a good data set to add. It's the location of all US hospitals. https://catalog.data.gov/dataset/hospitals-dcdfc . There's also this one from HIFLD, although I'm not sure if it's any different Hospitals

This is a global thing so I'll see if I can find hospital locations for other countries. This was just and easy one to find.

I found this Healthsites.io

...and this this morning https://pages.semanticscholar.org/coronavirus-research . It's mostly published papers. Might be some interesting stuff in the metadata. Maybe use NLP/BITES, etc and there are some good links to other sources of data at the bottom.

Italian data set. Will need to use google translate GitHub - pcm-dpc/COVID-19: COVID-19 Italia - Monitoraggio situazione

Tableau started a COVID-19 data hub - https://www.tableau.com/covid-19-coronavirus-data-resources

might have additional data sources

They just released a new 2.0 dataset [NeurIPS 2020] Data Science for COVID-19 (DS4C) | Kaggle

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

I'm back on this after a brief adjustment to my work/home life. I'm now looking at the data set provided by https://covidtracking.com . Most of the data coming out is very tabular in nature and I'm looking for ways to pull in data from other data sets.

Right now if you wanted to pull in positive cases you could run the following query

select ?abbr ?date (xsd:integer(?positive)/?pop * 100 AS ?ratio) where {
    ?s :state ?abbr; :positive ?positive ; :date ?date 
} order by ?abbr ?date

but say you wanted to normalize by population. You can pull in population from wikidata with the following query to look at just New York.

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>

select ?abbr ?date (xsd:integer(?positive)/?pop * 100 AS ?ratio) where {
    ?s :state ?abbr; :positive ?positive ; :date ?date 
    service <https://query.wikidata.org/sparql> {
         ?state wdt:P31 wd:Q35657; wdt:P5086 ?abbr; wdt:P1082 ?pop .
         SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
    }
    filter(?abbr = "NY")
} order by ?abbr ?date

and to look at the latest

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>

select ?abbr ?date (xsd:integer(?positive)/?pop * 100 AS ?ratio) where {
    ?s :state ?abbr; :positive ?positive ; :date ?date 
    service <https://query.wikidata.org/sparql> {
         ?state wdt:P31 wd:Q35657; wdt:P5086 ?abbr; wdt:P1082 ?pop .
         SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }
    }
    filter(?abbr = "NY")
} order by ?abbr desc(?date) limit 1

1.33% of New Yorker's have tested positive. (1.42% as of 4/26, ugh) There are a lot of other factors that go into the infection rate but it's interesting.

If anyone has any suggestions of what else might be interesting to pull in that is outside the standard tabular reporting I'd be happy to look into that.

Some observations on the Corona virus and data. The world community isn't tracking very interesting data about the virus. Some of that can probably be attributed to privacy considerations for health related data and even when only tracking the most basic of data, positive cases, hospitalizations, etc. the data is still challenging to collect, to report and quality is questionable. Answering even the simplest and seemingly easiest question, how many people have tested positive, is not an easy task.

Since I'm using this as a study on data collection and integration to identify opportunities to improve I want to note how poor the CDC's data is. Say maybe you want to compare SARS-COV2 to other Corona type viruses. There's a nice chart at Coronavirus National Trends - NREVSS | CDC If you click on the link at the bottom you get an html chart Coronavirus for the US . It's not the worst. At least you get the data but it's not that great.

If you want CDC data on laboratory confirmed hospitalizations you can go to this page COVID-19 Hospitalizations and there is a data download link but you'd need to do that for each region and there isn't a simple url behind it. They're doing some strange cookie thing based on your selections. I don't need a chart. These are small data sets and I can do the chart in a minute with excel. Just a csv would be better than what they have and only a pdf would be worse.

Pulling in some genomic information might make this more interesting. Not interesting as in, "yay, this is fun" but more in the way a serial killer is interesting.

I've been reading reports about multiple SARS-CoV-19 strains, approximately 30 and gene deletion sequences. I know that semantic web technology is used a lot in the biosciences but I don't follow it too closely. I find that if you're not into that field it's tough to follow.

https://www.tillett.info/2020/04/28/there-are-many-sars-cov-2-strains-with-gene-deletions/

It might be interesting to pull in some data on this as people are interested in identifying an attenuated strain.