Public Knowledge Graph

Alexy Khrabrov
Chief Scientist
Published in
2 min readJan 12, 2016

--

In multiple SF Text meetups and Text By the Bay presentations, modern NLP reveals itself as a means to extract actionable knowledge from the world’s Internet data and user data in a specific context, derive human intent, and follow up on it. This used to be called AI. Machine Learning (ML) is much more popular now under the new brand management of Deep Learning, when it really went mainstream.

Carlos Guestrin, the CEO and founder of Dato (GraphLab), keynoted this Data Summit 2015 with a strong prediction: all apps will be ML-enabled. ML will be a commodity, and each ML user or provider will need to be able to package it as (micro)services and then aggregate and consume them.

Consumer applications will need to know about the real world and context. Google has its knowledge graph, LinkedIn has its Economic Graph. Knowledge bases and providers such as Factual can be used by startups, but the key problem is not yet fully realized.

Currently, only global and usually public corporations, such as Google, Facebook, Amazon, Microsoft, LinkedIn and a few others can maintain a vast and comprehensive knowledge graph about the real world. The graph is used to disambiguate named entities, such as places, company names, people, etc. API providers such as Factual exist, but their databases are usually not as complete as the big guys’ (BG’s).

What does it mean for ML-based apps outlined by Carlos? The BG’s apps will be more precise in the real world context than small guys (SG’s), and the gap will only widen, The BG’s are using network effects across their web properties, including global user accounts.

How can startups, nonprofits, and other SG’s compete in the brave new ML world?

The answer is the same way Spark and Hadoop is replacing Oracle: Open Source. There are two parts to the Public Knowledge Graph solution.

First, we need to recognize that various individual companies can join forces around the graph, at least the part that represents public knowledge in a semantically enriched formalism. We can then join forces on making the Big Data Pipeline building the graph fully OSS. E.g. the SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka — see noetl.org) is already used for this “shell”.

Second, the knowledge itself should be crowdsourced. The fact where people work, where businesses are located, and other shared aspects of the real world are public, and we the people can keep it so. There are multiple efforts in this direction, and enabling ML apps via composable services will drive convergence of the public efforts. Furthermore, open-source alliances such as framework.foundation could help frame “last mile” production-ready work. The tipping point will be that fusion of data science and data engineering that we see emerging with the SMACK stack and microservices.

One idea I had is to create a VC firm, or herd existing ones, into a syndicate where all portfolio companies will share a knowledge graph. The condition of funding is that you can use the graph, but also have to contribute back.

--

--

Open-Source Science Founder and Chair, NumFOCUS. Founder and organizer, Scale By the Bay and Bay Area AI. Dad of 4.