Was 2017 the Year of the Graph?

As I am sitting here working on my presentation for GraphDay Texas 2018 (abstract here) I got to thinking about what Lynn Bender said in his opening to GraphDay Texas 2017 when he declared that 2017 was “The Year of the Graph”.  Taking a look back on all the happenings in the graph world over the past year all I can say was that he certainly was right about that.  As a rule I tend to hate all those “Best of 2017” type articles but in this case I decided to take some time to reflect on the amount of momentum that the graph ecosystem has gained over the last year.

Ever Evolving Landscape

JanusGraph

Last January, JanusGraph was officially announced as a a fork of the popular, but no longer maintained distributed graph database Titan.  This announcement was met with great excitement from the community as there was now a viable path forward for all the Titan users that were left out in the cold.  In the year since it was announced JanusGraph has had two major releases including several major updates and additions to the supporting backend storage and indexing engines.  As a nod to the maturity of JanusGraph, IBM announced that they were ending support for their IBM Graph product in favor of Compose for JanusGraph.

CosmosDB and Amazon Neptune

In May, at Build 2017 Microsoft got into the graph Database as a Service (gDaaS) game with the announcement of Azure CosmosDB.  CosmosDB is a globally distributed multi-model data service that provides support for multiple different data query APIs.  Currently there is support for MongoDB, Table, SQL, Cassandra and Graph using Tinkerpop’s Gremlin query language.

At AWS re:Invent 2017 Amazon announced a limited preview of their gDaaS platform called Amazon Neptune. Amazon Neptune is a full managed graph database that allows you to use it as either a RDF triple store using SPARQL or as a Property Model datastore using Tinkerpop’s Gremlin query language.

Newcomers

There were a variety of newcomers to the field this year but I wanted to mention two specifically because of both how they are the same and different.  The first vendor I want to talk about is DGraph has been around since 2016 but had their 1.0 release this year.

The second vendor is Memgraph which had an preview release last year and was named as one of London TechStars in 2016.

A few of the interesting similarities I see between these two vendors are:

  • Both are targeting real-time transactional workloads
  • Both are starting from the ground up by designing distributed property model data stores.
  • Unlike many of other major vendors in the space (Titan, DSE Graph, Neo4j, JanusGraph) they are both building out their engines using native code instead of JVM languages (DGraph is using Go, Memgraph is using C++).

While building real time distributed property model graphs using non-JVM languages is not a novel idea (something similar has been done by TigerGraph) and two instances is certainly not enough to say that there is a trend I do think that this is strong evidence of the market need in this area.

While their are several interesting similarities there are also some distinct differences.  The first an probably most distinct is their approach to established query languages.  DGraph allows communication either by using their gRPC client (which is a novel concept) or by using there custom query language (based on Facebook’s GraphQL) GraphQL+-.  Memgraph has taken a different approach and made their system compatible with OpenCypher.  I have had the chance to get hands on with Memgraph (I hope to write this up when I get free time) and I found that their compatibility with OpenCypher allowed me to easily and quickly integrate with the growing ecosystem of tools that support Neo4j and Cypher.

In my mind this difference is very telling of the raw nature of property model graph ecosystem.  The lack on one truly dominate standard (e.g. SQL) is defiantly an area where I wish we were able to come to an agreement on. While both OpenCypher and Tinkerpop Gremlin are open and widely adopted standards their are many vendors that have felt the need to create their own languages to address some of the shortcomings in both.

Other Major Events

The items listed above were certainly not the only ones to happen in the graph ecosystem this year.  Some of the other events I took note of were:

  • CalladiusCloud’s purchase of well known graph vendor OrientDB is certainly another noteworthy event.
  • Introduction of Gremlin Language Variants to the Apache Tinkerpop allowing programming languages to fluently query the graph was a tremendous leap forward.
  • Neo4j’s announcement of their transition from being just a graph database to being a graph platform including support for Cypher on Apache Spark as well as better tooling and integration in enterprise environments.
  • Additionally Neo4j has introduced a new starter kit platform for modern full stack application development in the form of their GRAND stack.
  • Increased support for additional graph databases from vendors such as Linkurious, Cambridge Intelligence’s Keylines, and Tom Sawyer makes visualizing graph data easier than ever
  • Kelvin Lawrence’s release of Practical Graph – An Apache Tinkerpop Tutorial which the most complete and most informative explanation of the  Gremlin query language.  This along with Doan DuyHai’s blog series The Gremlin Compendium have become one of my go to resources.
  • The open source community stepping up to provide some of the much needed missing tools for working with graph databases.  This includes:

Looking Forward

Whew, that was a lot to get through but I am very excited to see what this coming year brings.  In particular I am keeping my eye on the evolution of Machine Learning integrations into Graph Databases.  In the last year the ecosystem has seemed to mature enough to move beyond the “What is a Graph Database?” stage on to a more productive “How can I use this to help me” phase.  Both Stardog and Neo4j released integrated machine learning libraries last year and I am expecting (or maybe just hoping) that others follow suit this year.  With all that being said I am looking forward to all the new and interesting ideas that come about from DataDay Texas 2018.  If I missed anything you think is important then please feel free to leave them in the comments below.

 

Dipping my Toe in CosmosDB Graph API

For the inaugural post of my new blog I am going to discuss something else which is the brand new CosmosDB.  CosmosDB was a major announcement at Microsoft’s Build 2017 conference held last week.  For those of you who missed the annoncement you can find it here.  

TL;DR Summary on CosmosDB 

CosmosDB is the next generation of Azure’s DocumentDB which supports globally distributed multi-model data including Key/Value, Wide Row (Table), Document and Graph data models.  I am not going to repeat all the details here but if you want them I suggest you read this post.

My Interest

While there are a lot of new features included as part of CosmosDB (global replication , data partitioning , tunable consistency levels , …) that each are worthy of their own post but what really caught my eye was the CosmosDB’s Graph API.  This new feature provides support the Apache Tinkerpop Gremlin query language.  I was particularly interested in this because part of my current project at work has been evaluating Tinkerpop enabled graph databases for use in upcoming projects.  I currently work at a .NET development shop and in general the .NET drivers and support for the major graph databases lag behind that of the Java, NodeJS and Python counterparts.  With CosmosDB being a Microsoft project the .NET driver is really a first class citizen in the ecosystem and that makes it an intriguing prospect.  

First Impression

My first experience using CosmosDB’s Graph API was to setup the initial database using the Azure web portal.  As shown in their docs (click here)  create the initial graph was the sort of point and click experience that you would expect from a managed service.  A nice additional feature in the web portal was the ability to download either a customized .NET solution or a pre-configured version of the gremlin console with all the proper connection information configured for you.  I initially missed this and I struggled to get the gremlin console connecting to the graph due to my inability to figure out the correct username and password.  It was in the documentation provided online but I missed it.  In case you need it the docs on how to manually configure gremlin console are available here.

Second Step – Migrate a Real Use Case

Since my first experience was so painless I thought, why not take this a step further and try porting my current project over to use CosmosDB.  Since this application was a .NET Core project this the migration was a rather straightforward process that took me <1 hour.  I just replaced my current driver with the .NET CosmosDB driver from Nuget (currently in Preview so make sure to check that box) here.

Once the driver was installed there was a bit of coding required to migrate to the CosmosDB method of executing Gremlin traversals.  The changes required were minimal and were easily copied by following their sample project.  I was able to get it to compile and run.  Unfortunately I ran into a few hiccups with my traversals due to some gremlin features/steps that are not yet supported.  The list of currently supported steps is available here.

The two specific issues I ran into were:

  • Recurring.V() in a traversal are not currently supported such asg.V().has(‘titan’, ‘name’, ‘saturn’).as(‘s’).V().has(‘god’, ‘name’, ‘neptune’).as(‘n’).addE(‘father’).from(‘n’).to(‘s’) 

While this was an annoyance it only took a bit of reworking my traversal to get my insert queries working to add vertices and edges.

  • Subgraph steps are not currently supported.  This was a bit of a show stopper for me as my current project relies heavily on the use of subgraphs when retrieving data.  

I contacted the CosmosDB team about these issues and they quickly responded  that both of features are currently under development or on the near term roadmap.  

Using CosmosDB

In addition to the application drivers, you are provided with two additional options to interact with CosmosDB.  

The first is to use the gremlin console to connect to the remote gremlin server to send your traversals, but don’t forget to submit your command with :> as I did at first.  This provides you the ability to run all your gremlin traversals in your standard terminal window.

The second is to use the Data Explorer, which is a visualization tool that is built into the Azure Portal.  It provides a way to visually interact with your data, which I find very helpful when working with graph models.  

It does have a few quirks about it, specifically the default node layout can start out a bit strange.

In addition to that I was not able to figure out how to see edge properties.  With that said it is a really nice tool to help you out when trying to visualize your data.

I haven’t yet tried any of the other 3rd party tools to connect but I suspect/hope that as long as they work with the gremlin console that they would.

Summary

I think it is a really great addition to the graph database ecosystem to have another Gremlin enabled graph database on the market, especially one with strong support for .NET.   I am interested in taking some time to explore their different consistency models (read about them here and here ) as well as looking at the performance of the Graph API.  

As with other current offerings, CosmosDB has a few rough edges.  Given that it was only announced as a preview last week it is a strong initial offering worth taking notice of.   The fact that it is a newly released a globally distributed multi-model datastore supporting 4 different types of data model is a huge accomplishment that the entire CosmosDB team should be proud of.  

I will be watching to see where this goes to next.