2018 edit

2018-03-07 Peter

It’s time for the news again…

Technology updates (details are on the relevant technology pages):

Apache Kylin has hit 2.3
Apache Spark has also hit 2.3

Other technology news:

Druid has been donated to the Apache Incubator - proposal; incubator page
Elastic have announced that they’ll be open sourcing their Elastic X-Pack as of Elastic 6.3. The code will be moved into the public repos for their other products (but under the Elastic EULA), and the free elements will be pre-bundled with those products rather than requiring a separate download - accouncement; details; Datanami view
An excellent article on Data Warehouse Automation - we’ll get to talking more about this soon - link
MapR have announced “MapR Data Fabric for Kubernetes” - persistent storage for containers running on Kubernetes - announcement; homepage; Datanami view
Hortonworks have blogged about what’s new in Cloudbreak 2.4 - link
The latest Hortonworks blog post on HDF 3.1 is up, this time on the MiNiFi C++ agent - link
AWS have published their best practice for running Kafka on AWS - link
Datanami have covered Cloudera’s announcement of Altus Data Science (R and Python-based machine learning workloads based on their Data Science Workbench) coming to beta soon, with an operational database build on HBase coming as the fourth package in the future - link
Again from Datanami, a report that Streamlio is claiming up to 150% performance advantage of Apache Pulsar vs Apacke Kafka as a Streaming Data Store - link
From ZDNet, this is a well worth a read if you have an interest in Graph Databases or RDF Databases that’s dense with information - link
- Cypher (the open source Graph query language from Neo4J) now has adapters to allow Cypher jobs to be run over Spark and TinkerPop Gremlin compatible databases
- There’s a SPARQL Gremlin bridge, allowing you to run SPARQL queries over TinkerPop Gremlin compatible databases
- Amazone Neptune (which supports both Gremlin and SPARQL), is apparently built on BlazeGraph
- There’s a new massively parallel distributed graph database from Cambridge Semantics (CS) called AnzoGraph, which they compare to TigerGraph
Looks like I missed the donation of this to the Apache Foundation, but Apache Hivemall is a scalable machine learning library implemented as Hive UDFs/UDAFs/UDTFs - home page
LinkedIn have proposed DrElephant to the Apache Foundation - their performance monitoring and tuning service for jobs and workflows that run on Apache Hadoop and Apache Spark - proposal