Slides from an O'Reilly Webinar given on July 29th, 2015. This presentation describes how the Doradus database framework and the OLAP storage service extend Cassandra to provide a unique database solution for certain big data applications. Doradus OLAP uses columnar storage, application-level sharding, compression, and other techniques to store data very densely, yielding fast loading and queries that can scan millions of objects per second.
Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
This document discusses Spark SQL and DataFrames. It provides three key points:
1. DataFrames are distributed collections of data organized into named columns similar to a table in a relational database. They allow SQL-like operations to be performed on structured data.
2. DataFrames can be created from a variety of data sources like JSON, Parquet files, existing RDDs, or Hive tables. The schema can be inferred automatically using case classes or specified programmatically.
3. Common SQL operations like selecting columns, filtering rows, aggregation, and joining can be performed on DataFrames to analyze structured data. The results are DataFrames that support additional transformations.
Amazon DynamoDB is a fully managed, highly scalable distributed database service. In this technical talk, we will deep dive on how to: Use DynamoDB to build high-scale applications like social gaming, chat, and voting. - Model these applications using DynamoDB, including how to use building blocks such as conditional writes, consistent reads, and batch operations to build the higher-level functionality such as multi-item atomic writes and join queries. - Incorporate best practices such as index projections, item sharding, and parallel scan for maximum scalability
Overview of the Doradus database open source project and the Cassandra database on which it is based. This presentation was given to the Orange County Big Data Meetup group on July 16, 2014.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
• Distributed datasets loaded into named columns (similar to relational DBs or
Python DataFrames).
• Can be constructed from existing RDDs or external data sources.
• Can scale from small datasets to TBs/PBs on multi-node Spark clusters.
• APIs available in Python, Java, Scala and R.
• Bytecode generation and optimization using Catalyst Optimizer.
• Simpler DSL to perform complex and data heavy operations.
• Faster runtime performance than vanilla RDDs.
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
This document discusses Spark SQL and DataFrames. It provides three key points:
1. DataFrames are distributed collections of data organized into named columns similar to a table in a relational database. They allow SQL-like operations to be performed on structured data.
2. DataFrames can be created from a variety of data sources like JSON, Parquet files, existing RDDs, or Hive tables. The schema can be inferred automatically using case classes or specified programmatically.
3. Common SQL operations like selecting columns, filtering rows, aggregation, and joining can be performed on DataFrames to analyze structured data. The results are DataFrames that support additional transformations.
Amazon DynamoDB is a fully managed, highly scalable distributed database service. In this technical talk, we will deep dive on how to: Use DynamoDB to build high-scale applications like social gaming, chat, and voting. - Model these applications using DynamoDB, including how to use building blocks such as conditional writes, consistent reads, and batch operations to build the higher-level functionality such as multi-item atomic writes and join queries. - Incorporate best practices such as index projections, item sharding, and parallel scan for maximum scalability
This document discusses using Apache Spark to perform analytics on Cassandra data. It provides an overview of Spark and how it can be used to query and aggregate Cassandra data through transformations and actions on resilient distributed datasets (RDDs). It also describes how to use the Spark Cassandra connector to load data from Cassandra into Spark and write data from Spark back to Cassandra.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
My Hadoop Ecosystem presentation at the 2011 BreizhCamp.
See the talk video (in french):
http://mediaserver.univ-rennes1.fr/videos/?video=MEDIA110628093346744
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
The document discusses the author's initial experiences learning and working with Amazon DynamoDB, a fully managed NoSQL database service. Some key points:
- DynamoDB allows for flexible data structures and extreme performance but is not suitable for complex queries. It uses a NoSQL model with primary and secondary indexes.
- The author created tables for a forum application and implemented a simple REST API for basic CRUD operations on the tables.
- While DynamoDB requires less maintenance than self-managed databases, the author notes it has limitations like the inability to modify indexes and less flexibility for querying data.
- Additional tools like Pentaho ETL were used to process and load sample data into the DynamoDB
My talk about Catalyst for QCon Beijing 2015. In this talk, we walk through Catalyst, Spark SQL's query optimizer, by using a simplified version of Catalyst to build an optimizing Brainfuck compiler named Brainsuck in less than 300 lines of code.
If you’re familiar with relational databases, designing your app to use a fully-managed NoSQL database service like Amazon DynamoDB may be new to you. In this webinar, we’ll walk you through common NoSQL design patterns for a variety of applications to help you learn how to design a schema, store, and retrieve data with DynamoDB. We will discuss best practices with DynamoDB to develop IoT, AdTech, and gaming apps.
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
This document provides an overview and summary of key aspects of DynamoDB, including:
1. DynamoDB is a fully managed NoSQL database that scales to any workload and provides fast and consistent performance.
2. DynamoDB uses a table structure with partition and sort keys to organize and access data, and scales both read and write throughput independently across partitions.
3. Common challenges include hot keys/partitions that can cause throttling, and designing schemas and partitions to spread access uniformly across the keyspace.
4. NoSQL data modeling focuses on aggregations rather than relations, using patterns like hierarchical data structures and parent-child relationships to model one-to-many and many-to-
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. You can use Amazon DynamoDB to create a database table that can store and retrieve any amount of data, and serve any level of request traffic. Amazon DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent and fast performance.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
A deeper-understanding-of-spark-internalsCheng Min Chi
The document discusses Spark's execution model and how it runs jobs. It explains that Spark first creates a directed acyclic graph (DAG) of RDDs to represent the computation. It then splits the DAG into stages separated by shuffle operations. Each stage is divided into tasks that operate on data partitions in parallel. The document uses an example job to illustrate how Spark schedules and executes the tasks across a cluster. It emphasizes that understanding these internals can help optimize jobs by increasing parallelism and reducing shuffles.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
This document discusses using Apache Spark to perform analytics on Cassandra data. It provides an overview of Spark and how it can be used to query and aggregate Cassandra data through transformations and actions on resilient distributed datasets (RDDs). It also describes how to use the Spark Cassandra connector to load data from Cassandra into Spark and write data from Spark back to Cassandra.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
My Hadoop Ecosystem presentation at the 2011 BreizhCamp.
See the talk video (in french):
http://mediaserver.univ-rennes1.fr/videos/?video=MEDIA110628093346744
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
The document discusses the author's initial experiences learning and working with Amazon DynamoDB, a fully managed NoSQL database service. Some key points:
- DynamoDB allows for flexible data structures and extreme performance but is not suitable for complex queries. It uses a NoSQL model with primary and secondary indexes.
- The author created tables for a forum application and implemented a simple REST API for basic CRUD operations on the tables.
- While DynamoDB requires less maintenance than self-managed databases, the author notes it has limitations like the inability to modify indexes and less flexibility for querying data.
- Additional tools like Pentaho ETL were used to process and load sample data into the DynamoDB
My talk about Catalyst for QCon Beijing 2015. In this talk, we walk through Catalyst, Spark SQL's query optimizer, by using a simplified version of Catalyst to build an optimizing Brainfuck compiler named Brainsuck in less than 300 lines of code.
If you’re familiar with relational databases, designing your app to use a fully-managed NoSQL database service like Amazon DynamoDB may be new to you. In this webinar, we’ll walk you through common NoSQL design patterns for a variety of applications to help you learn how to design a schema, store, and retrieve data with DynamoDB. We will discuss best practices with DynamoDB to develop IoT, AdTech, and gaming apps.
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
This document provides an overview and summary of key aspects of DynamoDB, including:
1. DynamoDB is a fully managed NoSQL database that scales to any workload and provides fast and consistent performance.
2. DynamoDB uses a table structure with partition and sort keys to organize and access data, and scales both read and write throughput independently across partitions.
3. Common challenges include hot keys/partitions that can cause throttling, and designing schemas and partitions to spread access uniformly across the keyspace.
4. NoSQL data modeling focuses on aggregations rather than relations, using patterns like hierarchical data structures and parent-child relationships to model one-to-many and many-to-
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. You can use Amazon DynamoDB to create a database table that can store and retrieve any amount of data, and serve any level of request traffic. Amazon DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent and fast performance.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
A deeper-understanding-of-spark-internalsCheng Min Chi
The document discusses Spark's execution model and how it runs jobs. It explains that Spark first creates a directed acyclic graph (DAG) of RDDs to represent the computation. It then splits the DAG into stages separated by shuffle operations. Each stage is divided into tasks that operate on data partitions in parallel. The document uses an example job to illustrate how Spark schedules and executes the tasks across a cluster. It emphasizes that understanding these internals can help optimize jobs by increasing parallelism and reducing shuffles.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.
[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced FeaturesAndrew Liu
Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
This document provides an agenda for a presentation on integrating Apache Cassandra and Apache Spark. The presentation will cover RDBMS vs NoSQL databases, an overview of Cassandra including data model and queries, and Spark including RDDs and running Spark on Cassandra data. Examples will be shown of performing joins between Cassandra and Spark DataFrames for both simple and complex queries.
This document discusses using PySpark with Cassandra for analytics. It provides background on Cassandra, Spark, and PySpark. Key features of PySpark Cassandra include scanning Cassandra tables into RDDs, writing RDDs to Cassandra, and joining RDDs with Cassandra tables. Examples demonstrate using operators like scan, project, filter, join, and save to perform tasks like processing time series data, media metadata processing, and earthquake monitoring. The document discusses getting started, compatibility, and provides code samples for common operations.
Structured streaming provides a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows processing live data streams using continuous queries that look identical to batch queries. The presentation discusses Spark components including RDDs, DataFrames and Datasets. It then covers limitations of the traditional Spark Streaming model and how structured streaming addresses them by using incremental execution plans and exactly-once semantics. An example of a word count application and demo is presented to illustrate structured streaming concepts.
NoSQL - MongoDB. Agility, scalability, performance. I am going to talk about the basis of NoSQL and MongoDB. Why some projects requires RDBMs and another NoSQL databases? What are the pros and cons to use NoSQL vs. SQL? How data are stored and transefed in MongoDB? What query language is used? How MongoDB supports high availability and automatic failover with the help of the replication? What is sharding and how it helps to support scalability?. The newest level of the concurrency - collection-level and document-level.
Erich Ess CTO of SimpelRelevance introduces the Spark distributed computing platform and explains how to integrate it with Cassandra. He demonstrates running a distributed analytic computation on a data-set stored in Cassandra
This document provides an overview of Weather.com's analytics architecture using Apache Cassandra and Spark. It summarizes Weather.com's initial attempts using Cassandra, lessons learned, and its improved architecture. The improved architecture uses Cassandra for streaming event data with time-window compaction, stores all other data in Amazon S3 for batch processing in Spark, and replaces Kafka with Amazon SQS for event ingestion. It discusses best practices for data modeling in Cassandra including partitioning, secondary indexes, and avoiding wide rows and nulls. The document also highlights how Weather.com uses Apache Zeppelin notebooks for data exploration and visualization.
1. Scalding is a library that provides a concise domain-specific language (DSL) for writing MapReduce jobs in Scala. It allows defining source and sink connectors, as well as data transformation operations like map, filter, groupBy, and join in a more readable way than raw MapReduce APIs.
2. Some use cases for Scalding include splitting or reusing data streams, handling exotic data sources like JDBC or HBase, performing joins, distributed caching, and building connected user profiles by bridging data from different sources.
3. For connecting user profiles, Scalding can be used to model the data as a graph with vertices for user interests and edges for bridging rules.
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.
This document discusses MATLAB support for scientific data formats and analytics workflows. It provides an overview of MATLAB's capabilities for accessing, exploring, and preprocessing large scientific datasets. These include built-in support for HDF5, NetCDF, and other file formats. It also describes datastore objects that allow loading large datasets incrementally for analysis. The document concludes with an example that uses a FileDatastore to access and summarize HDF5 data from NASA ice sheet surveys in a MapReduce workflow.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
This is a introduction to PostgreSQL that provides a brief overview of PostgreSQL's architecture, features and ecosystem. It was delivered at NYLUG on Nov 24, 2014.
http://www.meetup.com/nylug-meetings/events/180533472/
This document provides an introduction and overview of Neo4j, a graph database. It discusses trends in big data, NoSQL databases, and different types of NoSQL databases like key-value stores, column family databases, and document databases. It then defines what a graph and graph database are, and introduces Neo4j as a native graph database that uses a property graph model. It outlines some of Neo4j's features and provides examples of how it can be used to represent social network, spatial, and interconnected data.
Application development with Oracle NoSQL Database 3.0Anuj Sahni
The document introduces table-based data modeling features for Oracle NoSQL Database. It discusses using tables to simplify application data modeling with familiar concepts like tables and data types. Examples show how to model user and email data using tables, including defining the schema using DDL, querying the data using DML, and indexing the tables. The document also provides an example of modeling user and email data from an email client application to illustrate how to approach data modeling.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
This document discusses enabling multi-region Cassandra clusters that span heterogeneous data centers using Network Address Translation (NAT) and DNS-based Service Discovery (DNS-SD). It describes how NAT allows sharing a limited number of public IP addresses between private nodes by mapping private ports to public ports. DNS-SD is proposed to advertise the port mappings so nodes can discover each other, with SRV and TXT records storing port and cluster details. Minor modifications to Cassandra and drivers are suggested to lookup ports via DNS-SD during connection establishment.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
University of New South Wales degree offer diploma Transcript
Extending Cassandra with Doradus OLAP for High Performance Analytics
1. Extending Cassandra with
Doradus OLAP for
High Performance Analytics
Randy Guck
Principal Engineer,
Dell Software
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope
Webcast sponsored by:
2. Agenda
• Doradus Overview
– What is it?
– Why Cassandra?
– Data model and DQL
• Architecture
– Storage services
• Doradus OLAP
– Motivation and sharding model
– Example Events application
– Example queries
3. Doradus In A Nutshell
• Storage and query service
• Uses NoSQL DB for persistence
• Pure Java
• Stateless
• Full text and graph query language
• Pluggable storage services
• Bundled with Dell UCCS Analytics
for 3 years
• Open source, Apache License 2.0
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log files
5. Cassandra: Best NoSQL DB?
• Choice for many big data apps
• Wide-column model
• CQL: Similar to SQL
• Pure peer architecture
• Cabinet- and Data Center-aware
• Elasticity, recovery features
• Consistent benchmark winner
6. So, Why Doradus?
• Cassandra limitations–by design:
– Minimal indexing
– No relationship support
– Limited queries
• e.g., SELECT <columns> FROM <table> WHERE KEY IN (<list>)
• No joins, embedded selects, OR clauses, ...
• No full text, range, or statistical queries
– API: requires client driver
• What if we could leverage Cassandra’s best features
while elevating the data model and query language?
7. Doradus Data Model:
Example Message Tracking Schema
Message
{Size, SendDate, ...}
Participant
{ReceiptDate}
Address
{Email}
Person
{FirstName, LastName,
Department, ...}
Attachment
{Size, Extension} Manager
Employees
Person
Address
Attachments
Message
Recipients
MessageAsRecipient
Address
Participants
Bi-directional relationships are formed by link pairs
Sender
MessageAsSender
8. Doradus Query Language:
Object Queries
• Goal:
Find and return objects
• Builds on Lucene syntax:
Full text queries: terms, phrases,
ranges, AND/OR/NOT, etc.
• Adds link paths:
Directed graph searches
Quantifiers and filters
Transitive searches
• Other features:
Stateless paging
Sorting
• Example DQL expressions:
// On Person:
LastName = Smith AND
NOT (FirstName : Jo*) AND
BirthDate = [1986 TO 1992]
// On Message:
ALL(Recipients).ANY(Address.
WHERE(Email='*.gmail.com')).
Person.Department : support
// On Person:
Employees^(4).Office='San Jose'
10. Doradus Service Architecture
REST
TaskManagerSchema MBean
Storage
OLAPSpider Logging
DB
CQLThrift DynamoDB
Cassandra
DoradusServer
doradus.yaml
REST API
JMX
Doradus Server
Tenant
pluggable
modules
AWS DynamoDB
*
1
11. Storage Service Comparison
Feature
Storage Service
Spider OLAP Logging
Data variability Unstructured Semi-structured Unstructured
Data mutability High Medium None
Update granularity Fine Batch Batch
Load performance Medium High Very high
Space usage High Low Low
Query focus Object queries Aggregate queries
Time-based
Aggregate queries
Data aging? yes yes yes
Supports links? yes yes no
Dynamic fields? yes no yes
Event load rate/node 1.3K/second 124K/second 444K/second
115M Event DB Size 100GB 1.89GB 1.49GB
12. Doradus OLAP: Motivation
• Combines ideas from:
– Online Analytical Processing: data arranged in static cubes
– Columnar databases: Column-oriented storage and
compression
– NoSQL databases: Sharding
• Features:
– Fast loading: up to 1M objects/second/node
– Dense storage: 1 billion objects in 2 GB
– Fast cube merging: typically seconds
– No indexes
17. OLAP Shard Merging
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014-03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
...
Batch #1: Shard 2014-03-01
Batch #2: Shard 2014-03-01
OLAP Store
18. Does Merging Take Long?
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
0 10 20 30 40 50 60 70 80 90
ShardSize(totalobjects)
Merge Time (seconds)
Merge Time vs. Shard Size
19. OLAP Query Execution
• Example query:
– Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
• How many rows are read?
– 2 fields x 31 shards = 62 rows
– Each row typically represents millions of objects
• Value arrays are scanned in memory
• Physical rows are read on “cold” start only
– Multiple caching levels create “hot” and “warm” data
pools
20. OLAP Example: Security Events
Sample event in CSV format:
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,
"Logon/Logoff",540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,
"(0x0,0x142999A)",3,Kerberos,Kerberos
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1
MAILSERVER18
$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18
22. OLAP Example: Events Loading
• Configuration:
– Single Dell PowerEdge™ server with dual Intel® Xeon® CPUs
– Cassandra: Xmx=2G, Doradus: Xmx=8G
• Load stats (15 threads):
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 15 minutes, 27 seconds
Average load rate: 1.1M objects/second; 124K events/second
• Space usage:
:nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.89 GB 100.0% 9fe241b5-20f7-4afb-af92-207ef24b8095 -9176223118562734495 rack1
23. OLAP Example: Events Queries
• Count all Events in all shards
• REST command:
http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0
• Example response:
<results>
<aggregate metric="COUNT(*)" query="*"/>
<totalobjects>114572247</totalobjects>
<value>114572247</value>
</results>
application
name
table
name
aggregate query
resource
all shardsmetric
function
24. OLAP Example: Events Queries
• Find the top 5 hours-of-the-day when certain
privileged events fail:
Event IDs are any of 577, 681, 529
Event type is ‘Failure Audit’
Insertion string 8 is (0x0,0x3E7)
Event occurred in first half of 2005 (181 shards)
GET /OLAPEvents/Events/_aggregate?m=COUNT(*)
&range=2005-01-01,2005-06-30
&q=Type='Failure Audit' AND EventID IN (577,681,529) AND
Params.WHERE(Index=8).Value='(0x0,0x3E7)'
&f=TOP(5,Timestamp.HOUR) AS HourOfDay
25. “Privileged Event Failures”:
Query Results
<results>
<aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND
Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/>
<totalobjects>591</totalobjects>
<summary>591</summary>
<totalgroups>17</totalgroups>
<groups>
<group>
<metric>119</metric>
<field name="HourOfDay">8</field>
</group>
<group>
<metric>87</metric>
<field name="HourOfDay">18</field>
</group>
<group>
<metric>72</metric>
<field name="HourOfDay">19</field>
</group>
<group>
<metric>69</metric>
<field name="HourOfDay">9</field>
</group>
<group>
<metric>66</metric>
<field name="HourOfDay">5</field>
</group>
</groups>
</results>
Most failures
occur at 08:00
26. Doradus OLAP Summary
• Advantages:
Simple REST API
All fields are searchable
without indexes
Ad-hoc statistical searches
Support for graph-based
queries
Near real time data
warehousing
Dense storage = less
hardware
Horizontally scalable when
needed
• Good for applications
where data:
Is continuous/streaming
Is structured to semi-
structured
Can be loaded in batches
Is partitionable, typically by
time
Is typically queried in a
subset of shards
Emphasizes statistical
queries
27. Thank You!
• Where to find Doradus
– Source and docs: github.com/dell-oss/Doradus
– Downloads: search.maven.org
• Contact me
– Randy.Guck@software.dell.com
– @randyguck
• Thanks to our sponsors!
30 Doradus R136 Cluster
Source: Hubble Space Telescope
PowerEdge is a trademark of Dell Inc.
Intel and Xeon are trademarks of Intel Corporation
in the U.S. and/or other countries.
Editor's Notes
This presentation describes Doradus OLAP, which is a query and storage service that extends the Cassandra NoSQL database. The name was derived from 30 Doradus, also known as The Tarantula Nebula, which is a region in the Large Magellanic Cloud. It is a very active and luminous region, serving as the birthplace for new galaxies and bright stars.
Here are the topics covered in this presentation. First we’ll describe Doradus at a broad level including it’s main features, how it leverages Cassandra, and its data model and query language. We’ll then touch on the internal architecture and the concept of storage services. Finally, we’ll take a close look at the Doradus OLAP service and how it provides near real time data warehousing. We’ll use an example Events application to demonstrate load and storage characteristics and sample queries.
Here is a quick overview of Doradus. From a high-level viewpoint, it is a storage and query service that leverages a NoSQL database for persistence. Doradus primarily uses Cassandra, but integrations with other persistence stores have also been demonstrated.
Doradus provides a high-level data model and a query language that supports full text and graph searching features. Applications communicate with Doradus via a REST API that supports JSON and XML messages.
Doradus is stateless, so multiple instances can be run against the same Cassandra cluster using either the Thrift or CQL API. Like Cassandra, Doradus is pure Java and runs on Windows and Linux. The architecture diagram shows a minimal implementation consisting of a user application, Doradus instance, and Cassandra instance, all of which can run on a single machine.
The Doradus architecture supports pluggable storage services, which organize data for specific application scenarios with different space/performance tradeoffs. One of those services, Doradus OLAP, is the primary focus of this presentation. Because of its dense storage techniques, Doradus OLAP allows many big data applications to run on a single node. However, the Cassandra NoSQL underpinning allows horizontal expansion when needed.
Doradus has been bundled with the Dell UCCS Analytics product for about 3 years now and is available as open source under the Apache License 2.0. This means the source code is free for anyone to download, use, and even modify.
Both Doradus and Cassandra can be scaled horizontally.
When increased scalability or failover are needed, a multi-node Cassandra cluster can be used. In this example, Cassandra is deployed in a 3-node cluster. Technically, only one Doradus instance is needed: it will rotate requests through all Cassandra nodes and ignore failed nodes automatically. Because Doradus is stateless, you can run additional instances for increased capacity or failover. For simplicity, one approach is to run a Doradus instance on each Cassandra node so that every node is configured identically.
From Cassandra’s viewpoint, Doradus is an application. Doradus adds functionality from the “outside” and does not modify core Cassandra code. Doradus has been tested with both Cassandra 2.0 and 2.1 releases.
There are many good NoSQL DBs available today, so why did we choose Cassandra? Here are some of the reasons:
Cassandra is the choice of many companies for serious big data applications, including Netflix, EBay, and Apple. At last count, the Netflix cluster uses over 2,800 nodes. Apple reportedly has deployed over 75,000 Cassandra nodes that manage over 10PB of data.
Cassandra’s wide-column or tabular data model is extremely flexible. It can be used for simple key/value and document-oriented applications, but it supports up to 2 billion columns per row, and a column value can be up to 2GB in length.
The primary query language for Cassandra is CQL, which borrows from SQL for both DDL (schema) and DML (query) operations. This makes it easier to transition from relational databases.
Cassandra is pure Java and uses a “pure peer” architecture: there is only one process type, and requests can be sent to any node. There are no master/secondary processes, which make Cassandra easy to deploy and extend.
Cassandra has arguably best-in-class features for horizontal scalability. It is cabinet- and data center-aware, allowing it to choose replication strategies that maximize availability while balancing network usage.
When it comes to managing a distributed cluster, again Cassandra has best-in-class features. New nodes can be added dynamically while the cluster rebalances in the background. Lost nodes can be recovered from replicated data, and Cassandra handles complex problems such as “split brain” failures.
Cassandra is a consistent winner of NoSQL benchmarks.
Most NoSQL DBs are easy to install and get started with. But as data and network complexity grow, Cassandra is arguably the best NoSQL database. This is why many companies choose it and why we use it with Doradus.
If Cassandra is so popular, why do we need Doradus? Many applications find that Cassandra provides exactly what they need. But some applications need more than Cassandra’s basic features and must add functionality in application code. By design, Cassandra supports limited queries and indexing to encourage applications to follow it’s partitioned-based access pattern. Here are some examples when additional functionality is needed:
Indexing: Cassandra supports minimal indexing. Secondary indexes are hash structures that only support equality searches. Furthermore, they are only recommended for “low cardinality” scalar fields, and some experts don’t recommend them at all.
Data model: Cassandra supports single-valued and limited multi-valued scalars. It does not directly support any means for relating data objects.
Queries: By design, CQL does not support joins, subqueries, or aggregation. The most complex select query it supports uses an IN clause. There is no support for full text queries, general inequalities, or statistical queries.
API: Cassandra must be accessed via its Thrift or CQL APIs, and both require a client-side driver. Some applications find this limiting and would prefer to use a REST API.
What if we could leverage Cassandra’s best features—scalability, elasticity, recoverability—while elevating the data model and query language? This is the premise on which Doradus was started.
Let’s look how Doradus extends Cassandra, starting with the data model and an example Message Tracking application. This schema uses 5 tables to store information about email message traffic. The objects within each table hold scalar field values such as Size and SendDate, but they can also define link fields, which form bi-directional relationships. Every link field has an inverse that defines the same relationship from the opposite direction. For example, the Message table defines a link called Recipients, which points to the Participant table. The inverse link is called MessageAsRecipient, which points back to the Message table. Another pair of links, Sender and MessageAsSender, form another relationship between the same tables. Doradus maintains inverse links automatically: if Message.Sender is updated, the inverse Participant.MessageAsSender link is automatically added or deleted as needed.
The Person table defines a link called Manager, which points to Person with Employees as the inverse link. The Manager/Employees relationship forms an org chart graph that can be searched “up” or “down”. A link whose extent—the table it points to—is the same as its owner is called a reflexive link. Reflexive links can be searched recursively via transitive functions. Doradus also allows self-reflexive links that are their own inverse. An example self-reflexive relationship is Friends (though friendship is not always reciprocal!)
We’ll see how links are used in the query language next.
The Doradus Query Language (DQL) can be used in two contexts: An object query selects and returns objects from a specific table, called the perspective. Object queries can return fields from both selected objects and objects linked to them.
DQL is based on and extends the Lucene full text language. This means that DQL supports clause types such as terms, phrases, wildcards, ranges, equalities and inequalities, AND/OR/NOT and parentheses, and more. To these full text concepts DQL adds the notion of link paths to traverse relationship graphs. Link paths can use functions such as quantifiers and filters to narrow the selection of objects. Reflexive relationships can use a transitive function to search a graph recursively. DQL queries also support stateless paging and results sorting.
The first example on the right is a simple full text query, fetching objects from the Person table. It consists of three clauses that select objects (1) whose LastName is Smith, (2) whose FirstName does not contain a term that begins with “Jo”, and (3) whose BirthDate falls between the years 1986 and 1992 (inclusively). This example uses several features including an equality clause, a term clause, a wildcard term, a range clause, and clause negation.
The second example queries the Message and demonstrates how link fields are used. Link paths are one of the features that makes DQL easy to use. Though deceptively simple, each “.” in a link path such as X.Y.Z essentially performs a join. This example shows how link paths are explicitly quantified. When a link path is enclosed in the quantifier ANY, ALL, or NONE, a specific number of the objects in the quantified link path must match the clause’s condition. In this example, ALL(Recipients) means that every Recipients value must meet the remainder of the clause. The Address link is quantified with ANY, and it is also filtered with a WHERE clause. This means that only Address objects whose Email ends with “gmail.com” are selected, and at least one of them (ANY) must meet the remainder of the clause. Without the quantifiers and filter, the link path is Recipients.Address.Person.Department, and it uses a contains clause searching for the term support. This means the Department field must contain the term “support” (case-insensitive). Textually, this query selects objects where all Recipients have at least one Address whose Email ends with “gmail.com” and whose linked Person’s department is in support.
The last example demonstrates transitive searches using the Person table. Because Employees is a reflexive link, we can search it recursively through the graph, in this case “down” the org chart. The carat (^) is the transitive function and causes the preceding link, Employees, to be searched recursively. The optional value in parentheses after the carat limits the recursion to a maximum number of levels. In this case, ^(4) says “recurse the link to no more than 4 levels”. Without the limit parameter, the transitive function searches the graph until a cycle or a “leaf” object is found.
The other context in which DQL can be used is aggregate queries, which select objects from a perspective table and performs statistical computations across those objects. This slide shows several example aggregate queries. These examples demonstrate the metric expression, grouping expression, and query expression components of aggregate queries. All examples use the Message table as the perspective.
The first example performs three computations across selected objects: (1) a COUNT of all objects, (2) the AVERAGE value of the Size field, and (2) the smallest (MIN) Birthdate found among objects linked via Recipients.Address.Person. All three statistical computations are made in a single pass through the data. Since this example contains no grouping expression, a single value is returned for each of these computations.
The second example demonstrates multi-level grouping, which divides objects into groups and performs the metric computations across each subgroup. In this example, a single metric function is computed: the unique (DISTINCT) Attachments.Extension values within each group. The grouping expression groups objects first by their Tags field and secondarily by the values found in the link path Recipients.Address.Person.Department. Because this example has a query expression, only those objects matching the selection expression are included in the aggregate computation.
The third example computes AVERAGE Size for all objects, grouped by a single-level grouping expression Recipients.Address.Email. The grouping expression is enclosed in a TOP(10) function, which means that only the groups with the top 10 metric values (average Size) are returned.
Doradus supports a wide range of grouping functions to create aggregate query groups. Some of the more popular functions are summarized below:
BATCH: creates groups from explicit value ranges
TOP/BOTTOM: returns a limited number of groups based on their highest/lowest metric values
FIRST/LAST: returns a limited number of groups based on their highest/lowest alphabetical group names
INCLUDE/EXCLUDE: includes or excludes specific group values within a grouping level
UPPER/LOWER: Creates groups with case-insensitive text values
SETS: creates groups from arbitrary query expressions
TERMS: creates one group per term within a text field
TRUNCATE: creates groups from a timestamp value rounded down to a specific date/time granularity
See the Doradus documentation for a full description of all aggregate query grouping functions.
Here we look at the Doradus internal architecture. The major internal building blocks are called services. Some services such as the Schema service are required, whereas some services such as Task Manager and REST services can be disabled when not needed. This allows “skinny” execution for embedded scenarios such as bulk load applications. Services are controlled via parameters specified in the doradus.yaml configuration file. All parameters can also be controlled at runtime via command-line arguments. The core DoradusServer component handles the startup and shutdown of all services.
Two service types are abstract, allowing configurable, concrete services to be dynamically selected. These pluggable service types are:
Storage: Storage services are responsible for mapping update and query requests to low-level rows and columns. That is, storage services organize data using techniques that are optimal for specific application scenarios. Three primary storage services are currently available: Spider, OLAP, and Logging, which are discussed in the next slide. But we are always experimenting with new services. Multiple storage services can be used at the same time in a single Doradus instance.
DB: The DB service maps row and column read/write requests to a specific underlying persistence API. For access to Cassandra, both Thrift and CQL services are available: Thrift is older but more performant; CQL has more advanced failover features. Very recently, a DynamoDB implementation was added to allow data to be persisted using Amazon Web Services (AWS) DynamoDB. Experimentally, we have also created DB service implementations that store data in flat files, SQL databases, and memory-only storage. In a given Doradus instance, only a single DB service type can be used.
Below is a summary of the other services currently used by Doradus:
REST: This service processes REST API requests using the Java servlet API. A Jetty server is bundled with Doradus and is the default web server, but Doradus can be hosted in another servlet-based web server such as Tomcat. The REST service is optional and can be disabled for embedded applications.
Schema: This service is responsible for processing schema requests such as initializing new schemas (called applications) and deleting existing ones. This service is required.
Task Manager: This service manages background tasks such as data aging and automatic merging (for OLAP). In a multi-instance cluster, tasks are distributed among all Doradus instances. The Task Manager can be disabled for embedded applications.
MBean: This service collects statistics on REST command usage and other metrics and makes them available via JMX. This service can be disabled when these metrics are not needed.
Tenant: This service provides multi-tenant features. It is a required service.
This slide highlights the feature differences between the Doradus storage services.
Doradus Spider is the first storage service we developed. It is still the most flexible service because it supports highly mutable data with fine-grained updates. Spider stores each object in a separate row and adds customizable indexing for each field. It’s focus is object queries that require full text queries. Although Spider is the most flexible service, it is intended for moderate-sized database (millions of objects) and moderate load rates (1,000’s to 10’s of thousands of objects/second). It emphasizes object queries selected by full text query features.
Doradus OLAP is our workhorse and most mature storage service. It uses columnar storage and other techniques to compact data from many objects into a single row. OLAP has more restrictions that Spider: for example, all data fields must be pre-defined and data must be loaded in batches. However, OLAP yields very fast loading, high performance analytic queries, and dense storage usage.
Doradus Logging is our newest storage service intended for time-series log data. The Logging service has more restrictions than OLAP: for example, data cannot be modified once added, and link fields are not supported. However, the Logging service yields even faster loading than OLAP, slightly denser storage, and fast time-based analytic queries.
This slide compares the load and storage characteristics of the three storage services using a common data set consisting of 115 million events:
Spider takes 24 hours to load this data and requires over 100GB of storage.
OLAP loads the same data in only 15 minutes and creates a database that uses less than 2GB of space.
Logging loads the same data in only 5 minutes and uses even less space.
Now let’s look at Doradus OLAP more closely, starting with what motivated it. After we created Doradus Spider, our primary client application changed focus from full text object queries to statistical queries. The variability of search criteria meant that traditional indexing would not be sufficient. We needed to scan millions of objects per second, which was not possible using traditional storage techniques. Some out of the box thinking resulted in Doradus OLAP, which borrows from several disciplines:
Online Analytical Processing: OLAP is widely used in data warehouses, where data is transformed and stored in n-dimensional cubes.
Columnar databases: These databases store column values instead of row values in the same physical record.
NoSQL databases: NoSQL DBs commonly use sharding to co-locate data, especially time-series data.
Doradus OLAP combines ideas from these areas to provide a new storage service that is suitable for certain kinds of applications. Compared to the Spider storage manager, OLAP imposes some restrictions. But for applications that can use it, OLAP offers many advantages including:
Fast data loading: We’ve observed loads up to 1M objects/second on a single node in bulk-load scenarios.
Dense storage: Data is stored in a way that allows makes it highly compressible.
Fast merging: One of the typical drawbacks of data warehouses is the time required to update the database with new data. With Doradus OLAP, data is stored in cubes that can be updated quickly, often in a few seconds.
To accomplish this, Doradus OLAP uses no indexes. Note that OLAP is not a memory database: data can be orders of magnitude larger than available memory.
Let’s look at how data loading works with Doradus OLAP. The data loading sequence also highlights the criteria that applications must meet to effectively load data.
As with most database applications, there may be multiple data sources each with their own “velocity”. That means that some data may be generated quickly, perhaps continuously, whereas other data may change infrequently. Events, for example, may be collected in a continuous stream, whereas information about People may be collected from a directory server via a daily snapshot. It is up to the application to decide how often data is collected and loaded.
Doradus OLAP requires data to be loaded in batches. A single batch can mix new, modified, and deleted objects, and a batch can contain partial updates such as adding or modifying a single field. Each batch can update data in multiple tables.
The ideal batch size depends on many factors including how many fields are updated, but tests show that good batch sizes are at least a few hundred objects up to many thousands of objects.
As each batch is loaded, it identifies the shard to which it belongs. A shard is a partition of the database and can be visualized as a cube. A shard is typically a snapshot of the database for a specific time period. The most common shard granularity is “one day”, though finer and coarser shard granularities will be ideal for some applications. Each shard has a name: the name is not meaningful to Doradus, but the shard names are considered ordered. So, if you want to query a specific range of day shards, say March 1st through March 31st, a good idea is to name shards using the format YYYY-MM-DD. Then you can query for the shard range 2015-03-01 to 2015-03-31 and the appropriate shards are selected.
When a batch is loaded, is it not immediately visible to the shard, which means its data is not returned in queries. Periodically, the shard must be merged, which causes its pending batches to be applied. You can think of a shard’s queryable data as the “live cube”: merging applies updates from batches, creating a new live cube. After merging, update batches are deleted. Merging can be requested manually or performed automatically in the background based on a schedule.
The frequency with which shards are merged affects the latency of the data. If shards are merged once per hour, queries will reflect data that is up to 1 hour old. Merging more often yields fresher data but incurs more resources.
As each shard is updated and merged, a “cube” is added to the OLAP store, which is a Cassandra ColumnFamily. A typical OLAP application is expected to have 100’s to 1000’s of shards. Inside, a shard consists of arrays that are designed for fast merging and fast loading during queries. This is a critical component of Doradus OLAP, so let’s take a closer look.
When batches are loaded, data is extracted and stored in minimally-sized arrays. For example, if an integer field is found to have a maximum value of 50,000, then a 2-byte array is used. Boolean fields use a bit array. Text values are de-duped and stored case-insensitively. Each value array is sorted in object ID order and then stored as a compressed row in Cassandra. Because each array contains homogeneous data (e.g., all integers), it compresses extremely well, often 90% or better. Rows that are too small to warrant compression are stored uncompressed. Multiple, small rows are joined into single large rows so that all values can be loaded with a single read. The idea is to use as little physical disk space as possible to allow a large number of values to be loaded at once.
When a shard is merged, all updates are combined with the shard’s current live data to create a new live data set. Since batch loading generates sorted arrays, the merge process consists of heap merging all arrays of the same table and field into a new array. Heap merging is very fast, hence merging does not take as long.
In traditional data warehouse technology, the extract-transform-load (ETL) process is typically very lengthy, sometimes many hours. Since a Doradus OLAP shard is intended to hold millions of objects, you might therefore assume that the merge process takes a long time. However, the merge process is designed to be fast, often only a few seconds.
This graph shows the merge time required for an event tracking application (described later). In this load, 860 shards were loaded varying in size. The time to merge each shard with 100,000 objects or more is plotted against the time to merge the shard. As the graph shows, the merge time is directly proportional to the number of objects in the shard. For this application, the longest time to merge a shard was just over 80 seconds: that shard contained over 37 million objects. (Keep in mind that merge time is also affected by the number of tables and fields being merged, so your mileage may vary.)
Quick merge time means that Doradus OLAP allows data to be added and merged fairly often, allowing queries to access data that is close to real time.
Since Doradus OLAP uses no indexes, how are queries efficient? This slide shows how OLAP arrays are accessed during queries. The query searches the Message table and counts objects whose Size field falls between 1,000 and 10,000 and whose HasBeenSent flag is false. Furthermore, the query searches shards whose name falls between 2014-03-01 and 2014-03-31. Assuming we used 1-day shards, this requires searching 31 shards. Since the query accesses two fields, OLAP must read 62 rows: the value array for each field in each shard. This might sound like a lot of reads for a single query, but remember that shards typically contain a large set of objects. If our shards averaged 1 million objects each, this query scans 31 million objects with only 62 reads!
As each array is read, it is decompressed and scanned in memory. Value arrays are designed for fast scanning: modern processors can typically scan 10’s of millions of values per second. When a value array is scanned, an “object bit array” is generated to reflect the selected objects. Each additional value array scan turns on or off bits; the final bit array represents the results of the query.
On a “cold” system, where nothing is cached, each row read requires a physical read. However, in practice data is cached at multiple levels:
The operating system caches Cassandra data files (SSTables). Since data is compressed, a large amount of data is typically cached at this level.
Cassandra caches certain information such as recently-read rows, key indices, and bloom filters to speed-up access to recently-accessed data.
Doradus OLAP caches value arrays on an most-recently-basic (MRU) basis, hence recently-accessed values are not re-read from Cassandra.
Since query results consist of compact bit arrays, these are also cached at the clause level on an MRU basis. Hence, recent repeated query clauses are reused and act as cached queries.
These caching levels produce natural “hot” and “warm” data pools that speed-up access to the most requested data.
Let’s look at a real-world Doradus OLAP application: a Windows event tracking application. Shown is an example source event in CSV format. Each event consists of 10 fixed fields and 0 or more variable fields, called “insertion strings”. The number of insertion strings depends on the event’s ID.
In this example, a 540 event has 7 insertion strings, one of which is null. The index of the insertion strings is significant: that is, we must store the index of each insertion string as well as the value since each position is meaningful to the corresponding event.
Doradus OLAP requires predefined tables and fields. However, although our event data is variable in format, we can capture all event fields with a two-table schema: Events stores the 10 fixed fields of each event, and InsertionStrings stores the index and value of each variable field. The Events.Params link field and its inverse InsertionStrings.Event connect each event to its insertion string objects. This allows us to find events and navigate to its insertion strings or vice versa.
For this test, we loaded just under 115 million events. Since events average around 7.7 insertion strings each, this requires the creation of 880 million insertion string objects. Each event object is connected to its insertion string objects by populating the Params/Event links.
A standard benchmark we use is loading the 115M Events data set on a single node running both Doradus and Cassandra. The server is a Dell PowerEdge™ server with dual Intel® Xeon® CPUs and 48GB of memory. We configure Cassandra to use a maximum of 2GB of memory, and we configure Doradus, which runs embedded in the load application, to use a maximum of 8GB of memory. The load app uses 15 threads to load data in parallel.
The Events data set spans 860 calendar days, and we load data using one-day shards, which creates 860 shards. The bulk load application creates a total of over 994 million objects. On a single server, this data loads in about about 15 minutes. This works out to an average load rate of 124K events per second or 1.1M total objects per second (Events + InsertionStrings).
When the Cassandra database is flushed and compacted, the “nodetool status” command reports that the entire database takes 1.89 GB.
In another test, we compared the space used by Doradus to the space used by a relational database for an auditing application. The SQL Server database required 8.5K per object whereas Doradus required only 87 bytes–a savings of almost 99%!
Doradus OLAP’s space savings result from columnar storage, compression, de-duping, and the lack of indexes.
Here is a simple DQL query that uses our events application. Because Doradus provides a REST API, we can just use a browser to query the database. The full REST URL (which assumes Doradus is running locally) is shown. The significant parts of the URL are highlighted:
The first node in the URI is the application name containing the data to be queried.
The second node is the table being queried.
The third node is the system resource name for aggregate queries, called _aggregate. (System resources begin with an underscore.)
As a query element of the URI, the m parameter defines the metric function to be computed; COUNT(*) simply counts all objects. The range parameter defines the range of shards to be queried; range=0 is shorthand for “all shards”.
This query counts all events in all 860 shards. This query takes less than 10 seconds on a “cold” system and less than 1 second on a warm system.
This example performs a typical analytical query. Suppose we’re trying to see if there’s a pattern of when certain privileged events fail. That is, we want to know if they mostly fail at the same time of day. The query looks for events where (1) the EventID belongs to a specific set of privileged event numbers, (2) the event Type field denotes a failure, and (3) the result code from the failed Windows operation (Index=8) has a specific event code (Value=‘(0x0,0x37)’). Since we are querying in a six month range, this query will search 181 shards.
Shown is the REST command for this query with the URI encoding details removed:
The metric parameter (m) is COUNT(*).
The shards are selected using a range parameter that selects the beginning and ending range.
The query expression (q) uses three clauses that specify the selection criteria. Notice that objects are selected in part based on having an object linked via the Params field whose Index is 8 and whose Value is the hex code we’re looking for.
The grouping parameter (f) is the HOUR component of the Timestamp field. Only the TOP 5 groups are returned, and the grouping expression is renamed HourOfDay for easier parsing in the query results, which are shown on the next slide.
BTW, these results can be returned in JSON by appending “&format=json” to the query.
Shown is a typical response to our “privileged event failures” query. The elements of this query result are:
aggregate: echoes the parameters used to create the query.
totalobjects: Shows the total number of objects that were selected by the query.
summary: This is a “rolled up” value of all groups, For this query, summary is the same value as totalobjects because the metric function is COUNT(*).
totalgroups: This indicates that data fell into a total of 17 groups. Since the grouping field was hour-of-day, this means 17 separate hours had occurrences of the requested privileged event failures.
groups: This outer element contains one group for which a metric computation was made. Because the grouping expression requested TOP(5), only the top 5 groups (with the highest metric value) were returned.
group: For each group, the metric value of the group is returned and the identity of the group, which was named HourOfDay because of the AS clause.
This shows that most of these failed events occur around 08:00, probably because people are logging into their computer at that time.
To summarize, the primary advantages of Doradus OLAP are:
The REST API is easy to use in all languages and platforms without requiring a specialized client library.
All fields are searchable without using specialized indexes.
The lack of indexes means Doradus OLAP is ideal for ad-hoc statistical queries.
DQL extends Lucene’s full text query language with graph-based query features that are much simpler than joins.
Fast data loading and shard merging allows an OLAP database to provide near real time data warehousing.
Because data is stored very compactly, less disk space is required, saving up to 99% compared to other databases. Combined with fast in-memory query processing, this means less hardware is required compared to other big data approaches. A single node will be sufficient for many analytics applications. But when necessary, the database can be scaled horizontally using Cassandra’s replication and sharding features.
Doradus OLAP works best for application where:
Data is received as a continuous stream: events, log records, transactions, etc.
Data is structured (all fields are predefined) or semi-structured (variability can be accommodated as in the Events application in this presentation).
Data can be accumulated and loaded in batches.
Data can be partitioned into shards. Time-series data works the best, but strictly speaking other partitioning criteria can work.
Data is typically queried in a subset of shards. For time-sharded data, this means queries select specific time ranges.
The application primarily uses statistical queries. Full text queries are supported, but statistical queries are Doradus OLAP’s forte.
The Dell Software Group uses many open source software projects, and we’re pleased we can contribute back to the open source community. Doradus is open source under the Apache License 2.0, so everyone if free to download, modify, and use the code. Full source code and documentation are available from Github. Binary, source code, and Java doc bundles are available from Maven central. Comments and suggestions are more than welcome, so feel free to contact me.
I would also like to thank the sponsors of this O’Reilly webcast—Dell and Intel—whose support allows open source software and free webcasts such as this possible. Thank you!
- Randy Guck