Extending Cassandra with Doradus OLAP for High Performance Analytics

•Download as PPTX, PDF•

1 like•1,137 views

Slides from an O'Reilly Webinar given on July 29th, 2015. This presentation describes how the Doradus database framework and the OLAP storage service extend Cassandra to provide a unique database solution for certain big data applications. Doradus OLAP uses columnar storage, application-level sharding, compression, and other techniques to store data very densely, yielding fast loading and queries that can scan millions of objects per second.

Data & Analytics

Extending Cassandra with
Doradus OLAP for
High Performance Analytics
Randy Guck
Principal Engineer,
Dell Software
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope
Webcast sponsored by:

Agenda
• Doradus Overview
– What is it?
– Why Cassandra?
– Data model and DQL
• Architecture
– Storage services
• Doradus OLAP
– Motivation and sharding model
– Example Events application
– Example queries

Doradus In A Nutshell
• Storage and query service
• Uses NoSQL DB for persistence
• Pure Java
• Stateless
• Full text and graph query language
• Pluggable storage services
• Bundled with Dell UCCS Analytics
for 3 years
• Open source, Apache License 2.0
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log files

Example Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Node 1
Cassandra
Node 3
Data Data
Doradus
Secondary instances
are optional
Applications
REST API
Thrift or CQL

Cassandra: Best NoSQL DB?
• Choice for many big data apps
• Wide-column model
• CQL: Similar to SQL
• Pure peer architecture
• Cabinet- and Data Center-aware
• Elasticity, recovery features
• Consistent benchmark winner

So, Why Doradus?
• Cassandra limitations–by design:
– Minimal indexing
– No relationship support
– Limited queries
• e.g., SELECT <columns> FROM <table> WHERE KEY IN (<list>)
• No joins, embedded selects, OR clauses, ...
• No full text, range, or statistical queries
– API: requires client driver
• What if we could leverage Cassandra’s best features
while elevating the data model and query language?

Doradus Data Model:
Example Message Tracking Schema
Message
{Size, SendDate, ...}
Participant
{ReceiptDate}
Address
{Email}
Person
{FirstName, LastName,
Department, ...}
Attachment
{Size, Extension} Manager
Employees
Person
Address 
Attachments
Message
Recipients
MessageAsRecipient
Address
Participants
Bi-directional relationships are formed by link pairs
Sender
MessageAsSender

Doradus Query Language:
Object Queries
• Goal:
Find and return objects
• Builds on Lucene syntax:
Full text queries: terms, phrases,
ranges, AND/OR/NOT, etc.
• Adds link paths:
Directed graph searches
Quantifiers and filters
Transitive searches
• Other features:
Stateless paging
Sorting
• Example DQL expressions:
// On Person:
LastName = Smith AND
NOT (FirstName : Jo*) AND
BirthDate = [1986 TO 1992]
// On Message:
ALL(Recipients).ANY(Address.
WHERE(Email='*.gmail.com')).
Person.Department : support
// On Person:
Employees^(4).Office='San Jose'

Doradus Query Language:
Aggregate Queries
• Goal:
Compute statistics
• Metric functions:
COUNT, AVERAGE, MIN,
MAX, DISTINCT, SUM, ...
• Multi-level grouping
• Grouping functions:
BATCH, BOTTOM, FIRST,
LAST, LOWER, SETS, TERMS,
TOP, TRUNCATE, UPPER,
WHERE, ...
• Example queries (on Message):
// Multi-metric function
metric=COUNT(*), AVERAGE(Size),
MIN(Recipients.Address.Person.Birthdate)
// Multi-level grouping
metric=DISTINCT(Attachments.Extension);
groups=Tags,
Recipients.Address.Person.Department;
query=Attachments.Size > 100000
// Grouping function: TOP
metric=AVERAGE(Size);
groups=TOP(10, Recipients.Address.Email)

Doradus Service Architecture
REST
TaskManagerSchema MBean
Storage
OLAPSpider Logging
DB
CQLThrift DynamoDB
Cassandra
DoradusServer
doradus.yaml
REST API
JMX
Doradus Server
Tenant
pluggable
modules
AWS DynamoDB
*
1

Storage Service Comparison
Feature
Storage Service
Spider OLAP Logging
Data variability Unstructured Semi-structured Unstructured
Data mutability High Medium None
Update granularity Fine Batch Batch
Load performance Medium High Very high
Space usage High Low Low
Query focus Object queries Aggregate queries
Time-based
Aggregate queries
Data aging? yes yes yes
Supports links? yes yes no
Dynamic fields? yes no yes
Event load rate/node 1.3K/second 124K/second 444K/second
115M Event DB Size 100GB 1.89GB 1.49GB

Doradus OLAP: Motivation
• Combines ideas from:
– Online Analytical Processing: data arranged in static cubes
– Columnar databases: Column-oriented storage and
compression
– NoSQL databases: Sharding
• Features:
– Fast loading: up to 1M objects/second/node
– Dense storage: 1 billion objects in 2 GB
– Fast cube merging: typically seconds
– No indexes

OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources

OLAP Data Loading
BatchEventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch
Batch
Batch
Batch
Sources Batches
…

OLAP Data Loading
BatchEventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Batch
Batch
Batch
Batch
2015-03-01
2015-02-28
2015-02-27
Sources Shards
…
…
Merge
Merge
Merge
Batches

OLAP Shard Merging
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message Table
ID ...
Size ...
SendDate ...
...
Person Table
ID ...
FirstName ...
LastName ...
...
Address Table
ID ...
Person ...
Messages ...
...
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014-03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
...
Batch #1: Shard 2014-03-01
Batch #2: Shard 2014-03-01
OLAP Store

Does Merging Take Long?
-
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
0 10 20 30 40 50 60 70 80 90
ShardSize(totalobjects)
Merge Time (seconds)
Merge Time vs. Shard Size

OLAP Query Execution
• Example query:
– Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
• How many rows are read?
– 2 fields x 31 shards = 62 rows
– Each row typically represents millions of objects
• Value arrays are scanned in memory
• Physical rows are read on “cold” start only
– Multiple caching levels create “hot” and “warm” data
pools

OLAP Example: Security Events
Sample event in CSV format:
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,
"Logon/Logoff",540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,
"(0x0,0x142999A)",3,Kerberos,Kerberos
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1
MAILSERVER18
$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18

OLAP Example: Events Schema
Events
Insertion
Strings
Params
Event
Count: 115 Million Count: 880 Million
Fields:
• ComputerName (text)
• LogName (text)
• Timestamp (timestamp)
• Type (text)
• Source (text)
• Category (text)
• EventID (integer)
• UserDomain (text)
• UserSID (text)
• Params (link)
Fields:
• Index (integer)
• Value (text)
• Event (link)

OLAP Example: Events Loading
• Configuration:
– Single Dell PowerEdge™ server with dual Intel® Xeon® CPUs
– Cassandra: Xmx=2G, Doradus: Xmx=8G
• Load stats (15 threads):
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 15 minutes, 27 seconds
Average load rate: 1.1M objects/second; 124K events/second
• Space usage:
:nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.89 GB 100.0% 9fe241b5-20f7-4afb-af92-207ef24b8095 -9176223118562734495 rack1

OLAP Example: Events Queries
• Count all Events in all shards
• REST command:
http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0
• Example response:
<results>
<aggregate metric="COUNT(*)" query="*"/>
<totalobjects>114572247</totalobjects>
<value>114572247</value>
</results>
application
name
table
name
aggregate query
resource
all shardsmetric
function

OLAP Example: Events Queries
• Find the top 5 hours-of-the-day when certain
privileged events fail:
Event IDs are any of 577, 681, 529
Event type is ‘Failure Audit’
Insertion string 8 is (0x0,0x3E7)
Event occurred in first half of 2005 (181 shards)
GET /OLAPEvents/Events/_aggregate?m=COUNT(*)
&range=2005-01-01,2005-06-30
&q=Type='Failure Audit' AND EventID IN (577,681,529) AND
Params.WHERE(Index=8).Value='(0x0,0x3E7)'
&f=TOP(5,Timestamp.HOUR) AS HourOfDay

“Privileged Event Failures”:
Query Results
<results>
<aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND
Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/>
<totalobjects>591</totalobjects>
<summary>591</summary>
<totalgroups>17</totalgroups>
<groups>
<group>
<metric>119</metric>
<field name="HourOfDay">8</field>
</group>
<group>
<metric>87</metric>
<field name="HourOfDay">18</field>
</group>
<group>
<metric>72</metric>
<field name="HourOfDay">19</field>
</group>
<group>
<metric>69</metric>
<field name="HourOfDay">9</field>
</group>
<group>
<metric>66</metric>
<field name="HourOfDay">5</field>
</group>
</groups>
</results>
Most failures
occur at 08:00

Doradus OLAP Summary
• Advantages:
 Simple REST API
 All fields are searchable
without indexes
 Ad-hoc statistical searches
 Support for graph-based
queries
 Near real time data
warehousing
 Dense storage = less
hardware
 Horizontally scalable when
needed
• Good for applications
where data:
 Is continuous/streaming
 Is structured to semi-
structured
 Can be loaded in batches
 Is partitionable, typically by
time
 Is typically queried in a
subset of shards
 Emphasizes statistical
queries

Thank You!
• Where to find Doradus
– Source and docs: github.com/dell-oss/Doradus
– Downloads: search.maven.org
• Contact me
– Randy.Guck@software.dell.com
– @randyguck
• Thanks to our sponsors!
30 Doradus R136 Cluster
Source: Hubble Space Telescope
PowerEdge is a trademark of Dell Inc.
Intel and Xeon are trademarks of Intel Corporation
in the U.S. and/or other countries.

Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide: 1) Introduction to DataFrames 2) Creating DataFrames from JSON 3) DataFrame Operations 4) Running SQL Queries Programmatically 5) Datasets 6) Inferring the Schema Using Reflection 7) Programmatically Specifying the Schema

Spark Dataframe - Mr. Jyotiska

Sigmoid

• Distributed datasets loaded into named columns (similar to relational DBs or Python DataFrames). • Can be constructed from existing RDDs or external data sources. • Can scale from small datasets to TBs/PBs on multi-node Spark clusters. • APIs available in Python, Java, Scala and R. • Bytecode generation and optimization using Catalyst Optimizer. • Simpler DSL to perform complex and data heavy operations. • Faster runtime performance than vanilla RDDs.

User Defined Aggregation in Apache Spark: A Love Story

Databricks

This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

Matthias Niehoff

SparkSQL and Dataframe

Namgee Lee

This document discusses Spark SQL and DataFrames. It provides three key points: 1. DataFrames are distributed collections of data organized into named columns similar to a table in a relational database. They allow SQL-like operations to be performed on structured data. 2. DataFrames can be created from a variety of data sources like JSON, Parquet files, existing RDDs, or Hive tables. The schema can be inferred automatically using case classes or specified programmatically. 3. Common SQL operations like selecting columns, filtering rows, aggregation, and joining can be performed on DataFrames to analyze structured data. The results are DataFrames that support additional transformations.

AWS Webcast - Dynamo DB

Amazon Web Services

Amazon DynamoDB is a fully managed, highly scalable distributed database service. In this technical talk, we will deep dive on how to: Use DynamoDB to build high-scale applications like social gaming, chat, and voting. - Model these applications using DynamoDB, including how to use building blocks such as conditional writes, consistent reads, and batch operations to build the higher-level functionality such as multi-item atomic writes and join queries. - Incorporate best practices such as index projections, item sharding, and parallel scan for maximum scalability

Zero to Streaming: Spark and Cassandra

Russell Spitzer

Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.

Big data analytics with Spark & Cassandra

Matthias Niehoff

This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.

The Hadoop Ecosystem

Mathias Herberts

Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...

StampedeCon

Apache Spark & Streaming

Fernando Rodriguez

Amazon DynamoDB Lessen's Learned by Beginner

Hirokazu Tokuno

The document discusses the author's initial experiences learning and working with Amazon DynamoDB, a fully managed NoSQL database service. Some key points: - DynamoDB allows for flexible data structures and extreme performance but is not suitable for complex queries. It uses a NoSQL model with primary and secondary indexes. - The author created tables for a forum application and implemented a simple REST API for basic CRUD operations on the tables. - While DynamoDB requires less maintenance than self-managed databases, the author notes it has limitations like the inability to modify indexes and less flexibility for querying data. - Additional tools like Pentaho ETL were used to process and load sample data into the DynamoDB

Dive into Catalyst

Cheng Lian

Design Patterns using Amazon DynamoDB

Amazon Web Services

If you’re familiar with relational databases, designing your app to use a fully-managed NoSQL database service like Amazon DynamoDB may be new to you. In this webinar, we’ll walk you through common NoSQL design patterns for a variety of applications to help you learn how to design a schema, store, and retrieve data with DynamoDB. We will discuss best practices with DynamoDB to develop IoT, AdTech, and gaming apps.

Lightning fast analytics with Spark and Cassandra

nickmbailey

Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.

Deep Dive on Amazon DynamoDB

Amazon Web Services

This document provides an overview and summary of key aspects of DynamoDB, including: 1. DynamoDB is a fully managed NoSQL database that scales to any workload and provides fast and consistent performance. 2. DynamoDB uses a table structure with partition and sort keys to organize and access data, and scales both read and write throughput independently across partitions. 3. Common challenges include hot keys/partitions that can cause throttling, and designing schemas and partitions to spread access uniformly across the keyspace. 4. NoSQL data modeling focuses on aggregations rather than relations, using patterns like hierarchical data structures and parent-child relationships to model one-to-many and many-to-

Spark cassandra integration 2016

Duyhai Doan

This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.

Amazon DynamoDB Workshop

Amazon Web Services

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. You can use Amazon DynamoDB to create a database table that can store and retrieve any amount of data, and serve any level of request traffic. Amazon DynamoDB automatically spreads the data and traffic for the table over a sufficient number of servers to handle the request capacity specified by the customer and the amount of data stored, while maintaining consistent and fast performance.

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Piotr Kolaczkowski

The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.

Time series with Apache Cassandra - Long version

Patrick McFadin

Fast track to getting started with DSE Max @ ING

Duyhai Doan

This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.

Apache Spark with Scala

Fernando Rodriguez

Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.

A deeper-understanding-of-spark-internals

Cheng Min Chi

The document discusses Spark's execution model and how it runs jobs. It explains that Spark first creates a directed acyclic graph (DAG) of RDDs to represent the computation. It then splits the DAG into stages separated by shuffle operations. Each stage is divided into tasks that operate on data partitions in parallel. The document uses an example job to illustrate how Spark schedules and executes the tasks across a cluster. It emphasizes that understanding these internals can help optimize jobs by increasing parallelism and reducing shuffles.

Cassandra introduction 2016

Duyhai Doan

This document provides an introduction to Cassandra including: - Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise. - Cassandra was created at Facebook and is now open source software with the current version being 3.2. - Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.

Big Data Processing using Apache Spark and Clojure

Dr. Christian Betz

Talk given at ClojureD conference, Berlin Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API. In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming. Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience. About Paulus Esterhazy and Christian Betz Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization. Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster. Paulus Esterhazy Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development. He currently works as Senior Web Developer at Red Pineapple Media in Berlin.

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...

randyguck

Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Andrew Liu

Let's talk about how you can get the most out of Azure DocumentDB. In this session we will dive deep into the mechanics of DocumentDB and explain the various levers available to tune performance and scale. From partitioned collections to global databases to advanced indexing and query features - this session will equip you with the best practices and nuggets of information that will become invaluable tools in your toolbox for building blazingly fast large-scale applications.

What's hot

3 Dundee-Spark Overview for C* developers

Christopher Batey

Spark Streaming with Cassandra

Jacek Lewandowski

Big data analytics with Spark & Cassandra

Matthias Niehoff

The Hadoop Ecosystem

Mathias Herberts

Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...

StampedeCon

Apache Spark & Streaming

Fernando Rodriguez

Amazon DynamoDB Lessen's Learned by Beginner

Hirokazu Tokuno

Dive into Catalyst

Cheng Lian

Design Patterns using Amazon DynamoDB

Amazon Web Services

Lightning fast analytics with Spark and Cassandra

nickmbailey

Deep Dive on Amazon DynamoDB

Amazon Web Services

Spark cassandra integration 2016

Duyhai Doan

Amazon DynamoDB Workshop

Amazon Web Services

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Piotr Kolaczkowski

Time series with Apache Cassandra - Long version

Patrick McFadin

Fast track to getting started with DSE Max @ ING

Duyhai Doan

Apache Spark with Scala

Fernando Rodriguez

A deeper-understanding-of-spark-internals

Cheng Min Chi

Cassandra introduction 2016

Duyhai Doan

Big Data Processing using Apache Spark and Clojure

Dr. Christian Betz

What's hot (20)

3 Dundee-Spark Overview for C* developers

Spark Streaming with Cassandra

Big data analytics with Spark & Cassandra

The Hadoop Ecosystem

Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...

Apache Spark & Streaming

Amazon DynamoDB Lessen's Learned by Beginner

Dive into Catalyst

Design Patterns using Amazon DynamoDB

Lightning fast analytics with Spark and Cassandra

Deep Dive on Amazon DynamoDB

Spark cassandra integration 2016

Amazon DynamoDB Workshop

Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra

Time series with Apache Cassandra - Long version

Fast track to getting started with DSE Max @ ING

Apache Spark with Scala

A deeper-understanding-of-spark-internals

Cassandra introduction 2016

Big Data Processing using Apache Spark and Clojure

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...

randyguck

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Andrew Liu

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Databricks

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

Presentation

Dimitris Stripelis

PySpark Cassandra - Amsterdam Spark Meetup

Frens Jan Rumph

This document discusses using PySpark with Cassandra for analytics. It provides background on Cassandra, Spark, and PySpark. Key features of PySpark Cassandra include scanning Cassandra tables into RDDs, writing RDDs to Cassandra, and joining RDDs with Cassandra tables. Examples demonstrate using operators like scan, project, filter, join, and save to perform tasks like processing time series data, media metadata processing, and earthquake monitoring. The document discusses getting started, compatibility, and provides code samples for common operations.

Spark Kafka summit 2017

ajay_ei

Structured streaming provides a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows processing live data streams using continuous queries that look identical to batch queries. The presentation discusses Spark components including RDDs, DataFrames and Datasets. It then covers limitations of the traditional Spark Streaming model and how structured streaming addresses them by using incremental execution plans and exactly-once semantics. An example of a word count application and demo is presented to illustrate structured streaming concepts.

MongoDB 3.0

Victoria Malaya

NoSQL - MongoDB. Agility, scalability, performance. I am going to talk about the basis of NoSQL and MongoDB. Why some projects requires RDBMs and another NoSQL databases? What are the pros and cons to use NoSQL vs. SQL? How data are stored and transefed in MongoDB? What query language is used? How MongoDB supports high availability and automatic failover with the help of the replication? What is sharding and how it helps to support scalability?. The newest level of the concurrency - collection-level and document-level.

Spark Introduction

DataStax Academy

Lambda at Weather Scale - Cassandra Summit 2015

Robbie Strickland

This document provides an overview of Weather.com's analytics architecture using Apache Cassandra and Spark. It summarizes Weather.com's initial attempts using Cassandra, lessons learned, and its improved architecture. The improved architecture uses Cassandra for streaming event data with time-window compaction, stores all other data in Amazon S3 for batch processing in Spark, and replaces Kafka with Amazon SQS for event ingestion. It discusses best practices for data modeling in Cassandra including partitioning, secondary indexes, and avoiding wide rows and nulls. The document also highlights how Weather.com uses Apache Zeppelin notebooks for data exploration and visualization.

Scalding big ADta

b0ris_1

1. Scalding is a library that provides a concise domain-specific language (DSL) for writing MapReduce jobs in Scala. It allows defining source and sink connectors, as well as data transformation operations like map, filter, groupBy, and join in a more readable way than raw MapReduce APIs. 2. Some use cases for Scalding include splitting or reusing data streams, handling exotic data sources like JDBC or HBase, performing joins, distributed caching, and building connected user profiles by bridging data from different sources. 3. For connecting user profiles, Scalding can be used to model the data as a graph with vertices for user interests and edges for bridging rules.

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...

Victor Giannakouris

This document proposes CSMR, a scalable algorithm for text clustering that uses cosine similarity and MapReduce. CSMR performs pairwise text similarity by representing text documents as vectors in a vector space model and measuring similarity in parallel using MapReduce. It is a 4-phase algorithm that includes word counting, text vectorization using term frequencies, applying TF-IDF to document vectors, and measuring cosine similarity. The algorithm is designed to cluster large text corpora in a scalable manner on distributed systems like Hadoop. Future work includes implementing and testing CSMR on real data and publishing results.

Data Analytics using MATLAB and HDF5

The HDF-EOS Tools and Information Center

This document discusses MATLAB support for scientific data formats and analytics workflows. It provides an overview of MATLAB's capabilities for accessing, exploring, and preprocessing large scientific datasets. These include built-in support for HDF5, NetCDF, and other file formats. It also describes datastore objects that allow loading large datasets incrementally for analysis. The document concludes with an example that uses a FileDatastore to access and summarize HDF5 data from NASA ice sheet surveys in a MapReduce workflow.

Unified Big Data Processing with Apache Spark

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

Apache Spark: What? Why? When?

Massimo Schenone

Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.

Introduction to PostgreSQL

Jim Mlodgenski

managing big data

Suveeksha

This document provides an introduction and overview of Neo4j, a graph database. It discusses trends in big data, NoSQL databases, and different types of NoSQL databases like key-value stores, column family databases, and document databases. It then defines what a graph and graph database are, and introduces Neo4j as a native graph database that uses a property graph model. It outlines some of Neo4j's features and provides examples of how it can be used to represent social network, spatial, and interconnected data.

Application development with Oracle NoSQL Database 3.0

Anuj Sahni

The document introduces table-based data modeling features for Oracle NoSQL Database. It discusses using tables to simplify application data modeling with familiar concepts like tables and data types. Examples show how to model user and email data using tables, including defining the schema using DDL, querying the data using DML, and indexing the tables. The document also provides an example of modeling user and email data from an email client application to illustrate how to approach data modeling.

Keeping Spark on Track: Productionizing Spark for ETL

Databricks

ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors. Speakers: Kyle Pistor & Miklos Christine This talk was originally presented at Spark Summit East 2017.

Multi-Region Cassandra Clusters

Instaclustr

This document discusses enabling multi-region Cassandra clusters that span heterogeneous data centers using Network Address Translation (NAT) and DNS-based Service Discovery (DNS-SD). It describes how NAT allows sharing a limited number of public IP addresses between private nodes by mapping private ports to public ports. DNS-SD is proposed to advertise the port mappings so nodes can discover each other, with SRV and TXT records storing port and cluster details. Minor modifications to Cassandra and drivers are suggested to lookup ports via DNS-SD during connection establishment.

Jump Start with Apache Spark 2.0 on Databricks

Databricks

Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: What’s new in Spark 2.0 SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics (20)

Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Presentation

PySpark Cassandra - Amsterdam Spark Meetup

Spark Kafka summit 2017

MongoDB 3.0

Spark Introduction

Lambda at Weather Scale - Cassandra Summit 2015

Scalding big ADta

CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...

Data Analytics using MATLAB and HDF5

Unified Big Data Processing with Apache Spark

Apache Spark: What? Why? When?

Introduction to PostgreSQL

managing big data

Application development with Oracle NoSQL Database 3.0

Keeping Spark on Track: Productionizing Spark for ETL

Multi-Region Cassandra Clusters

Jump Start with Apache Spark 2.0 on Databricks

Recently uploaded

My burning issue is homelessness K.C.M.O.

rwarrenll

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

Walaa Eldin Moustafa

Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines. #SQL #Views #Privacy #Compliance #DataLake

Learn SQL from basic queries to Advance queries

manishkhaire30

Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively. Key Highlights: Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation. Advanced Queries: Learn to craft complex queries to uncover deep insights from your data. Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets. Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios. Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making. Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data! #DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

v7oacc3l

学校原件一模一样【微信：741003700 】《(英国UCA毕业证书)创意艺术大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

Timothy Spann

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSB毕业证书)圣芭芭拉分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

nuttdpt

毕业原版【微信:176555708】【(UCSF毕业证书)旧金山分校毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

A presentation that explain the Power BI Licensing

AlessioFois2

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...

Social Samosa

Challenges of Nation Building-1.pptx with more important

Sm321

Experts live - Improving user adoption with AI

jitskeb

Influence of Marketing Strategy and Market Competition on Business Plan

jerlynmaetalle

Global Situational Awareness of A.I. and where its headed

vikram sood

You can see the future first in San Francisco. Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum. The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war. Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change. Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride. Let me tell you what we see.

The Building Blocks of QuestDB, a Time Series Database

javier ramirez

Talk Delivered at Valencia Codes Meetup 2024-06. Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds. It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理

74nqk8xf

毕业原版【微信:41543339】【(牛布毕业证书)牛津布鲁克斯大学毕业证】【微信:41543339】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信41543339】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信41543339】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

Timothy Spann

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM by Timothy Spann Principal Developer Advocate https://budapestdata.hu/2024/en/ https://budapestml.hu/2024/en/ tim.spann@zilliz.com https://www.linkedin.com/in/timothyspann/ https://x.com/paasdev https://github.com/tspannhw https://www.youtube.com/@flank-stack milvus vector database gen ai generative ai deep learning machine learning apache nifi apache pulsar apache kafka apache flink

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理

nyfuhyz

毕业原版【微信:176555708】【(UMN毕业证书)明尼苏达大学毕业证】【微信:176555708】成绩单、外壳、offer、留信学历认证（永久存档真实可查）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【我们承诺采用的是学校原版纸张（纸质、底色、纹路），我们拥有全套进口原装设备，特殊工艺都是采用不同机器制作，仿真度基本可以达到100%，所有工艺效果都可提前给客户展示，不满意可以根据客户要求进行调整，直到满意为止！】【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：客户在留信官方认证查询网站查询到认证通过结果后付款，不成功不收费！

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

Aggregage

The Ipsos - AI - Monitor 2024 Report.pdf

Social Samosa

University of New South Wales degree offer diploma Transcript

soxrziqu

Recently uploaded (20)

My burning issue is homelessness K.C.M.O.

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake

Learn SQL from basic queries to Advance queries

在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理

一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理

A presentation that explain the Power BI Licensing

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...

Challenges of Nation Building-1.pptx with more important

Experts live - Improving user adoption with AI

Influence of Marketing Strategy and Market Competition on Business Plan

Global Situational Awareness of A.I. and where its headed

The Building Blocks of QuestDB, a Time Series Database

一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM

一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...

The Ipsos - AI - Monitor 2024 Report.pdf

University of New South Wales degree offer diploma Transcript

Extending Cassandra with Doradus OLAP for High Performance Analytics

1. Extending Cassandra with Doradus OLAP for High Performance Analytics Randy Guck Principal Engineer, Dell Software 30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope Webcast sponsored by:

2. Agenda • Doradus Overview – What is it? – Why Cassandra? – Data model and DQL • Architecture – Storage services • Doradus OLAP – Motivation and sharding model – Example Events application – Example queries

3. Doradus In A Nutshell • Storage and query service • Uses NoSQL DB for persistence • Pure Java • Stateless • Full text and graph query language • Pluggable storage services • Bundled with Dell UCCS Analytics for 3 years • Open source, Apache License 2.0 Application Doradus Cassandra REST API Thrift or CQL Data and Log files

4. Example Multi-Node Cluster Doradus Cassandra Data Node 2 Cassandra Node 1 Cassandra Node 3 Data Data Doradus Secondary instances are optional Applications REST API Thrift or CQL

5. Cassandra: Best NoSQL DB? • Choice for many big data apps • Wide-column model • CQL: Similar to SQL • Pure peer architecture • Cabinet- and Data Center-aware • Elasticity, recovery features • Consistent benchmark winner

6. So, Why Doradus? • Cassandra limitations–by design: – Minimal indexing – No relationship support – Limited queries • e.g., SELECT <columns> FROM <table> WHERE KEY IN (<list>) • No joins, embedded selects, OR clauses, ... • No full text, range, or statistical queries – API: requires client driver • What if we could leverage Cassandra’s best features while elevating the data model and query language?

7. Doradus Data Model: Example Message Tracking Schema Message {Size, SendDate, ...} Participant {ReceiptDate} Address {Email} Person {FirstName, LastName, Department, ...} Attachment {Size, Extension} Manager Employees Person Address  Attachments Message Recipients MessageAsRecipient Address Participants Bi-directional relationships are formed by link pairs Sender MessageAsSender

8. Doradus Query Language: Object Queries • Goal: Find and return objects • Builds on Lucene syntax: Full text queries: terms, phrases, ranges, AND/OR/NOT, etc. • Adds link paths: Directed graph searches Quantifiers and filters Transitive searches • Other features: Stateless paging Sorting • Example DQL expressions: // On Person: LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992] // On Message: ALL(Recipients).ANY(Address. WHERE(Email='*.gmail.com')). Person.Department : support // On Person: Employees^(4).Office='San Jose'

9. Doradus Query Language: Aggregate Queries • Goal: Compute statistics • Metric functions: COUNT, AVERAGE, MIN, MAX, DISTINCT, SUM, ... • Multi-level grouping • Grouping functions: BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ... • Example queries (on Message): // Multi-metric function metric=COUNT(*), AVERAGE(Size), MIN(Recipients.Address.Person.Birthdate) // Multi-level grouping metric=DISTINCT(Attachments.Extension); groups=Tags, Recipients.Address.Person.Department; query=Attachments.Size > 100000 // Grouping function: TOP metric=AVERAGE(Size); groups=TOP(10, Recipients.Address.Email)

10. Doradus Service Architecture REST TaskManagerSchema MBean Storage OLAPSpider Logging DB CQLThrift DynamoDB Cassandra DoradusServer doradus.yaml REST API JMX Doradus Server Tenant pluggable modules AWS DynamoDB * 1

11. Storage Service Comparison Feature Storage Service Spider OLAP Logging Data variability Unstructured Semi-structured Unstructured Data mutability High Medium None Update granularity Fine Batch Batch Load performance Medium High Very high Space usage High Low Low Query focus Object queries Aggregate queries Time-based Aggregate queries Data aging? yes yes yes Supports links? yes yes no Dynamic fields? yes no yes Event load rate/node 1.3K/second 124K/second 444K/second 115M Event DB Size 100GB 1.89GB 1.49GB

12. Doradus OLAP: Motivation • Combines ideas from: – Online Analytical Processing: data arranged in static cubes – Columnar databases: Column-oriented storage and compression – NoSQL databases: Sharding • Features: – Fast loading: up to 1M objects/second/node – Dense storage: 1 billion objects in 2 GB – Fast cube merging: typically seconds – No indexes

13. OLAP Data Loading EventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Sources

14. OLAP Data Loading BatchEventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch Batch Batch Batch Sources Batches …

15. OLAP Data Loading BatchEventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch Batch Batch Batch 2015-03-01 2015-02-28 2015-02-27 Sources Shards … … Merge Merge Merge Batches

16. OLAP Data Loading BatchEventsEventsEvents EventsEventsPeople EventsEventsComputers EventsEventsDomains Batch Batch Batch Batch 2015-03-01 2015-02-28 2015-02-27 Sources Shards OLAP Store … … Merge Merge Merge Batches

17. OLAP Shard Merging Message Table ID ... Size ... SendDate ... ... Person Table ID ... FirstName ... LastName ... ... Address Table ID ... Person ... Messages ... ... Message Table ID ... Size ... SendDate ... ... Person Table ID ... FirstName ... LastName ... ... Address Table ID ... Person ... Messages ... ... Key Columns Email/Message/2014-03-01/ID [compressed data] Email/Message/2014-03-01/Size [compressed data] Email/Message/2014-03-01/SendDate [compressed data] ... ... Email/Person/2014-03-01/ID [compressed data] Email/Person/2014-03-01/FirstName [compressed data] Email/Person/2014-03-01/LastName [compressed data] ... ... Email/Address/2014-03-01/ID [compressed data] Email/Address/2014-03-01/Person [compressed data] Email/Address/2014-03-01/Message [compressed data] ... ... Email/Message/2014-02-28/ID [compressed data] ... Batch #1: Shard 2014-03-01 Batch #2: Shard 2014-03-01 OLAP Store

18. Does Merging Take Long? - 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 45,000,000 0 10 20 30 40 50 60 70 80 90 ShardSize(totalobjects) Merge Time (seconds) Merge Time vs. Shard Size

19. OLAP Query Execution • Example query: – Count messages with Size between 1000-10000 and HasBeenSent=false in shards 2014-03-01 to 2014-03-31 • How many rows are read? – 2 fields x 31 shards = 62 rows – Each row typically represents millions of objects • Value arrays are scanned in memory • Physical rows are read on “cold” start only – Multiple caching levels create “hot” and “warm” data pools

20. OLAP Example: Security Events Sample event in CSV format: MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security, "Logon/Logoff",540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation, "(0x0,0x142999A)",3,Kerberos,Kerberos Fixed Fields Variable Fields Computer Name MAILSERVER18 1 MAILSERVER18 $ Log Name Security 2 Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation Type Success Audit 4 (0x0,0x142999A) Source Security 5 3 Category Logon/Logoff 6 Kerberos Event ID 540 7 Kerberos User Domain NT AUTHORITY User Name SYSTEM User SID S-1-5-18

21. OLAP Example: Events Schema Events Insertion Strings Params Event Count: 115 Million Count: 880 Million Fields: • ComputerName (text) • LogName (text) • Timestamp (timestamp) • Type (text) • Source (text) • Category (text) • EventID (integer) • UserDomain (text) • UserSID (text) • Params (link) Fields: • Index (integer) • Value (text) • Event (link)

22. OLAP Example: Events Loading • Configuration: – Single Dell PowerEdge™ server with dual Intel® Xeon® CPUs – Cassandra: Xmx=2G, Doradus: Xmx=8G • Load stats (15 threads): Total shards: 860 Total events: 114,572,247 Total ins strings: 879,529,753 Total objects: 994,102,000 Total load time: 15 minutes, 27 seconds Average load rate: 1.1M objects/second; 124K events/second • Space usage: :nodetool -h localhost status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns Host ID Token Rack UN 127.0.0.1 1.89 GB 100.0% 9fe241b5-20f7-4afb-af92-207ef24b8095 -9176223118562734495 rack1

23. OLAP Example: Events Queries • Count all Events in all shards • REST command: http://localhost:1123/OLAPEvents/Events/_aggregate?m=COUNT(*)&range=0 • Example response: <results> <aggregate metric="COUNT(*)" query="*"/> <totalobjects>114572247</totalobjects> <value>114572247</value> </results> application name table name aggregate query resource all shardsmetric function

24. OLAP Example: Events Queries • Find the top 5 hours-of-the-day when certain privileged events fail: Event IDs are any of 577, 681, 529 Event type is ‘Failure Audit’ Insertion string 8 is (0x0,0x3E7) Event occurred in first half of 2005 (181 shards) GET /OLAPEvents/Events/_aggregate?m=COUNT(*) &range=2005-01-01,2005-06-30 &q=Type='Failure Audit' AND EventID IN (577,681,529) AND Params.WHERE(Index=8).Value='(0x0,0x3E7)' &f=TOP(5,Timestamp.HOUR) AS HourOfDay

25. “Privileged Event Failures”: Query Results <results> <aggregate metric="COUNT(*)" query="Type='Failure Audit' EventID IN (577,681,529) AND Params.Index=8 AND Params.Value='(0x0,0x3E7)'" group="TOP(5,Timestamp.HOUR) AS HourOfDay"/> <totalobjects>591</totalobjects> <summary>591</summary> <totalgroups>17</totalgroups> <groups> <group> <metric>119</metric> <field name="HourOfDay">8</field> </group> <group> <metric>87</metric> <field name="HourOfDay">18</field> </group> <group> <metric>72</metric> <field name="HourOfDay">19</field> </group> <group> <metric>69</metric> <field name="HourOfDay">9</field> </group> <group> <metric>66</metric> <field name="HourOfDay">5</field> </group> </groups> </results> Most failures occur at 08:00

26. Doradus OLAP Summary • Advantages:  Simple REST API  All fields are searchable without indexes  Ad-hoc statistical searches  Support for graph-based queries  Near real time data warehousing  Dense storage = less hardware  Horizontally scalable when needed • Good for applications where data:  Is continuous/streaming  Is structured to semi- structured  Can be loaded in batches  Is partitionable, typically by time  Is typically queried in a subset of shards  Emphasizes statistical queries

27. Thank You! • Where to find Doradus – Source and docs: github.com/dell-oss/Doradus – Downloads: search.maven.org • Contact me – Randy.Guck@software.dell.com – @randyguck • Thanks to our sponsors! 30 Doradus R136 Cluster Source: Hubble Space Telescope PowerEdge is a trademark of Dell Inc. Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

Editor's Notes

This presentation describes Doradus OLAP, which is a query and storage service that extends the Cassandra NoSQL database. The name was derived from 30 Doradus, also known as The Tarantula Nebula, which is a region in the Large Magellanic Cloud. It is a very active and luminous region, serving as the birthplace for new galaxies and bright stars.
Here are the topics covered in this presentation. First we’ll describe Doradus at a broad level including it’s main features, how it leverages Cassandra, and its data model and query language. We’ll then touch on the internal architecture and the concept of storage services. Finally, we’ll take a close look at the Doradus OLAP service and how it provides near real time data warehousing. We’ll use an example Events application to demonstrate load and storage characteristics and sample queries.
Here is a quick overview of Doradus. From a high-level viewpoint, it is a storage and query service that leverages a NoSQL database for persistence. Doradus primarily uses Cassandra, but integrations with other persistence stores have also been demonstrated. Doradus provides a high-level data model and a query language that supports full text and graph searching features. Applications communicate with Doradus via a REST API that supports JSON and XML messages. Doradus is stateless, so multiple instances can be run against the same Cassandra cluster using either the Thrift or CQL API. Like Cassandra, Doradus is pure Java and runs on Windows and Linux. The architecture diagram shows a minimal implementation consisting of a user application, Doradus instance, and Cassandra instance, all of which can run on a single machine. The Doradus architecture supports pluggable storage services, which organize data for specific application scenarios with different space/performance tradeoffs. One of those services, Doradus OLAP, is the primary focus of this presentation. Because of its dense storage techniques, Doradus OLAP allows many big data applications to run on a single node. However, the Cassandra NoSQL underpinning allows horizontal expansion when needed. Doradus has been bundled with the Dell UCCS Analytics product for about 3 years now and is available as open source under the Apache License 2.0. This means the source code is free for anyone to download, use, and even modify.
Both Doradus and Cassandra can be scaled horizontally. When increased scalability or failover are needed, a multi-node Cassandra cluster can be used. In this example, Cassandra is deployed in a 3-node cluster. Technically, only one Doradus instance is needed: it will rotate requests through all Cassandra nodes and ignore failed nodes automatically. Because Doradus is stateless, you can run additional instances for increased capacity or failover. For simplicity, one approach is to run a Doradus instance on each Cassandra node so that every node is configured identically. From Cassandra’s viewpoint, Doradus is an application. Doradus adds functionality from the “outside” and does not modify core Cassandra code. Doradus has been tested with both Cassandra 2.0 and 2.1 releases.
There are many good NoSQL DBs available today, so why did we choose Cassandra? Here are some of the reasons: Cassandra is the choice of many companies for serious big data applications, including Netflix, EBay, and Apple. At last count, the Netflix cluster uses over 2,800 nodes. Apple reportedly has deployed over 75,000 Cassandra nodes that manage over 10PB of data. Cassandra’s wide-column or tabular data model is extremely flexible. It can be used for simple key/value and document-oriented applications, but it supports up to 2 billion columns per row, and a column value can be up to 2GB in length. The primary query language for Cassandra is CQL, which borrows from SQL for both DDL (schema) and DML (query) operations. This makes it easier to transition from relational databases. Cassandra is pure Java and uses a “pure peer” architecture: there is only one process type, and requests can be sent to any node. There are no master/secondary processes, which make Cassandra easy to deploy and extend. Cassandra has arguably best-in-class features for horizontal scalability. It is cabinet- and data center-aware, allowing it to choose replication strategies that maximize availability while balancing network usage. When it comes to managing a distributed cluster, again Cassandra has best-in-class features. New nodes can be added dynamically while the cluster rebalances in the background. Lost nodes can be recovered from replicated data, and Cassandra handles complex problems such as “split brain” failures. Cassandra is a consistent winner of NoSQL benchmarks. Most NoSQL DBs are easy to install and get started with. But as data and network complexity grow, Cassandra is arguably the best NoSQL database. This is why many companies choose it and why we use it with Doradus.
If Cassandra is so popular, why do we need Doradus? Many applications find that Cassandra provides exactly what they need. But some applications need more than Cassandra’s basic features and must add functionality in application code. By design, Cassandra supports limited queries and indexing to encourage applications to follow it’s partitioned-based access pattern. Here are some examples when additional functionality is needed: Indexing: Cassandra supports minimal indexing. Secondary indexes are hash structures that only support equality searches. Furthermore, they are only recommended for “low cardinality” scalar fields, and some experts don’t recommend them at all. Data model: Cassandra supports single-valued and limited multi-valued scalars. It does not directly support any means for relating data objects. Queries: By design, CQL does not support joins, subqueries, or aggregation. The most complex select query it supports uses an IN clause. There is no support for full text queries, general inequalities, or statistical queries. API: Cassandra must be accessed via its Thrift or CQL APIs, and both require a client-side driver. Some applications find this limiting and would prefer to use a REST API. What if we could leverage Cassandra’s best features—scalability, elasticity, recoverability—while elevating the data model and query language? This is the premise on which Doradus was started.
Let’s look how Doradus extends Cassandra, starting with the data model and an example Message Tracking application. This schema uses 5 tables to store information about email message traffic. The objects within each table hold scalar field values such as Size and SendDate, but they can also define link fields, which form bi-directional relationships. Every link field has an inverse that defines the same relationship from the opposite direction. For example, the Message table defines a link called Recipients, which points to the Participant table. The inverse link is called MessageAsRecipient, which points back to the Message table. Another pair of links, Sender and MessageAsSender, form another relationship between the same tables. Doradus maintains inverse links automatically: if Message.Sender is updated, the inverse Participant.MessageAsSender link is automatically added or deleted as needed. The Person table defines a link called Manager, which points to Person with Employees as the inverse link. The Manager/Employees relationship forms an org chart graph that can be searched “up” or “down”. A link whose extent—the table it points to—is the same as its owner is called a reflexive link. Reflexive links can be searched recursively via transitive functions. Doradus also allows self-reflexive links that are their own inverse. An example self-reflexive relationship is Friends (though friendship is not always reciprocal!) We’ll see how links are used in the query language next.
The Doradus Query Language (DQL) can be used in two contexts: An object query selects and returns objects from a specific table, called the perspective. Object queries can return fields from both selected objects and objects linked to them. DQL is based on and extends the Lucene full text language. This means that DQL supports clause types such as terms, phrases, wildcards, ranges, equalities and inequalities, AND/OR/NOT and parentheses, and more. To these full text concepts DQL adds the notion of link paths to traverse relationship graphs. Link paths can use functions such as quantifiers and filters to narrow the selection of objects. Reflexive relationships can use a transitive function to search a graph recursively. DQL queries also support stateless paging and results sorting. The first example on the right is a simple full text query, fetching objects from the Person table. It consists of three clauses that select objects (1) whose LastName is Smith, (2) whose FirstName does not contain a term that begins with “Jo”, and (3) whose BirthDate falls between the years 1986 and 1992 (inclusively). This example uses several features including an equality clause, a term clause, a wildcard term, a range clause, and clause negation. The second example queries the Message and demonstrates how link fields are used. Link paths are one of the features that makes DQL easy to use. Though deceptively simple, each “.” in a link path such as X.Y.Z essentially performs a join. This example shows how link paths are explicitly quantified. When a link path is enclosed in the quantifier ANY, ALL, or NONE, a specific number of the objects in the quantified link path must match the clause’s condition. In this example, ALL(Recipients) means that every Recipients value must meet the remainder of the clause. The Address link is quantified with ANY, and it is also filtered with a WHERE clause. This means that only Address objects whose Email ends with “gmail.com” are selected, and at least one of them (ANY) must meet the remainder of the clause. Without the quantifiers and filter, the link path is Recipients.Address.Person.Department, and it uses a contains clause searching for the term support. This means the Department field must contain the term “support” (case-insensitive). Textually, this query selects objects where all Recipients have at least one Address whose Email ends with “gmail.com” and whose linked Person’s department is in support. The last example demonstrates transitive searches using the Person table. Because Employees is a reflexive link, we can search it recursively through the graph, in this case “down” the org chart. The carat (^) is the transitive function and causes the preceding link, Employees, to be searched recursively. The optional value in parentheses after the carat limits the recursion to a maximum number of levels. In this case, ^(4) says “recurse the link to no more than 4 levels”. Without the limit parameter, the transitive function searches the graph until a cycle or a “leaf” object is found.
The other context in which DQL can be used is aggregate queries, which select objects from a perspective table and performs statistical computations across those objects. This slide shows several example aggregate queries. These examples demonstrate the metric expression, grouping expression, and query expression components of aggregate queries. All examples use the Message table as the perspective. The first example performs three computations across selected objects: (1) a COUNT of all objects, (2) the AVERAGE value of the Size field, and (2) the smallest (MIN) Birthdate found among objects linked via Recipients.Address.Person. All three statistical computations are made in a single pass through the data. Since this example contains no grouping expression, a single value is returned for each of these computations. The second example demonstrates multi-level grouping, which divides objects into groups and performs the metric computations across each subgroup. In this example, a single metric function is computed: the unique (DISTINCT) Attachments.Extension values within each group. The grouping expression groups objects first by their Tags field and secondarily by the values found in the link path Recipients.Address.Person.Department. Because this example has a query expression, only those objects matching the selection expression are included in the aggregate computation. The third example computes AVERAGE Size for all objects, grouped by a single-level grouping expression Recipients.Address.Email. The grouping expression is enclosed in a TOP(10) function, which means that only the groups with the top 10 metric values (average Size) are returned. Doradus supports a wide range of grouping functions to create aggregate query groups. Some of the more popular functions are summarized below: BATCH: creates groups from explicit value ranges TOP/BOTTOM: returns a limited number of groups based on their highest/lowest metric values FIRST/LAST: returns a limited number of groups based on their highest/lowest alphabetical group names INCLUDE/EXCLUDE: includes or excludes specific group values within a grouping level UPPER/LOWER: Creates groups with case-insensitive text values SETS: creates groups from arbitrary query expressions TERMS: creates one group per term within a text field TRUNCATE: creates groups from a timestamp value rounded down to a specific date/time granularity See the Doradus documentation for a full description of all aggregate query grouping functions.
Here we look at the Doradus internal architecture. The major internal building blocks are called services. Some services such as the Schema service are required, whereas some services such as Task Manager and REST services can be disabled when not needed. This allows “skinny” execution for embedded scenarios such as bulk load applications. Services are controlled via parameters specified in the doradus.yaml configuration file. All parameters can also be controlled at runtime via command-line arguments. The core DoradusServer component handles the startup and shutdown of all services. Two service types are abstract, allowing configurable, concrete services to be dynamically selected. These pluggable service types are: Storage: Storage services are responsible for mapping update and query requests to low-level rows and columns. That is, storage services organize data using techniques that are optimal for specific application scenarios. Three primary storage services are currently available: Spider, OLAP, and Logging, which are discussed in the next slide. But we are always experimenting with new services. Multiple storage services can be used at the same time in a single Doradus instance. DB: The DB service maps row and column read/write requests to a specific underlying persistence API. For access to Cassandra, both Thrift and CQL services are available: Thrift is older but more performant; CQL has more advanced failover features. Very recently, a DynamoDB implementation was added to allow data to be persisted using Amazon Web Services (AWS) DynamoDB. Experimentally, we have also created DB service implementations that store data in flat files, SQL databases, and memory-only storage. In a given Doradus instance, only a single DB service type can be used. Below is a summary of the other services currently used by Doradus: REST: This service processes REST API requests using the Java servlet API. A Jetty server is bundled with Doradus and is the default web server, but Doradus can be hosted in another servlet-based web server such as Tomcat. The REST service is optional and can be disabled for embedded applications. Schema: This service is responsible for processing schema requests such as initializing new schemas (called applications) and deleting existing ones. This service is required. Task Manager: This service manages background tasks such as data aging and automatic merging (for OLAP). In a multi-instance cluster, tasks are distributed among all Doradus instances. The Task Manager can be disabled for embedded applications. MBean: This service collects statistics on REST command usage and other metrics and makes them available via JMX. This service can be disabled when these metrics are not needed. Tenant: This service provides multi-tenant features. It is a required service.
This slide highlights the feature differences between the Doradus storage services. Doradus Spider is the first storage service we developed. It is still the most flexible service because it supports highly mutable data with fine-grained updates. Spider stores each object in a separate row and adds customizable indexing for each field. It’s focus is object queries that require full text queries. Although Spider is the most flexible service, it is intended for moderate-sized database (millions of objects) and moderate load rates (1,000’s to 10’s of thousands of objects/second). It emphasizes object queries selected by full text query features. Doradus OLAP is our workhorse and most mature storage service. It uses columnar storage and other techniques to compact data from many objects into a single row. OLAP has more restrictions that Spider: for example, all data fields must be pre-defined and data must be loaded in batches. However, OLAP yields very fast loading, high performance analytic queries, and dense storage usage. Doradus Logging is our newest storage service intended for time-series log data. The Logging service has more restrictions than OLAP: for example, data cannot be modified once added, and link fields are not supported. However, the Logging service yields even faster loading than OLAP, slightly denser storage, and fast time-based analytic queries. This slide compares the load and storage characteristics of the three storage services using a common data set consisting of 115 million events: Spider takes 24 hours to load this data and requires over 100GB of storage. OLAP loads the same data in only 15 minutes and creates a database that uses less than 2GB of space. Logging loads the same data in only 5 minutes and uses even less space.
Now let’s look at Doradus OLAP more closely, starting with what motivated it. After we created Doradus Spider, our primary client application changed focus from full text object queries to statistical queries. The variability of search criteria meant that traditional indexing would not be sufficient. We needed to scan millions of objects per second, which was not possible using traditional storage techniques. Some out of the box thinking resulted in Doradus OLAP, which borrows from several disciplines: Online Analytical Processing: OLAP is widely used in data warehouses, where data is transformed and stored in n-dimensional cubes. Columnar databases: These databases store column values instead of row values in the same physical record. NoSQL databases: NoSQL DBs commonly use sharding to co-locate data, especially time-series data. Doradus OLAP combines ideas from these areas to provide a new storage service that is suitable for certain kinds of applications. Compared to the Spider storage manager, OLAP imposes some restrictions. But for applications that can use it, OLAP offers many advantages including: Fast data loading: We’ve observed loads up to 1M objects/second on a single node in bulk-load scenarios. Dense storage: Data is stored in a way that allows makes it highly compressible. Fast merging: One of the typical drawbacks of data warehouses is the time required to update the database with new data. With Doradus OLAP, data is stored in cubes that can be updated quickly, often in a few seconds. To accomplish this, Doradus OLAP uses no indexes. Note that OLAP is not a memory database: data can be orders of magnitude larger than available memory.
Let’s look at how data loading works with Doradus OLAP. The data loading sequence also highlights the criteria that applications must meet to effectively load data. As with most database applications, there may be multiple data sources each with their own “velocity”. That means that some data may be generated quickly, perhaps continuously, whereas other data may change infrequently. Events, for example, may be collected in a continuous stream, whereas information about People may be collected from a directory server via a daily snapshot. It is up to the application to decide how often data is collected and loaded.
Doradus OLAP requires data to be loaded in batches. A single batch can mix new, modified, and deleted objects, and a batch can contain partial updates such as adding or modifying a single field. Each batch can update data in multiple tables. The ideal batch size depends on many factors including how many fields are updated, but tests show that good batch sizes are at least a few hundred objects up to many thousands of objects.
As each batch is loaded, it identifies the shard to which it belongs. A shard is a partition of the database and can be visualized as a cube. A shard is typically a snapshot of the database for a specific time period. The most common shard granularity is “one day”, though finer and coarser shard granularities will be ideal for some applications. Each shard has a name: the name is not meaningful to Doradus, but the shard names are considered ordered. So, if you want to query a specific range of day shards, say March 1st through March 31st, a good idea is to name shards using the format YYYY-MM-DD. Then you can query for the shard range 2015-03-01 to 2015-03-31 and the appropriate shards are selected. When a batch is loaded, is it not immediately visible to the shard, which means its data is not returned in queries. Periodically, the shard must be merged, which causes its pending batches to be applied. You can think of a shard’s queryable data as the “live cube”: merging applies updates from batches, creating a new live cube. After merging, update batches are deleted. Merging can be requested manually or performed automatically in the background based on a schedule. The frequency with which shards are merged affects the latency of the data. If shards are merged once per hour, queries will reflect data that is up to 1 hour old. Merging more often yields fresher data but incurs more resources.
As each shard is updated and merged, a “cube” is added to the OLAP store, which is a Cassandra ColumnFamily. A typical OLAP application is expected to have 100’s to 1000’s of shards. Inside, a shard consists of arrays that are designed for fast merging and fast loading during queries. This is a critical component of Doradus OLAP, so let’s take a closer look.
When batches are loaded, data is extracted and stored in minimally-sized arrays. For example, if an integer field is found to have a maximum value of 50,000, then a 2-byte array is used. Boolean fields use a bit array. Text values are de-duped and stored case-insensitively. Each value array is sorted in object ID order and then stored as a compressed row in Cassandra. Because each array contains homogeneous data (e.g., all integers), it compresses extremely well, often 90% or better. Rows that are too small to warrant compression are stored uncompressed. Multiple, small rows are joined into single large rows so that all values can be loaded with a single read. The idea is to use as little physical disk space as possible to allow a large number of values to be loaded at once. When a shard is merged, all updates are combined with the shard’s current live data to create a new live data set. Since batch loading generates sorted arrays, the merge process consists of heap merging all arrays of the same table and field into a new array. Heap merging is very fast, hence merging does not take as long.
In traditional data warehouse technology, the extract-transform-load (ETL) process is typically very lengthy, sometimes many hours. Since a Doradus OLAP shard is intended to hold millions of objects, you might therefore assume that the merge process takes a long time. However, the merge process is designed to be fast, often only a few seconds. This graph shows the merge time required for an event tracking application (described later). In this load, 860 shards were loaded varying in size. The time to merge each shard with 100,000 objects or more is plotted against the time to merge the shard. As the graph shows, the merge time is directly proportional to the number of objects in the shard. For this application, the longest time to merge a shard was just over 80 seconds: that shard contained over 37 million objects. (Keep in mind that merge time is also affected by the number of tables and fields being merged, so your mileage may vary.) Quick merge time means that Doradus OLAP allows data to be added and merged fairly often, allowing queries to access data that is close to real time.
Since Doradus OLAP uses no indexes, how are queries efficient? This slide shows how OLAP arrays are accessed during queries. The query searches the Message table and counts objects whose Size field falls between 1,000 and 10,000 and whose HasBeenSent flag is false. Furthermore, the query searches shards whose name falls between 2014-03-01 and 2014-03-31. Assuming we used 1-day shards, this requires searching 31 shards. Since the query accesses two fields, OLAP must read 62 rows: the value array for each field in each shard. This might sound like a lot of reads for a single query, but remember that shards typically contain a large set of objects. If our shards averaged 1 million objects each, this query scans 31 million objects with only 62 reads! As each array is read, it is decompressed and scanned in memory. Value arrays are designed for fast scanning: modern processors can typically scan 10’s of millions of values per second. When a value array is scanned, an “object bit array” is generated to reflect the selected objects. Each additional value array scan turns on or off bits; the final bit array represents the results of the query. On a “cold” system, where nothing is cached, each row read requires a physical read. However, in practice data is cached at multiple levels: The operating system caches Cassandra data files (SSTables). Since data is compressed, a large amount of data is typically cached at this level. Cassandra caches certain information such as recently-read rows, key indices, and bloom filters to speed-up access to recently-accessed data. Doradus OLAP caches value arrays on an most-recently-basic (MRU) basis, hence recently-accessed values are not re-read from Cassandra. Since query results consist of compact bit arrays, these are also cached at the clause level on an MRU basis. Hence, recent repeated query clauses are reused and act as cached queries. These caching levels produce natural “hot” and “warm” data pools that speed-up access to the most requested data.
Let’s look at a real-world Doradus OLAP application: a Windows event tracking application. Shown is an example source event in CSV format. Each event consists of 10 fixed fields and 0 or more variable fields, called “insertion strings”. The number of insertion strings depends on the event’s ID. In this example, a 540 event has 7 insertion strings, one of which is null. The index of the insertion strings is significant: that is, we must store the index of each insertion string as well as the value since each position is meaningful to the corresponding event.
Doradus OLAP requires predefined tables and fields. However, although our event data is variable in format, we can capture all event fields with a two-table schema: Events stores the 10 fixed fields of each event, and InsertionStrings stores the index and value of each variable field. The Events.Params link field and its inverse InsertionStrings.Event connect each event to its insertion string objects. This allows us to find events and navigate to its insertion strings or vice versa. For this test, we loaded just under 115 million events. Since events average around 7.7 insertion strings each, this requires the creation of 880 million insertion string objects. Each event object is connected to its insertion string objects by populating the Params/Event links.
A standard benchmark we use is loading the 115M Events data set on a single node running both Doradus and Cassandra. The server is a Dell PowerEdge™ server with dual Intel® Xeon® CPUs and 48GB of memory. We configure Cassandra to use a maximum of 2GB of memory, and we configure Doradus, which runs embedded in the load application, to use a maximum of 8GB of memory. The load app uses 15 threads to load data in parallel. The Events data set spans 860 calendar days, and we load data using one-day shards, which creates 860 shards. The bulk load application creates a total of over 994 million objects. On a single server, this data loads in about about 15 minutes. This works out to an average load rate of 124K events per second or 1.1M total objects per second (Events + InsertionStrings). When the Cassandra database is flushed and compacted, the “nodetool status” command reports that the entire database takes 1.89 GB. In another test, we compared the space used by Doradus to the space used by a relational database for an auditing application. The SQL Server database required 8.5K per object whereas Doradus required only 87 bytes–a savings of almost 99%! Doradus OLAP’s space savings result from columnar storage, compression, de-duping, and the lack of indexes.
Here is a simple DQL query that uses our events application. Because Doradus provides a REST API, we can just use a browser to query the database. The full REST URL (which assumes Doradus is running locally) is shown. The significant parts of the URL are highlighted: The first node in the URI is the application name containing the data to be queried. The second node is the table being queried. The third node is the system resource name for aggregate queries, called _aggregate. (System resources begin with an underscore.) As a query element of the URI, the m parameter defines the metric function to be computed; COUNT(*) simply counts all objects. The range parameter defines the range of shards to be queried; range=0 is shorthand for “all shards”. This query counts all events in all 860 shards. This query takes less than 10 seconds on a “cold” system and less than 1 second on a warm system.
This example performs a typical analytical query. Suppose we’re trying to see if there’s a pattern of when certain privileged events fail. That is, we want to know if they mostly fail at the same time of day. The query looks for events where (1) the EventID belongs to a specific set of privileged event numbers, (2) the event Type field denotes a failure, and (3) the result code from the failed Windows operation (Index=8) has a specific event code (Value=‘(0x0,0x37)’). Since we are querying in a six month range, this query will search 181 shards. Shown is the REST command for this query with the URI encoding details removed: The metric parameter (m) is COUNT(*). The shards are selected using a range parameter that selects the beginning and ending range. The query expression (q) uses three clauses that specify the selection criteria. Notice that objects are selected in part based on having an object linked via the Params field whose Index is 8 and whose Value is the hex code we’re looking for. The grouping parameter (f) is the HOUR component of the Timestamp field. Only the TOP 5 groups are returned, and the grouping expression is renamed HourOfDay for easier parsing in the query results, which are shown on the next slide. BTW, these results can be returned in JSON by appending “&format=json” to the query.
Shown is a typical response to our “privileged event failures” query. The elements of this query result are: aggregate: echoes the parameters used to create the query. totalobjects: Shows the total number of objects that were selected by the query. summary: This is a “rolled up” value of all groups, For this query, summary is the same value as totalobjects because the metric function is COUNT(*). totalgroups: This indicates that data fell into a total of 17 groups. Since the grouping field was hour-of-day, this means 17 separate hours had occurrences of the requested privileged event failures. groups: This outer element contains one group for which a metric computation was made. Because the grouping expression requested TOP(5), only the top 5 groups (with the highest metric value) were returned. group: For each group, the metric value of the group is returned and the identity of the group, which was named HourOfDay because of the AS clause. This shows that most of these failed events occur around 08:00, probably because people are logging into their computer at that time.
To summarize, the primary advantages of Doradus OLAP are: The REST API is easy to use in all languages and platforms without requiring a specialized client library. All fields are searchable without using specialized indexes. The lack of indexes means Doradus OLAP is ideal for ad-hoc statistical queries. DQL extends Lucene’s full text query language with graph-based query features that are much simpler than joins. Fast data loading and shard merging allows an OLAP database to provide near real time data warehousing. Because data is stored very compactly, less disk space is required, saving up to 99% compared to other databases. Combined with fast in-memory query processing, this means less hardware is required compared to other big data approaches. A single node will be sufficient for many analytics applications. But when necessary, the database can be scaled horizontally using Cassandra’s replication and sharding features. Doradus OLAP works best for application where: Data is received as a continuous stream: events, log records, transactions, etc. Data is structured (all fields are predefined) or semi-structured (variability can be accommodated as in the Events application in this presentation). Data can be accumulated and loaded in batches. Data can be partitioned into shards. Time-series data works the best, but strictly speaking other partitioning criteria can work. Data is typically queried in a subset of shards. For time-sharded data, this means queries select specific time ranges. The application primarily uses statistical queries. Full text queries are supported, but statistical queries are Doradus OLAP’s forte.
The Dell Software Group uses many open source software projects, and we’re pleased we can contribute back to the open source community. Doradus is open source under the Apache License 2.0, so everyone if free to download, modify, and use the code. Full source code and documentation are available from Github. Binary, source code, and Java doc bundles are available from Maven central. Comments and suggestions are more than welcome, so feel free to contact me. I would also like to thank the sponsors of this O’Reilly webcast—Dell and Intel—whose support allows open source software and free webcasts such as this possible. Thank you! - Randy Guck

Extending Cassandra with Doradus OLAP for High Performance Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics

Similar to Extending Cassandra with Doradus OLAP for High Performance Analytics (20)

Recently uploaded

Recently uploaded (20)

Extending Cassandra with Doradus OLAP for High Performance Analytics

Editor's Notes