This document provides an agenda for a presentation on integrating Apache Cassandra and Apache Spark. The presentation will cover RDBMS vs NoSQL databases, an overview of Cassandra including data model and queries, and Spark including RDDs and running Spark on Cassandra data. Examples will be shown of performing joins between Cassandra and Spark DataFrames for both simple and complex queries.
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
This presentation was made at the Open Source India Conference Nov 2015. It explains how Apache Spark, pySpark, Cassandra, Node.js and D3.js can be used for creating a platform for visualizing and analyzing streaming big data
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
We present the basic functionality of the official DataStax spark-cassandra connector - how to load cassandra tables as Spark RDDs and how to save Spark RDDs to Cassandra.
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
This presentation was made at the Open Source India Conference Nov 2015. It explains how Apache Spark, pySpark, Cassandra, Node.js and D3.js can be used for creating a platform for visualizing and analyzing streaming big data
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
We present the basic functionality of the official DataStax spark-cassandra connector - how to load cassandra tables as Spark RDDs and how to save Spark RDDs to Cassandra.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Another feature we have provided with our plugin is the possibility of indexing bitemporal data models, which distinguish between system time and business time. This way, it is possible to make queries over C* such as “give me what system thought in a certain instant about what happened in another instant”. The implementation has been performed combining range prefix trees with the 4R-Tree approach exposed by Bliujūtė et al. Both full-text, geospatial and bitemporal queries can be combined with Apache Spark to avoid systematic full-scan, dramatically reducing the amount of data to be processed.
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
Cassandra is awesome for many things. One of the things it's awesome for is Time Series. Combining the power of Cassandra with APIs of existing Time Series tools, such as Graphite can yield interesting results.
Cyanite is a Time Series aggregator and store built on top of Cassandra. It's fully compatible with Graphite, can serve as a plug-in replacement for Graphite and Graphite web.
Cyanite is using SASI indexes to make glob metric path queries, can query and aggregate, store, display and analyse metrics from hundreds and thousands of servers.
Which data modelling practices work best for Time Series, which new awesome Cassandra features you can use to make your Time Series analysis better.
About the Speaker
Alex Petrov Software Engineer, DataStax
Polyglot programmer. Interested in algorithms, distributed systems, algebra and high performance solutions.
Many people promise fast data as the next step after big data. The idea of creating a complete end-to-end data pipeline that combines Spark, Akka, Cassandra, Kafka, and Apache Mesos came up two years ago, sometimes called the SMACK stack. The SMACK stack is an ideal environment for handling all sorts of data-processing needs which can be nightly batch-processing tasks, real-time ingestion of sensor data or business intelligence questions. The SMACK stack includes a lot of components which have to be deployed somewhere. Let’s see how we can create a distributed environment in the cloud with Terraform and how we can provision a Mesos-Cluster with Mesosphere Datacenter Operating System (DC/OS) to create a powerful fast data platform.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
You have collected a lot of time series data so now what? It's not going to be useful unless you can analyze what you have. Apache Spark has become the heir apparent to Map Reduce but did you know you don't need Hadoop? Apache Cassandra is a great data source for Spark jobs! Let me show you how it works, how to get useful information and the best part, storing analyzed data back into Cassandra. That's right. Kiss your ETL jobs goodbye and let's get to analyzing. This is going to be an action packed hour of theory, code and examples so caffeine up and let's go.
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.
Geospatial and bitemporal search in cassandra with pluggable lucene indexAndrés de la Peña
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra. With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module. Another feature we have provided with our plugin is the possibility of indexing bitemporal data models, which distinguish between system time and business time. This way, it is possible to make queries over C* such as “give me what system thought in a certain instant about what happened in another instant”. The implementation has been performed combining range prefix trees with the 4R-Tree approach exposed by Bliujūtė et al. Both full-text, geospatial and bitemporal queries can be combined with Apache Spark to avoid systematic full-scan, dramatically reducing the amount of data to be processed.
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...DataStax
Cassandra is awesome for many things. One of the things it's awesome for is Time Series. Combining the power of Cassandra with APIs of existing Time Series tools, such as Graphite can yield interesting results.
Cyanite is a Time Series aggregator and store built on top of Cassandra. It's fully compatible with Graphite, can serve as a plug-in replacement for Graphite and Graphite web.
Cyanite is using SASI indexes to make glob metric path queries, can query and aggregate, store, display and analyse metrics from hundreds and thousands of servers.
Which data modelling practices work best for Time Series, which new awesome Cassandra features you can use to make your Time Series analysis better.
About the Speaker
Alex Petrov Software Engineer, DataStax
Polyglot programmer. Interested in algorithms, distributed systems, algebra and high performance solutions.
Many people promise fast data as the next step after big data. The idea of creating a complete end-to-end data pipeline that combines Spark, Akka, Cassandra, Kafka, and Apache Mesos came up two years ago, sometimes called the SMACK stack. The SMACK stack is an ideal environment for handling all sorts of data-processing needs which can be nightly batch-processing tasks, real-time ingestion of sensor data or business intelligence questions. The SMACK stack includes a lot of components which have to be deployed somewhere. Let’s see how we can create a distributed environment in the cloud with Terraform and how we can provision a Mesos-Cluster with Mesosphere Datacenter Operating System (DC/OS) to create a powerful fast data platform.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
Sig Narvaez, Executive Solution Architect at MongoDB
MongoDB is now a Developer Data Platform. Come learn what�s new in the 6.0 release and Atlas following all the recent announcements made at MongoDB World 2022. Topics will include
- Atlas Search which combines 3 systems into one (database, search engine, and sync mechanisms) letting you focus on your product's differentiation.
- Atlas Data Federation to seamlessly query, transform, and aggregate data from one or more MongoDB Atlas databases, Atlas Data Lake and AWS S3 buckets
- Queryable Encryption lets you run expressive queries on fully randomized encrypted data to meet the most stringent security requirements
- Relational Migrator which analyzes your existing relational schemas and helps you design a new MongoDB schema.
- And more!
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.
To learn more, visit: http://www.snaplogic.com/redshift-trial
This spring, the data warehouse team at Ancestry, flawlessly migrated and validated nearly half a trillion records from Actian Matrix to Amazon Redshift. During this session, the Ancestry team will describe how they orchestrated the entire migration in less than four months, the technical challenges they faced and overcame along the way, as well as share tips and tricks to break through common pitfalls of data warehouse migrations. They will also highlight how they tuned and optimized the Amazon Redshift environment, adopted Redshift Spectrum, and how they leverage their collaboration with Amazon to deliver a powerful customer experience.
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
Most web applications start out with a Postgres database and it serves the application very well for an extended period of time. Based on type of application, the data model of the app will have a table that tracks some kind of state for either objects in the system or the users of the application. Names for this table include logs, messages or events. The growth in the number of rows in this table is not linear as the traffic to the app increases, it's typically exponential.
Over time, the state table will increasingly become the bulk of the data volume in Postgres, think terabytes, and become increasingly hard to query. This use case can be characterized as the one-big-table problem. In this situation, it makes sense to move that table out of Postgres and into Cassandra. This talk will walk through the conceptual differences between the two systems, a bit of data modeling, as well as advice on making the conversion.
About the Speaker
Rimas Silkaitis Product Manager, Heroku
Rimas currently runs Product for Heroku Postgres and Heroku Redis but the common thread throughout his career is data. From data analysis, building data warehouses and ultimately building data products, he's held various positions that have allowed him to see the challenges of working with data at all levels of an organization. This experience spans the smallest of startups to the biggest enterprises.
NoSQL - MongoDB. Agility, scalability, performance. I am going to talk about the basis of NoSQL and MongoDB. Why some projects requires RDBMs and another NoSQL databases? What are the pros and cons to use NoSQL vs. SQL? How data are stored and transefed in MongoDB? What query language is used? How MongoDB supports high availability and automatic failover with the help of the replication? What is sharding and how it helps to support scalability?. The newest level of the concurrency - collection-level and document-level.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
NoSQL Application Development with JSON and MapR-DBMapR Technologies
NoSQL databases are being used everywhere by startups and Global 2000 companies alike for data environments that require cost-effective scaling. These environments also typically need to represent data in a more flexible way than is practical with relational databases.
NoSQL Strikes Back (An introduction to the dark side of your data)
A long time ago in a database far, far away...
SQL was the only option to save vast amounts of application data for a long period of time. There were always some rebellion activities, to overcome the SQL Empire, which brought a new hope, but all other ways of storing data were never more than a phantom menace.
Now Cosmos DB awakens and is ready for the revenge of the NoSQL.
During this talk, we will have a look at what Azure Cosmos DB is, what you can achieve with its possibilities and how to use it in a galactic environment of data and applications.
Join me and find your way to the right solution for your application.
May the data be with you!
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
If you are storing records with a timestamp in your database, it is very likely a time series database can make your life easier.
However, time series databases are still the great unknown for a large part of the tech community.
In this talk, I will show you what use cases they are good for, what they give you that you cannot get from a traditional database, and when it is a good idea (and when it is not) to use them.
For the demos, we will be using QuestDB, the fastest open-source time series database.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. Big Data Era
• Online applications
• Internet of Things
• Big Data:
– Data Velocity
– Data Variety
– Data Volume
– Data Complexity
3. RDBMS vs NoSQL
RDBMS
relates to ACID
NoSQL
relates to CAP
A: Atomicity
C: Consistency
I: Isolation
D: Durability
All four
C: Consistency
A: Availability
P: Partition Tolerance
Pick 2 out of 3
5. NoSQL
• Shared Nothing
– remove dependency between the scaling units
– private memory and peripheral disks
• Master-less Architecture
– Most of NoSQL databases offer master – master data replication strategy
(some exceptions may apply, e.g. Redis)
6. What is Apache Cassandra
• Open Source DDBMS
• Initially developed at Facebook, 2008
• Developed in Java
• Combination of
Dynamo DB(Amazon) – architecture principals
BigTable (Google) – SST design
• DataStax Enterprise commercial distribution
7. Why Cassandra?
• Storing Huge Datasets – Elastic Scalability
• Multi – Master Replication
• High Availability – No SPOF
• Eventual Consistency
• Flexible Data Model
• Locality
• Highest Write Throughput Time
8. Gossip & Seeds
• Gossip
Protocol
Discovers location, state information
Timer runs every second
Info: onJoin, onAlive, onDead, onChange
• Seeds
Bootstrapping other nodes
9. Data Distribution & Replication
• In Cassandra data distribution and replication go together
• Replication is affected by:
1. Consistent Hashing & Partitioners
Data partitioning methodology across the cluster
2. Replication Strategy
Determines the replicas of each row of data
3. Snitch
Defines topology information for replicas placement
4. Virtual Nodes
Assign data ownership to physical machines
10. Consistent Hashing
• Distributes the data across the cluster
• Partitions the data based on the partition key of each row
Row
Key
Hashed
value
Node
John 772335892720368
0754
D
Andrew -
672337285403678
0875
A
Mike 116860462738794
0318
C
11. Partitioners
• Defines the Hash function for Consistent Hashing
• Compute the token for each row key
Types
Murmur3Partitioner
Values: [ -263 … 0 …. +263 ]
Random Partitioner
Values: [ 0…2127 – 1 ]
ByteOrderedPartitioner (Not Recommended)
12. Data Replication
• Replication Factor
Number of replicas across the cluster (e.g. RF = 2, RF = 4)
• Replication Strategies
Simple Strategy
Single Data Center
Replicas are placed clockwise – no network topology into account
replication = {'class' : 'SimpleStrategy', 'replication_factor':3};
Network Topology Strategy
Multiple Data Centers
Replicas are placed in different racks
13. Snitch
• Affects where the replicas are placed
• Determines the data centers and the racks that the Machines belong to
• 9 Different Snitches
Simple Snitch: Single Data Center
Gossiping Property File Snitch: automatic update for new nodes via
gossip – production recommended
Property File Snitch: Location of nodes determined by rack and data
center
EC2 Multi Region Snitch: Amazon Web Services
Google Cloud Snitch: Multiple Regions
14. Cassandra Virtual Nodes
Ring without VNodes
Contiguous token
Contiguous data range
One large range
Ring with Vnodes
Non-contiguous token
Non-contiguous data range
Many smaller ranges
Why Vnodes
Even distribution of data
Faster rebuild of a node failure
16. Client Requests
Client
Running application, read/write requests
(JAVA, Python, C++, PHP, etc…)
Coordinator
• Handles the requests
• Finds Nodes based on Partitioner and Replica Placement Strategy
• Any Node can act as the Coordinator
17. Consistency Levels Write and Read
• Tunable consistency
• Specify how many replicas must respond to consider an operation a success
LEVEL WRITE READ
One X X
Two X X
Three X X
Quorum X X
Any X
All X X
Each_Quorum X X
Local_Quorum X X
Local_One X X
QUORUM = ceil(RF/2)
R + W > N
R: needed #replicas for read
W: needed #replicas for write
N: replication factor
19. C* Data Model
Cassandra Backbone for efficient queries
First look the queries
First Concepts
Keyspace: similar to a relational schema, contains Column Families
Column Family(CF): similar to a relational table, contains the data
Super Column (SC): contains multiple CFs
20. C* Data Model
Next Concepts
– Column based key value store (multi level dictionary)
– Think of it as a JSON representation or Map [String, Map [String, Data] ]
– SST: Sorted String Table
Column Family
| Columns
↓ |
{"Street Monitor": ↓
{"Hollywood": { "avg.speed": 75,
↑ "vehicles": 45,
| "time": "2015-03-02 09:35” }
| ↑
Keys |
| Values
↓
{"Santa Monica": { "avg_speed": 35,
"vehicles": 50,
"time": "2015–03–02 10:35"
}
}
21. C* PRIMARY KEY
Last Concept – Primary Key:
– Remember JSON format
– Storage is 2 – level nested HashMap
– A table/column family has Primary Key which consists of
Level 1: Partition key and clustering key
Level 2: Clustering key and data
Partition Key Clustering Key
Responsible for hashing the
data to the corresponding
physical machine. Make it
random to have evenly
distributed datasets.
Responsible for ordering the
data inside the table.
22. C* PRIMARY KEY
Types of C* Primary Key:
1. Compound Key
Exactly one Partition Key
e.g. PK(Parition_key1)
PK(Parition_key1, Clustering_key1)
PK(Parition_key1, Clustering_key1, Clustering_key2,…)
2. Composite Key
Two or more Partition Keys, careful with the syntax
e.g. PK( ( Partition_key1, Parition_Key2 ) )
PK( ( Partition_key1, Partition_Key2,…), Clustering_Key1, Clustering_Key2,…)
Skinny Rows & Wide Rows
Skinny Rows: if the Primary Key contains only the partition Key
Wide rows: if the Primary Key contains columns other than the partition key
23. BDTS Example
Keyspace: highway
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.street_monitoring (
ONSTREET varchar,
YEAR int,
MONTH int,
DAY int,
TIME int,
POSTMILE float,
DIRECTION int,
FROMSTREET varchar,
TOSTREET varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
PRIMARY KEY ( ( ONSTREET,YEAR,MONTH
),DAY,TIME,POSTMILE,DIRECTION )
);
""")
client.CreateIndex("highway","street_monitoring",”SPEED")
client.CreateIndex("highway","street_monitoring","FROMSTREET")
client.CreateIndex("highway","street_monitoring","TOSTREET")
client.CreateTable ("""
CREATE TABLE IF NOT EXISTS highway.regional_monitoring (
REGION varchar,
SPEED int,
VOLUME int,
OCCUPANCY int,
HOVSPEED int,
YEAR int,
HH int,
MONTH int,
DAY int,
SENSOR_ID int,
PRIMARY KEY ( ( REGION,YEAR,MONTH ),HH, DAY, SENSOR_ID )
);
""")
client.CreateIndex("highway","regional_monitoring",”SPEED")
Partition Key: Onstreet, Year, Month
Clustering Key: Day, Time, Postmile, Direction
Partition Key: Region, Year, Month
Clustering Key: HH, Day, Sensor_id
24. C* Queries
• Pure CQL does not support:
JOINS, and Sub queries || GroupBy and OrderBy only on clustering columns
• No Aggregate Functions supported at Cassandra 2.0.+, later versions will
• Always need to restrict the preceding part of subject
Guidelines:
Partition key columns support the = operator
The last column in the partition key supports the IN operator
Clustering columns support the =, >, >=, <, and <= operators
Secondary index columns support the = operator
Query1 – Some parts of Partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND month=3
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Partition key part year must be
restricted since preceding part is”
Query2 – Full partition key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2015 AND month
IN(2,4)
Error Message: None
25. C* Queries
Query3 – Range on Partition Key
SELECT * FROM highway.street_monitoring WHERE onstreet=‘I-10’ AND year=2014 AND
month<=1
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are
supported on the partition key (unless you use the token() function)"
Query4 – Only secondary index
SELECT * FROM highway.street_monitoring WHERE day>=2 AND day<=3
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite the performance unpredictability, use ALLOW
FILTERING
Query5 – Range on Secondary Index
SELECT * FROM highway.street_monitoring WHERE onstreet='47' AND year=2014 AND month=2
AND day=21 AND time>=360 AND time<=7560 AND speed>30
Error Message: cassandra.InvalidRequest: code=2200 [Invalid query] message="No indexed columns present in
by-columns clause with Equal operator"
26. Cassandra Configurations
{CASSANDRA_HOME}/conf/cassandra.yaml
listen_address: < internal IP address for rest of the nodes – communication, gossip />
broadcast_address: < external IP when deployed in multiple regions />
rpc_address: < address for drivers access – internal IP, hostname/>
seeds: < addresses of seed nodes />
commitlog_directory: < commitLog is written sequentially all the time, affected by the seek time/>
saved_caches_directory: < tables keys and row caches are stored />
data_file_directories: < SST tables, holds all data written to the nodes />
{CASSANDRA_HOME}/conf/cassandra-env.sh
MAX_HEAP_SIZE
sets Maximum Heap Size for JVM
Default 1gb, do not set it too high, max 8gb
HEAP_NEWSIZE
new generation size, good guideline is 100 MB per CPU core
27. C* Last Call
• Real-Time Data – Clustering
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id, event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);
28. Big Data Analytics Stack - BDAS
Origin Berkeley AMP Lab
Multiple Packages
Multiple Data Sources
Spark is the Kernel of Functionality
29. What is Spark
• An open-source cluster computing platform for fast
and general purpose large-scale data analytics
• In-memory computations
• Software suite
• Built in Scala
• Highly Accessible (Java, Python, Scala, SQL APIs)
• Started on 2009, Berkeley AMP lab
• Master-Slave Architecture
• Still developing, numerous contributors
30. Spark vs Hadoop
Hadoop Issues
1. Data Replication
2. Disk I/O
3. Serialization increases execution time
Performance Degradation when applying:
1. Iterative Jobs
2. Interactive Analytics
31. Spark vs Hadoop
• Spark Characteristics against Hadoop MR
Data Reuse
Interactive data analytics
Ad-hoc queries
Iterative algorithms
( Machine Learning Algorithms - MLlib
Graph Processing Algorithms – GraphX )
Real-time data flow processing
( Spark Streaming )
Faster
x100 in memory
x10 on disk
32. Daytona Competition 2014
Goal:
sort 100 TB of data
Hadoop:
generates 3100 GB/s
of disk I/O
time: 300% of Spark
Spark:
Generates 500 GB/s
of disk I/O
All the sorting took place
on disk (HDFS), without
using Spark’s in-memory
cache
http://sortbenchmark.org/
33. RDDs
• Stands for Resilient Distributed Datasets
Definition
An RDD is an immutable, in-memory collection of objects. Each RDD is split into
multiple partitions, which in turn are computed on different nodes of the cluster.
RDDs can be:
(1) External Datasets or
(2) Parallelizing Collections
34. RDDs Operations
• Two Distinct Important Operations
transformations: return a new RDD
actions: return final value
Scala code:
val visits = spark.hadoopfile(“hdfs:// … “)
/* tranformation */
val counts = visits.map( v => (v.url,1))
.reduceByKey((a,b) => a + b)
/* action */
counts.collect()
• Lazy Evaluation
Spark starts the execution when an action is called. Spark internally
stores metadata of how to compute the transformations data.
35. Spark Fault Tolerance
• RDD lineage
Spark logs information for different RDDs
Information is derived from transformations (e.g. map, filter,
join)
Crucial for data recovery upon partition failure
36. Spark Runtime
Driver
• Central Coordinator
• Creates a Spark Context
• Where the main() method runs
• Creates RDDS
• Performs RDDs transformations
and actions
Cluster Manager
Manages Cluster Resources
Cluster Worker
Contains the Spark worker, Executor
Driver + Executors = Spark Application
37. Spark Driver Duties
1. Convert User Program into Tasks 2. Schedule Tasks on Executors
• The Driver is responsible for
coordinating the individual tasks
on the Executors
• Checks the Executors and delivers
the tasks to the appropriate
location
• Tasks from different Applications
run on different JVMs
38. Executors
• Properties
Worker Processes
Launch at the start
Die when application ends
• Mission
1. Run individual Tasks &
return results to the Spark Driver
2. In memory storage for the RDDs
[ .cache() | .persist() ]
39. Spark Cluster Managers
• Spark Standalone Scheduler
FIFO scheduling
• For multi-tenancy systems:
Hadoop YARN
recommended when dealing with HDFS
for fast access due to nodes locality
Apache Mesos
fine-grained: static memory, dynamic cores
coarse-grained: static memory and cores
40. SPARK Installation
Key Concepts:
1. Spark versions: 1.2.1(stable), 1.3.1(latest release), of course previous
2. Apache Maven
3. Scala Build Tools (sbt)
4. Hadoop Version
HDFS protocol compatible across versions (e.g. 2.2.x, 2.3.x, default 1.0.4)
Apache Maven and SBT are required for configuring any dependencies or plugins required
for project installation (add new plugins, handle exceptions)
Examples:
Maven changes apply to:
{SPARK_HOME}/pom.xml
SBT changes apply to:
{SPARK_HOME}/sbt/sbt
44. Cassandra JOINS Spark
From Simple to Very Complex
SIMPLE:
val cc = new CassandraSQLContext(sc)
val config = sc.cassandraTable("highway","highway_config").select("config_id","agency","link_id").where("onstreet
= ?","I-605”)
val joined = config.joinWithCassandraTable("highway","highway_history").select("speed”)
val specified_joined = joined.on(SomeColumns("config_id","agency","link_id"))
joined.collect()
COMPLEX:
val hists = sc.cassandraTable("highway", "highway_history").where("config_id = ?","85").where("agency =
?","Caltrans-D7").where("event_time = ?","2014-04-03 02:47”)
val configs = sc.cassandraTable("highway","highway_config_metrics").where("onstreet = ?","SR-
60").where("fromstreet = ?","EUCLID”)
val histsKeyBy = hists.map(f =>
(((f.getInt("config_id"),f.getString("agency"))),(f.getString("event_time"),f.getInt("occupancy"),f.getInt("speed
"),f.getInt("volume"),f.getInt("hovspeed"))))
val configsKeyBy = configs.map(f =>
((f.getInt("config_id"),f.getString("agency")),(f.getString("event_time"),f.getString("onstreet"),f.getString("fr
omstreet"),f.getInt("direction"))))
val joined = histsKeyBy.join(configsKeyBy)
joined.collect()
val speed_avg = joined.map(x => x._2._1._2).mean()