Data Modeling with Cassandra Column Families

•

20 likes•6,133 views

gdusbabek

Slide notes I used for my presentation at ICOODB 2010.

Technology

Needle Meet HaystackAdapting your data models for Cassandra Gary Dusbabek • Rackspace• ICOODB 2010

Outline First Things First Column Families Trade Offs Procedures & Best Practices Internals

Scalability Availability Replication & Backup

Relational Way Define entities Normalize Identify Many-to-many Query any way you want

Cassandra Way Know your app Queries first Denormalize

Viewers also liked

Docker and CloudStack

Sebastien Goasguen

CloudStack Architecture

CloudStack - Open Source Cloud Computing Project

Advanced excel 2010 & 2013 updated Terrabiz

Ahmed Yasir Khan

Key-Value Stores: a practical overview

Marc Seeger

Hbase: Introduction to column oriented databases

Luis Cipriani

Sql queries with answers

vijaybusu

Viewers also liked (6)

Docker and CloudStack

CloudStack Architecture

Advanced excel 2010 & 2013 updated Terrabiz

Key-Value Stores: a practical overview

Hbase: Introduction to column oriented databases

Sql queries with answers

Similar to Data Modeling with Cassandra Column Families

mongo db EMERSON EDUARDO RODRIGUES

EMERSON EDUARDO RODRIGUES

Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...

Edureka!

Growing Up MongoDB

MongoDB

Scaling Databases On The Cloud

Imaginea

Scaing databases on the cloud

Imaginea

Neo4j Presentation

Max De Marzi

Thinking about graphs

Neo4j

MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins

kiwilkins

Cassandra via-docker

Chris Ballance

Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on. In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc. Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Spark Summit

Performance By Design

Guy Harrison

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

An efficient data mining solution by integrating Spark and Cassandra

Stratio

Stratio big data spain

Álvaro Agea Herradón

There are many data modeling and database design terms and jargon that uses the word "key." Do you know the difference between a surrogate key and a primary key? A super key and a candidate key? Could you explain them to a technical audience? A business user or an auditor? In this presentation, Karen Lopez covers the concepts of primary keys, foreign keys, candidate key, surrogate keys, and more.

The Key to Keys - Database Design

Karen Lopez

Finding Love with MongoDB

MongoDB

Scaling AWS With Scalr

Ram Viswanadha

Enterprise NoSQL: Silver Bullet or Poison Pill

Billy Newport

NATS is an open-source, high-performance, lightweight cloud messaging system. NATS was created by Derek Collison, Founder/CEO of Apcera who has spent 20+ years designing, building, and using publish-subscribe messaging systems. Unlike traditional enterprise messaging systems, NATS has an always-on dial tone that does whatever it takes to remain available. This forms a great base for building modern, reliable, and scalable cloud and distributed systems.

NATS - A new nervous system for distributed cloud platforms

Derek Collison

When I was tasked with improving our predictions of when customers were likely to purchase in a category, I ran into a problem – we had one model that was trying to predict everything from milk and eggs to batteries and tea. I was able to improve our predictions by creating category-specific models, but how could I possibly handle every category we had? Turns out, PandasUDFs were my One Weird Trick to solving this problem and many others. By using them, I was able to take already-written development code, add a function decorator, and scale my analysis to every category with minimal effort. 10 hour runtimes finished in 30 minutes. You too can use this One Weird Trick to scale from one model to whole ensembles of models. Topics covered will include: General outline of use and fitting in your workflows Types of PandasUDFs The Ser/De limit and how to work around it Equivalents in R and Koalas

PandasUDFs: One Weird Trick to Scaled Ensembles

Databricks

Scaling Cloud Web & Data Technologies

Anant Corporation

Similar to Data Modeling with Cassandra Column Families (20)

mongo db EMERSON EDUARDO RODRIGUES

Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...

Growing Up MongoDB

Scaling Databases On The Cloud

Scaing databases on the cloud

Neo4j Presentation

Thinking about graphs

MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins

Cassandra via-docker

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Performance By Design

An efficient data mining solution by integrating Spark and Cassandra

Stratio big data spain

The Key to Keys - Database Design

Finding Love with MongoDB

Scaling AWS With Scalr

Enterprise NoSQL: Silver Bullet or Poison Pill

NATS - A new nervous system for distributed cloud platforms

PandasUDFs: One Weird Trick to Scaled Ensembles

Scaling Cloud Web & Data Technologies

More from gdusbabek

My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015

gdusbabek

Releasing an open source project while maintaining a shipping product is hard! Different behaviors, attitudes and actions can help or hinder your cause; and they are not always obvious. The Blueflood distributed metrics engine was released as open source software by Rackspace in August 2012. In the succeeding months the team had to strike a manageable balance between the challenges of growing a community, being good open source stewards, and maintaining a shipping product for Rackspace. Find out what worked, what did not work, and the lessons that can be applied as you endeavor to take your project out into the open. In this presentation you will learn about strategies for releasing open source products, pitfalls to avoid, and the potential benefits of moving more of your development out in the open. We have also made a few realizations about the community growing up around metrics. It is still young, and there are problems that come with that youth. I'll talk about some things we can do to make a better software ecosystem.

How To (Not) Open Source - Javazone, Oslo 2014

gdusbabek

Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014

gdusbabek

Measure All the Things! - Austin Data Day 2014

gdusbabek

Blueflood: Open Source Metrics Processing at CassandraEU 2013

gdusbabek

Introduction to Blueflood at Berlin Buzzwords 2013

gdusbabek

Rackspace Cloud Monitoring - Strata NYC

gdusbabek

Austin cassandra meetup

gdusbabek

How Rackspace Cloud Monitoring uses Cassandra

gdusbabek

Breaking the Relational Headlock: A Survey of NoSQL Datastores

gdusbabek

Building Rackspace Cloud Monitoring

gdusbabek

Cassandra Codebase 2011

gdusbabek

Getting to Know the Cassandra Codebase

gdusbabek

Introduction to Cassandra (June 2010)

gdusbabek

Cassandra Presentation for San Antonio JUG

gdusbabek

More from gdusbabek (15)

My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015

How To (Not) Open Source - Javazone, Oslo 2014

Blueflood and Beyond: The Future of Metrics - Berlin Buzzwords 2014

Measure All the Things! - Austin Data Day 2014

Blueflood: Open Source Metrics Processing at CassandraEU 2013

Introduction to Blueflood at Berlin Buzzwords 2013

Rackspace Cloud Monitoring - Strata NYC

Austin cassandra meetup

How Rackspace Cloud Monitoring uses Cassandra

Breaking the Relational Headlock: A Survey of NoSQL Datastores

Building Rackspace Cloud Monitoring

Cassandra Codebase 2011

Getting to Know the Cassandra Codebase

Introduction to Cassandra (June 2010)

Cassandra Presentation for San Antonio JUG

Recently uploaded

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Manulife - Insurer Innovation Award 2024

The Digital Insurer

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

How to Troubleshoot Apps for the Modern Connected Worker

Manulife - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Strategies for Landing an Oracle DBA Job as a Fresher

A Domino Admins Adventures (Engage 2024)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Exploring the Future Potential of AI-Enabled Smartphone Processors

Why Teams call analytics are critical to your entire business

Tata AIG General Insurance Company - Insurer Innovation Award 2024

GenAI Risks & Security Meetup 01052024.pdf

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Artificial Intelligence: Facts and Myths

Partners Life - Insurer Innovation Award 2024

Data Modeling with Cassandra Column Families

1. Needle Meet HaystackAdapting your data models for Cassandra Gary Dusbabek • Rackspace• ICOODB 2010

2. Outline First Things First Column Families Trade Offs Procedures & Best Practices Internals

3. It’s all about scalability

4. We can all be friends

5. Column Families

10.

11.

12.

13.

14. 2.TradeOffs

15. No Transactions

16. No Adhoc Queries

17. No Joins

18. No Flexible Indexes

19. Don’t Panic!

20. Scalability Availability Replication & Backup

21. 3. Procedures & Practices

22. Relational Way Define entities Normalize Identify Many-to-many Query any way you want

23. How Come? Scarcity Efficiency

24. Cassandra Way Know your app Queries first Denormalize

25. Know Your App

26. Queries First

27. Nobody is Normal

28. Relational Example

29. Column Family Example

30. Column Family Example

31. Column Family Example

32. Column Family Example

33. Does it feel strange?

34. 4. Internals

35. Sequential Writes Always

36. Consistency Level

37. Partitioning

38. Slices Data Locality

39.

40. ColumnFamilies != Relational tables

41. Trade-offs: you win some, you lose some

42. Know your application

43. Queries first

44. Denormalization is OK

45.

46. Image Credits haystack http://www.flickr.com/photos/james_lumb/3921968993 pyramids http://www.flickr.com/photos/gracewong/93631410 scales http://www.flickr.com/photos/eflon/3465042138 friends http://www.flickr.com/photos/ngmmemuda/4166182931 television http://www.flickr.com/photos/angelrravelor/314306023 columns http://www.flickr.com/photos/nostri-imago/3564300653 devil http://www.flickr.com/photos/52890443@N02/4887855756 angel http://www.flickr.com/photos/75001512@N00/4938623021 transaction http://www.flickr.com/photos/neubie/2273635564 queries http://www.flickr.com/photos/-bast-/349497988 rings http://www.flickr.com/photos/baldur/4395738741 indexes http://www.flickr.com/photos/waferboard/4137041591 panic http://www.flickr.com/photos/pasukaru76/3998981988 procedures "The Anatomy Lesson of Dr. NicolaesTuip" by Rembrandt relational http://www.flickr.com/photos/35536700@N07/3292544674 desert http://www.flickr.com/photos/waldenpond/4252575735 jet http://www.flickr.com/photos/rmahle/709685 queries http://www.flickr.com/photos/andreanna/2812118063 blackboard http://www.flickr.com/photos/shonk/418180402 normal http://www.flickr.com/photos/infrogmation/3180606117 phonograph http://www.flickr.com/photos/shiyazuni/4770244591 dodo http://www.flickr.com/photos/wheatfields/2071347416 Internals http://www.flickr.com/photos/37hz/4057856826 writing http://www.flickr.com/photos/stevendepolo/3877225152 consistency http://www.flickr.com/photos/betsyweber/4962297050 partitioning http://www.flickr.com/photos/featheredtar/3137028766 slices http://www.flickr.com/photos/free-stock/4899674517 summary http://www.flickr.com/photos/jkdsphotography/4061838798 links http://www.flickr.com/photos/creative_stock/3397559016

Editor's Notes

It’s not about one data model vs. another.It’s not about one storage engine vs. another.Cassandra excels at replicating data and achieving high sustained write throughput.
The right tool for the right job
Shaped by distribution model
Shaped by distribution model
Shaped by distribution model
Sparse – do not have to exist in every row.
Flexible column namingYou define the sort orderNot required to have a specific column just because another row does
Look familiar?
Arise because of distribution model, not CF model.
* Atomic @ CF row. Not isolated.* Large trans apps push down to node (shared nothing)* Guaranteeing ACID constraints across nodes is a hard problem.
OTOH, you do get a lot of things:Data redundancyVery fast writes, fast reads
Relational>formally defined>correctQuery first>not formally defined>somehow incorrectYou get some things in exchange:ScalabilityAvailabilityReplication
Relational>formally defined>correctQuery first>not formally defined>somehow incorrectYou get some things in exchange:ScalabilityAvailabilityReplication
Focus on query & analysis.B+treesUpdate once*Cassandra typically becomes IO bound before becoming CPU bound.
Not set in stone.Your application may require a different approach.
Recognize non-starters: Is my dataset going to become Very Large? Will I need to sustain high write throughput?Also, what are the common operations? Optimize CFs for those operations.
*columns sorted. Choose keys and columns.you need to think about how you plan to slice your data.Related data is close to reduce io
DenormalizeUse the disk.Don’t be afraid to create another CF that duplicates some data.
Composite column namesPainful updates of denormalized partsFast reads & insertions
Key
Normal attributes
Composite column names.Pulling in relationshipsPainful updates. Denormalization is best when data doesn’t change.
Commit log – separate diskMemtableSstable

Data Modeling with Cassandra Column Families

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Data Modeling with Cassandra Column Families

Similar to Data Modeling with Cassandra Column Families (20)

More from gdusbabek

More from gdusbabek (15)

Recently uploaded

Recently uploaded (20)

Data Modeling with Cassandra Column Families

Editor's Notes