Contributing
code to
CassandraTel Aviv Cassandra Meetup
May 2015
Oded Peer, RSA Security
Brief History
2007
February 2010
Jan 2009
July 2008
An Apache Project
• Project Management Committee (PMC)
o Established by and reports to ASF Board
o Provides oversight (legal, procedures, community)
o PMC members are elected based on merit
• Mailing lists for communication
o “If it didn't happen on the dev list, it didn't happen”
• Apache License
• Voting procedures to reach consensus
• Release procedure
• Tools
Getting involved
• Reporting bugs
• Testing patches
• Help others on the mailing list
• Submitting patches
Dev Roles
Contributor Committer
91 Commit Authors
46% Datastax
7682 Files changed
70 % Datastax
248,411 Changed lines
78% Datastax
Code Contribution
2014
26 Committers
10 from Datastax
4 from Apple
3 from Twitter
Source: ‘git log’
Source: Cassandra wiki
Source Control
Patches, not pull requests
Branching model
• Mainline model
Alternative - Trunk Based
Development
Cassandra Repo
• Cassandra trunk
o 1570+ Java files
o 100,000+ lines of code
• MySQL 5.7
o 4200+ c/c++ files
o 1,130,000+ lines of code
Frequency of Changes
0
5
10
15
20
25
30
35
40
45
50
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
StorageService 2014
# times changed
Issue tracker
https://issues.apache.org/jira/browse/CASSANDRA
Build
http://cassci.datastax.com/
How to contribute
• Find an issue to work on
• Discuss proposed solution
• Modify the source
• Submit a patch
o Review
o Voting
• Commit (by committer)
Submitting a Patch
My case
Porting to C*
• Events have many attributes
• Not all attributes have values
• INSERT only
• JDBC way - Use PreparedStatement
My Problem - Tombstones
• INSERT == UPDATE
• ‘null’ in prepared statements creates tombstones
• Terrible performance
• Workarounds
o Using QueryBuilder
o PreparedStatement for different Permutations of the query
CASSANDRA-7304
• OR: Ability to distinguish between NULL and UNSET values
in Prepared Statements
• Discuss solutions on JIRA
o Query Options
o IGNORE_NULLS keyword
o Indicate ‘unset’ with a special value
• Accepted solution
o Bound value format <length><data>
o Up to binary protocol version 4 a negative length indicates null
o On binary protocol version 4 length -1 for null, -2 unset, error otherwise
• Submit Patch
• Review
Testing CASSANDRA-7304
• Unit test
o CQLTester
• Cover all features (tuples, UDT, UDF, etc.)
• Distributed Tests (dtests)
Questions?
Recap
• Apache Project
• SCM - Git
o Mainline model
• Jira
o Bugs, discussing solutions, code review
• Process
o Assign issue, develop, submit patch, review and commit
Thank You
@odpeer

Contributing code to cassandra

Editor's Notes

  • #2 In this lecture we will review the tools and process of getting new code into Cassandra About me: Working with Cassandra since version 1.2, CQL3 Focus on Data Modeling and Performance For RSA: Cassandra is a distributed Database - מסד נתונים מבוזר Apache Cassandra is named after the Greek mythological prophet Cassandra. When Cassandra of Troy refused Apollo, he put a curse on her so that all of her and her descendants’ predictions would not be believed. Cassandra is the cursed Oracle
  • #3 Cassandra originated at Facebook in 2007 to solve that company's inbox search ... The code was released as an open source Google Code project in July 2008. It was updateable only by FB engineers and had little community. In 2009 moved to Apache Incubator. The Apache Software Foundation is a decentralized community of developers. The software they produce is distributed under the terms of the Apache License and is therefore free and open source software (FOSS). The Apache projects are characterized by a collaborative, consensus-based development process and an open and pragmatic software license. Each project is managed by a self-selected team of technical experts who are active contributors to the project. The ASF is a meritocracy, implying that membership of the foundation is granted only to volunteers who have actively contributed to Apache projects.
  • #4 https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different http://spyced.blogspot.ie/search?updated-min=2009-01-01T00:00:00-08:00&updated-max=2010-01-01T00:00:00-08:00&max-results=17 I've never been a big fan of JIRA, but you work with the tools you have. Or the ones the ASF inflicts on you
  • #6 https://community.apache.org/contributors/ http://www.apache.org/foundation/how-it-works.html
  • #7 http://wiki.apache.org/cassandra/Committers Without Merging, on trunk git log --no-merges --since="Jan 1 2014" --until="Dec 31 2014" | grep Author | sort | uniq | grep -v datastax | grep -v "apache.org" | wc -l git log --no-merges --since="Jan 1 2014" --until="Dec 31 2014" --author="datastax" --shortstat | grep "files changed" | cut -d" " -f2 | paste -sd+ | bc git log --no-merges --since="Jan 1 2014" --until="Dec 31 2014" --shortstat --author="apache" | grep "files changed" | grep -v "0 files changed" | grep '(+),' | less | cut -d" " -f7 | paste -sd+ | bc
  • #8 Apache projects are either git or SVN Not working with pull requests Working with patches
  • #10 The focus is on the next minor release and bug fixing in existing releases, which are on different branches. I think this is the cause of patches on trunk to wait for along time before getting reviewed
  • #12 git log --shortstat --since="Jan 1 2014" --until="Dec 31 2014" -p src/java/org/apache/cassandra/service/StorageService.java | grep Date | cut -d" " -f5 > dates.csv 230 changes git log --since="Jan 1 2014" --until="Dec 31 2014" --shortstat -p src/java/org/apache/cassandra/service/StorageService.java | grep "changed," | cut -d" " -f5 | grep -v "^$" | paste -sd+ | bc git log --since="Jan 1 2014" --until="Dec 31 2014" --shortstat -p src/java/org/apache/cassandra/service/StorageService.java | grep "changed," | cut -d" " -f7 | grep -v "^$" | paste -sd+ | bc
  • #13 Apache requires Jira or Bugzilla
  • #14 Discussion on dev mailing list of move to maven, no clear benefit
  • #15 https://www.apache.org/foundation/voting.html Because one of the fundamental aspects of accomplishing things within the Apache framework is doing so by consensus, there obviously needs to be a way to tell whether it has been reached. This is done by voting. https://www.apache.org/foundation/glossary.html#ReviewThenCommit https://www.apache.org/foundation/voting.html Cassandra uses Review-Then-Commit (RTC) policy A positive vote carries the very strong implied message, 'I have tested this patch myself, and found it good. For code-modification votes, +1 votes are in favour of the proposal, but -1 votes are vetos and kill the proposal dead until all vetoers withdraw their -1 votes. Three +1 votes are required for a code-modification proposal to pass. Cassandra accepts a single vote to pass.
  • #16 Squash local commits into one commit No formal Review tool
  • #20 A month and a half until review 1 Another month and a half until review 2 4 months until next review Why: Mainline model means less resources are working on trunk So what: Frequency of changes leads to conflicts in rebasing
  • #23 RangeSliceQuery.queryNextPage -> create data range CompositesSearcher.AbstractScanIterator.computeNext -> if (!range.contains(dk)) continue