Changing the tires on a big data racecar

Changing the Tires on a Big
Data Racecar
@davemcnelis
Sr. Software Engineer, Proofpoint

Who am I?
Software engineer at Proofpoint, formerly Emerging Threats
14 years experience, 7 with Cassandra or Hadoop
Currently using Scala more than any other language
Big data focuses have been social media analysis, marketing data, Smart-meter
analytics, and information security research
Current projects revolve around building threat intelligence APIs and data stores

Goals
Outline approaches to migrating or upgrading your infrastructure / data store
Pros and Cons of these approaches
Demystify the process, identify ‘gotchas’
Establish guidelines and provide ideas for handling these situations, not create a
gospel

Core System Components
Back end store (Hadoop, Cassandra, ect.)
Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ)
Event / Data Producers (APIs, log data, sensors)
Generates base data for the messaging service
Analytics (Queuing system consumers, batch jobs)
Access (APIs, Front ends, batch job output

2 Basic Approaches
Upgrade in place
Build a new cluster and figure out how to get your data over

Upgrading in place
Pros
Least expensive
Data can stay where it already lives
Often has sufficient documentation
Cons
Stability concerns of the new back end
Downtime / customer visibility
Good luck rolling back in event of a problem
Degradation of performance during the
upgrade
Testing in production is bad, mmkay
Generally limited to minor upgrades / updates
Even drop-in upgrades aren’t clean

So you want to build a new cluster
All of these inherently will cost more than upgrading in place.
Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut
over consumption
Red Flag -- Stop ingestion and consumption, move data, restart ingestion and
consumption
Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption
for brief periods of time
Green Flag -- Let your foundation do most of the work for you

Keys to Success
Pre-planning is essential. Don’t expect this to be a couple of days work, plan for weeks
of time.
Solid data flow foundations are key. Consider archiving all incoming data to
something like S3 so you can replay an arbitrary amount of data.
Automated / unit testing on data interaction components will create a lot more
confidence and help identify problem areas early.

Start your engines!
Spinning up a new cluster and writing data until there is enough to sustain operations
Fine if no historic data longer than spin-up time is required
Least amount of risk, if older data isn’t needed
Can back-fill legacy data after the cut-over has occurred

Red flag -- Stop the race!
Shut it all down, move data to the new format, start everything back up
High customer impact
User visible downtime will occur, not just analytics/ingestion/processing downtime
Might be OK for non-critical, offline systems

Black Flag -- Dealing with the stop and go penalty
Attempts to lessen downtime/customer impact
Significant engineering time to set up properly
If you don’t have timestamped write times and non-linear data, might not be feesible
Longest path, in terms of calendar days
High complexity, high potential for mistakes

Green flag -- Letting your foundation work for you
Only “Start your engines” has less planned downtime
Difficult or impossible if you don’t have a solid data flow architecture in place
Results for this should be reproducible (in other words, can test things multiple times
if needed)
Data needs to come in from either a queue or batch loads
If everything is from batch loads, should be able to avoid any customer disruption

Watch out for that pile up!
Queue / Message bus -- Need to have ample capacity for when you’re not ingesting
from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable
Testing -- Build in time to test and verify migrations, and then check it all a second
time.
Testing must be multifaceted -- The code, the data, and the infrastructure
Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge
trap, but this is high risk for often little rewards

Example -- Migrating from Cassandra to Hadoop
“Start your engines!” approach
Began with duplicate writing to both systems
Eventually added kafka with different consumers pushing data to both backends
Dev work to re-implement things with Hadoop/HBase took most resources/time
Once in a “stable” place, started comparing batch job outputs from two systems
Brief maintenance window to cut over
Entire process took several months including dev, ops and testing work

Example -- Migrating from Cassandra to Hadoop (cont.)
Unique challenges
Exporting data from Cassandra was hard
Prior to a decent option like Spark
Greatly complicated by vNodes
Used a set of python scripts to actually export all the data
Had multiple kinds of products to deliver
API under constant customer use, couldn’t afford any downtime
Batch job outputs, hourly and daily

Example -- Upgrading major versions of Hadoop
Green flag approach
Had to minimize downtime
Not enough calendar time for “Start your engines”
Leveraged Snapshots (both Cassandra and HBase have this construct)
Loaded snapshots into testing environments multiple times
Majority of engineering time was in upgrading libraries and verifying there were no
breaking changes because of the version changes
Second most engineering time was spent building and running test clusters

Example -- Upgrading major versions of Hadoop (steps)
1. Determine “time” to start ingesting into both environments
2. Took snapshots of original cluster, loaded into new cluster (can take a long time)
3. Started raw data consumers for the new cluster (i.e. enabling data insertion)
4. Once lag was reduced on insertion, started analytics based consumers
5. Enabled any batch processing
6. Continue to write to both stores for a couple of weeks
7. Verify new cluster output by comparing batch jobs to old cluster
8. Cut over customer facing APIs

Summary
Strong foundations are essential
Number of possible ways to win the race
Plan as far out as you can foresee
Upgrading and Migrating are operationally similar, have similar approaches available
Archiving raw incoming data can save you a lot of headaches if you can afford it
Racing analogies only work so long in a presentation before they get worn out

Changing the tires on a big data racecar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Changing the tires on a big data racecar

Similar to Changing the tires on a big data racecar (20)

Recently uploaded

Recently uploaded (20)

Changing the tires on a big data racecar