Changing the Tires on a Big
Data Racecar
@davemcnelis
Sr. Software Engineer, Proofpoint
Who am I?
Software engineer at Proofpoint, formerly Emerging Threats
14 years experience, 7 with Cassandra or Hadoop
Currently using Scala more than any other language
Big data focuses have been social media analysis, marketing data, Smart-meter
analytics, and information security research
Current projects revolve around building threat intelligence APIs and data stores
Goals
Outline approaches to migrating or upgrading your infrastructure / data store
Pros and Cons of these approaches
Demystify the process, identify ‘gotchas’
Establish guidelines and provide ideas for handling these situations, not create a
gospel
Core System Components
Back end store (Hadoop, Cassandra, ect.)
Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ)
Event / Data Producers (APIs, log data, sensors)
Generates base data for the messaging service
Analytics (Queuing system consumers, batch jobs)
Access (APIs, Front ends, batch job output
2 Basic Approaches
Upgrade in place
Build a new cluster and figure out how to get your data over
Upgrading in place
Pros
Least expensive
Data can stay where it already lives
Often has sufficient documentation
Cons
Stability concerns of the new back end
Downtime / customer visibility
Good luck rolling back in event of a problem
Degradation of performance during the
upgrade
Testing in production is bad, mmkay
Generally limited to minor upgrades / updates
Even drop-in upgrades aren’t clean
So you want to build a new cluster
All of these inherently will cost more than upgrading in place.
Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut
over consumption
Red Flag -- Stop ingestion and consumption, move data, restart ingestion and
consumption
Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption
for brief periods of time
Green Flag -- Let your foundation do most of the work for you
Keys to Success
Pre-planning is essential. Don’t expect this to be a couple of days work, plan for weeks
of time.
Solid data flow foundations are key. Consider archiving all incoming data to
something like S3 so you can replay an arbitrary amount of data.
Automated / unit testing on data interaction components will create a lot more
confidence and help identify problem areas early.
Start your engines!
Spinning up a new cluster and writing data until there is enough to sustain operations
Fine if no historic data longer than spin-up time is required
Least amount of risk, if older data isn’t needed
Can back-fill legacy data after the cut-over has occurred
Red flag -- Stop the race!
Shut it all down, move data to the new format, start everything back up
High customer impact
User visible downtime will occur, not just analytics/ingestion/processing downtime
Might be OK for non-critical, offline systems
Black Flag -- Dealing with the stop and go penalty
Attempts to lessen downtime/customer impact
Significant engineering time to set up properly
If you don’t have timestamped write times and non-linear data, might not be feesible
Longest path, in terms of calendar days
High complexity, high potential for mistakes
Green flag -- Letting your foundation work for you
Only “Start your engines” has less planned downtime
Difficult or impossible if you don’t have a solid data flow architecture in place
Results for this should be reproducible (in other words, can test things multiple times
if needed)
Data needs to come in from either a queue or batch loads
If everything is from batch loads, should be able to avoid any customer disruption
Watch out for that pile up!
Queue / Message bus -- Need to have ample capacity for when you’re not ingesting
from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable
Testing -- Build in time to test and verify migrations, and then check it all a second
time.
Testing must be multifaceted -- The code, the data, and the infrastructure
Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge
trap, but this is high risk for often little rewards
Example -- Migrating from Cassandra to Hadoop
“Start your engines!” approach
Began with duplicate writing to both systems
Eventually added kafka with different consumers pushing data to both backends
Dev work to re-implement things with Hadoop/HBase took most resources/time
Once in a “stable” place, started comparing batch job outputs from two systems
Brief maintenance window to cut over
Entire process took several months including dev, ops and testing work
Example -- Migrating from Cassandra to Hadoop (cont.)
Unique challenges
Exporting data from Cassandra was hard
Prior to a decent option like Spark
Greatly complicated by vNodes
Used a set of python scripts to actually export all the data
Had multiple kinds of products to deliver
API under constant customer use, couldn’t afford any downtime
Batch job outputs, hourly and daily
Example -- Upgrading major versions of Hadoop
Green flag approach
Had to minimize downtime
Not enough calendar time for “Start your engines”
Leveraged Snapshots (both Cassandra and HBase have this construct)
Loaded snapshots into testing environments multiple times
Majority of engineering time was in upgrading libraries and verifying there were no
breaking changes because of the version changes
Second most engineering time was spent building and running test clusters
Example -- Upgrading major versions of Hadoop (steps)
1. Determine “time” to start ingesting into both environments
2. Took snapshots of original cluster, loaded into new cluster (can take a long time)
3. Started raw data consumers for the new cluster (i.e. enabling data insertion)
4. Once lag was reduced on insertion, started analytics based consumers
5. Enabled any batch processing
6. Continue to write to both stores for a couple of weeks
7. Verify new cluster output by comparing batch jobs to old cluster
8. Cut over customer facing APIs
Summary
Strong foundations are essential
Number of possible ways to win the race
Plan as far out as you can foresee
Upgrading and Migrating are operationally similar, have similar approaches available
Archiving raw incoming data can save you a lot of headaches if you can afford it
Racing analogies only work so long in a presentation before they get worn out

Changing the tires on a big data racecar

  • 1.
    Changing the Tireson a Big Data Racecar @davemcnelis Sr. Software Engineer, Proofpoint
  • 2.
    Who am I? Softwareengineer at Proofpoint, formerly Emerging Threats 14 years experience, 7 with Cassandra or Hadoop Currently using Scala more than any other language Big data focuses have been social media analysis, marketing data, Smart-meter analytics, and information security research Current projects revolve around building threat intelligence APIs and data stores
  • 3.
    Goals Outline approaches tomigrating or upgrading your infrastructure / data store Pros and Cons of these approaches Demystify the process, identify ‘gotchas’ Establish guidelines and provide ideas for handling these situations, not create a gospel
  • 4.
    Core System Components Backend store (Hadoop, Cassandra, ect.) Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ) Event / Data Producers (APIs, log data, sensors) Generates base data for the messaging service Analytics (Queuing system consumers, batch jobs) Access (APIs, Front ends, batch job output
  • 5.
    2 Basic Approaches Upgradein place Build a new cluster and figure out how to get your data over
  • 6.
    Upgrading in place Pros Leastexpensive Data can stay where it already lives Often has sufficient documentation Cons Stability concerns of the new back end Downtime / customer visibility Good luck rolling back in event of a problem Degradation of performance during the upgrade Testing in production is bad, mmkay Generally limited to minor upgrades / updates Even drop-in upgrades aren’t clean
  • 7.
    So you wantto build a new cluster All of these inherently will cost more than upgrading in place. Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut over consumption Red Flag -- Stop ingestion and consumption, move data, restart ingestion and consumption Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption for brief periods of time Green Flag -- Let your foundation do most of the work for you
  • 8.
    Keys to Success Pre-planningis essential. Don’t expect this to be a couple of days work, plan for weeks of time. Solid data flow foundations are key. Consider archiving all incoming data to something like S3 so you can replay an arbitrary amount of data. Automated / unit testing on data interaction components will create a lot more confidence and help identify problem areas early.
  • 9.
    Start your engines! Spinningup a new cluster and writing data until there is enough to sustain operations Fine if no historic data longer than spin-up time is required Least amount of risk, if older data isn’t needed Can back-fill legacy data after the cut-over has occurred
  • 10.
    Red flag --Stop the race! Shut it all down, move data to the new format, start everything back up High customer impact User visible downtime will occur, not just analytics/ingestion/processing downtime Might be OK for non-critical, offline systems
  • 11.
    Black Flag --Dealing with the stop and go penalty Attempts to lessen downtime/customer impact Significant engineering time to set up properly If you don’t have timestamped write times and non-linear data, might not be feesible Longest path, in terms of calendar days High complexity, high potential for mistakes
  • 12.
    Green flag --Letting your foundation work for you Only “Start your engines” has less planned downtime Difficult or impossible if you don’t have a solid data flow architecture in place Results for this should be reproducible (in other words, can test things multiple times if needed) Data needs to come in from either a queue or batch loads If everything is from batch loads, should be able to avoid any customer disruption
  • 13.
    Watch out forthat pile up! Queue / Message bus -- Need to have ample capacity for when you’re not ingesting from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable Testing -- Build in time to test and verify migrations, and then check it all a second time. Testing must be multifaceted -- The code, the data, and the infrastructure Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge trap, but this is high risk for often little rewards
  • 14.
    Example -- Migratingfrom Cassandra to Hadoop “Start your engines!” approach Began with duplicate writing to both systems Eventually added kafka with different consumers pushing data to both backends Dev work to re-implement things with Hadoop/HBase took most resources/time Once in a “stable” place, started comparing batch job outputs from two systems Brief maintenance window to cut over Entire process took several months including dev, ops and testing work
  • 15.
    Example -- Migratingfrom Cassandra to Hadoop (cont.) Unique challenges Exporting data from Cassandra was hard Prior to a decent option like Spark Greatly complicated by vNodes Used a set of python scripts to actually export all the data Had multiple kinds of products to deliver API under constant customer use, couldn’t afford any downtime Batch job outputs, hourly and daily
  • 16.
    Example -- Upgradingmajor versions of Hadoop Green flag approach Had to minimize downtime Not enough calendar time for “Start your engines” Leveraged Snapshots (both Cassandra and HBase have this construct) Loaded snapshots into testing environments multiple times Majority of engineering time was in upgrading libraries and verifying there were no breaking changes because of the version changes Second most engineering time was spent building and running test clusters
  • 17.
    Example -- Upgradingmajor versions of Hadoop (steps) 1. Determine “time” to start ingesting into both environments 2. Took snapshots of original cluster, loaded into new cluster (can take a long time) 3. Started raw data consumers for the new cluster (i.e. enabling data insertion) 4. Once lag was reduced on insertion, started analytics based consumers 5. Enabled any batch processing 6. Continue to write to both stores for a couple of weeks 7. Verify new cluster output by comparing batch jobs to old cluster 8. Cut over customer facing APIs
  • 18.
    Summary Strong foundations areessential Number of possible ways to win the race Plan as far out as you can foresee Upgrading and Migrating are operationally similar, have similar approaches available Archiving raw incoming data can save you a lot of headaches if you can afford it Racing analogies only work so long in a presentation before they get worn out