SlideShare a Scribd company logo
Changing the Tires on a Big
Data Racecar
@davemcnelis
Sr. Software Engineer, Proofpoint
Who am I?
Software engineer at Proofpoint, formerly Emerging Threats
14 years experience, 7 with Cassandra or Hadoop
Currently using Scala more than any other language
Big data focuses have been social media analysis, marketing data, Smart-meter
analytics, and information security research
Current projects revolve around building threat intelligence APIs and data stores
Goals
Outline approaches to migrating or upgrading your infrastructure / data store
Pros and Cons of these approaches
Demystify the process, identify ‘gotchas’
Establish guidelines and provide ideas for handling these situations, not create a
gospel
Core System Components
Back end store (Hadoop, Cassandra, ect.)
Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ)
Event / Data Producers (APIs, log data, sensors)
Generates base data for the messaging service
Analytics (Queuing system consumers, batch jobs)
Access (APIs, Front ends, batch job output
2 Basic Approaches
Upgrade in place
Build a new cluster and figure out how to get your data over
Upgrading in place
Pros
Least expensive
Data can stay where it already lives
Often has sufficient documentation
Cons
Stability concerns of the new back end
Downtime / customer visibility
Good luck rolling back in event of a problem
Degradation of performance during the
upgrade
Testing in production is bad, mmkay
Generally limited to minor upgrades / updates
Even drop-in upgrades aren’t clean
So you want to build a new cluster
All of these inherently will cost more than upgrading in place.
Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut
over consumption
Red Flag -- Stop ingestion and consumption, move data, restart ingestion and
consumption
Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption
for brief periods of time
Green Flag -- Let your foundation do most of the work for you
Keys to Success
Pre-planning is essential. Don’t expect this to be a couple of days work, plan for weeks
of time.
Solid data flow foundations are key. Consider archiving all incoming data to
something like S3 so you can replay an arbitrary amount of data.
Automated / unit testing on data interaction components will create a lot more
confidence and help identify problem areas early.
Start your engines!
Spinning up a new cluster and writing data until there is enough to sustain operations
Fine if no historic data longer than spin-up time is required
Least amount of risk, if older data isn’t needed
Can back-fill legacy data after the cut-over has occurred
Red flag -- Stop the race!
Shut it all down, move data to the new format, start everything back up
High customer impact
User visible downtime will occur, not just analytics/ingestion/processing downtime
Might be OK for non-critical, offline systems
Black Flag -- Dealing with the stop and go penalty
Attempts to lessen downtime/customer impact
Significant engineering time to set up properly
If you don’t have timestamped write times and non-linear data, might not be feesible
Longest path, in terms of calendar days
High complexity, high potential for mistakes
Green flag -- Letting your foundation work for you
Only “Start your engines” has less planned downtime
Difficult or impossible if you don’t have a solid data flow architecture in place
Results for this should be reproducible (in other words, can test things multiple times
if needed)
Data needs to come in from either a queue or batch loads
If everything is from batch loads, should be able to avoid any customer disruption
Watch out for that pile up!
Queue / Message bus -- Need to have ample capacity for when you’re not ingesting
from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable
Testing -- Build in time to test and verify migrations, and then check it all a second
time.
Testing must be multifaceted -- The code, the data, and the infrastructure
Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge
trap, but this is high risk for often little rewards
Example -- Migrating from Cassandra to Hadoop
“Start your engines!” approach
Began with duplicate writing to both systems
Eventually added kafka with different consumers pushing data to both backends
Dev work to re-implement things with Hadoop/HBase took most resources/time
Once in a “stable” place, started comparing batch job outputs from two systems
Brief maintenance window to cut over
Entire process took several months including dev, ops and testing work
Example -- Migrating from Cassandra to Hadoop (cont.)
Unique challenges
Exporting data from Cassandra was hard
Prior to a decent option like Spark
Greatly complicated by vNodes
Used a set of python scripts to actually export all the data
Had multiple kinds of products to deliver
API under constant customer use, couldn’t afford any downtime
Batch job outputs, hourly and daily
Example -- Upgrading major versions of Hadoop
Green flag approach
Had to minimize downtime
Not enough calendar time for “Start your engines”
Leveraged Snapshots (both Cassandra and HBase have this construct)
Loaded snapshots into testing environments multiple times
Majority of engineering time was in upgrading libraries and verifying there were no
breaking changes because of the version changes
Second most engineering time was spent building and running test clusters
Example -- Upgrading major versions of Hadoop (steps)
1. Determine “time” to start ingesting into both environments
2. Took snapshots of original cluster, loaded into new cluster (can take a long time)
3. Started raw data consumers for the new cluster (i.e. enabling data insertion)
4. Once lag was reduced on insertion, started analytics based consumers
5. Enabled any batch processing
6. Continue to write to both stores for a couple of weeks
7. Verify new cluster output by comparing batch jobs to old cluster
8. Cut over customer facing APIs
Summary
Strong foundations are essential
Number of possible ways to win the race
Plan as far out as you can foresee
Upgrading and Migrating are operationally similar, have similar approaches available
Archiving raw incoming data can save you a lot of headaches if you can afford it
Racing analogies only work so long in a presentation before they get worn out

More Related Content

What's hot

Nosql East October 2009
Nosql East October 2009Nosql East October 2009
Nosql East October 2009
Christopher Curtin
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
Pete Jarvis
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
Edgar Alejandro Villegas
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Hadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix itHadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix it
Kognitio
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
DataStax
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Hui 3.0
Hui 3.0Hui 3.0
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
Kognitio
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Sriram Krishnan
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
ROHIT KHARABE
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax
 

What's hot (20)

Nosql East October 2009
Nosql East October 2009Nosql East October 2009
Nosql East October 2009
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Hadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix itHadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix it
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Hui 3.0
Hui 3.0Hui 3.0
Hui 3.0
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
 

Viewers also liked

Portafolio virtua lalbertaponte
Portafolio virtua lalbertapontePortafolio virtua lalbertaponte
Portafolio virtua lalbertaponte
uftsaia
 
Mk0016 advertising management and sales
Mk0016 advertising management and salesMk0016 advertising management and sales
Mk0016 advertising management and sales
consult4solutions
 
Introduction to Managing Cancer Living Meaningfully (CALM)
Introduction to Managing Cancer Living Meaningfully (CALM) Introduction to Managing Cancer Living Meaningfully (CALM)
Introduction to Managing Cancer Living Meaningfully (CALM)
Global Institute GIPPEC
 
Mu0013 hr audit
Mu0013 hr auditMu0013 hr audit
Mu0013 hr audit
consult4solutions
 
Mi0035 computer networks
Mi0035 computer networksMi0035 computer networks
Mi0035 computer networks
consult4solutions
 
FlexDealer Automotive Digital Marketing Agency Presentation
FlexDealer Automotive Digital Marketing Agency PresentationFlexDealer Automotive Digital Marketing Agency Presentation
FlexDealer Automotive Digital Marketing Agency Presentation
Jason Prud'homme
 
Tenant services hawaii
Tenant services hawaiiTenant services hawaii
Tenant services hawaii
Certifiedps
 
Researchers’ perceptions of DH trends and topics
Researchers’ perceptions of DH trends and topicsResearchers’ perceptions of DH trends and topics
Researchers’ perceptions of DH trends and topics
Uned Laboratorio de Innovación en Humanidades
 
Mu0018 change management
Mu0018 change managementMu0018 change management
Mu0018 change management
consult4solutions
 
Pm0011 project planning and scheduling
Pm0011 project planning and schedulingPm0011 project planning and scheduling
Pm0011 project planning and scheduling
consult4solutions
 
¿Por qué elegir cristalería Chef & Sommelier?
¿Por qué elegir cristalería Chef & Sommelier?¿Por qué elegir cristalería Chef & Sommelier?
¿Por qué elegir cristalería Chef & Sommelier?
Arc Ibérica
 
Los 5 teléfonos mas avanzados - Taylor Rivera
Los 5 teléfonos mas avanzados - Taylor RiveraLos 5 teléfonos mas avanzados - Taylor Rivera
Los 5 teléfonos mas avanzados - Taylor Rivera
Jerson Saritama Bsc
 
букет
букетбукет
букет
arzmary
 
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
Amro Elfeki
 
İnovatif Kimya Dergisi Sayı-11
İnovatif Kimya Dergisi Sayı-11İnovatif Kimya Dergisi Sayı-11
İnovatif Kimya Dergisi Sayı-11
İnovatif Kimya Dergisi
 

Viewers also liked (15)

Portafolio virtua lalbertaponte
Portafolio virtua lalbertapontePortafolio virtua lalbertaponte
Portafolio virtua lalbertaponte
 
Mk0016 advertising management and sales
Mk0016 advertising management and salesMk0016 advertising management and sales
Mk0016 advertising management and sales
 
Introduction to Managing Cancer Living Meaningfully (CALM)
Introduction to Managing Cancer Living Meaningfully (CALM) Introduction to Managing Cancer Living Meaningfully (CALM)
Introduction to Managing Cancer Living Meaningfully (CALM)
 
Mu0013 hr audit
Mu0013 hr auditMu0013 hr audit
Mu0013 hr audit
 
Mi0035 computer networks
Mi0035 computer networksMi0035 computer networks
Mi0035 computer networks
 
FlexDealer Automotive Digital Marketing Agency Presentation
FlexDealer Automotive Digital Marketing Agency PresentationFlexDealer Automotive Digital Marketing Agency Presentation
FlexDealer Automotive Digital Marketing Agency Presentation
 
Tenant services hawaii
Tenant services hawaiiTenant services hawaii
Tenant services hawaii
 
Researchers’ perceptions of DH trends and topics
Researchers’ perceptions of DH trends and topicsResearchers’ perceptions of DH trends and topics
Researchers’ perceptions of DH trends and topics
 
Mu0018 change management
Mu0018 change managementMu0018 change management
Mu0018 change management
 
Pm0011 project planning and scheduling
Pm0011 project planning and schedulingPm0011 project planning and scheduling
Pm0011 project planning and scheduling
 
¿Por qué elegir cristalería Chef & Sommelier?
¿Por qué elegir cristalería Chef & Sommelier?¿Por qué elegir cristalería Chef & Sommelier?
¿Por qué elegir cristalería Chef & Sommelier?
 
Los 5 teléfonos mas avanzados - Taylor Rivera
Los 5 teléfonos mas avanzados - Taylor RiveraLos 5 teléfonos mas avanzados - Taylor Rivera
Los 5 teléfonos mas avanzados - Taylor Rivera
 
букет
букетбукет
букет
 
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...
 
İnovatif Kimya Dergisi Sayı-11
İnovatif Kimya Dergisi Sayı-11İnovatif Kimya Dergisi Sayı-11
İnovatif Kimya Dergisi Sayı-11
 

Similar to Changing the tires on a big data racecar

Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
Sybase Türkiye
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
David Walker
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
BlueData, Inc.
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Big Data Aplications Meetup
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Knoldus Inc.
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
David Mitzenmacher
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Ajug april 2011
Ajug april 2011Ajug april 2011
Ajug april 2011
Christopher Curtin
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
Durga Gadiraju
 
Enter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingEnter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputing
Intel IT Center
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Timothy Fitz
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth PresentationEric Ries
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 

Similar to Changing the tires on a big data racecar (20)

Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Solution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab AcceleratorSolution Brief: Big Data Lab Accelerator
Solution Brief: Big Data Lab Accelerator
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Migrating to Cloud: Inhouse Hadoop to Databricks (3)
Migrating to Cloud: Inhouse Hadoop to Databricks (3)
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Ajug april 2011
Ajug april 2011Ajug april 2011
Ajug april 2011
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
 
Enter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingEnter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputing
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
 
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
Just In Time Scalability  Agile Methods To Support Massive Growth PresentationJust In Time Scalability  Agile Methods To Support Massive Growth Presentation
Just In Time Scalability Agile Methods To Support Massive Growth Presentation
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Changing the tires on a big data racecar

  • 1. Changing the Tires on a Big Data Racecar @davemcnelis Sr. Software Engineer, Proofpoint
  • 2. Who am I? Software engineer at Proofpoint, formerly Emerging Threats 14 years experience, 7 with Cassandra or Hadoop Currently using Scala more than any other language Big data focuses have been social media analysis, marketing data, Smart-meter analytics, and information security research Current projects revolve around building threat intelligence APIs and data stores
  • 3. Goals Outline approaches to migrating or upgrading your infrastructure / data store Pros and Cons of these approaches Demystify the process, identify ‘gotchas’ Establish guidelines and provide ideas for handling these situations, not create a gospel
  • 4. Core System Components Back end store (Hadoop, Cassandra, ect.) Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ) Event / Data Producers (APIs, log data, sensors) Generates base data for the messaging service Analytics (Queuing system consumers, batch jobs) Access (APIs, Front ends, batch job output
  • 5. 2 Basic Approaches Upgrade in place Build a new cluster and figure out how to get your data over
  • 6. Upgrading in place Pros Least expensive Data can stay where it already lives Often has sufficient documentation Cons Stability concerns of the new back end Downtime / customer visibility Good luck rolling back in event of a problem Degradation of performance during the upgrade Testing in production is bad, mmkay Generally limited to minor upgrades / updates Even drop-in upgrades aren’t clean
  • 7. So you want to build a new cluster All of these inherently will cost more than upgrading in place. Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut over consumption Red Flag -- Stop ingestion and consumption, move data, restart ingestion and consumption Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption for brief periods of time Green Flag -- Let your foundation do most of the work for you
  • 8. Keys to Success Pre-planning is essential. Don’t expect this to be a couple of days work, plan for weeks of time. Solid data flow foundations are key. Consider archiving all incoming data to something like S3 so you can replay an arbitrary amount of data. Automated / unit testing on data interaction components will create a lot more confidence and help identify problem areas early.
  • 9. Start your engines! Spinning up a new cluster and writing data until there is enough to sustain operations Fine if no historic data longer than spin-up time is required Least amount of risk, if older data isn’t needed Can back-fill legacy data after the cut-over has occurred
  • 10. Red flag -- Stop the race! Shut it all down, move data to the new format, start everything back up High customer impact User visible downtime will occur, not just analytics/ingestion/processing downtime Might be OK for non-critical, offline systems
  • 11. Black Flag -- Dealing with the stop and go penalty Attempts to lessen downtime/customer impact Significant engineering time to set up properly If you don’t have timestamped write times and non-linear data, might not be feesible Longest path, in terms of calendar days High complexity, high potential for mistakes
  • 12. Green flag -- Letting your foundation work for you Only “Start your engines” has less planned downtime Difficult or impossible if you don’t have a solid data flow architecture in place Results for this should be reproducible (in other words, can test things multiple times if needed) Data needs to come in from either a queue or batch loads If everything is from batch loads, should be able to avoid any customer disruption
  • 13. Watch out for that pile up! Queue / Message bus -- Need to have ample capacity for when you’re not ingesting from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable Testing -- Build in time to test and verify migrations, and then check it all a second time. Testing must be multifaceted -- The code, the data, and the infrastructure Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge trap, but this is high risk for often little rewards
  • 14. Example -- Migrating from Cassandra to Hadoop “Start your engines!” approach Began with duplicate writing to both systems Eventually added kafka with different consumers pushing data to both backends Dev work to re-implement things with Hadoop/HBase took most resources/time Once in a “stable” place, started comparing batch job outputs from two systems Brief maintenance window to cut over Entire process took several months including dev, ops and testing work
  • 15. Example -- Migrating from Cassandra to Hadoop (cont.) Unique challenges Exporting data from Cassandra was hard Prior to a decent option like Spark Greatly complicated by vNodes Used a set of python scripts to actually export all the data Had multiple kinds of products to deliver API under constant customer use, couldn’t afford any downtime Batch job outputs, hourly and daily
  • 16. Example -- Upgrading major versions of Hadoop Green flag approach Had to minimize downtime Not enough calendar time for “Start your engines” Leveraged Snapshots (both Cassandra and HBase have this construct) Loaded snapshots into testing environments multiple times Majority of engineering time was in upgrading libraries and verifying there were no breaking changes because of the version changes Second most engineering time was spent building and running test clusters
  • 17. Example -- Upgrading major versions of Hadoop (steps) 1. Determine “time” to start ingesting into both environments 2. Took snapshots of original cluster, loaded into new cluster (can take a long time) 3. Started raw data consumers for the new cluster (i.e. enabling data insertion) 4. Once lag was reduced on insertion, started analytics based consumers 5. Enabled any batch processing 6. Continue to write to both stores for a couple of weeks 7. Verify new cluster output by comparing batch jobs to old cluster 8. Cut over customer facing APIs
  • 18. Summary Strong foundations are essential Number of possible ways to win the race Plan as far out as you can foresee Upgrading and Migrating are operationally similar, have similar approaches available Archiving raw incoming data can save you a lot of headaches if you can afford it Racing analogies only work so long in a presentation before they get worn out