Partners in Crime: Cassandra Analytics and ETL with Hadoop

•

11 likes•2,273 views

Light slides supporting a Hadoop and Cassandra integration talk at the 2010 Cassandra Summit. The code is more interesting: http://github.com/stuhood/cassandra-summit-demo

Technology

Partners in Crime
Cassandra Analytics and ETL with Hadoop

Cassandra Summit 2010

Date: August 10th, 2010

What is Hadoop?

• Distributed processing framework (MapReduce)
– Moves processing to the data
• Distributed filesystem
– Allows data to move when processing can't

Why use Hadoop with Cassandra?

Perfect partners for big data laundering

• Cassandra optimized for access
• Hadoop optimized for processing
– Many analytics frameworks
– Existing integrations
• RDBMS → Hadoop → Cassandra

Cluster Layouts

• Existing Hadoop cluster?
– Start Hadoop tasktrackers on Cassandra cluster
– Processing performed on local nodes

Cluster Layouts

• No Hadoop cluster?
– Start all Hadoop daemons on 2-3 nodes
• MapReduce depends lightly on HDFS
– Start Hadoop tasktrackers on Cassandra cluster

Hadoop Integration Points

• JVM MapReduce
– Keys/values iterated in process
• Hadoop Streaming
– Performs IPC on stdin/stdout to arbitrary processes
• Apache Pig
– High level relational language (SQL alternative)
• Apache Hive
– Forthcoming support for Cassandra storage

Demo

• Code
– github.com/stuhood/cassandra-summit-demo
• Flow
– Load with Hadoop Streaming
– Analyze with Apache Pig
– Load/Process with JVM MapReduce

Hadoop Streaming Summary

• Mapper/Reducer scripts
– Any language
• Script is moved to the data

cat $input | mapper | sort | reducer > $output

ETL with Streaming

• ETL to Cassandra in ~50 lines
Load!

ETL with Streaming

1)Files in HDFS
2)Hadoop Streaming
3)bin/load-mapper.py (the code you write)
4)Cassandra's Streaming Shim
5)Cassandra

Apache Pig Summary

• Declarative relational language

Analytics with Pig

• Analytics from Cassandra in ~20 lines
Analyze!

Analytics with Pig

1)Data stored in Cassandra
2)Cassandra's Pig LoadFunc
3)bin/analyze.pig (the code you write)
4)Files in HDFS

JVM MapReduce Summary

• Extend Mapper/Reducer base classes
• Hadoop:
– Transports the Jar to nodes near the data
– Efficiently streams data through

Load/Process with MapReduce

• Efficient bulk loading in ~80 lines
Summarize!

Load/Process with MapReduce

1)Files in HDFS
2)MapReduce
3)Mapper/Reducer (the code you write)
4)Cassandra's ColumnFamilyOutputFormat
5)Cassandra

Future Work

• Pig Output
• Hive
• Hadoop Streaming Input
• Optimizations

References

• Code available at
– github.com/stuhood/cassandra-summit-demo
• Open issues
– CASSANDRA-1315
– CASSANDRA-1322
– CASSANDRA-1368
• “Hadoop + Cassandra” - Jeremy Hanna
– slideshare.net/jeromatron/cassandrahadoop-4399672

What's hot

AWS Redshift Introduction - Big Data AnalyticsKeeyong Han

Migrating structured data between Hadoop and RDBMSBouquet

Introduction to AWS Big Data Omid Vahdaty

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Nextag talkJoydeep Sen Sarma

Cloudera Impala + PostgreSQLliuknag

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

Introduction to NoSqlOmid Vahdaty

HBase in Practice DataWorks Summit/Hadoop Summit

Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.

HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.

HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.

What every developer should know about database scalability, PyCon 2010jbellis

Apache HBase - Introduction & Use CasesData Con LA

Apache Hadoop and HBaseCloudera, Inc.

Hadoop - How It WorksVladimír Hanušniak

Hbase jddAndrzej Grzesik

Apache sqoopmegrhi haikel

Scalable Data Science with SparkRDataWorks Summit

Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Edureka!

What's hot (20)

AWS Redshift Introduction - Big Data Analytics

Migrating structured data between Hadoop and RDBMS

Introduction to AWS Big Data

Hadoop trainting in hyderabad@kelly technologies

Nextag talk

Cloudera Impala + PostgreSQL

HBaseCon 2015: HBase Operations in a Flurry

Introduction to NoSql

HBase in Practice

Facebook - Jonthan Gray - Hadoop World 2010

HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

What every developer should know about database scalability, PyCon 2010

Apache HBase - Introduction & Use Cases

Apache Hadoop and HBase

Hadoop - How It Works

Hbase jdd

Apache sqoop

Scalable Data Science with SparkR

Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...

Viewers also liked

Space-time data workshop at IfGITomislav Hengl

ArcGIS Space-Time Mining of Crime Datamargaretmfurr

10 Steps to Optimize Your Crime AnalysisAzavea

Crime Risk Forecasting and Predictive Analytics - Esri UCAzavea

Helping Australian agencies fight serious crimeWynyard Group

Group Capstone Projectmargaretmfurr

Crime Analytics: Analysis of crimes through news paper articlesChamath Sajeewa

Fraud Analytics with Machine Learning and Big Data Engineering for TelecomSudarson Roy Pratihar

ACFE Presentation on Analytics for Fraud Detection and MitigationScott Mongeau

Cyber crime and security pptLipsita Behera

Viewers also liked (10)

Space-time data workshop at IfGI

ArcGIS Space-Time Mining of Crime Data

10 Steps to Optimize Your Crime Analysis

Crime Risk Forecasting and Predictive Analytics - Esri UC

Helping Australian agencies fight serious crime

Group Capstone Project

Crime Analytics: Analysis of crimes through news paper articles

Fraud Analytics with Machine Learning and Big Data Engineering for Telecom

ACFE Presentation on Analytics for Fraud Detection and Mitigation

Cyber crime and security ppt

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop

Introduction to HDFS and MapReduceDerek Chen

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Impala for PhillyDB MeetupShravan (Sean) Pabba

HadoopYojana Nanaware

Apache hadoop, hdfs and map reduce OverviewNisanth Simon

Introduction to Hadoop and Big DataJoe Alex

Presentationch samaram

Cloudera Hadoop DistributionThisara Pramuditha

Hadoop And Their Ecosystem pptsunera pathan

Hadoop And Their Ecosystemsunera pathan

Hadoopavnishagr

hadoop-ecosystem-ppt.pptxraghavanand36

Analytics using big data technologiesBalakrishnan Vinchu

Scaling Storage and Computation with Hadoopyaevents

Hadoop, Map Reduce and Apache Pig tutorialPranamesh Chakraborty

Big data Hadoop Ayyappan Paramesh

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Apache Hadoop 1.1Sperasoft

Introduction To Hadoop EcosystemInSemble

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop (20)

Introduction to HDFS and MapReduce

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Impala for PhillyDB Meetup

Hadoop

Apache hadoop, hdfs and map reduce Overview

Introduction to Hadoop and Big Data

Presentation

Cloudera Hadoop Distribution

Hadoop And Their Ecosystem ppt

Hadoop And Their Ecosystem

Hadoop

hadoop-ecosystem-ppt.pptx

Analytics using big data technologies

Scaling Storage and Computation with Hadoop

Hadoop, Map Reduce and Apache Pig tutorial

Big data Hadoop

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Apache Hadoop 1.1

Introduction To Hadoop Ecosystem

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Slack Application Development 101 Slidespraypatel2

A Call to Action for Generative AI in 2024Results

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Developing An App To Navigate The Roads of BrazilV3cube

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Salesforce Community Group Quito, Salesforce 101

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Handwritten Text Recognition for manuscripts and early printed texts

Slack Application Development 101 Slides

A Call to Action for Generative AI in 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

Data Cloud, More than a CDP by Matt Robison

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

IAC 2024 - IA Fast Track to Search Focused AI Solutions

How to Troubleshoot Apps for the Modern Connected Worker

CNv6 Instructor Chapter 6 Quality of Service

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Developing An App To Navigate The Roads of Brazil

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Partners in Crime: Cassandra Analytics and ETL with Hadoop

1. Partners in Crime Cassandra Analytics and ETL with Hadoop Cassandra Summit 2010 Date: August 10th, 2010

2. What is Hadoop? • Distributed processing framework (MapReduce) – Moves processing to the data • Distributed filesystem – Allows data to move when processing can't

3. Why use Hadoop with Cassandra? Perfect partners for big data laundering • Cassandra optimized for access • Hadoop optimized for processing – Many analytics frameworks – Existing integrations • RDBMS → Hadoop → Cassandra

4. Cluster Layouts • Existing Hadoop cluster? – Start Hadoop tasktrackers on Cassandra cluster – Processing performed on local nodes

5. Cluster Layouts • No Hadoop cluster? – Start all Hadoop daemons on 2-3 nodes • MapReduce depends lightly on HDFS – Start Hadoop tasktrackers on Cassandra cluster

6. Hadoop Integration Points • JVM MapReduce – Keys/values iterated in process • Hadoop Streaming – Performs IPC on stdin/stdout to arbitrary processes • Apache Pig – High level relational language (SQL alternative) • Apache Hive – Forthcoming support for Cassandra storage

7. Demo • Code – github.com/stuhood/cassandra-summit-demo • Flow – Load with Hadoop Streaming – Analyze with Apache Pig – Load/Process with JVM MapReduce

8. Hadoop Streaming Summary • Mapper/Reducer scripts – Any language • Script is moved to the data cat $input | mapper | sort | reducer > $output

9. ETL with Streaming • ETL to Cassandra in ~50 lines Load!

10. ETL with Streaming 1)Files in HDFS 2)Hadoop Streaming 3)bin/load-mapper.py (the code you write) 4)Cassandra's Streaming Shim 5)Cassandra

11. Apache Pig Summary • Declarative relational language

12. Analytics with Pig • Analytics from Cassandra in ~20 lines Analyze!

13. Analytics with Pig 1)Data stored in Cassandra 2)Cassandra's Pig LoadFunc 3)bin/analyze.pig (the code you write) 4)Files in HDFS

14. JVM MapReduce Summary • Extend Mapper/Reducer base classes • Hadoop: – Transports the Jar to nodes near the data – Efficiently streams data through

15. Load/Process with MapReduce • Efficient bulk loading in ~80 lines Summarize!

16. Load/Process with MapReduce 1)Files in HDFS 2)MapReduce 3)Mapper/Reducer (the code you write) 4)Cassandra's ColumnFamilyOutputFormat 5)Cassandra

17. Future Work • Pig Output • Hive • Hadoop Streaming Input • Optimizations

18. Questions?

19. References • Code available at – github.com/stuhood/cassandra-summit-demo • Open issues – CASSANDRA-1315 – CASSANDRA-1322 – CASSANDRA-1368 • “Hadoop + Cassandra” - Jeremy Hanna – slideshare.net/jeromatron/cassandrahadoop-4399672

Partners in Crime: Cassandra Analytics and ETL with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop

Similar to Partners in Crime: Cassandra Analytics and ETL with Hadoop (20)

Recently uploaded

Recently uploaded (20)

Partners in Crime: Cassandra Analytics and ETL with Hadoop