SlideShare a Scribd company logo
1 of 40
NoSQL, Hadoop and Cascading Christopher Curtin
About Me 20+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop
Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
Why NoSQL? Should be ‘not just SQL’ Not all problems are relational Not all schemas are known when developing a solution Is LAMP the reason? Difficult to assign a value to a lot of the data in an enterprise
Cost Obligatory slam on Oracle/IBM/Microsoft CPU/Core costs  Disks for fast RDBMS performance Clustering, Redundancy and DR Obligatory slam on Accenture, IBM and Corp IT Data warehouses Report writers   Lead Time
NIH Some of this is ‘not invented here’ But many of these solutions came from the Internet-scale businesses like Facebook, Google, Twitter, Amazon Problems most of us will never see or really understand
Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
What is NoSQL? Lots of definitions, but  a few key ideas Eventual Consistency vs. ACID (usually) Distributed on commodity hardware, easy to add/remove nodes (Typically) Open Source (Usually) non-SQL interface A few examples Graph databases Key/Value stores Document databases
Graph Databases For handling deeply associative data (graphs/networks) Relational isn’t good at recursive structures Data both at the leaves (nodes) and edges (relationships)  Very fast to navigate “Whiteboard Friendly” Neo4j (www.neo4j.org)
Key/Value Stores Don’t call them databases ;-) No JOIN concept, instead duplicate data Think of it as a huge Associated Array (Map) Not all values have all attributes Designed for fast reading ‘Ridiculous’ sized SimpleDB(Amazon), BigTable (Google)
Document Databases Inspired by Lotus Notes. Really. Collections of Key/Value stores But with some additional visibility into the data of the stores  CouchDB (couchdb.apache.org) Stores JSON documents Can search in documents quickly No schema so Domain changes are easier
Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
Map/Reduce Made famous by Google for how the index the web Parallel processing of large datasets across multiple commodity nodes Highly fault tolerant Hadoop is Apache’s implementation of Map/Reduce
Map Identifies what is in the input that you want to process Can be simple: occurrence of a word Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.* Output is a list of name/value pairs Name and value do not have to be primitives
Reduce Takes the name/value pairs from the Map step and does something useful with them Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values Output is the ‘answer’ to the question Example: bytes streamed by IP address from Apache logs
HDFS Distributed File System WITHOUT NFS etc. Hadoop knows which parts of which files are on which machine (say that 5 times fast!) “Move the processing to the data” if possible Simple API to move files in and out of HDFS 3 Copies of the data in the cluster for redundancy AND performance
Runtime Distribution © Concurrent 2009
Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
Getting Started with Map/Reduce First challenge: real examples Second challenge: when to map and when to reduce? Third challenge: what if I need more than one of each? How to coordinate? Fourth challenge: non-trivial business logic
Cascading Hadoop coding is non-trivial Hadoop is looking for a class to do Map steps and a class to do Reduce step What if you need multiple in your application? Who coordinates what can be run in parallel? What if you need to do non-Hadoop logic between Hadoop steps?
Tuple A single ‘row’ of data being processed Each column is named Can access data by name or position
TAP Abstraction on top of Hadoop files Allows you to define own parser for files Example: Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
Operations Define what to do on the data Each – for each “tuple” in data do this to it Group – similar to a ‘group by’ in SQL CoGroup – joins of tuple streams together Every – for every key in the Group or CoGroup do this
Pipes Pipes tie Operations together Pipes can be thought of as ‘tuple streams’ Pipes can be split, allowing parallel execution of Operations
Operations - advanced Each operations allow logic on the row, such a parsing dates, creating new attributes etc. Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Both allow multiple operations in same function, so no nested function calls!
PIPE Splitting PIPEs define processing flows A SPLIT of a PIPE will tell Cascading to pass the TUPLE to another set of logic Allows single parsing EACH to feed multiple EVERY steps Or output from one EVERY to feed different EVERY steps
Flows Flows are reusable combinations of Taps, Pipes and Operations Allows you to build library of functions  Groups of Flow are called Cascades
Cascading Scheduler Once the Flows and Cascades are defined, looks for dependencies When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used Knows what can be executed in parallel Knows when a step completes what other steps can execute
Example Operation RowAggregator aggr = new RowAggregator(row); Fields groupBy = new Fields(ColumnDefinition.RECIPIENT_ID_NAME); Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile); formatPipe = new GroupBy(formatPipe, groupBy); formatPipe = new Every(formatPipe, Fields.ALL, aggr);
Runtime Distribution © Concurrent 2009
Dynamic Flow Creation Flows can be created at run time based on inputs. 5 input files one week, 10 the next, Java code creates 10 Flows instead of 5 Group and Every don’t care how many input Taps
Dynamic Tuple Definition Each operations on input Taps can parse text lines into different Fields So one source may have 5 inputs, another 10 Each operations can used meta data to know how to parse Can write Each operations to output common Tuples Every operations can output new Tuples as well Dynamically provide GroupBy fields so end users can build own ad-hoc logic
Mixing non-Hadoop code Cascading allows you to mix regular java between Flows in a Cascade So you can call out to databases, write intermediates to a file etc.
External File Creation An aggregator can write to the local file system Sometimes you don’t need or want to push back to HDFS  NFS mounts needed since you don’t know where the logic is going to execute GROUP BY is critical to doing this right
Real Example For the hundreds of mailings sent last year To millions of recipients Show me who opened, how often Break it down by how long they have been a subscriber And their Gender And the number of times clicked on the offer
RDBMS solution Lots of million + row joins Lots of million + row counts Temporary tables since we want multiple answers Lots of memory Lots of CPU and I/O $$ becomes bottleneck to adding more rows or more clients to same logic
Cascading Solution Let Hadoop parse input files Let Cascading group all inputs by recipient’s email Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links Bandwidth to export data from RDBMS becomes bottleneck
Pros and Cons Pros Mix java between map/reduce steps Don’t have to worry about when to map, when to reduce Don’t have to think about dependencies or how to process Data definition can change on the fly Cons Level above Hadoop – sometimes ‘black magic’ Data must (should) be outside of database to get most concurrency
Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’  Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Amazon Hadoop http://aws.amazon.com/elasticmapreduce/ Runs on EC2 Provide Map and Reduce functions Can use Cascading  Pay as you go
Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel  Web site: www.cascading.org Mailing list off website AWSome Atlanta Group: http://www.meetup.com/awsomeatlanta/ O’Reilly Hadoop Book:  http://oreilly.com/catalog/9780596521974/

More Related Content

What's hot

Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Hw09 Matchmaking In The Cloud
Hw09   Matchmaking In The CloudHw09   Matchmaking In The Cloud
Hw09 Matchmaking In The CloudCloudera, Inc.
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauLynn Langit
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)Sid Anand
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 

What's hot (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
SF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher BernerSF Big Analytics: Machine Learning with Presto by Christopher Berner
SF Big Analytics: Machine Learning with Presto by Christopher Berner
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Hw09 Matchmaking In The Cloud
Hw09   Matchmaking In The CloudHw09   Matchmaking In The Cloud
Hw09 Matchmaking In The Cloud
 
Masterclass - Redshift
Masterclass - RedshiftMasterclass - Redshift
Masterclass - Redshift
 
Building AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and TableauBuilding AWS Redshift Data Warehouse with Matillion and Tableau
Building AWS Redshift Data Warehouse with Matillion and Tableau
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 

Similar to NoSQL, Hadoop, Cascading June 2010

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 

Similar to NoSQL, Hadoop, Cascading June 2010 (20)

Nosql East October 2009
Nosql East October 2009Nosql East October 2009
Nosql East October 2009
 
No sql
No sqlNo sql
No sql
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 

More from Christopher Curtin

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlantaChristopher Curtin
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleChristopher Curtin
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleChristopher Curtin
 

More from Christopher Curtin (7)

2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 
2011 march cloud computing atlanta
2011 march cloud computing atlanta2011 march cloud computing atlanta
2011 march cloud computing atlanta
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
 
AJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop exampleAJUG April 2011 Raw hadoop example
AJUG April 2011 Raw hadoop example
 
IASA Atlanta September 2009
IASA Atlanta September 2009IASA Atlanta September 2009
IASA Atlanta September 2009
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

NoSQL, Hadoop, Cascading June 2010

  • 1. NoSQL, Hadoop and Cascading Christopher Curtin
  • 2. About Me 20+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop
  • 3. Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
  • 4. Why NoSQL? Should be ‘not just SQL’ Not all problems are relational Not all schemas are known when developing a solution Is LAMP the reason? Difficult to assign a value to a lot of the data in an enterprise
  • 5. Cost Obligatory slam on Oracle/IBM/Microsoft CPU/Core costs Disks for fast RDBMS performance Clustering, Redundancy and DR Obligatory slam on Accenture, IBM and Corp IT Data warehouses Report writers Lead Time
  • 6. NIH Some of this is ‘not invented here’ But many of these solutions came from the Internet-scale businesses like Facebook, Google, Twitter, Amazon Problems most of us will never see or really understand
  • 7. Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
  • 8. What is NoSQL? Lots of definitions, but a few key ideas Eventual Consistency vs. ACID (usually) Distributed on commodity hardware, easy to add/remove nodes (Typically) Open Source (Usually) non-SQL interface A few examples Graph databases Key/Value stores Document databases
  • 9. Graph Databases For handling deeply associative data (graphs/networks) Relational isn’t good at recursive structures Data both at the leaves (nodes) and edges (relationships) Very fast to navigate “Whiteboard Friendly” Neo4j (www.neo4j.org)
  • 10. Key/Value Stores Don’t call them databases ;-) No JOIN concept, instead duplicate data Think of it as a huge Associated Array (Map) Not all values have all attributes Designed for fast reading ‘Ridiculous’ sized SimpleDB(Amazon), BigTable (Google)
  • 11. Document Databases Inspired by Lotus Notes. Really. Collections of Key/Value stores But with some additional visibility into the data of the stores CouchDB (couchdb.apache.org) Stores JSON documents Can search in documents quickly No schema so Domain changes are easier
  • 12. Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
  • 13. Map/Reduce Made famous by Google for how the index the web Parallel processing of large datasets across multiple commodity nodes Highly fault tolerant Hadoop is Apache’s implementation of Map/Reduce
  • 14. Map Identifies what is in the input that you want to process Can be simple: occurrence of a word Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.* Output is a list of name/value pairs Name and value do not have to be primitives
  • 15. Reduce Takes the name/value pairs from the Map step and does something useful with them Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values Output is the ‘answer’ to the question Example: bytes streamed by IP address from Apache logs
  • 16. HDFS Distributed File System WITHOUT NFS etc. Hadoop knows which parts of which files are on which machine (say that 5 times fast!) “Move the processing to the data” if possible Simple API to move files in and out of HDFS 3 Copies of the data in the cluster for redundancy AND performance
  • 17. Runtime Distribution © Concurrent 2009
  • 18. Topics Why NoSQL? What is NoSQL? Hadoop Cascading Questions
  • 19. Getting Started with Map/Reduce First challenge: real examples Second challenge: when to map and when to reduce? Third challenge: what if I need more than one of each? How to coordinate? Fourth challenge: non-trivial business logic
  • 20. Cascading Hadoop coding is non-trivial Hadoop is looking for a class to do Map steps and a class to do Reduce step What if you need multiple in your application? Who coordinates what can be run in parallel? What if you need to do non-Hadoop logic between Hadoop steps?
  • 21. Tuple A single ‘row’ of data being processed Each column is named Can access data by name or position
  • 22. TAP Abstraction on top of Hadoop files Allows you to define own parser for files Example: Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
  • 23. Operations Define what to do on the data Each – for each “tuple” in data do this to it Group – similar to a ‘group by’ in SQL CoGroup – joins of tuple streams together Every – for every key in the Group or CoGroup do this
  • 24. Pipes Pipes tie Operations together Pipes can be thought of as ‘tuple streams’ Pipes can be split, allowing parallel execution of Operations
  • 25. Operations - advanced Each operations allow logic on the row, such a parsing dates, creating new attributes etc. Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Both allow multiple operations in same function, so no nested function calls!
  • 26. PIPE Splitting PIPEs define processing flows A SPLIT of a PIPE will tell Cascading to pass the TUPLE to another set of logic Allows single parsing EACH to feed multiple EVERY steps Or output from one EVERY to feed different EVERY steps
  • 27. Flows Flows are reusable combinations of Taps, Pipes and Operations Allows you to build library of functions Groups of Flow are called Cascades
  • 28. Cascading Scheduler Once the Flows and Cascades are defined, looks for dependencies When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used Knows what can be executed in parallel Knows when a step completes what other steps can execute
  • 29. Example Operation RowAggregator aggr = new RowAggregator(row); Fields groupBy = new Fields(ColumnDefinition.RECIPIENT_ID_NAME); Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile); formatPipe = new GroupBy(formatPipe, groupBy); formatPipe = new Every(formatPipe, Fields.ALL, aggr);
  • 30. Runtime Distribution © Concurrent 2009
  • 31. Dynamic Flow Creation Flows can be created at run time based on inputs. 5 input files one week, 10 the next, Java code creates 10 Flows instead of 5 Group and Every don’t care how many input Taps
  • 32. Dynamic Tuple Definition Each operations on input Taps can parse text lines into different Fields So one source may have 5 inputs, another 10 Each operations can used meta data to know how to parse Can write Each operations to output common Tuples Every operations can output new Tuples as well Dynamically provide GroupBy fields so end users can build own ad-hoc logic
  • 33. Mixing non-Hadoop code Cascading allows you to mix regular java between Flows in a Cascade So you can call out to databases, write intermediates to a file etc.
  • 34. External File Creation An aggregator can write to the local file system Sometimes you don’t need or want to push back to HDFS NFS mounts needed since you don’t know where the logic is going to execute GROUP BY is critical to doing this right
  • 35. Real Example For the hundreds of mailings sent last year To millions of recipients Show me who opened, how often Break it down by how long they have been a subscriber And their Gender And the number of times clicked on the offer
  • 36. RDBMS solution Lots of million + row joins Lots of million + row counts Temporary tables since we want multiple answers Lots of memory Lots of CPU and I/O $$ becomes bottleneck to adding more rows or more clients to same logic
  • 37. Cascading Solution Let Hadoop parse input files Let Cascading group all inputs by recipient’s email Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links Bandwidth to export data from RDBMS becomes bottleneck
  • 38. Pros and Cons Pros Mix java between map/reduce steps Don’t have to worry about when to map, when to reduce Don’t have to think about dependencies or how to process Data definition can change on the fly Cons Level above Hadoop – sometimes ‘black magic’ Data must (should) be outside of database to get most concurrency
  • 39. Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’ Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Amazon Hadoop http://aws.amazon.com/elasticmapreduce/ Runs on EC2 Provide Map and Reduce functions Can use Cascading Pay as you go
  • 40. Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel Web site: www.cascading.org Mailing list off website AWSome Atlanta Group: http://www.meetup.com/awsomeatlanta/ O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/