SlideShare a Scribd company logo
1 of 64
Two-tier Architecture 
Real-time Tier (Couchbase) 
• Detects user intent 
•Gives next best recommendation or deal 
Data Bridge (Couchdoop) 
Batch Tier (Hadoop) 
• Recommends products 
User events 
Recommendations
Importing Data 
{ 
“user”: “Rudy”, 
“action”: “view”, 
“product”: “Fender Guitar” 
} 
{ 
“user”: “Rudy”, 
“action”: “click”, 
“product”: “Guitar Amplifier” 
} { 
“user”: “Emma”, 
“action”: “buy”, 
“product”: “Blue Skirt” 
} 
Couchdoo 
p 
Machine Learning Recommendations 
Hadoop 
I 
M 
P 
O 
R 
T 
HDFS
{ 
“user”: “Rudy”, 
“recommendations”: [ 
[“Ibanez Acoustic Guitar”, 
450], 
[“Guitar Tuner”, 120], 
[“Sound Mixer”, 30] 
] 
} 
Exporting Data 
Couchdoo 
p 
Machine Learning Recommendations 
Hadoop
{ 
“user”: “Rudy”, 
“recommendations”: [ 
[“Ibanez Acoustic Guitar”, 
450], 
[“Guitar Tuner”, 120], 
[“Sound Mixer”, 30] 
] 
} 
Updating Data 
Couchdoo 
p 
Machine Learning Recommendations 
Hadoop
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase

More Related Content

Similar to Couchdoop: Connecting Hadoop with Couchbase

HuddleCamHD™ Camera Line
HuddleCamHD™ Camera LineHuddleCamHD™ Camera Line
HuddleCamHD™ Camera LinePaul Richards
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?BigDataCloud
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 

Similar to Couchdoop: Connecting Hadoop with Couchbase (10)

HuddleCamHD™ Camera Line
HuddleCamHD™ Camera LineHuddleCamHD™ Camera Line
HuddleCamHD™ Camera Line
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Data scientist a perfect job
Data scientist a perfect jobData scientist a perfect job
Data scientist a perfect job
 

More from Bigstep

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolBigstep
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with AnsibleBigstep
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkBigstep
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBBigstep
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationBigstep
 

More from Bigstep (8)

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with Ansible
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance Benchmark
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DB
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimization
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Couchdoop: Connecting Hadoop with Couchbase

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Two-tier Architecture Real-time Tier (Couchbase) • Detects user intent •Gives next best recommendation or deal Data Bridge (Couchdoop) Batch Tier (Hadoop) • Recommends products User events Recommendations
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Importing Data { “user”: “Rudy”, “action”: “view”, “product”: “Fender Guitar” } { “user”: “Rudy”, “action”: “click”, “product”: “Guitar Amplifier” } { “user”: “Emma”, “action”: “buy”, “product”: “Blue Skirt” } Couchdoo p Machine Learning Recommendations Hadoop I M P O R T HDFS
  • 17. { “user”: “Rudy”, “recommendations”: [ [“Ibanez Acoustic Guitar”, 450], [“Guitar Tuner”, 120], [“Sound Mixer”, 30] ] } Exporting Data Couchdoo p Machine Learning Recommendations Hadoop
  • 18. { “user”: “Rudy”, “recommendations”: [ [“Ibanez Acoustic Guitar”, 450], [“Guitar Tuner”, 120], [“Sound Mixer”, 30] ] } Updating Data Couchdoo p Machine Learning Recommendations Hadoop

Editor's Notes

  1. Hello! My name is Călin-Andrei Burloiu and I work at Avira as a Big Data Engineer. I am going to present Couchdoop, an open-source application and library that we developed in-house at Avira, which helps bringing real-time applications closer to Big Data.
  2. My presentation is structured around the idea of splitting Big Data applications into two tiers, a real-time tier and a batch tier. For each tier I will present the technologies used at Avira: Couchbase and Hadoop, respectively. I will show how my project, Couchdoop, sits as a data bridge between the two tiers. At the end I will show how my project performs in production.
  3. First of all I want to say a few words about me. I have a master in Computer Science and I have been Big Data Engineer at Avira for more than 1 year. I have 1 year and a half experience in this field. In this period a had the chance to work with technologies such as Hadoop, HBase, Couchbase, Hive and Spark. My main expertise is in distributed systems, but since I work with data I am willing to learn more in data science topics like statistics, machine learning and data mining.
  4. Consumer applications which need to work with large amounts of data and which need to serve lots of users are typically split into two tiers. The batch tier is responsible for transforming large volumes of data into relevant information. Being much smaller the resulted information can be used by the real-time tier with low latencies.
  5. This is a stacked representation of the Two-tier architecture. My project, Couchdoop, is the thin orange layer between the two tiers, which acts as a data bridge. Its responsibility is to abstract the interface between the layers, with the smallest possible overhead.
  6. I believe in the power of example. So I constructed my presentation around a real application developed at Avira. Along with our famous antivirus, our security suite includes Avira Browser Safety, an extension for Chrome and Firefox which protects users from malware sites and enhances privacy. This year Avira launched Avira Offers a feature of Avira Browser Safety, which assists users while shopping online by giving them better deals for products. In the future we wish to create a recommending system which helps users discover new products.
  7. In this screenshot the user landed on a guitar product page on Amazon. Avira Offers shows a deal for the same product, but from a different merchant with 14% off.
  8. Let’s imagine a fictitious user. So, meet Rudy, a guitar player who plays in a heavy metal band. Although he is not a sound engineer, he likes to play with his studio equipment from his basement. He likes to travel around the world and in his spare time he likes to buy clothes.
  9. So how would Avira Offers work for Rudy? Based on his shopping history, the browser extension recommends guitar accessories, recording studio equipment, trips around the world and fancy clothes. But what if he is not in the mood to buy music related products. Imagine, after 8 hours of practice and mixing, he only wants to relax by buying a new pair of sun glasses. So, based on his very recent actions, Avira Offers is able to detect his intention and only show what Rudy wants to see at that moment.
  10. Here is what Avira Offers would show to Rudy based on user intent. If he is shopping online for musical instruments, our app might give him a good deal for a new guitar or it might recommend some accessories for his guitars. If he searches for hotels in Paris, then he might plan a concert with his band there. Avira Offers detects user intent and gives him a voucher for a nice hotel and a deal for a cheap flight. If he is searching for snorkeling equipment and it’s summer, than he is probably planning a holiday with his girlfriend in Greece. Avira Offers detects his intent of buying holiday related products and gives him a better deal for snorkeling equipment or recommends him some nice swimming suits for his girlfriend.
  11. Going back to the Two-tier Architecture, let’s see how it fits to Avira Offers. The real-time tier is responsible for intent detection and must be able to give the next best recommendation. The batch tier must be able to recommend products to the user. All user events captured by the real-time tier flow through the data bridge to the batch tier. Machine learning algorithms are able to use user events to recommend products. After computing the recommendations, the batch tier uses the data bridge to publish them to the real-time tier. The batch tier uses Hadoop for Big Data processing. The real-time tier stores its data into Couchbase, a NoSQL database. The data bridge is Couchdoop, the open-source project I developed.
  12. Now I will tell you more about Couchbase. But first of all, did anyone here worked with Couchbase before? OK! Did anyone here heard about Couchbase or is it the first time you hear about it? OK! Couchbase is a document-based NoSQL database, a key-value store, which stores JSON documents. Its main advantage is that it is very fast. It accomplishes this by keeping most of its documents and the index in memory. I am sure a lot of you worked with Memcached before. Couchbase is basically an evolution of Memcached. As Memcached it is scalable, because it runs on a computer cluster, but unlike it, it is fault tolerant, so if a node fails, you don’t loose the data, because it is persisted on disk and replicated to other nodes.
  13. You may wonder why we chose Couchbase? This graphic shows how it performs in comparison with its main competitors. If we assume that most of our operations are reads, we can see that Couchbase has a lower latency than MongoDB and Cassandra.
  14. In order to make the two tiers talk to each other an interface between them is necessary. Avira uses both Hadoop and Couchbase. In order to connect the two technologies we created Couchdoop. It is able to import documents from Couchbase to Hadoop storage. In Avira Offers, it feeds the recommending system with user events. Couchdoop is then able to export documents from Hadoop storage to Couchbase. In Avira Offers it makes the computed recommendations available to the real-time tier. Couchdoop can also update a document from Couchbase by combining its previous value with new data from Hadoop storage. This is useful for performing synchronizations, like updating user profile documents or updating item scores.
  15. The company behind Couchbase already provided a Hadoop connector based on Sqoop. But it didn’t fit our needs at Avira, because of its limitations, so we decided to develop Couchdoop. The Sqoop connector is only able to take full dumps from Couchbase or to stream documents for a limited amount of time. Additionally, Sqoop was designed to work with SQL databases, not with NoSQL.
  16. The real-time tier is tracking the user activity, by storing it in Couchbase as JSON documents. We are tracking when a user views a product web page, when it clicks on an offer given by our application or when the user buys a product. Then, all these events need to be stored in Hadoop storage, so we are using Couchdoop to import them. A machine learning algorithm from Mahout (like collaborative filtering) uses the user events to make recommendations.
  17. After the recommendations are computed, the data needs to be exported back to Couchbase. This is where Couchdoop comes again with its export feature.
  18. After the recommendations are computed, the data needs to be exported back to Couchbase. This is where Couchdoop comes again with its export feature.
  19. In order to import data from Couchbase into Hadoop we need to use a Couchbase feature known as views. Couchbase is a key-value store, so we need to know the key in order to retrieve a document. Keys are the primary index. But what if we don’t know the key? In our use case we want to retrieve lots of documents by a secondary index, so we must use views, which provice indexing by using Javascript MapReduce functions. But do not confuse them with Hadoop MapReduce functions, Couchbase uses its own MapReduce engine.
  20. Let’s assume our documents from Couchbase are used to track user events such as clicks from Avira Offers. This sample document says that user Rudy clicked on a Guitar product offer on the 16-th of September. If we want to import click events into Hadoop by date, we can create a view with this Javascript map function. By emitting the date for each document we can retrieve all documents for a particular date by querying the view. Views are queried by using view keys, which are secondary keys. Our sample document can be retrieved with this view key.
  21. Let’s assume our documents from Couchbase are used to track user events such as clicks from Avira Offers. This sample document says that user Rudy clicked on a Guitar product offer on the 16-th of September. If we want to import click events into Hadoop by date, we can create a view with this Javascript map function. By emitting the date for each document we can retrieve all documents for a particular date by querying the view. Views are queried by using view keys, which are secondary keys. Our sample document can be retrieved with this view key.
  22. When using Couchdoop’s import feature we must pass a set of view keys which are used to query Couchbase for documents. On this slide, you can see how to do this with the command line tool. Colored in red is a list of two view keys separated by semicolon. The list is used to exploit Hadoop’s parallelism, because view keys are distributed evenly across map tasks. Each view key will retrieve multiple documents. The parallelism is controlled by setting the number of map tasks. By default, each map task will use exactly one view key to query Couchbase.
  23. You can use Couchdoop’s import feature in your own MapReduce jobs by setting the InputFormat to CouchbaseViewInputFormat. Behind the scenes, the data from Couchbase is partitioned into multiple InputSplits. The parallelism value dictates the number of InputSplits. Each InputSplit contains one or more view keys. Hadoop framework will create a RecordReader for each input split which will do the following. It will connect to Couchbase. Then it will query the view by using the view keys from the InputSplit. And finally, it will convert the Couchbase key-values retrieved into Hadoop input key-values. You will then be able to write map functions directly on Couchbase data.
  24. Exporting is much more simple, because you don’t need views. Hadoop key-values are directly mapped to Couchbase key-values. For each key-value written to Couchbase from a Hadoop task, you can specify an operation, such as ADD, SET, REPLACE or DELETE. To improve reliability, we used exponential back-off for failed operations, so if an operation fails it is retried several times. The interval grows exponentially in order to avoid congestion.
  25. In order to write data to Couchbase from your own MapReduce jobs, you must configure the OutputFormat to CouchbaseOutputFormat. Behind the scenes, a RecordWriter is used which does the following. It connects to Couchbase. Then, it converts Hadoop output key-values to Couchbase operations. And finally, the key-value operations are written in Couchbase. OutputCommiters are usually required when writing to a file system, such as HDFS. Here, no OutputCommitter implementation is required.
  26. We would like to thank Bigstep for providing us their infrastructure for our experiments. We used bare metal instances. The Hadoop cluster had 7 worker nodes. The Couchbase cluster initially had only 2 nodes. Bigstep’s infrastructure allowed us to provision an additional Couchbase note within minutes. Each node had two 2 CPUs with a total number of 16 physical cores. This means 32 virtual cores with hyper-threading. Each node has 192 GiB and used local hard disks. The nodes were connected with 10 gigabits Ethernet connections.
  27. We tested two features of Couchdoop: importing of documents from Couchbase to Hadoop and exporting documents to Couchbase from Hadoop. Between various experiments we varied the parallelism and the number of nodes used in a Couchbase cluster. To provide statistical significance we repeated each experiment 10 times.
  28. The data used in our experiments is comprised of fake, auto-generated JSON documents, which have about 2.2 kilobytes each. Each import experiment pulls about 7 GiB of data from Couchbase. For exporting, we used HDFS files to control parallelism. FileInputFormat will use a map task for each file. So, the number of input files equals parallelism value. As a consequence, as the parallelism value increases, the size of the data which is pushed to Couchbase also increases, linearly. So there is a difference between imports and exports in this experiment. Imports work with constant amounts of data, while for exports the amount of data is variable.
  29. Now let’s see how exporting from Hadoop to Couchbase performs.
  30. This grahic shows the evolution of the number of operations per second during an experiment when there is no parallelism, we only have 1 mapper. We reach about 4000 operation per second.
  31. Now let’s start to increase the parallelism. With 2 mappers we reach about 7000 operations per second...
  32. ... with 4 mappers we reach 14,000...
  33. ... with 8, about 20,000.
  34. Now we change the scale to accommodate higher values. For 16 mappers, we reach about 30,000 ops/second…
  35. … with 32 mappers we reach more than 50,000…
  36. … with 64 we get 90,000…
  37. … and with a parallelism of 128 we exceed 200,000 operations per second.
  38. The experiment with 256 mappers shows that we exceeded the maximum number of Hadoop tasks that could be executed simultaneously and some tasks wait for others to finish. We can see two waves of tasks executed during the experiment. However, we still see growth to about 340k operations per second.
  39. From this point increasing parallelism no longer improves performance.
  40. Here we can see the previous experiments overlapped. We can clearly conclude that increasing the parallelism dramatically improves performance for exports. Notice how the limit is exceeded at 256 mappers.
  41. The limit is better shown in this graph, which shows has performance grows with parallelism.
  42. The previous experiments used a Couchbase cluster with 3 nodes. Now let’s see how do they compare with 2 node clusters. With 1 mapper (no parallelism) we have better results with 3 Couchbase nodes. Makes sense, isn’t it?
  43. The same thing can be said while we increase the parallelism to 16 mappers…
  44. … then 32…
  45. … but starting from 64 the situation is reversed. More Couchbase nodes actually perform worse with high levels of parallelism.
  46. Same for 128. This is because lots of nodes communicate with each other and the latency grows, affecting the performance. This is a known issue for high contention clouds like Amazon.
  47. Now let’s see how importing data from Couchbase to Hadoop performs.
  48. With no parallelism we reach more than 9000 ops/s.
  49. With 2 mappers we reach 17,000.
  50. With 4, we exceed 30,000…
  51. … with 8 we reach 60,000.
  52. Now we change the scale to accommodate higher values. For 16 mappers, we reach about 90k ops/second…
  53. … with 32 it’s more than 130k…
  54. … with 64 we reach 170,000…
  55. … but at 128 parallelism we see no difference.
  56. By looking at all import experiments overlapped we conclude that at 64 parallelism we reach the limit. Again, increasing parallelism provide significant performance benefits.
  57. This graph shows how performance evolves while increasing parallelism. The limit for our setup is shown at 32 mappers.
  58. Now let’s see how the number of nodes in the Couchbase cluster affects performance. Again, for low levels of parallelism the performance for more nodes is better. This applies to 2 mappers…
  59. … 4 mappers…
  60. … but at 8 mappers, we have similar performance.
  61. Almost the same for 16 mappers.
  62. After reaching 64 mappers, again, as with the export, more Couchbase nodes performe worse for high levels of parallelism.
  63. We can now conclude that most consumer applications should be split into two tiers: real-time and batch. The open-source project we developed, Couchdoop, provides an interface between the two tiers if we want to use Couchbase and Hadoop. It has a very good performance, as we saw in the experiments, and it has an easy to use API. The experiments clearly show that exploiting Hadoop’s parallelism improves the communication with Couchbase. Avira Offers is the first real life application to use Couchdoop in production in order to implement the 2-tier architecture.
  64. Thank you very much! Checkout Couchdoop project on GitHub. Feel free to use it and fork it. Do you have any questions?