SlideShare a Scribd company logo
Big Data amb Cassandra i Celery
#bbmnk novembre 2013

Santi Camps Taltavull
@santicamps
@socialvane
La Problemàtica (Big Data)

➲
➲
➲
➲
➲

Gran volum d'informació (TeraBytes)
Informació no estructurada
Poca densitat d'informació útil
Altíssima capacitat de processament
Poca pasta
Les solucions aplicades

➲
➲
➲
➲
➲
➲
➲
➲
➲
➲
➲
➲

BBDD distribuida Cassandra
Gestor de tasques distribuides Celery
Gestor de missatgeria RabbitMQ
Aplicació --> RabbitMQ --> Celery <--> Cassandra
4 servidors inicials
12 TB de capacitat
208 GB de RAM
44 nuclis de CPU
Tolerant a fallades
Redundant
Molt Fàcilment Escalable
I Barat !!
Cassandra

➲
➲
➲
➲
➲
➲

Neix dins de Facebook i s'allibera
L'adopta la fundació Apache
Twitter també l'empra
Està escrit amb Java
És una BBDD NO SQL
Les dades es guarden com a clau -> valor
Cassandra - Avantatges

➲
➲
➲
➲
➲

BBDD distribuida
Redundància configurable
Tolerant a fallades
Preparada per WAN
Totalment Escalable
Cassandra - Inconvenients

➲
➲
➲
➲

No té gestió de transaccions
Es coordina amb timestamps
En mode RandomPartitioner no permet ordenar
En mode RandomPartitioner filtrar es fa difícil
Cassandra - Característiques

➲
➲
➲
➲
➲

Name Space = BBDD
Column Family = Taula
Cada Registre pot tenir columnes diferents
Un Registre pot tenir milions de columnes
Tots es guarda com a clau -> valor
Cassandra - Exemple
create column family user_item with key_validation_class = 'UTF8Type' and comparator =
'UTF8Type' and default_validation_class = 'UTF8Type'
and column_metadata=[
{ column_name: source,
validation_class: UTF8Type, index_type: KEYS},
{column_name: user_name,
validation_class: UTF8Type, index_type: KEYS},
{column_name: type,
validation_class: UTF8Type},
{column_name: last_update,
validation_class: UTF8Type},
{column_name: id,
validation_class: UTF8Type},
{column_name: profile_image_url, validation_class: UTF8Type},
{column_name: name,
validation_class: UTF8Type},
{column_name: friends_count,
validation_class: UTF8Type},
{column_name: followers_count, validation_class: UTF8Type},
{column_name: location,
validation_class: UTF8Type},
{column_name: description,
validation_class: UTF8Type},
{column_name: lang,
validation_class: UTF8Type},
{column_name: geo_latitude,
validation_class: FloatType, index_type: KEYS},
{column_name: geo_longitude,
validation_class: FloatType, index_type: KEYS},
{column_name: geo_radious,
validation_class: FloatType},
];
Cassandra - Exemple
get user_item['facebook.santi.camps.58']
... ;
=> (name=description, value=Me dedico a ..., timestamp=1383782405981374)
=> (name=followers_count, value=0, timestamp=1383782405981374)
=> (name=friends_count, value=, timestamp=1383782405981374)
=> (name=geo_latitude, value=4.264729, timestamp=1383782405981374)
=> (name=geo_longitude, value=39.88943, timestamp=1383782405981374)
=> (name=geo_radious, value=8.976159, timestamp=1383782405981374)
=> (name=id, value=100000444843078, timestamp=1383782405981374)
=> (name=lang, value=en_GB, timestamp=1383782405981374)
=> (name=last_update, value=2013-11-07T01:00:05.981352, timestamp=1383782405981374)
=> (name=location, value=Mahón, Islas Baleares, Spain, timestamp=1383782405981374)
=> (name=name, value=Santi Camps, timestamp=1383782405981374)
=> (name=profile_image_url, value=https://graph.facebook.com/santi.camps.58/picture,
timestamp=1383782405981374)
=> (name=profile_url, value=https://www.facebook.com/santi.camps.58,
timestamp=1383782405981374)
=> (name=source, value=facebook, timestamp=1383782405981374)
=> (name=type, value=user, timestamp=1383782405981374)
=> (name=user_name, value=santi.camps.58, timestamp=1383782405981374)
Cassandra - Indexació
get user_follower_index['santicamps58.facebook.current'];
=> (name=2013-10-29T11:09:01.979083, value=santicamps58.facebook.100000561127539,
timestamp=1381823950979106)
=> (name=2013-10-27T09:59:07.980314, value=santicamps58.facebook.1810751517,
timestamp=1381823950980330)
=> (name=2013-10-11T07:50:10.980547, value=santicamps58.facebook.100002326398873,
timestamp=1381823950980559)
...
get user_follower_item['santicamps58.facebook.100002326398873'];
=> (name=fetch_date, value=2013-10-15, timestamp=1381823950980662)
=> (name=friend_count, value=134, timestamp=1381823950980662)
=> (name=id, value=100002326398873, timestamp=1381823950980662)
=> (name=lang, value=, timestamp=1381823950980662)
=> (name=name, value=Diego Izquierdo Carranza, timestamp=1381823950980662)
=> (name=profile_image_url, value=https://graph.facebook.com/diego.izquierdocarranza/picture,
timestamp=1381823950980662)
=> (name=profile_url, value=https://www.facebook.com/diego.izquierdocarranza,
timestamp=1381823950980662)
=> (name=source, value=facebook, timestamp=1381823950980662)
=> (name=start_date, value=2013-10-15, timestamp=1381823950980662)
=> (name=user_name, value=diego.izquierdocarranza, timestamp=1381823950980662)
Cassandra - Indexació
get mention_tag_source_index['803.possitive'];
...
=> (name=2013-11-08T02:00:27.361445, value=803__-UzkY7psQTYJ,
timestamp=1383876396514768)
=> (name=2013-11-08T06:53:57, value=803__twitter.398704931630481408,
timestamp=1383894677856944)
=> (name=2013-11-08T06:54:38, value=803__twitter.398705100648382464,
timestamp=1383894677646453)
=> (name=2013-11-08T06:57:51, value=803__twitter.398705909511503872,
timestamp=1383894677313681)
...
get mention_tag_source_index['803.possitive.google'];
=> (name=2012-12-01T00:00:00.395260, value=803__YfOIKwVseDkJ,
timestamp=1381830781423739)
=> (name=2012-12-01T00:00:00.420936, value=803__YfOIKwVseDkJ,
timestamp=1381867147942586)
=> (name=2012-12-01T00:00:00.633055, value=803__YfOIKwVseDkJ,
timestamp=1381830436666804)
=> (name=2013-06-14T00:00:00.055140, value=803__5Bv2Eu9qk04J,
timestamp=1381867142254676)
Cassandra - Indexació
get mention_item['803__twitter.398705909511503872'];
=> (name=body, value=@SocialVane INTERESANTÍSIMA HERRAMIENTA DE ANÁLISIS PARA
REDES SOCIALES, timestamp=1383894677307778)
=> (name=body_norm, value=your_brand interesante herramienta analisis red your_brand,
timestamp=1383894677307778)
=> (name=brand, value=103, timestamp=1383894677307778)
=> (name=checked, value=false, timestamp=1383894677307778)
=> (name=emissor, value=SebastianCamps, timestamp=1383894677307778)
=> (name=emissor_id, value=234140801, timestamp=1383894677307778)
=> (name=emissor_name, value=Sebastián Camps , timestamp=1383894677307778)
=> (name=geo, value=None, timestamp=1383894677307778)
=> (name=id, value=398705909511503872, timestamp=1383894677307778)
=> (name=in_reply_to_id, value=, timestamp=1383894677307778)
=> (name=interest, value=, timestamp=1383894677307778)
=> (name=interest_checked, value=False, timestamp=1383894677307778)
=> (name=lang, value=es, timestamp=1383894677307778)
=> (name=like_action_count, value=0, timestamp=1383894677307778)
=> (name=probability, value=0.482361909795, timestamp=1383894677307778)
=> (name=query, value=803, timestamp=1383894677307778)
=> (name=reply_action_count, value=0, timestamp=1383894677307778)
=> (name=retweeted, value=False, timestamp=1383894677307778)
=> (name=share_action_count, value=0, timestamp=1383894677307778)
=> (name=source, value=twitter, timestamp=1383894677307778)
=> (name=tag, value=possitive, timestamp=1383894677307778)
=> (name=time, value=2013-11-08T06:57:51, timestamp=1383894677307778)
Celery

➲
➲
➲
➲
➲

Es configuren cues d'execució
S'engeguen N workers a M màquines escoltant cada cua
Les tasques distribuibles es marquen al codi
Es defineix la cua d'execució de cada tasca
Es poden cridar síncronament o asíncrona

➲
➲
➲

Molt senzill d'implantar
Molt fàcil d'escalar
Cal vigilar la concurrència
Celery Exemple

CELERY_ROUTES = {
"celeryutils.track_all_users_followers": {"queue": "slow", "routing_key": "slow_task"},
"userfollowers.bulk_insert": {"queue": "slow", "routing_key": "slow_task"},
"extract_mentions_from_website": {"queue": "slow", "routing_key": "slow_task"},
"LeadsClassifier.classify_untagged": {"queue": "cpu", "routing_key": "cpu_task"},
...
@task(name = 'extract_mentions_from_website', time_limit=300)
def extract_mentions_from_website(brand, query,...):
...
# CRIDA LOCAL
extract_mentions_from_website(params)
# CRIDA DISTRIBUIDA ASÍNCRONA
extract_mentions_from_website.delay(params)
# CRIDA DISTRIBUIDA SÍNCRONA
extract_mentions_from_website.delay(params).get()

More Related Content

Viewers also liked

Conferencia Big Data en #MenorcaConnecta
Conferencia Big Data en #MenorcaConnectaConferencia Big Data en #MenorcaConnecta
Conferencia Big Data en #MenorcaConnecta
Santi Camps
 
Transparencias taller Python
Transparencias taller PythonTransparencias taller Python
Transparencias taller Python
Sergio Soto
 
Knowing your garbage collector - PyCon Italy 2015
Knowing your garbage collector - PyCon Italy 2015Knowing your garbage collector - PyCon Italy 2015
Knowing your garbage collector - PyCon Italy 2015
fcofdezc
 
BDD - Test Academy Barcelona 2017
BDD - Test Academy Barcelona 2017BDD - Test Academy Barcelona 2017
BDD - Test Academy Barcelona 2017
Carlos Ble
 
Madrid SPARQL handson
Madrid SPARQL handsonMadrid SPARQL handson
Madrid SPARQL handson
Victor de Boer
 
Python Dominicana 059: Django Migrations
Python Dominicana 059: Django MigrationsPython Dominicana 059: Django Migrations
Python Dominicana 059: Django Migrations
Rafael Belliard
 
Volunteering assistance to online geocoding services through a distributed kn...
Volunteering assistance to online geocoding services through a distributed kn...Volunteering assistance to online geocoding services through a distributed kn...
Volunteering assistance to online geocoding services through a distributed kn...
José Pablo Gómez Barrón S.
 
Introduccio a python
Introduccio a pythonIntroduccio a python
Introduccio a python
Santi Camps
 
STM on PyPy
STM on PyPySTM on PyPy
STM on PyPy
fcofdezc
 
TDD in the Web with Python and Django
TDD in the Web with Python and DjangoTDD in the Web with Python and Django
TDD in the Web with Python and Django
Carlos Ble
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
Natalia Díaz Rodríguez
 
Presentacion scraping
Presentacion scrapingPresentacion scraping
Presentacion scraping
Jose Mussach Gomez
 
PyQgis gpul-lab Univerisity of A Coruña 20160413
PyQgis gpul-lab Univerisity of A Coruña 20160413PyQgis gpul-lab Univerisity of A Coruña 20160413
PyQgis gpul-lab Univerisity of A Coruña 20160413
Luigi Pirelli
 
Charla mspba
Charla mspbaCharla mspba
Charla mspba
Julián Perelli
 
Guía de Python
Guía de Python Guía de Python
Guía de Python
Lennys Camargo
 
Geospatial and MongoDB
Geospatial and MongoDBGeospatial and MongoDB
Geospatial and MongoDB
Norberto Leite
 
Bucles con Scratch
Bucles con ScratchBucles con Scratch
Bucles con Scratch
Fco Javier Lucena
 

Viewers also liked (19)

Conferencia Big Data en #MenorcaConnecta
Conferencia Big Data en #MenorcaConnectaConferencia Big Data en #MenorcaConnecta
Conferencia Big Data en #MenorcaConnecta
 
Transparencias taller Python
Transparencias taller PythonTransparencias taller Python
Transparencias taller Python
 
Knowing your garbage collector - PyCon Italy 2015
Knowing your garbage collector - PyCon Italy 2015Knowing your garbage collector - PyCon Italy 2015
Knowing your garbage collector - PyCon Italy 2015
 
BDD - Test Academy Barcelona 2017
BDD - Test Academy Barcelona 2017BDD - Test Academy Barcelona 2017
BDD - Test Academy Barcelona 2017
 
Tidy vews, decorator and presenter
Tidy vews, decorator and presenterTidy vews, decorator and presenter
Tidy vews, decorator and presenter
 
Madrid SPARQL handson
Madrid SPARQL handsonMadrid SPARQL handson
Madrid SPARQL handson
 
Python Dominicana 059: Django Migrations
Python Dominicana 059: Django MigrationsPython Dominicana 059: Django Migrations
Python Dominicana 059: Django Migrations
 
Volunteering assistance to online geocoding services through a distributed kn...
Volunteering assistance to online geocoding services through a distributed kn...Volunteering assistance to online geocoding services through a distributed kn...
Volunteering assistance to online geocoding services through a distributed kn...
 
Introduccio a python
Introduccio a pythonIntroduccio a python
Introduccio a python
 
STM on PyPy
STM on PyPySTM on PyPy
STM on PyPy
 
TDD in the Web with Python and Django
TDD in the Web with Python and DjangoTDD in the Web with Python and Django
TDD in the Web with Python and Django
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
 
Presentacion scraping
Presentacion scrapingPresentacion scraping
Presentacion scraping
 
Grunt.js introduction
Grunt.js introductionGrunt.js introduction
Grunt.js introduction
 
PyQgis gpul-lab Univerisity of A Coruña 20160413
PyQgis gpul-lab Univerisity of A Coruña 20160413PyQgis gpul-lab Univerisity of A Coruña 20160413
PyQgis gpul-lab Univerisity of A Coruña 20160413
 
Charla mspba
Charla mspbaCharla mspba
Charla mspba
 
Guía de Python
Guía de Python Guía de Python
Guía de Python
 
Geospatial and MongoDB
Geospatial and MongoDBGeospatial and MongoDB
Geospatial and MongoDB
 
Bucles con Scratch
Bucles con ScratchBucles con Scratch
Bucles con Scratch
 

Similar to Big data amb Cassandra i Celery ##bbmnk

Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
Dave Gardner
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
DataStax
 
Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0
Alexander DEJANOVSKI
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
Dave Gardner
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over Cassandra
Noam Barkai
 
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
DataStax Academy
 
Results cache
Results cacheResults cache
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Spark Summit
 
Ten modules I haven't yet talked about
Ten modules I haven't yet talked aboutTen modules I haven't yet talked about
Ten modules I haven't yet talked about
acme
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
Stavros Kontopoulos
 
MUC - Moodle Universal Cache
MUC - Moodle Universal CacheMUC - Moodle Universal Cache
MUC - Moodle Universal Cache
Tim Hunt
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
DataStax Academy
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
ScyllaDB
 
Think Distributed: The Hazelcast Way
Think Distributed: The Hazelcast WayThink Distributed: The Hazelcast Way
Think Distributed: The Hazelcast Way
Rahul Gupta
 
PWA caching strategies
PWA caching strategiesPWA caching strategies
PWA caching strategies
Gabriele Falasca
 
Distributed caching and computing v3.7
Distributed caching and computing v3.7Distributed caching and computing v3.7
Distributed caching and computing v3.7
Rahul Gupta
 

Similar to Big data amb Cassandra i Celery ##bbmnk (20)

Cabs, Cassandra, and Hailo
Cabs, Cassandra, and HailoCabs, Cassandra, and Hailo
Cabs, Cassandra, and Hailo
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
 
Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over Cassandra
 
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
C* Summit EU 2013: No Whistling Required: Cabs, Cassandra, and Hailo
 
Results cache
Results cacheResults cache
Results cache
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Ten modules I haven't yet talked about
Ten modules I haven't yet talked aboutTen modules I haven't yet talked about
Ten modules I haven't yet talked about
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
MUC - Moodle Universal Cache
MUC - Moodle Universal CacheMUC - Moodle Universal Cache
MUC - Moodle Universal Cache
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
C* Summit 2013: No Whistling Required: Cabs, Cassandra, and Hailo by Dave Gar...
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
 
Think Distributed: The Hazelcast Way
Think Distributed: The Hazelcast WayThink Distributed: The Hazelcast Way
Think Distributed: The Hazelcast Way
 
PWA caching strategies
PWA caching strategiesPWA caching strategies
PWA caching strategies
 
Distributed caching and computing v3.7
Distributed caching and computing v3.7Distributed caching and computing v3.7
Distributed caching and computing v3.7
 

Recently uploaded

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Big data amb Cassandra i Celery ##bbmnk

  • 1. Big Data amb Cassandra i Celery #bbmnk novembre 2013 Santi Camps Taltavull @santicamps @socialvane
  • 2. La Problemàtica (Big Data) ➲ ➲ ➲ ➲ ➲ Gran volum d'informació (TeraBytes) Informació no estructurada Poca densitat d'informació útil Altíssima capacitat de processament Poca pasta
  • 3. Les solucions aplicades ➲ ➲ ➲ ➲ ➲ ➲ ➲ ➲ ➲ ➲ ➲ ➲ BBDD distribuida Cassandra Gestor de tasques distribuides Celery Gestor de missatgeria RabbitMQ Aplicació --> RabbitMQ --> Celery <--> Cassandra 4 servidors inicials 12 TB de capacitat 208 GB de RAM 44 nuclis de CPU Tolerant a fallades Redundant Molt Fàcilment Escalable I Barat !!
  • 4. Cassandra ➲ ➲ ➲ ➲ ➲ ➲ Neix dins de Facebook i s'allibera L'adopta la fundació Apache Twitter també l'empra Està escrit amb Java És una BBDD NO SQL Les dades es guarden com a clau -> valor
  • 5. Cassandra - Avantatges ➲ ➲ ➲ ➲ ➲ BBDD distribuida Redundància configurable Tolerant a fallades Preparada per WAN Totalment Escalable
  • 6. Cassandra - Inconvenients ➲ ➲ ➲ ➲ No té gestió de transaccions Es coordina amb timestamps En mode RandomPartitioner no permet ordenar En mode RandomPartitioner filtrar es fa difícil
  • 7. Cassandra - Característiques ➲ ➲ ➲ ➲ ➲ Name Space = BBDD Column Family = Taula Cada Registre pot tenir columnes diferents Un Registre pot tenir milions de columnes Tots es guarda com a clau -> valor
  • 8. Cassandra - Exemple create column family user_item with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and default_validation_class = 'UTF8Type' and column_metadata=[ { column_name: source, validation_class: UTF8Type, index_type: KEYS}, {column_name: user_name, validation_class: UTF8Type, index_type: KEYS}, {column_name: type, validation_class: UTF8Type}, {column_name: last_update, validation_class: UTF8Type}, {column_name: id, validation_class: UTF8Type}, {column_name: profile_image_url, validation_class: UTF8Type}, {column_name: name, validation_class: UTF8Type}, {column_name: friends_count, validation_class: UTF8Type}, {column_name: followers_count, validation_class: UTF8Type}, {column_name: location, validation_class: UTF8Type}, {column_name: description, validation_class: UTF8Type}, {column_name: lang, validation_class: UTF8Type}, {column_name: geo_latitude, validation_class: FloatType, index_type: KEYS}, {column_name: geo_longitude, validation_class: FloatType, index_type: KEYS}, {column_name: geo_radious, validation_class: FloatType}, ];
  • 9. Cassandra - Exemple get user_item['facebook.santi.camps.58'] ... ; => (name=description, value=Me dedico a ..., timestamp=1383782405981374) => (name=followers_count, value=0, timestamp=1383782405981374) => (name=friends_count, value=, timestamp=1383782405981374) => (name=geo_latitude, value=4.264729, timestamp=1383782405981374) => (name=geo_longitude, value=39.88943, timestamp=1383782405981374) => (name=geo_radious, value=8.976159, timestamp=1383782405981374) => (name=id, value=100000444843078, timestamp=1383782405981374) => (name=lang, value=en_GB, timestamp=1383782405981374) => (name=last_update, value=2013-11-07T01:00:05.981352, timestamp=1383782405981374) => (name=location, value=Mahón, Islas Baleares, Spain, timestamp=1383782405981374) => (name=name, value=Santi Camps, timestamp=1383782405981374) => (name=profile_image_url, value=https://graph.facebook.com/santi.camps.58/picture, timestamp=1383782405981374) => (name=profile_url, value=https://www.facebook.com/santi.camps.58, timestamp=1383782405981374) => (name=source, value=facebook, timestamp=1383782405981374) => (name=type, value=user, timestamp=1383782405981374) => (name=user_name, value=santi.camps.58, timestamp=1383782405981374)
  • 10. Cassandra - Indexació get user_follower_index['santicamps58.facebook.current']; => (name=2013-10-29T11:09:01.979083, value=santicamps58.facebook.100000561127539, timestamp=1381823950979106) => (name=2013-10-27T09:59:07.980314, value=santicamps58.facebook.1810751517, timestamp=1381823950980330) => (name=2013-10-11T07:50:10.980547, value=santicamps58.facebook.100002326398873, timestamp=1381823950980559) ... get user_follower_item['santicamps58.facebook.100002326398873']; => (name=fetch_date, value=2013-10-15, timestamp=1381823950980662) => (name=friend_count, value=134, timestamp=1381823950980662) => (name=id, value=100002326398873, timestamp=1381823950980662) => (name=lang, value=, timestamp=1381823950980662) => (name=name, value=Diego Izquierdo Carranza, timestamp=1381823950980662) => (name=profile_image_url, value=https://graph.facebook.com/diego.izquierdocarranza/picture, timestamp=1381823950980662) => (name=profile_url, value=https://www.facebook.com/diego.izquierdocarranza, timestamp=1381823950980662) => (name=source, value=facebook, timestamp=1381823950980662) => (name=start_date, value=2013-10-15, timestamp=1381823950980662) => (name=user_name, value=diego.izquierdocarranza, timestamp=1381823950980662)
  • 11. Cassandra - Indexació get mention_tag_source_index['803.possitive']; ... => (name=2013-11-08T02:00:27.361445, value=803__-UzkY7psQTYJ, timestamp=1383876396514768) => (name=2013-11-08T06:53:57, value=803__twitter.398704931630481408, timestamp=1383894677856944) => (name=2013-11-08T06:54:38, value=803__twitter.398705100648382464, timestamp=1383894677646453) => (name=2013-11-08T06:57:51, value=803__twitter.398705909511503872, timestamp=1383894677313681) ... get mention_tag_source_index['803.possitive.google']; => (name=2012-12-01T00:00:00.395260, value=803__YfOIKwVseDkJ, timestamp=1381830781423739) => (name=2012-12-01T00:00:00.420936, value=803__YfOIKwVseDkJ, timestamp=1381867147942586) => (name=2012-12-01T00:00:00.633055, value=803__YfOIKwVseDkJ, timestamp=1381830436666804) => (name=2013-06-14T00:00:00.055140, value=803__5Bv2Eu9qk04J, timestamp=1381867142254676)
  • 12. Cassandra - Indexació get mention_item['803__twitter.398705909511503872']; => (name=body, value=@SocialVane INTERESANTÍSIMA HERRAMIENTA DE ANÁLISIS PARA REDES SOCIALES, timestamp=1383894677307778) => (name=body_norm, value=your_brand interesante herramienta analisis red your_brand, timestamp=1383894677307778) => (name=brand, value=103, timestamp=1383894677307778) => (name=checked, value=false, timestamp=1383894677307778) => (name=emissor, value=SebastianCamps, timestamp=1383894677307778) => (name=emissor_id, value=234140801, timestamp=1383894677307778) => (name=emissor_name, value=Sebastián Camps , timestamp=1383894677307778) => (name=geo, value=None, timestamp=1383894677307778) => (name=id, value=398705909511503872, timestamp=1383894677307778) => (name=in_reply_to_id, value=, timestamp=1383894677307778) => (name=interest, value=, timestamp=1383894677307778) => (name=interest_checked, value=False, timestamp=1383894677307778) => (name=lang, value=es, timestamp=1383894677307778) => (name=like_action_count, value=0, timestamp=1383894677307778) => (name=probability, value=0.482361909795, timestamp=1383894677307778) => (name=query, value=803, timestamp=1383894677307778) => (name=reply_action_count, value=0, timestamp=1383894677307778) => (name=retweeted, value=False, timestamp=1383894677307778) => (name=share_action_count, value=0, timestamp=1383894677307778) => (name=source, value=twitter, timestamp=1383894677307778) => (name=tag, value=possitive, timestamp=1383894677307778) => (name=time, value=2013-11-08T06:57:51, timestamp=1383894677307778)
  • 13. Celery ➲ ➲ ➲ ➲ ➲ Es configuren cues d'execució S'engeguen N workers a M màquines escoltant cada cua Les tasques distribuibles es marquen al codi Es defineix la cua d'execució de cada tasca Es poden cridar síncronament o asíncrona ➲ ➲ ➲ Molt senzill d'implantar Molt fàcil d'escalar Cal vigilar la concurrència
  • 14. Celery Exemple CELERY_ROUTES = { "celeryutils.track_all_users_followers": {"queue": "slow", "routing_key": "slow_task"}, "userfollowers.bulk_insert": {"queue": "slow", "routing_key": "slow_task"}, "extract_mentions_from_website": {"queue": "slow", "routing_key": "slow_task"}, "LeadsClassifier.classify_untagged": {"queue": "cpu", "routing_key": "cpu_task"}, ... @task(name = 'extract_mentions_from_website', time_limit=300) def extract_mentions_from_website(brand, query,...): ... # CRIDA LOCAL extract_mentions_from_website(params) # CRIDA DISTRIBUIDA ASÍNCRONA extract_mentions_from_website.delay(params) # CRIDA DISTRIBUIDA SÍNCRONA extract_mentions_from_website.delay(params).get()