SlideShare a Scribd company logo
1 of 15
Download to read offline
Hadoop @ eBuddy
eBuddy
Web based chat (Started in 2003)
● Initially no statistics, msn only
● Started basic logging in 2004
● Today
  ○ 34.467.010.693 login records (34x109)
  ○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)
● Today
   ○ 1.334.794.121 records (1,3x109)
Website (google analytics)
Banners (openx)
Warehousing needs
● Product owners
  ○ Comparing product version
     ■ avg duration
     ■ msg sent/received
  ○ Churn analysis
  ○ Feature analysis
● Marketing
  ○ What countries should we focus on
  ○ What people should we target?
● Sales
  ○ Sell banners in countries/products.
● Operations/Dev
  ○ Help solve bugs
  ○ Blocked in countries/providers
Interesting to know
● Developers are Java centric
● Hosting in the US but BI people in Amsterdam
● 18 hadoop nodes each having
    ○ 16 cores
    ○ 24G ram
    ○ 4x400G HD's
●   We make money with banners
    ○ So don't expect deep pockets
Warehouse timeline
● Traditional rdbms (2004)
● Custom mapreduce code (2008)
  ○ Joining two files (merge join/map join?)
  ○ Repeating code
  ○ Consider abstraction
  ○ Changing data changing code?
● Pig scripts (2008/2009)
  ○ Much simpler to read but domain specific
● Hive (2009)
  ○ Generic sql but with some limitations
  ○ Existing tools can be used
Hive
● Hey I already know this:
select *
from table1 t1
  left outer join table2 t2 on (t1.id = t2.id)
where t2.id is null;


● Java programmers will like this:
  ○ Spring JdbcTemplates
  ○ Existing jdbc tools (SQuirreL)
  ○ Syntax highlighting
  ○ Code completion
Present
● App servers log to mysql
  ○ Brittle but it works
● Hive
  ○ Sql (most developers know this)
  ○ Partition pruning issues
  ○ No rollup queries
● ETL
  ○ Star schema
  ○ Fair scheduling (ETL vs BI)
     ■ reserved for etl pool
     ■ don't start reducers until 90% mappers done
  ○ Lzo on all jobs
● MicroStrategy (odbc)
● SQuirreL (jdbc)
Future
● Look at users from a to z
  ○ website logs
  ○ banners
● Cassandra handler for hive
  ○ Looking at contact lists (not just size)
● Streaming ETL
  ○ flume
      ■ No more mysql & scripts
      ■ Directly write into the correct partition
  ○ avro
      ■ Less schema related problems
  ○ snappy
      ■ Lightweight compression
Questions?
Hive partition pruning
● Won't work
select count(*)
from chatsessions cs
  inner join calendar c on (c.cldr_id = cs.login_cldr_id)
where c.iso_date = '2012-06-14';


● Will work
select cldr_id from calendar where iso_date = '2012-06-14';
select count(*) from chatsessions where login_cldr_id in (1234);
Left outer join in Pig
A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);
B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);
C = COGROUP A BY a1, B BY b1 OUTER;
X = FILTER C BY IsEmpty(B);
Z = FOREACH X GENERATE flatten(A.a2);
DUMP Z;
● avro & hive: https://issues.apache.org/jira/browse/HIVE-
  895

● flume:
   https://cwiki.apache.org/FLUME/

More Related Content

What's hot

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...PROIDEA
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DIVincent Harcq
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöCraig Taverner
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018Fast Reports
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Sanjog Kumar Dash
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generationTung Nguyen
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualizationGraphAware
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJSAnyline
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache SparkopenCypher
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17Julio Martinez
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsDimitris Akridas
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomicLaurence Chen
 

What's hot (20)

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DI
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in Malmö
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Ad Placement Challenge
Ad Placement ChallengeAd Placement Challenge
Ad Placement Challenge
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualization
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache Spark
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informations
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 

Viewers also liked

Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesUtsav Singh Rathour
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012Utsav Singh Rathour
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australiaMaiju Heinonen
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partituraNa Re
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosasCarlos Colombo
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)drpdwilkins
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......vikas dobhal
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of itUtsav Singh Rathour
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use themUtsav Singh Rathour
 

Viewers also liked (19)

La familia
La familiaLa familia
La familia
 
Hive jdbc
Hive jdbcHive jdbc
Hive jdbc
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post Types
 
Alimentos saludable
Alimentos saludableAlimentos saludable
Alimentos saludable
 
Introducao blue solar
Introducao blue solarIntroducao blue solar
Introducao blue solar
 
Working with WordPress themes
Working with WordPress themesWorking with WordPress themes
Working with WordPress themes
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australia
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partitura
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosas
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of it
 
Plan anual 2015 cc ee noveno
Plan anual 2015 cc ee novenoPlan anual 2015 cc ee noveno
Plan anual 2015 cc ee noveno
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use them
 
Branding strategy
Branding strategyBranding strategy
Branding strategy
 

Similar to Hadoop @ eBuddy

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015Jorg Janke
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedbacksinfomicien
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipsterJulien Dubois
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLOSInet
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversHYS Enterprise
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 

Similar to Hadoop @ eBuddy (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Hadoop @ eBuddy

  • 2. eBuddy Web based chat (Started in 2003) ● Initially no statistics, msn only ● Started basic logging in 2004 ● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all. XMS (Launched May 23, 2011) ● Today ○ 1.334.794.121 records (1,3x109) Website (google analytics) Banners (openx)
  • 3. Warehousing needs ● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis ● Marketing ○ What countries should we focus on ○ What people should we target? ● Sales ○ Sell banners in countries/products. ● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  • 4.
  • 5.
  • 6. Interesting to know ● Developers are Java centric ● Hosting in the US but BI people in Amsterdam ● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HD's ● We make money with banners ○ So don't expect deep pockets
  • 7. Warehouse timeline ● Traditional rdbms (2004) ● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code? ● Pig scripts (2008/2009) ○ Much simpler to read but domain specific ● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  • 8. Hive ● Hey I already know this: select * from table1 t1 left outer join table2 t2 on (t1.id = t2.id) where t2.id is null; ● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  • 9. Present ● App servers log to mysql ○ Brittle but it works ● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries ● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ don't start reducers until 90% mappers done ○ Lzo on all jobs ● MicroStrategy (odbc) ● SQuirreL (jdbc)
  • 10. Future ● Look at users from a to z ○ website logs ○ banners ● Cassandra handler for hive ○ Looking at contact lists (not just size) ● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  • 12. Hive partition pruning ● Won't work select count(*) from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id) where c.iso_date = '2012-06-14'; ● Will work select cldr_id from calendar where iso_date = '2012-06-14'; select count(*) from chatsessions where login_cldr_id in (1234);
  • 13.
  • 14. Left outer join in Pig A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray); B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray); C = COGROUP A BY a1, B BY b1 OUTER; X = FILTER C BY IsEmpty(B); Z = FOREACH X GENERATE flatten(A.a2); DUMP Z;
  • 15. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895 ● flume: https://cwiki.apache.org/FLUME/