SlideShare a Scribd company logo
Hadoop @ eBuddy
eBuddy
Web based chat (Started in 2003)
● Initially no statistics, msn only
● Started basic logging in 2004
● Today
  ○ 34.467.010.693 login records (34x109)
  ○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)
● Today
   ○ 1.334.794.121 records (1,3x109)
Website (google analytics)
Banners (openx)
Warehousing needs
● Product owners
  ○ Comparing product version
     ■ avg duration
     ■ msg sent/received
  ○ Churn analysis
  ○ Feature analysis
● Marketing
  ○ What countries should we focus on
  ○ What people should we target?
● Sales
  ○ Sell banners in countries/products.
● Operations/Dev
  ○ Help solve bugs
  ○ Blocked in countries/providers
Interesting to know
● Developers are Java centric
● Hosting in the US but BI people in Amsterdam
● 18 hadoop nodes each having
    ○ 16 cores
    ○ 24G ram
    ○ 4x400G HD's
●   We make money with banners
    ○ So don't expect deep pockets
Warehouse timeline
● Traditional rdbms (2004)
● Custom mapreduce code (2008)
  ○ Joining two files (merge join/map join?)
  ○ Repeating code
  ○ Consider abstraction
  ○ Changing data changing code?
● Pig scripts (2008/2009)
  ○ Much simpler to read but domain specific
● Hive (2009)
  ○ Generic sql but with some limitations
  ○ Existing tools can be used
Hive
● Hey I already know this:
select *
from table1 t1
  left outer join table2 t2 on (t1.id = t2.id)
where t2.id is null;


● Java programmers will like this:
  ○ Spring JdbcTemplates
  ○ Existing jdbc tools (SQuirreL)
  ○ Syntax highlighting
  ○ Code completion
Present
● App servers log to mysql
  ○ Brittle but it works
● Hive
  ○ Sql (most developers know this)
  ○ Partition pruning issues
  ○ No rollup queries
● ETL
  ○ Star schema
  ○ Fair scheduling (ETL vs BI)
     ■ reserved for etl pool
     ■ don't start reducers until 90% mappers done
  ○ Lzo on all jobs
● MicroStrategy (odbc)
● SQuirreL (jdbc)
Future
● Look at users from a to z
  ○ website logs
  ○ banners
● Cassandra handler for hive
  ○ Looking at contact lists (not just size)
● Streaming ETL
  ○ flume
      ■ No more mysql & scripts
      ■ Directly write into the correct partition
  ○ avro
      ■ Less schema related problems
  ○ snappy
      ■ Lightweight compression
Questions?
Hive partition pruning
● Won't work
select count(*)
from chatsessions cs
  inner join calendar c on (c.cldr_id = cs.login_cldr_id)
where c.iso_date = '2012-06-14';


● Will work
select cldr_id from calendar where iso_date = '2012-06-14';
select count(*) from chatsessions where login_cldr_id in (1234);
Left outer join in Pig
A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);
B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);
C = COGROUP A BY a1, B BY b1 OUTER;
X = FILTER C BY IsEmpty(B);
Z = FOREACH X GENERATE flatten(A.a2);
DUMP Z;
● avro & hive: https://issues.apache.org/jira/browse/HIVE-
  895

● flume:
   https://cwiki.apache.org/FLUME/

More Related Content

What's hot

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...PROIDEA
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DIVincent Harcq
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöCraig Taverner
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big dataLars Albertsson
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucFraugster
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018Fast Reports
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Sanjog Kumar Dash
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generationTung Nguyen
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualizationGraphAware
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJSAnyline
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache SparkopenCypher
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17Julio Martinez
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsDimitris Akridas
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomicLaurence Chen
 

What's hot (20)

Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
Atmosphere 2018: Wojciech Krysmann- INFRA AS CODE - TERRAFORM DEEP DIVE AND B...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Talend connect BE Vincent Harcq - Talend ESB - DI
Talend connect BE Vincent Harcq - Talend  ESB - DITalend connect BE Vincent Harcq - Talend  ESB - DI
Talend connect BE Vincent Harcq - Talend ESB - DI
 
Neo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in MalmöNeo4j Spatial at LocationDay 2013 in Malmö
Neo4j Spatial at LocationDay 2013 in Malmö
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Ad Placement Challenge
Ad Placement ChallengeAd Placement Challenge
Ad Placement Challenge
 
10 ways to stumble with big data
10 ways to stumble with big data10 ways to stumble with big data
10 ways to stumble with big data
 
Intro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana GoriucIntro To Graph Databases - Oxana Goriuc
Intro To Graph Databases - Oxana Goriuc
 
FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018FastReport VCL6 Nuremberg 2018
FastReport VCL6 Nuremberg 2018
 
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
Distributed Logging System Using Elasticsearch Logstash,Beat,Kibana Stack and...
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Distributed unique id generation
Distributed unique id generationDistributed unique id generation
Distributed unique id generation
 
Challenges in knowledge graph visualization
Challenges in knowledge graph visualizationChallenges in knowledge graph visualization
Challenges in knowledge graph visualization
 
ConvNetJS & CaffeJS
ConvNetJS & CaffeJSConvNetJS & CaffeJS
ConvNetJS & CaffeJS
 
Cypher for Apache Spark
Cypher for Apache SparkCypher for Apache Spark
Cypher for Apache Spark
 
Customer segmentation scbcn17
Customer segmentation scbcn17Customer segmentation scbcn17
Customer segmentation scbcn17
 
Efficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informationsEfficient analysis of large scale digital circuits and parasitic informations
Efficient analysis of large scale digital circuits and parasitic informations
 
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to GraphX | Big Data Hadoop Spark Tutorial | CloudxLab
 
The immutable database datomic
The immutable database   datomicThe immutable database   datomic
The immutable database datomic
 

Viewers also liked

Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesUtsav Singh Rathour
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012Utsav Singh Rathour
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australiaMaiju Heinonen
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partituraNa Re
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosasCarlos Colombo
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)drpdwilkins
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......vikas dobhal
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of itUtsav Singh Rathour
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use themUtsav Singh Rathour
 

Viewers also liked (19)

La familia
La familiaLa familia
La familia
 
Hive jdbc
Hive jdbcHive jdbc
Hive jdbc
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Extending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post TypesExtending WordPress. Making use of Custom Post Types
Extending WordPress. Making use of Custom Post Types
 
Alimentos saludable
Alimentos saludableAlimentos saludable
Alimentos saludable
 
Introducao blue solar
Introducao blue solarIntroducao blue solar
Introducao blue solar
 
Working with WordPress themes
Working with WordPress themesWorking with WordPress themes
Working with WordPress themes
 
When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012When to use WordPress MultiSite WordCamp Nepal 2012
When to use WordPress MultiSite WordCamp Nepal 2012
 
Must see & experience while in australia
Must see & experience while in australiaMust see & experience while in australia
Must see & experience while in australia
 
Claro luna partitura
Claro luna partituraClaro luna partitura
Claro luna partitura
 
Nr16 atividades e operações perigosas
Nr16 atividades e operações perigosasNr16 atividades e operações perigosas
Nr16 atividades e operações perigosas
 
Ttg on twitter (1)
Ttg on twitter (1)Ttg on twitter (1)
Ttg on twitter (1)
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
Power profesiones
Power profesionesPower profesiones
Power profesiones
 
wine and grape with france regions.......
wine and grape with france regions.......wine and grape with france regions.......
wine and grape with france regions.......
 
WordCamps and how you can make the most of it
WordCamps and how you can make the most of itWordCamps and how you can make the most of it
WordCamps and how you can make the most of it
 
Plan anual 2015 cc ee noveno
Plan anual 2015 cc ee novenoPlan anual 2015 cc ee noveno
Plan anual 2015 cc ee noveno
 
What are child themes, and why use them
What are child themes, and why use themWhat are child themes, and why use them
What are child themes, and why use them
 
Branding strategy
Branding strategyBranding strategy
Branding strategy
 

Similar to Hadoop @ eBuddy

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015Jorg Janke
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAlex Pinkin
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedbacksinfomicien
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipsterJulien Dubois
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLOSInet
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversHYS Enterprise
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 

Similar to Hadoop @ eBuddy (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Dart the better Javascript 2015
Dart the better Javascript 2015Dart the better Javascript 2015
Dart the better Javascript 2015
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
Devoxx : being productive with JHipster
Devoxx : being productive with JHipsterDevoxx : being productive with JHipster
Devoxx : being productive with JHipster
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Scaling xtext
Scaling xtextScaling xtext
Scaling xtext
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod serversKibana+ElasticSearch+LogStash to handle Log messages on Prod servers
Kibana+ElasticSearch+LogStash to handle Log messages on Prod servers
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform EngineeringJemma Hussein Allen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

Hadoop @ eBuddy

  • 2. eBuddy Web based chat (Started in 2003) ● Initially no statistics, msn only ● Started basic logging in 2004 ● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all. XMS (Launched May 23, 2011) ● Today ○ 1.334.794.121 records (1,3x109) Website (google analytics) Banners (openx)
  • 3. Warehousing needs ● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis ● Marketing ○ What countries should we focus on ○ What people should we target? ● Sales ○ Sell banners in countries/products. ● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  • 4.
  • 5.
  • 6. Interesting to know ● Developers are Java centric ● Hosting in the US but BI people in Amsterdam ● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HD's ● We make money with banners ○ So don't expect deep pockets
  • 7. Warehouse timeline ● Traditional rdbms (2004) ● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code? ● Pig scripts (2008/2009) ○ Much simpler to read but domain specific ● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  • 8. Hive ● Hey I already know this: select * from table1 t1 left outer join table2 t2 on (t1.id = t2.id) where t2.id is null; ● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  • 9. Present ● App servers log to mysql ○ Brittle but it works ● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries ● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ don't start reducers until 90% mappers done ○ Lzo on all jobs ● MicroStrategy (odbc) ● SQuirreL (jdbc)
  • 10. Future ● Look at users from a to z ○ website logs ○ banners ● Cassandra handler for hive ○ Looking at contact lists (not just size) ● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  • 12. Hive partition pruning ● Won't work select count(*) from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id) where c.iso_date = '2012-06-14'; ● Will work select cldr_id from calendar where iso_date = '2012-06-14'; select count(*) from chatsessions where login_cldr_id in (1234);
  • 13.
  • 14. Left outer join in Pig A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray); B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray); C = COGROUP A BY a1, B BY b1 OUTER; X = FILTER C BY IsEmpty(B); Z = FOREACH X GENERATE flatten(A.a2); DUMP Z;
  • 15. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895 ● flume: https://cwiki.apache.org/FLUME/