Your SlideShare is downloading. ×
0
Hadoop @ eBuddy
eBuddyWeb based chat (Started in 2003)● Initially no statistics, msn only● Started basic logging in 2004● Today  ○ 34.467....
Warehousing needs● Product owners  ○ Comparing product version     ■ avg duration     ■ msg sent/received  ○ Churn analysi...
Interesting to know● Developers are Java centric● Hosting in the US but BI people in Amsterdam● 18 hadoop nodes each havin...
Warehouse timeline● Traditional rdbms (2004)● Custom mapreduce code (2008)  ○ Joining two files (merge join/map join?)  ○ ...
Hive● Hey I already know this:select *from table1 t1  left outer join table2 t2 on (t1.id = t2.id)where t2.id is null;● Ja...
Present● App servers log to mysql  ○ Brittle but it works● Hive  ○ Sql (most developers know this)  ○ Partition pruning is...
Future● Look at users from a to z  ○ website logs  ○ banners● Cassandra handler for hive  ○ Looking at contact lists (not ...
Questions?
Hive partition pruning● Wont workselect count(*)from chatsessions cs  inner join calendar c on (c.cldr_id = cs.login_cldr_...
Left outer join in PigA = LOAD file1 USING PigStorage(,) AS (a1:int,a2:chararray);B = LOAD file2 USING PigStorage(,) AS (b...
● avro & hive: https://issues.apache.org/jira/browse/HIVE-  895● flume:   https://cwiki.apache.org/FLUME/
Hadoop @ eBuddy
Hadoop @ eBuddy
Hadoop @ eBuddy
Upcoming SlideShare
Loading in...5
×

Hadoop @ eBuddy

485

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
485
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop @ eBuddy"

  1. 1. Hadoop @ eBuddy
  2. 2. eBuddyWeb based chat (Started in 2003)● Initially no statistics, msn only● Started basic logging in 2004● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all.XMS (Launched May 23, 2011)● Today ○ 1.334.794.121 records (1,3x109)Website (google analytics)Banners (openx)
  3. 3. Warehousing needs● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis● Marketing ○ What countries should we focus on ○ What people should we target?● Sales ○ Sell banners in countries/products.● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  4. 4. Interesting to know● Developers are Java centric● Hosting in the US but BI people in Amsterdam● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HDs● We make money with banners ○ So dont expect deep pockets
  5. 5. Warehouse timeline● Traditional rdbms (2004)● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code?● Pig scripts (2008/2009) ○ Much simpler to read but domain specific● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  6. 6. Hive● Hey I already know this:select *from table1 t1 left outer join table2 t2 on (t1.id = t2.id)where t2.id is null;● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  7. 7. Present● App servers log to mysql ○ Brittle but it works● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ dont start reducers until 90% mappers done ○ Lzo on all jobs● MicroStrategy (odbc)● SQuirreL (jdbc)
  8. 8. Future● Look at users from a to z ○ website logs ○ banners● Cassandra handler for hive ○ Looking at contact lists (not just size)● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  9. 9. Questions?
  10. 10. Hive partition pruning● Wont workselect count(*)from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id)where c.iso_date = 2012-06-14;● Will workselect cldr_id from calendar where iso_date = 2012-06-14;select count(*) from chatsessions where login_cldr_id in (1234);
  11. 11. Left outer join in PigA = LOAD file1 USING PigStorage(,) AS (a1:int,a2:chararray);B = LOAD file2 USING PigStorage(,) AS (b1:int,b2:chararray);C = COGROUP A BY a1, B BY b1 OUTER;X = FILTER C BY IsEmpty(B);Z = FOREACH X GENERATE flatten(A.a2);DUMP Z;
  12. 12. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895● flume: https://cwiki.apache.org/FLUME/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×