Hadoop @ eBuddy

679 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
679
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop @ eBuddy

  1. 1. Hadoop @ eBuddy
  2. 2. eBuddyWeb based chat (Started in 2003)● Initially no statistics, msn only● Started basic logging in 2004● Today ○ 34.467.010.693 login records (34x109) ○ It takes about 40min to select them all.XMS (Launched May 23, 2011)● Today ○ 1.334.794.121 records (1,3x109)Website (google analytics)Banners (openx)
  3. 3. Warehousing needs● Product owners ○ Comparing product version ■ avg duration ■ msg sent/received ○ Churn analysis ○ Feature analysis● Marketing ○ What countries should we focus on ○ What people should we target?● Sales ○ Sell banners in countries/products.● Operations/Dev ○ Help solve bugs ○ Blocked in countries/providers
  4. 4. Interesting to know● Developers are Java centric● Hosting in the US but BI people in Amsterdam● 18 hadoop nodes each having ○ 16 cores ○ 24G ram ○ 4x400G HDs● We make money with banners ○ So dont expect deep pockets
  5. 5. Warehouse timeline● Traditional rdbms (2004)● Custom mapreduce code (2008) ○ Joining two files (merge join/map join?) ○ Repeating code ○ Consider abstraction ○ Changing data changing code?● Pig scripts (2008/2009) ○ Much simpler to read but domain specific● Hive (2009) ○ Generic sql but with some limitations ○ Existing tools can be used
  6. 6. Hive● Hey I already know this:select *from table1 t1 left outer join table2 t2 on (t1.id = t2.id)where t2.id is null;● Java programmers will like this: ○ Spring JdbcTemplates ○ Existing jdbc tools (SQuirreL) ○ Syntax highlighting ○ Code completion
  7. 7. Present● App servers log to mysql ○ Brittle but it works● Hive ○ Sql (most developers know this) ○ Partition pruning issues ○ No rollup queries● ETL ○ Star schema ○ Fair scheduling (ETL vs BI) ■ reserved for etl pool ■ dont start reducers until 90% mappers done ○ Lzo on all jobs● MicroStrategy (odbc)● SQuirreL (jdbc)
  8. 8. Future● Look at users from a to z ○ website logs ○ banners● Cassandra handler for hive ○ Looking at contact lists (not just size)● Streaming ETL ○ flume ■ No more mysql & scripts ■ Directly write into the correct partition ○ avro ■ Less schema related problems ○ snappy ■ Lightweight compression
  9. 9. Questions?
  10. 10. Hive partition pruning● Wont workselect count(*)from chatsessions cs inner join calendar c on (c.cldr_id = cs.login_cldr_id)where c.iso_date = 2012-06-14;● Will workselect cldr_id from calendar where iso_date = 2012-06-14;select count(*) from chatsessions where login_cldr_id in (1234);
  11. 11. Left outer join in PigA = LOAD file1 USING PigStorage(,) AS (a1:int,a2:chararray);B = LOAD file2 USING PigStorage(,) AS (b1:int,b2:chararray);C = COGROUP A BY a1, B BY b1 OUTER;X = FILTER C BY IsEmpty(B);Z = FOREACH X GENERATE flatten(A.a2);DUMP Z;
  12. 12. ● avro & hive: https://issues.apache.org/jira/browse/HIVE- 895● flume: https://cwiki.apache.org/FLUME/

×