Your SlideShare is downloading. ×
0
Kamil Chmielewski          Jacek Juraszek              Hadoop    w poszukiwaniu złotego młotka
Źródło: IDCs Digital Universe Study, sponsored by EMC, June 2011
• Facebook – 30 PB (2011)   • 2000 serwerów   • 22 400 rdzeni   • 64 TB RAM• Yahoo – 14 PB (2010)   • 4000 serwerów• Ebay ...
Wzrost mocy obliczeniowej                     Źródło: The Free Lunch Is Over, Herb Sutter
Architektura HDFS
HDFS File System Shell• hadoop fs -cat file:///file3 /user/hadoop/file4• hadoop fs -cp /user/hadoop/file1 /user/hadoop/fil...
Rozproszony klient?
NameNode HA
Zrównoleglenie – MapReducefunction map(String name, String document):  // name: document name  // document: document conte...
MapReduce – Hadoop JAVA                             63 linie !!!                   http://wiki.apache.org/hadoop/WordCount
MapReduce – Apache PIGinput_lines = LOAD /tmp/my-copy-of-all-pages-on-internet   AS (line:chararray);-- Extract words from...
Przykład z życia wzięty    public static class MetricsMapper extends TableMapper<Text, IntWritable> {        private final...
Hadoop + MongoDB                                     HADOOP   MongoDB      Bach proc. result                              ...
Filesystem = HDFS ?
HBasekey           timestamp   cf dane      cf adres80071223097   t3                       miasto=Warszawa80071223097   t2...
HTable table = new HTable("osoby");Put event = new Put(Bytes.toBytes("80071223097")  .add(Bytes.toBytes("dane"),      Byte...
HBase Shell# Count rows in a tabledef _count_internal(interval = 1000, caching_rows = 10) # We can safely set scanner cach...
Koszmar pakietowyorg.apache.hadoop.mapred             org.apache.hadoop.mapreduce   Wszystko_mający                     ...
Bałagan z wersjami
Przykładowa architektura systemu                       MR = Batch           Bazy danych nadal nadają sens aplikacji
Hadoop + SOLR = SOLR Cloud
Nie każdy problemjest dość duży…            FACEBOOK CLUSTER             2k maszyn             12 TB per maszyna        ...
Zastosowania• Indeksowanie dokumentów• Analiza wykorzystania serwisów internetowych• Logi serwerów, firewalli• Repozytoria...
More info …http://hortonworks.com/blog/http://www.cloudera.com/blog/http://hadoopblog.blogspot.com/http://www.larsgeorge.c...
Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."
Upcoming SlideShare
Loading in...5
×

Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

868

Published on

Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka." Prezencja z j.Piknik 30.08.12.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
868
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka.""

  1. 1. Kamil Chmielewski Jacek Juraszek Hadoop w poszukiwaniu złotego młotka
  2. 2. Źródło: IDCs Digital Universe Study, sponsored by EMC, June 2011
  3. 3. • Facebook – 30 PB (2011) • 2000 serwerów • 22 400 rdzeni • 64 TB RAM• Yahoo – 14 PB (2010) • 4000 serwerów• Ebay – 5,3 PB • 532 serwery • 4256 rdzeni• Google – 24 PB ???
  4. 4. Wzrost mocy obliczeniowej Źródło: The Free Lunch Is Over, Herb Sutter
  5. 5. Architektura HDFS
  6. 6. HDFS File System Shell• hadoop fs -cat file:///file3 /user/hadoop/file4• hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2• hadoop fs -du /user/hadoop/dir1• hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile• hadoop fs -ls /user/hadoop/file1• hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir• hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2• hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile• hadoop fs -rm hdfs://nn.example.com/file• hadoop fs -tail pathname
  7. 7. Rozproszony klient?
  8. 8. NameNode HA
  9. 9. Zrównoleglenie – MapReducefunction map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum) http://en.wikipedia.org/wiki/MapReduce
  10. 10. MapReduce – Hadoop JAVA 63 linie !!! http://wiki.apache.org/hadoop/WordCount
  11. 11. MapReduce – Apache PIGinput_lines = LOAD /tmp/my-copy-of-all-pages-on-internet AS (line:chararray);-- Extract words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;-- filter out any words that are just white spacesfiltered_words = FILTER words BY word MATCHES w+;-- create a group for each word 7 linii dobrze,word_groups = GROUP filtered_words BY word; 63 źle-- count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;-- order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO /tmp/number-of-words-on-internet; http://en.wikipedia.org/wiki/Pig_(programming_tool)
  12. 12. Przykład z życia wzięty public static class MetricsMapper extends TableMapper<Text, IntWritable> { private final static Logger log = LoggerFactory.getLogger(MetricsMapper.class); protected void map(ImmutableBytesWritable key, Result value, private Map<String, String> getValuesFromQueryString(String query,DEFINE extractor pl.allegro.cm.pig.udf.specific.Extractor();Set<String> keys) { Mapper<ImmutableBytesWritable, Result, Text, IntWritable>.ContextDEFINE isNotBlank pl.allegro.cm.pig.udf.IsNotBlank(); context) throws IOException, String[] keyVal = split(query, &); InterruptedException { Map<String, String> result = new HashMap<String, String>();DEFINE concat pl.allegro.cm.pig.udf.Concat(); String query = for (String s : keyVal) { Bytes.toString(value.getValue(RawDataFamily.CF_B, RawDataFamily.QUERY.getColumn() String[] kv = split(s, =);in = LOAD events.$account USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(r:userId > 1) { )); if (keys.contains(kv[0]) && kv.length e:processId e:createTime result.put(kv[0], kv[1]); Map<String, String> infoTags = getValuesFromQuery(query, KEYS);r:query for (String key : KEYS) { r:direct e:newprocess, -caster HBaseBinaryConverter) AS }(userId:chararray, processId:chararray, createTime:chararray, query:chararray, direct:chararray, newprocess:chararray } long eventTime = return result;); toLong(value.getValue(EvalDataFamily.CF_B, EvalDataFamily.CREATE_TIME.getColumn() } ));rows = FILTER in BY (userId IS NOT NULL) AND (processId IS NOT private String (createTime { NOT NULL); NULL) AND key(String key) IS long eventTruncatedToDay = timestampToDay(eventTime); if (SOURCE.equals(key)) {rows = FOREACH rows GENERATE SUBSTRING(createTime,0,10) AS createTime, SOURCE; return userId, processId, query, direct, newprocess;rows = FILTER rows BY resolveTagValue(key, value, infoTags);$upper >= createTime; String tagValue = $lower <= createTime AND } else if (MEDIUM.equals(key)) { int visitCount = return MEDIUM; toInt(value.getValue(EvalDataFamily.CF_B, EvalDataFamily.VISIT_COUNT.getColumn()) }processs = GROUP rows BY (userId,processId); ); return key;processs = FOREACH processs GENERATE concat(group.$0,|,group.$1) AS countId, COUNT($1) AS count; context.write(new Text(eventTruncatedToDay + KEY_DELIMITER + infoKey(key) } + KEY_DELIMITER + tagValue), new IntWritable( } visitCount)); public static class MetricsReducer extends TableReducer<Text, IntWritable,firstEvFromEachprocess = FILTER rows BY (newprocess IS NOT NULL); { } Writable>firstEvFromEachprocess = FOREACH firstEvFromEachprocess GENERATE createTime AS ct,key, Iterable<IntWritable> visitCounts, AS }; protected void reduce(Text concat(userId,|,processId) private String resolveTagValue(String attr, Result Reducer<Text, IntWritable, Writable, Writable>.Context context) throwscampId, extractor(query,direct) AS params; result, Map<String, String> allTags) { IOException, InterruptedException { String tagValue = allTags.get(attr); long visits = 0; tagValue = StringUtils.isBlank(tagValue) ? UNDEFINED : tagValue; long pv = 0; if (SOURCE.equals(attr)) { long bounces = 0;joinedData = JOIN firstEvFromEachprocess BY procId, processs BY countId; if (!UNDEFINED.equals(tagValue)) { for (IntWritable vc : visitCounts) {unpackParams = FOREACH joinedData GENERATE ct AS t, FLATTEN(params), count AS c, (count==1 ? 1 : 0) AS b; return tagValue; visits++; } pv += vc.get();dataForWrite = GROUP unpackParams BY (t,$1,$2); bounces += vc.get() == 1 ? 1 : 0; String direct =dataForWrite = FOREACH dataForWrite GENERATE Bytes.toString(result.getValue(RawDataFamily.CF_B, RawDataFamily.DIRECT.getColumn }group.t, group.$1, group.$2, SUM(unpackParams.b),SUM(unpackParams.c), COUNT(unpackParams); ())); context.write( if (StringUtils.isNotBlank(direct)) { null, return retrieveOrigin(direct); new Put(Bytes.toBytes(key.toString()))STORE dataForWrite INTO metrics USING } .add(Constants.CF_B, Constants.VISITS.getColumn(), toBytes(visits))org.apache.pig.piggybank.storage.DBStorage($driver,$url,$usr,$pass,INSERT INTO metrics return DIRECT; .add(Constants.CF_B, Constants.PV.getColumn(), toBytes(pv))(account,else ifsource = resolveTagValue(SOURCE, result, allTags); } date,(MEDIUM.equals(attr)) { String key,value, cripled, events, processs) VALUES ("$account", ?, ?, ?, ?,Constants.BOUNCES.getColumn(), .add(Constants.CF_B, ?, ?) ON DUPLICATE KEY UPDATEcripled=VALUES(cripled), events=VALUES(events), processs=VALUES(processs)); return source + VALUE_DELIMITER + tagValue; toBytes(bounces))); } }; return tagValue; } } private String retrieveHost(String url) { if (StringUtils.isNotBlank(url)) { try { A to jest PIG… return (new URL(url)).getHost().replaceFirst("www.", ""); } catch (MalformedURLException e) { log.warn("Malformed URL {}. Could not retrieve host value.", url);
  13. 13. Hadoop + MongoDB HADOOP MongoDB Bach proc. result archive data online data MR Flushed data
  14. 14. Filesystem = HDFS ?
  15. 15. HBasekey timestamp cf dane cf adres80071223097 t3 miasto=Warszawa80071223097 t2 miasto=Gdańsk80071223097 t1 imie=Jan86121267222 t2 ulica=Długa86121267222 t1 imie=Maria miasto=Poznań
  16. 16. HTable table = new HTable("osoby");Put event = new Put(Bytes.toBytes("80071223097") .add(Bytes.toBytes("dane"), Bytes.toBytes("imie"), Bytes.toBytes("Jan")) .add(Bytes.toBytes("adres"), Bytes.toBytes("miasto"), Bytes.toBytes("Warszawa")) ;table.put(event);// https://github.com/nearinfinity/hbase-dslHTable table = new HTable("osoby");hBase.save(table).row("80071223097"). family("dane").col("imie", "Jan"). family("adres").col("miasto", "Warszawa");# http://happybase.readthedocs.org/table = connection.table(osoby)table.put(80071223097’, {dane:imie: Jan, adres:miasto: Warszawa})
  17. 17. HBase Shell# Count rows in a tabledef _count_internal(interval = 1000, caching_rows = 10) # We can safely set scanner cachingwith the first key only filter scan = org.apache.hadoop.hbase.client.Scan.new scan.cache_blocks = false scan.caching = caching_rows scan.setFilter(org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter.new) # Run the scanner scanner = @table.getScanner(scan) count = 0 iter = scanner.iterator # Iterate results while iter.hasNext row = iter.next count += 1 next unless (block_given? && count % interval == 0) # Allow command modules to visualize counting process yield(count, String.from_java_bytes(row.getRow)) end # Return the counter return countend
  18. 18. Koszmar pakietowyorg.apache.hadoop.mapred org.apache.hadoop.mapreduce Wszystko_mający  Przyjazne API Status: legacy  Klasy bazowe Chain_mr  Konteksty Operacja JOIN na MR  Wsparcie dla CLI i CoC Smaczki z Maven Repo:  Przepakietowana GUAVA  Zależności do commons-logging  Dystrybucje tylko w 3rd party repo  HBASE z zależnościami do: jetty i servlet-api
  19. 19. Bałagan z wersjami
  20. 20. Przykładowa architektura systemu MR = Batch Bazy danych nadal nadają sens aplikacji
  21. 21. Hadoop + SOLR = SOLR Cloud
  22. 22. Nie każdy problemjest dość duży… FACEBOOK CLUSTER  2k maszyn  12 TB per maszyna  30 PB całkowitej pojemności  1200 maszyn x 8 core  800 maszyn X 16 core
  23. 23. Zastosowania• Indeksowanie dokumentów• Analiza wykorzystania serwisów internetowych• Logi serwerów, firewalli• Repozytoria obrazów, filmów• Metryki parametrów systemów• Systemy rekomendacji
  24. 24. More info …http://hortonworks.com/blog/http://www.cloudera.com/blog/http://hadoopblog.blogspot.com/http://www.larsgeorge.com/http://natishalom.typepad.com/nati_shaloms_blog/http://developer.yahoo.com/blogs/ydn/categories/hadoop/http://bradhedlund.com/topics/big-data/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×