Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

278 views

Published on

Nowadays we are producing a huge volume of information, but unfortunately at most only 12% of it is analyzed.
That is why we should dive into our data lake and pull out the Holy Grail - the knowledge. But BigData means big problem.
So, challenge accepted!
The perfect solution for achieving this goal is Hadoop. It is a 'data operating system', which allows us to process large volumes of any data in a distributed way.
Together, we will take a phenomenal journey around Hadoop world.
First stop: operations basics.
Second stop: short tour around Hadoop ecosystem.
At the end of our travel, we will walk through several examples, that show you real power of a Hadoop as your data platform.

Arkadiusz Osinski - Works in Allegro Group as a System administrator. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling.

Robert Mroczkowski - In 2006 graduated master studies in Computer Science at Nicolaus Copernicus University. In 2007 he graduated Bachelor Studies in Applied Informatics at Nicolaus Copernicus University. In years 2006 - 2011 he was a PhD student in Computer Science. His research field was Computer Science applied in Bioinformatcs. In 2012 he started to work as Unix System Administartor in Allegro Group. He gained experience in Hadoop World building and maintaining a cluster for GA. Every day he works with modern high-performance and high-available technologies, centrally managed in cloud environment.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
278
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

  1. 1. Hadoop:   challenge   accepted! Arkadiusz  Osiński arkadiusz.osinski@allegrogroup.com Robert  Mroczkowski robert.mroczkowski@allegrogroup.com
  2. 2. ToC -­‐‑   Hadoop  basics -­‐‑   Gather  data -­‐‑   Process  your  data -­‐‑   Learn  from  your  data -­‐‑   Visualize  your  data
  3. 3. BigData -­‐‑  Petabytes  of  (un)structured  data
  4. 4. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed
  5. 5. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered
  6. 6. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered -­‐‑   how  to  gain  knowledge?
  7. 7. Power Big  Data Data  Lake Scalability Petabytes Mapreduce Commodity
  8. 8. HDFS -­‐‑   Storage  layer
  9. 9. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system
  10. 10. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware
  11. 11. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability
  12. 12. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD
  13. 13. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control
  14. 14. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control -­‐‑   No  SPOF
  15. 15. YARN -­‐‑   Distributed  computing  layer
  16. 16. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data
  17. 17. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce…
  18. 18. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications
  19. 19. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications -­‐‑   Resource  management
  20. 20. Let’s  squize  our  data   to  get  a  juice!!
  21. 21. Gather  data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql
  22. 22. Process  your  data -­‐‑   Hadoop  Streaming!
  23. 23. Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java
  24. 24. Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java -­‐‑   You  can  use  Python,  Perl  or  Awk
  25. 25. Process  your  data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}t1'.format(str(dt))    
  26. 26. Process  your  data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("t") if datekey != line[0]: if datekey: print "{0}t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1  print "{0}t{1}".format(str(datekey),str(counter))    
  27. 27. Process  your  data yarn jar /usr/lib/hadoop-mapreduce/hadoop- streaming.jar -files ./map.py,./reduce.py -mapper ./map.py -reducer ./reduce.py -input /tweets/2014/04/*/*/* -input /tweets/2014/05/*/*/* -output /tweet_keyword
  28. 28. Process  your  data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)
  29. 29. Process  your  data
  30. 30. Recommendations Which  product  will  be  desired  by  client? We’ve  got  historical  users  interaction   with  items.
  31. 31. Simple  Example Let’s  just  do  mahout    -­‐‑  it’s  easy! > apt-get install mahout > cat simple_example.csv 1,101 1,102 1,103 2,101 > hdfs dfs -put simple_example.csv > mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv -Dmapred.output.dir=/mahout/output/wikilinks/simple_example -Dmapred.job.queue.name=atmosphere_prod
  32. 32. Simple  Example Tadadam! > hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]
  33. 33. Wiki  Case We’ve  got  links  between  wikipedia  articles,  and  want  to   propose  new  links  between  articles. „Wikipedia   (i/ˌwɪkɨˈpiːdiəә/   or   i/ˌwɪkiˈpiːdiəә/   WIK-­‐‑i-­‐‑PEE-­‐‑dee-­‐‑əә)   is   a   collaboratively   edited,   multilingual,   free   Internet   encyclopedia   that   is   supported   by   the   non-­‐‑profit   Wikimedia   Foundation.   Volunteers   worldwide   collaboratively   write   Wikipedia'ʹs   30   million   articles  in  287  languages,  including  over  4.5  million  in  the  English  Wikipedia.  Anyone  who  can   access”  
  34. 34. Wiki  Case
  35. 35. Wiki  Case hlp://users.on.net/%7Ehenry/pagerank/links-­‐‑simple-­‐‑sorted.zip #!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }  
  36. 36. Wiki  Case yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapreduce.job.max.split.locations=24 -Dmapreduce.job.queuename=hadoop_prod -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator -Dmapred.text.key.comparator.options=-n -Dmapred.output.compress=false -files ./mahout/mapper.awk -mapper ./mapper.awk -input /mahout/input/wikilinks/links-simple-sorted.txt -output /mahout/output/wikilinks/fixedinput
  37. 37. Wiki  Case Mahout  lib  count’s  similarity  Matrix  and  gave   recommendations  for  824  articles. What’s  important,  we  didn’t  gather  any  knowledge   a  priori  and  just  ran  algorithm’s  out  of  box.
  38. 38. Wiki  Case Acadèmia_Valenciana_de_la_Llengua FIFA Valencia October_1 Calendar Prehistoric_Iberia Link  appears  recently Ceuta Spain  City  at  the  north  coast  of  Africa Roussillon Part  of  France  by  the  border  with  Spain Sweden J Turís municipality  in  the  Valencian  Community Vulgar_Latin Language  article Western_Italo-­‐‑ Western_languages Language  article Àngel_Guimerà Spanish  wriler
  39. 39. Wiki  Case
  40. 40. Tweets Let’s  find  group  of: • tags   • users
  41. 41. Tweets •  Our  data  is  not  random •  We’ve  picked  specific  keywords •  We’ll  do  analysis  in  two   orthogonal  directions
  42. 42. Tweets { "filter_level":"medium", "contributors":null, "text":"PROMOCIÓN MES DE MAYO. con ...", "geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":"http://www.agmuriel.com", "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":"http://t.co/APpPjRRTXn" } ] } (…)  
  43. 43. Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"t"+text except KeyError: continue   #!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("t") if lastKey and lastKey != key: print lastKey+"t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)  
  44. 44. Tweets yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapreduce.job.queuename=atmosphere_time -Dmapred.output.compress=false -Dmapreduce.job.max.split.locations=24 -D-Dmapred.reduce.tasks=20 -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py -mapper ./twitter_map.py -reducer ./twitter_reduce.py -input /project/atmosphere/tweets/2014/04/*/* -output /project/atmosphere/tweets/output -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat Get  SequenceFile  with  proper  mapping
  45. 45. Tweets mahout seq2sparse -i /project/atmosphere/tweets/output -o /project/atmosphere/tweets/vectorized -ow -chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2 Calculate  vector  representation  for  text {10:0.6292275202550768,14:0.7772211575566166}   {10:0.6292275202550768,14:0.7772211575566166}   {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}   {17:1.0}   {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}  
  46. 46. Tweets I’ts  time  to  begin  clusterization Let’s  find  100  clusters mahout kmeans -i /tweets_5/vectorized/tfidf-vectors -c /tweets_5/kmeans/initial-clusters -o /tweets_5/kmeans/output-clusters -cd 1.0 -k 100 -x 10 -cl –ow -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
  47. 47. Tweets Glance  at  results BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE PATCHING
  48. 48. Tweets It  was  easy  because  tags  are   very  dependent  (coocurence).
  49. 49. Tweets Bigger  challenge  –  user  clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC
  50. 50. Tweets Bigger  challenge  –  user  clustering •  Results  show  that  dataset  is  strongly  curved   by  mobile  and  games •  Dataset  wasn’t  random  –  we  subscribed     specific  keywords •  OS  result  is  great!
  51. 51. Tweets HADOOP  WORLD run  predictive  machine  learning  algorithms  on  hadoop   without  even  knowing  mapreduce.:  data  scientists  are   very...  h:p://t.co/gdmqm5g1ar rt  @mapr:  google  cloud  storage  connector  for  #hadoop:   quick  start  guide  now  avail  h:p://t.co/17hxtvdlir     #bigdata
  52. 52. Tweets HADOOP  WORLD Cloudera  wants  to  do  big  data  in  Real  Time. Hortonworks  wants  to  replace  cloudera  by  research.
  53. 53. Visualize  data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY '012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang    
  54. 54. Visualize  data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;    
  55. 55. Visualize  data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;    
  56. 56. Visualize  data
  57. 57. Visualize  data Tag: eurovisiontve
  58. 58. Thank  you! Questions?

×