Nowadays we are producing a huge volume of information, but unfortunately at most only 12% of it is analyzed.
That is why we should dive into our data lake and pull out the Holy Grail - the knowledge. But BigData means big problem.
So, challenge accepted!
The perfect solution for achieving this goal is Hadoop. It is a 'data operating system', which allows us to process large volumes of any data in a distributed way.
Together, we will take a phenomenal journey around Hadoop world.
First stop: operations basics.
Second stop: short tour around Hadoop ecosystem.
At the end of our travel, we will walk through several examples, that show you real power of a Hadoop as your data platform.
Arkadiusz Osinski - Works in Allegro Group as a System administrator. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling.
Robert Mroczkowski - In 2006 graduated master studies in Computer Science at Nicolaus Copernicus University. In 2007 he graduated Bachelor Studies in Applied Informatics at Nicolaus Copernicus University. In years 2006 - 2011 he was a PhD student in Computer Science. His research field was Computer Science applied in Bioinformatcs. In 2012 he started to work as Unix System Administartor in Allegro Group. He gained experience in Hadoop World building and maintaining a cluster for GA. Every day he works with modern high-performance and high-available technologies, centrally managed in cloud environment.
24. Process your data
-‐‑ Hadoop Streaming!
-‐‑ No need to write code in Java
-‐‑ You can use Python, Perl or Awk
25. Process your data
#!/usr/bin/python
import sys
import json
import datetime as dt
keyword='hadoop'
for line in sys.stdin:
data = json.loads(line.strip())
if keyword in data['text'].lower():
dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000
%Y').strftime('%Y-%m-%d')
print '{0}t1'.format(str(dt))
26. Process your data
#!/usr/bin/python
import sys
(counter,datekey=(0,'')
for line in sys.stdin:
line = line.strip().split("t")
if datekey != line[0]:
if datekey:
print "{0}t{1}".format(str(datekey),str(counter))
datekey = line[0]
counter = 1
else:
counter += 1
print "{0}t{1}".format(str(datekey),str(counter))
27. Process your data
yarn jar /usr/lib/hadoop-mapreduce/hadoop-
streaming.jar
-files ./map.py,./reduce.py
-mapper ./map.py
-reducer ./reduce.py
-input /tweets/2014/04/*/*/*
-input /tweets/2014/05/*/*/*
-output /tweet_keyword
28. Process your data
(….)
2014-04-24 864
2014-04-25 1121
2014-04-26 593
2014-04-27 649
2014-04-28 1084
2014-04-29 1575
2014-04-30 1170
2014-05-01 1164
2014-05-02 1175
2014-05-03 779
2014-05-04 471
(….)
34. Wiki Case
We’ve got links between wikipedia articles, and want to
propose new links between articles.
„Wikipedia (i/ˌwɪkɨˈpiːdiəә/ or i/ˌwɪkiˈpiːdiəә/ WIK-‐‑i-‐‑PEE-‐‑dee-‐‑əә) is a collaboratively edited,
multilingual, free Internet encyclopedia that is supported by the non-‐‑profit
Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million
articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can
access”
37. Wiki Case
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapreduce.job.max.split.locations=24
-Dmapreduce.job.queuename=hadoop_prod
-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator
-Dmapred.text.key.comparator.options=-n
-Dmapred.output.compress=false
-files ./mahout/mapper.awk
-mapper ./mapper.awk
-input /mahout/input/wikilinks/links-simple-sorted.txt
-output /mahout/output/wikilinks/fixedinput
38. Wiki Case
Mahout lib count’s similarity Matrix and gave
recommendations for 824 articles.
What’s important, we didn’t gather any knowledge
a priori and just ran algorithm’s out of box.
39. Wiki Case
Acadèmia_Valenciana_de_la_Llengua
FIFA
Valencia
October_1
Calendar
Prehistoric_Iberia
Link appears recently
Ceuta
Spain City at the north coast of Africa
Roussillon
Part of France by the border with Spain
Sweden
J
Turís
municipality in the Valencian Community
Vulgar_Latin
Language article
Western_Italo-‐‑
Western_languages
Language article
Àngel_Guimerà
Spanish wriler
44. Tweets
#!/usr/bin/python
import json, sys
for line in sys.stdin:
line = line.strip()
if '"lang":"en"' in line:
tweet = json.loads(line)
try:
text = tweet['text'].lower().strip()
if text:
tags = tweet[” entities"][”hashtags”]
for tag in tags:
print tag[“text”]+"t"+text
except KeyError:
continue
#!/usr/bin/python
import sys
(lastKey,text) = (None,"")
for line in sys.stdin:
(key,value) = line.strip().split("t")
if lastKey and lastKey != key:
print lastKey+"t"+text
(lastKey,text) = (key,value)
else:
(lastKey,text) = (key,text+" "+value)
45. Tweets
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
-Dmapreduce.job.queuename=atmosphere_time
-Dmapred.output.compress=false
-Dmapreduce.job.max.split.locations=24
-D-Dmapred.reduce.tasks=20
-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py
-mapper ./twitter_map.py
-reducer ./twitter_reduce.py
-input /project/atmosphere/tweets/2014/04/*/*
-output /project/atmosphere/tweets/output
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
Get SequenceFile with proper mapping
50. Tweets
Bigger challenge – user clustering
LINUX
UBUNTU
WINDOWS
OS
PATCH
MAC
HACKED
MICROSOFT
FREE
CSRRACING
WON
RACEYOURFRIENDS
ANDROID
CSRCLASSIC
51. Tweets
Bigger challenge – user clustering
• Results show that dataset is strongly curved
by mobile and games
• Dataset wasn’t random – we subscribed
specific keywords
• OS result is great!
52. Tweets
HADOOP WORLD
run predictive machine learning algorithms on hadoop
without even knowing mapreduce.: data scientists are
very... h:p://t.co/gdmqm5g1ar
rt @mapr: google cloud storage connector for #hadoop:
quick start guide now avail h:p://t.co/17hxtvdlir
#bigdata
54. Visualize data
add jar hive-serdes-1.0-SNAPSHOT.jar;
create table tw_data_201404
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY '012’
STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS
SELECT
v_date,
LOWER(hashtags.text),
lang,
COUNT(*) AS total_count
FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
WHERE v_date like '2014-04-%'
GROUP BY v_date,LOWER(hashtags.text),lang
55. Visualize data
add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar;
CREATE EXTERNAL TABLE es_export (
v_date string,
tag string,
lang string,
total_count int,
info string )
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES (
'es.resource' = 'trends/log',
'es.index.auto.create' = 'true') ;
56. Visualize data
INSERT overwrite TABLE es_export
SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt'
FROM tw_data_201405 may
LEFT outer JOIN tw_data_201404 april
ON april.tag = may.tag
WHERE april.tag is null
AND may.total_count>1;