SlideShare a Scribd company logo
1 of 59
Download to read offline
Hadoop:  
challenge  
accepted!	
Arkadiusz  Osiński	
arkadiusz.osinski@allegrogroup.com	
Robert  Mroczkowski	
robert.mroczkowski@allegrogroup.com
ToC	
-­‐‑   Hadoop  basics	
-­‐‑   Gather  data	
-­‐‑   Process  your  data	
-­‐‑   Learn  from  your  data	
-­‐‑   Visualize  your  data
BigData	
-­‐‑  Petabytes  of  (un)structured  data
BigData	
-­‐‑  Petabytes  of  (un)structured  data	
-­‐‑   12%  of  data  is  analyzed
BigData	
-­‐‑  Petabytes  of  (un)structured  data	
-­‐‑   12%  of  data  is  analyzed	
-­‐‑   a  lot  of  data  is  not  gathered
BigData	
-­‐‑  Petabytes  of  (un)structured  data	
-­‐‑   12%  of  data  is  analyzed	
-­‐‑   a  lot  of  data  is  not  gathered	
-­‐‑   how  to  gain  knowledge?
Power	
Big  Data	
Data  Lake	
Scalability	
Petabytes	
Mapreduce	
Commodity
HDFS	
-­‐‑   Storage  layer
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system	
-­‐‑   Commodity  hardware
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system	
-­‐‑   Commodity  hardware	
-­‐‑   Scalability
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system	
-­‐‑   Commodity  hardware	
-­‐‑   Scalability	
-­‐‑   JBOD
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system	
-­‐‑   Commodity  hardware	
-­‐‑   Scalability	
-­‐‑   JBOD	
-­‐‑   Access  control
HDFS	
-­‐‑   Storage  layer	
-­‐‑   Distributed  file  system	
-­‐‑   Commodity  hardware	
-­‐‑   Scalability	
-­‐‑   JBOD	
-­‐‑   Access  control	
-­‐‑   No  SPOF
YARN	
-­‐‑   Distributed  computing  layer
YARN	
-­‐‑   Distributed  computing  layer	
-­‐‑   Operations  in  place  of  data
YARN	
-­‐‑   Distributed  computing  layer	
-­‐‑   Operations  in  place  of  data	
-­‐‑   MapReduce…
YARN	
-­‐‑   Distributed  computing  layer	
-­‐‑   Operations  in  place  of  data	
-­‐‑   MapReduce…	
-­‐‑   and  others  applications
YARN	
-­‐‑   Distributed  computing  layer	
-­‐‑   Operations  in  place  of  data	
-­‐‑   MapReduce…	
-­‐‑   and  others  applications	
-­‐‑   Resource  management
Let’s  squize  our  data  
to  get  a  juice!!
Gather  data	
flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
flume-twitter.sources.Twitter.channels = MemChannel
flume-twitter.sources.Twitter.consumerKey = (…)
flume-twitter.sources.Twitter.consumerSecret = (…)
flume-twitter.sources.Twitter.accessToken = (…)
flume-twitter.sources.Twitter.accessTokenSecret = (…)
flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql
Process  your  data	
-­‐‑   Hadoop  Streaming!
Process  your  data	
-­‐‑   Hadoop  Streaming!	
-­‐‑   No  need  to  write  code  in  Java
Process  your  data	
-­‐‑   Hadoop  Streaming!	
-­‐‑   No  need  to  write  code  in  Java	
-­‐‑   You  can  use  Python,  Perl  or  Awk
Process  your  data	
#!/usr/bin/python
import sys
import json
import datetime as dt
keyword='hadoop'
for line in sys.stdin:
data = json.loads(line.strip())
if keyword in data['text'].lower():
dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000
%Y').strftime('%Y-%m-%d')
print '{0}t1'.format(str(dt)) 	
  	
  
Process  your  data	
#!/usr/bin/python
import sys
(counter,datekey=(0,'')
for line in sys.stdin:
line = line.strip().split("t")
if datekey != line[0]:
if datekey:
print "{0}t{1}".format(str(datekey),str(counter))
datekey = line[0]
counter = 1
else:
counter += 1
 print "{0}t{1}".format(str(datekey),str(counter)) 	
  	
  
Process  your  data	
yarn jar /usr/lib/hadoop-mapreduce/hadoop-
streaming.jar 
-files ./map.py,./reduce.py 
-mapper ./map.py 
-reducer ./reduce.py 
-input /tweets/2014/04/*/*/* 
-input /tweets/2014/05/*/*/* 
-output /tweet_keyword
Process  your  data	
(….)
2014-04-24 864
2014-04-25 1121
2014-04-26 593
2014-04-27 649
2014-04-28 1084
2014-04-29 1575
2014-04-30 1170
2014-05-01 1164
2014-05-02 1175
2014-05-03 779
2014-05-04 471
(….)
Process  your  data
Recommendations	
Which  product  will  be  desired  by  client?	
We’ve  got  historical  users  interaction  
with  items.
Simple  Example	
Let’s  just  do  mahout    -­‐‑  it’s  easy!	
> apt-get install mahout
> cat simple_example.csv
1,101
1,102
1,103
2,101
> hdfs dfs -put simple_example.csv
> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b 
-Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv 
-Dmapred.output.dir=/mahout/output/wikilinks/simple_example 
-Dmapred.job.queue.name=atmosphere_prod
Simple  Example	
Tadadam!	
> hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy
1 [105:1.0,104:1.0]
2 [106:1.0,105:1.0]
3 [103:1.0,102:1.0]
4 [105:1.0,102:1.0]
5 [107:1.0,106:1.0]
Wiki  Case	
We’ve  got  links  between  wikipedia  articles,  and  want  to  
propose  new  links  between  articles.	
„Wikipedia   (i/ˌwɪkɨˈpiːdiəә/   or   i/ˌwɪkiˈpiːdiəә/   WIK-­‐‑i-­‐‑PEE-­‐‑dee-­‐‑əә)   is   a   collaboratively   edited,  
multilingual,   free   Internet   encyclopedia   that   is   supported   by   the   non-­‐‑profit  
Wikimedia   Foundation.   Volunteers   worldwide   collaboratively   write   Wikipedia'ʹs   30   million  
articles  in  287  languages,  including  over  4.5  million  in  the  English  Wikipedia.  Anyone  who  can  
access”	
  
Wiki  Case
Wiki  Case	
hlp://users.on.net/%7Ehenry/pagerank/links-­‐‑simple-­‐‑sorted.zip	
#!/usr/bin/awk -f
BEGIN {
OFS=",”;
}
{
gsub(":","",$1);
for (i=2;i<=NF;i++) {
print $1,$i
}
}  
Wiki  Case	
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar 
-Dmapreduce.job.max.split.locations=24 
-Dmapreduce.job.queuename=hadoop_prod 
-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator 
-Dmapred.text.key.comparator.options=-n 
-Dmapred.output.compress=false 
-files ./mahout/mapper.awk 
-mapper ./mapper.awk 
-input /mahout/input/wikilinks/links-simple-sorted.txt 
-output /mahout/output/wikilinks/fixedinput
Wiki  Case	
Mahout  lib  count’s  similarity  Matrix  and  gave  
recommendations  for  824  articles.	
What’s  important,  we  didn’t  gather  any  knowledge  
a  priori  and  just  ran  algorithm’s  out  of  box.
Wiki  Case	
Acadèmia_Valenciana_de_la_Llengua	
FIFA	
 Valencia	
October_1	
	
 Calendar	
Prehistoric_Iberia	
 Link  appears  recently	
Ceuta	
 Spain  City  at  the  north  coast  of  Africa	
Roussillon	
 Part  of  France  by  the  border  with  Spain	
Sweden	
 J	
Turís 	
	
 municipality  in  the  Valencian  Community	
Vulgar_Latin	
 Language  article	
Western_Italo-­‐‑
Western_languages	
Language  article	
Àngel_Guimerà	
 Spanish  wriler
Wiki  Case
Tweets	
Let’s  find  group  of:	
	
• tags  	
• users
Tweets	
•  Our  data  is  not  random	
•  We’ve  picked  specific  keywords	
•  We’ll  do  analysis  in  two  
orthogonal  directions
Tweets	
{
"filter_level":"medium",
"contributors":null,
"text":"PROMOCIÓN MES DE MAYO. con ...",
"geo":null,
"retweeted":false,
"lang":"es",
"entities":{
"urls":[
{ "expanded_url":"http://www.agmuriel.com",
"indices":[ 69, 91 ],
"display_url":"agmuriel.com/#!-/c1gz",
"url":"http://t.co/APpPjRRTXn" } ]
}
(…)
 
Tweets	
#!/usr/bin/python
import json, sys
for line in sys.stdin:
line = line.strip()
if '"lang":"en"' in line:
tweet = json.loads(line)
try:
text = tweet['text'].lower().strip()
if text:
tags = tweet[” entities"][”hashtags”]
for tag in tags:
print tag[“text”]+"t"+text
except KeyError:
continue
 
#!/usr/bin/python
import sys
(lastKey,text) = (None,"")
for line in sys.stdin:
(key,value) = line.strip().split("t")
if lastKey and lastKey != key:
print lastKey+"t"+text
(lastKey,text) = (key,value)
else:
(lastKey,text) = (key,text+" "+value)
 
Tweets	
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar 
-Dmapreduce.job.queuename=atmosphere_time 
-Dmapred.output.compress=false 
-Dmapreduce.job.max.split.locations=24 
-D-Dmapred.reduce.tasks=20 
-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py 
-mapper ./twitter_map.py 
-reducer ./twitter_reduce.py 
-input /project/atmosphere/tweets/2014/04/*/* 
-output /project/atmosphere/tweets/output 
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
Get  SequenceFile  with  proper  mapping
Tweets	
mahout seq2sparse 
-i /project/atmosphere/tweets/output 
-o /project/atmosphere/tweets/vectorized -ow 
-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2
Calculate  vector  representation  for  text	
{10:0.6292275202550768,14:0.7772211575566166}	
  
{10:0.6292275202550768,14:0.7772211575566166}	
  
{3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}	
  
{17:1.0}	
  
{3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}	
  
Tweets	
I’ts  time  to  begin  clusterization	
Let’s  find  100  clusters	
mahout kmeans 
-i /tweets_5/vectorized/tfidf-vectors 
-c /tweets_5/kmeans/initial-clusters 
-o /tweets_5/kmeans/output-clusters 
-cd 1.0 -k 100 -x 10 -cl –ow 
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
Tweets	
Glance  at  results	
BURN	
 OPEN	
 LEATHER	
FAT	
 SOFTWARE	
 WALLET	
WEIGHTLOSS	
 LINUX	
 MAN	
FITNESS	
 UBUNTU	
ZUMBA	
 OPENSUSE	
PATCHING
Tweets	
It  was  easy  because  tags  are  
very  dependent  (coocurence).
Tweets	
Bigger  challenge  –  user  clustering	
LINUX	
UBUNTU	
WINDOWS	
OS	
PATCH	
MAC	
HACKED	
MICROSOFT	
FREE	
CSRRACING	
WON	
RACEYOURFRIENDS	
ANDROID	
CSRCLASSIC
Tweets	
Bigger  challenge  –  user  clustering	
•  Results  show  that  dataset  is  strongly  curved  
by  mobile  and  games	
•  Dataset  wasn’t  random  –  we  subscribed    
specific  keywords	
•  OS  result  is  great!
Tweets	
HADOOP  WORLD	
run  predictive  machine  learning  algorithms  on  hadoop  
without  even  knowing  mapreduce.:  data  scientists  are  
very...  h:p://t.co/gdmqm5g1ar	
rt  @mapr:  google  cloud  storage  connector  for  #hadoop:  
quick  start  guide  now  avail  h:p://t.co/17hxtvdlir    
#bigdata
Tweets	
HADOOP  WORLD	
Cloudera  wants  to  do  big  data  in  Real  Time.	
Hortonworks  wants  to  replace  cloudera  by  research.
Visualize  data	
add jar hive-serdes-1.0-SNAPSHOT.jar;
create table tw_data_201404
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY '012’
STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS
SELECT
v_date,
LOWER(hashtags.text),
lang,
COUNT(*) AS total_count
FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
WHERE v_date like '2014-04-%'
GROUP BY v_date,LOWER(hashtags.text),lang 	
  	
  
Visualize  data	
add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar;
CREATE EXTERNAL TABLE es_export (
v_date string,
tag string,
lang string,
total_count int,
info string )
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES (
'es.resource' = 'trends/log',
'es.index.auto.create' = 'true') ; 	
  	
  
Visualize  data	
INSERT overwrite TABLE es_export
SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt'
FROM tw_data_201405 may
LEFT outer JOIN tw_data_201404 april
ON april.tag = may.tag
WHERE april.tag is null
AND may.total_count>1; 	
  	
  
Visualize  data
Visualize  data	
Tag: eurovisiontve
Thank  you!	
Questions?

More Related Content

What's hot

Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Productive data engineer
Productive data engineerProductive data engineer
Productive data engineerRafał Wojdyła
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 

What's hot (20)

Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Productive data engineer
Productive data engineerProductive data engineer
Productive data engineer
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Pptx present
Pptx presentPptx present
Pptx present
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 

Viewers also liked

Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...PROIDEA
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...PROIDEA
 
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof Konkowski
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof KonkowskiPLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof Konkowski
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof KonkowskiPROIDEA
 
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei. Analityka w służbie jej DN...
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei.Analityka w służbie jej DN...PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei.Analityka w służbie jej DN...
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei. Analityka w służbie jej DN...PROIDEA
 
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PROIDEA
 
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam Grodecki
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam GrodeckiPLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam Grodecki
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam GrodeckiPROIDEA
 
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...PROIDEA
 
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał Gąszczyk
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał GąszczykPLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał Gąszczyk
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał GąszczykPROIDEA
 
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...PROIDEA
 
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...PROIDEA
 
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny Hedlund
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny HedlundPLNOG16: Bringing SDN outside the cloud and datacenter, Johnny Hedlund
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny HedlundPROIDEA
 
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...PROIDEA
 
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PROIDEA
 
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina Bargisen
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina BargisenPLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina Bargisen
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina BargisenPROIDEA
 
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...PROIDEA
 
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey Plastunov
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey PlastunovCONFidence 2015: Fuzz your way into the web server's zoo - Andrey Plastunov
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey PlastunovPROIDEA
 
PLNOG 13: Jacek Wosz: User Defined Network
PLNOG 13: Jacek Wosz: User Defined NetworkPLNOG 13: Jacek Wosz: User Defined Network
PLNOG 13: Jacek Wosz: User Defined NetworkPROIDEA
 
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...PROIDEA
 
PLNOG 13: Julian Lucek: Centralized Traffic Enginnering
PLNOG 13: Julian Lucek: Centralized Traffic EnginneringPLNOG 13: Julian Lucek: Centralized Traffic Enginnering
PLNOG 13: Julian Lucek: Centralized Traffic EnginneringPROIDEA
 

Viewers also liked (20)

Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
Atmosphere 2014: When Storm hits data. Data streams processing in real time -...
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
 
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof Konkowski
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof KonkowskiPLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof Konkowski
PLNOG16: Kreowanie usług przez operatorów – SP IWAN, Krzysztof Konkowski
 
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei. Analityka w służbie jej DN...
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei.Analityka w służbie jej DN...PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei.Analityka w służbie jej DN...
PLNOG16: DNS – przyjaciel e-szpiegów i e-złodziei. Analityka w służbie jej DN...
 
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
 
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam Grodecki
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam GrodeckiPLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam Grodecki
PLNOG16: Planowanie sieci FTTx z wykorzystaniem technik T-WDM, Adam Grodecki
 
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...
PLNOG16: VXLAN Gateway, efektywny sposób połączenia świata wirtualnego z fizy...
 
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał Gąszczyk
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał GąszczykPLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał Gąszczyk
PLNOG16: EXTREME(alnie) przeciw DDoS’om, Krzysztof Surgut, Michał Gąszczyk
 
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...
PLNOG16: Architektura bezpieczeństwa na potrzeby wdrożenia systemu tetra w ...
 
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...
PLNOG16: Nowe założenia dla zbieranie logów, statystyk i alertów, Maciej Kałk...
 
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny Hedlund
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny HedlundPLNOG16: Bringing SDN outside the cloud and datacenter, Johnny Hedlund
PLNOG16: Bringing SDN outside the cloud and datacenter, Johnny Hedlund
 
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...
PLNOG16: Ewolucja infrastruktury średniego ISP, czyli jak człowiek uczy się n...
 
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
PLNOG16: Automatyzacja kreaowania usług operatorskich w separacji od rodzaju ...
 
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina Bargisen
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina BargisenPLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina Bargisen
PLNOG16: Netflix Open Connect is the Netflix proprietary CDN, Nina Bargisen
 
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...
PLNOg16: SDN dla entuzjastów i sceptyków. Co zaskoczyło mnie w rozwiązaniu wi...
 
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey Plastunov
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey PlastunovCONFidence 2015: Fuzz your way into the web server's zoo - Andrey Plastunov
CONFidence 2015: Fuzz your way into the web server's zoo - Andrey Plastunov
 
PLNOG 13: Jacek Wosz: User Defined Network
PLNOG 13: Jacek Wosz: User Defined NetworkPLNOG 13: Jacek Wosz: User Defined Network
PLNOG 13: Jacek Wosz: User Defined Network
 
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...
4Developers 2015: Couple of words about testing in Java, Spock and BDD - Piot...
 
Crockpot recipes
Crockpot recipesCrockpot recipes
Crockpot recipes
 
PLNOG 13: Julian Lucek: Centralized Traffic Enginnering
PLNOG 13: Julian Lucek: Centralized Traffic EnginneringPLNOG 13: Julian Lucek: Centralized Traffic Enginnering
PLNOG 13: Julian Lucek: Centralized Traffic Enginnering
 

Similar to Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.GeeksLab Odessa
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData StackPeadar Coyle
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...Facultad de Informática UCM
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Alexander Hendorf
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folksThomas Hütter
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 

Similar to Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski (20)

Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
מיכאל
מיכאלמיכאל
מיכאל
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

  • 1. Hadoop:   challenge   accepted! Arkadiusz  Osiński arkadiusz.osinski@allegrogroup.com Robert  Mroczkowski robert.mroczkowski@allegrogroup.com
  • 2. ToC -­‐‑   Hadoop  basics -­‐‑   Gather  data -­‐‑   Process  your  data -­‐‑   Learn  from  your  data -­‐‑   Visualize  your  data
  • 3. BigData -­‐‑  Petabytes  of  (un)structured  data
  • 4. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed
  • 5. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered
  • 6. BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered -­‐‑   how  to  gain  knowledge?
  • 9. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system
  • 10. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware
  • 11. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability
  • 12. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD
  • 13. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control
  • 14. HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control -­‐‑   No  SPOF
  • 16. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data
  • 17. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce…
  • 18. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications
  • 19. YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications -­‐‑   Resource  management
  • 20. Let’s  squize  our  data   to  get  a  juice!!
  • 21. Gather  data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql
  • 22. Process  your  data -­‐‑   Hadoop  Streaming!
  • 23. Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java
  • 24. Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java -­‐‑   You  can  use  Python,  Perl  or  Awk
  • 25. Process  your  data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}t1'.format(str(dt))    
  • 26. Process  your  data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("t") if datekey != line[0]: if datekey: print "{0}t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1  print "{0}t{1}".format(str(datekey),str(counter))    
  • 27. Process  your  data yarn jar /usr/lib/hadoop-mapreduce/hadoop- streaming.jar -files ./map.py,./reduce.py -mapper ./map.py -reducer ./reduce.py -input /tweets/2014/04/*/*/* -input /tweets/2014/05/*/*/* -output /tweet_keyword
  • 28. Process  your  data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)
  • 30. Recommendations Which  product  will  be  desired  by  client? We’ve  got  historical  users  interaction   with  items.
  • 31.
  • 32. Simple  Example Let’s  just  do  mahout    -­‐‑  it’s  easy! > apt-get install mahout > cat simple_example.csv 1,101 1,102 1,103 2,101 > hdfs dfs -put simple_example.csv > mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b -Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv -Dmapred.output.dir=/mahout/output/wikilinks/simple_example -Dmapred.job.queue.name=atmosphere_prod
  • 33. Simple  Example Tadadam! > hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]
  • 34. Wiki  Case We’ve  got  links  between  wikipedia  articles,  and  want  to   propose  new  links  between  articles. „Wikipedia   (i/ˌwɪkɨˈpiːdiəә/   or   i/ˌwɪkiˈpiːdiəә/   WIK-­‐‑i-­‐‑PEE-­‐‑dee-­‐‑əә)   is   a   collaboratively   edited,   multilingual,   free   Internet   encyclopedia   that   is   supported   by   the   non-­‐‑profit   Wikimedia   Foundation.   Volunteers   worldwide   collaboratively   write   Wikipedia'ʹs   30   million   articles  in  287  languages,  including  over  4.5  million  in  the  English  Wikipedia.  Anyone  who  can   access”  
  • 36. Wiki  Case hlp://users.on.net/%7Ehenry/pagerank/links-­‐‑simple-­‐‑sorted.zip #!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }  
  • 37. Wiki  Case yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapreduce.job.max.split.locations=24 -Dmapreduce.job.queuename=hadoop_prod -Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator -Dmapred.text.key.comparator.options=-n -Dmapred.output.compress=false -files ./mahout/mapper.awk -mapper ./mapper.awk -input /mahout/input/wikilinks/links-simple-sorted.txt -output /mahout/output/wikilinks/fixedinput
  • 38. Wiki  Case Mahout  lib  count’s  similarity  Matrix  and  gave   recommendations  for  824  articles. What’s  important,  we  didn’t  gather  any  knowledge   a  priori  and  just  ran  algorithm’s  out  of  box.
  • 39. Wiki  Case Acadèmia_Valenciana_de_la_Llengua FIFA Valencia October_1 Calendar Prehistoric_Iberia Link  appears  recently Ceuta Spain  City  at  the  north  coast  of  Africa Roussillon Part  of  France  by  the  border  with  Spain Sweden J Turís municipality  in  the  Valencian  Community Vulgar_Latin Language  article Western_Italo-­‐‑ Western_languages Language  article Àngel_Guimerà Spanish  wriler
  • 41. Tweets Let’s  find  group  of: • tags   • users
  • 42. Tweets •  Our  data  is  not  random •  We’ve  picked  specific  keywords •  We’ll  do  analysis  in  two   orthogonal  directions
  • 43. Tweets { "filter_level":"medium", "contributors":null, "text":"PROMOCIÓN MES DE MAYO. con ...", "geo":null, "retweeted":false, "lang":"es", "entities":{ "urls":[ { "expanded_url":"http://www.agmuriel.com", "indices":[ 69, 91 ], "display_url":"agmuriel.com/#!-/c1gz", "url":"http://t.co/APpPjRRTXn" } ] } (…)  
  • 44. Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"t"+text except KeyError: continue   #!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("t") if lastKey and lastKey != key: print lastKey+"t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)  
  • 45. Tweets yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapreduce.job.queuename=atmosphere_time -Dmapred.output.compress=false -Dmapreduce.job.max.split.locations=24 -D-Dmapred.reduce.tasks=20 -files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py -mapper ./twitter_map.py -reducer ./twitter_reduce.py -input /project/atmosphere/tweets/2014/04/*/* -output /project/atmosphere/tweets/output -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat Get  SequenceFile  with  proper  mapping
  • 46. Tweets mahout seq2sparse -i /project/atmosphere/tweets/output -o /project/atmosphere/tweets/vectorized -ow -chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2 Calculate  vector  representation  for  text {10:0.6292275202550768,14:0.7772211575566166}   {10:0.6292275202550768,14:0.7772211575566166}   {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}   {17:1.0}   {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}  
  • 47. Tweets I’ts  time  to  begin  clusterization Let’s  find  100  clusters mahout kmeans -i /tweets_5/vectorized/tfidf-vectors -c /tweets_5/kmeans/initial-clusters -o /tweets_5/kmeans/output-clusters -cd 1.0 -k 100 -x 10 -cl –ow -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
  • 48. Tweets Glance  at  results BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE PATCHING
  • 49. Tweets It  was  easy  because  tags  are   very  dependent  (coocurence).
  • 50. Tweets Bigger  challenge  –  user  clustering LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC
  • 51. Tweets Bigger  challenge  –  user  clustering •  Results  show  that  dataset  is  strongly  curved   by  mobile  and  games •  Dataset  wasn’t  random  –  we  subscribed     specific  keywords •  OS  result  is  great!
  • 52. Tweets HADOOP  WORLD run  predictive  machine  learning  algorithms  on  hadoop   without  even  knowing  mapreduce.:  data  scientists  are   very...  h:p://t.co/gdmqm5g1ar rt  @mapr:  google  cloud  storage  connector  for  #hadoop:   quick  start  guide  now  avail  h:p://t.co/17hxtvdlir     #bigdata
  • 53. Tweets HADOOP  WORLD Cloudera  wants  to  do  big  data  in  Real  Time. Hortonworks  wants  to  replace  cloudera  by  research.
  • 54. Visualize  data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY '012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang    
  • 55. Visualize  data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;    
  • 56. Visualize  data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;