SlideShare a Scribd company logo
1 of 39
Download to read offline
Big Data in Practice: 
The TrustYou Tech Stack 
Cluj Big Data Meetup, Nov 18th 
Steffen Wenz, CTO
Goals of today’s talk 
● Relate first-hand experiences with a big data tech 
stack 
● Introduce to a few essential technologies beyond 
Hadoop: 
○ Hortonworks HDP 
○ Apache Pig 
○ Luigi
Who are we? 
● For each hotel on the 
planet, provide a 
summary of all reviews 
● Expertise: 
○ NLP 
○ Machine Learning 
○ Big Data 
● Clients: …
TrustYou Tech Stack 
Batch Layer 
● Hadoop (HDP 2.1) 
● Python 
● Pig 
● Luigi 
Service Layer 
● PostgreSQL 
● MongoDB 
● Redis 
● Cassandra 
Data Data Queries 
Hadoop cluster (100 nodes) Application machines
Hadoop cluster 
(includes all live and 
development machines)
Python ♥ Big Data 
Hadoop ist Java-first, but: 
● Hadoop streaming 
cat input | ./map.py |  
sort | ./reduce.py > output 
○ MRJob, Luigi 
○ VirtualEnv 
● Pig: Python UDFS 
● Real-time processing: 
PySpark, PyStorm 
● Data processing: 
○ Numpy, SciPy 
○ Pandas 
● NLP: 
○ NLTK 
● Machine learning: 
○ Scikit-learn 
○ Gensim (word2vec)
Use case: Semantic analysis 
● “Nice room” 
● “Room wasn‘t so great” 
● “The air-conditioning 
was so powerful that we 
were cold in the room 
even when it was off.” 
● “อาหารรสชาติดี” 
● “ ” خدمة جیدة 
● 20 languages 
● Linguistic system 
(morphology, taggers, 
grammars, parsers …) 
● Hadoop: Scale out CPU 
● Python for ML & NLP 
libraries
Hortonworks Distribution
Hortonworks Distribution 
● Hortonworks Data 
Platform: Enterprise 
architecture out of the 
box 
● Try out in VM: 
Hortonworks Sandbox 
● Alternatives: Cloudera 
CDH, MapR 
TrustYou & Hortonworks @ BITKOM Big Data Summit
Apache Pig 
● Define & execute parallel data flows 
on Hadoop 
○ Engine + Language (“Pig Latin”) + Shell (“Grunt”) 
● “SQL of big data” (bad comparison; many differences) 
● Goal: Make Pig Latin native language of parallel data 
processing 
● Native support for: Projection, filtering, sort, group, join
Why not just MapReduce? 
● Projection 
SELECT a, b ... 
● Filter 
WHERE ... 
● Sort 
● Distinct 
● Group 
● Join 
Source: Hadoop - the definitive guide
Pig Example 
Load one day of raw GDELT data 
-- omitted: create table, insert 
select * from gdelt limit 10; 
gdelt = load '20141112.export.CSV'; 
gdelt = limit gdelt 10; 
dump gdelt; 
Pigs eat 
anything!
Specifying a schema 
gdelt = load '20141112.export.CSV' as ( 
event_id: chararray, 
sql_date: chararray, 
month_year: chararray, 
year: chararray, 
fraction_date: chararray, 
actor1: chararray, 
actor1_name: chararray, 
-- ... 59 columns in total ... 
event: int, 
goldstein_scale: float, 
date_added: chararray, 
source_url: chararray 
);
Pig Example 
Look at all non-empty actor countries 
select actor1_country 
from gdelt 
where actor1_country != ''; 
-- where: 
gdelt = filter gdelt 
by actor1_country != ''; 
-- select: 
country = foreach gdelt 
generate actor1_country; 
dump country;
Pig Example 
Get histogram of actor countries 
select actor1_country, count(*) 
from gdelt 
group by actor1_country; 
gdelt_grp = group gdelt 
by actor1_country; 
gdelt_cnt = foreach gdelt_grp generate 
group as country, 
COUNT(gdelt) as count; 
dump gdelt_cnt;
Pig Example 
Count total rows, count distinct event IDs 
select count(*) from gdelt; 
select count(distinct event_id) from 
gdelt; 
gdelt_grp = group gdelt all; 
gdelt_cnt = foreach gdelt_grp generate 
COUNT(gdelt); 
dump gdelt_cnt; -- 180793 
event = foreach gdelt generate event; 
event_dis = distinct event; 
event_grp = group event_dis all; 
event_cnt = foreach event_grp generate 
COUNT(event_dis); 
dump event_cnt; -- 215
Things you can’t do in Pig 
i = 2; 
Top-level variables are bags 
(sort of like tables). 
if (x#a == 2) dump xs; 
None of the usual control 
structures. You define data 
flows. 
For everything else: UDFs 
(user-defined functions). 
Custom operators 
implemented in Java or 
Python. 
Also: Directly call Java 
static methods
Cool, but where’s the parallelism? 
event = foreach gdelt generate event; 
-- map 
event_dis = distinct event 
parallel 50; -- reduce! 
event_grp = group event_dis all 
parallel 50; -- reduce! 
event_cnt = foreach event_grp generate 
COUNT(event_dis); -- map 
dump event_cnt;
Pig’s execution engine 
$ pig -x local -e "explain -script gdelt.pig" 
#----------------------------------------------- 
# New Logical Plan: 
#----------------------------------------------- 
event_cnt: (Name: LOStore Schema: #131:long) 
ColumnPrune:InputUids=[69]ColumnPrune:OutputUids=[69] 
| 
|---event_cnt: (Name: LOForEach Schema: #131:long) 
| | 
| (Name: LOGenerate[false] Schema: #131:long) 
| | | 
| | (Name: UserFunc(org.apache.pig.builtin. 
COUNT) Type: long Uid: 131) 
| | | 
| | |---event_dis:(Name: Project Type: bag 
Uid: 67 Input: 0 Column: (*)) 
| | 
| |---event_dis: (Name: LOInnerLoad[1] Schema: 
event#27:int) 
| 
|---event_grp: (Name: LOCogroup Schema: group#66: 
chararray,event_dis#67:bag{#129:tuple(event#27:int)}) 
$ pig -x local -e "explain -script gdelt.pig 
-dot -out gdelt.dot" 
$ dot -Tpng gdelt.dot > gdelt.png
Pig advanced: Asymmetric country relations 
-- we're only interested in countries 
gdelt = filter ( 
foreach gdelt generate actor1_country, actor2_country, goldstein_scale 
) by actor1_country != '' and actor2_country != ''; 
gdelt_grp = group gdelt by (actor1_country, actor2_country); 
-- it's not necessary to aggregate twice - except that Pig doesn't allow self joins 
gold_1 = foreach gdelt_grp generate 
group.actor1_country as actor1_country, 
group.actor2_country as actor2_country, 
SUM(gdelt.goldstein_scale) as goldstein_scale; 
gold_2 = foreach gdelt_grp generate 
group.actor1_country as actor1_country, 
group.actor2_country as actor2_country, 
SUM(gdelt.goldstein_scale) as goldstein_scale; 
-- join both sums together, to get the Goldstein values for both directions in one row 
gold = join gold_1 by (actor1_country, actor2_country), gold_2 by (actor2_country, actor1_country);
Pig advanced: Asymmetric country relations 
-- compute the difference in Goldstein score 
gold = foreach gold generate 
gold_1::actor1_country as actor1_country, 
gold_1::actor2_country as actor2_country, 
gold_1::goldstein_scale as gold_1, 
gold_2::goldstein_scale as gold_2, 
ABS(gold_1::goldstein_scale - gold_2::goldstein_scale) as diff; 
-- keep only the values where one direction is positive, the other negative 
-- also, remove all duplicate rows 
gold = filter gold by gold_1 * gold_2 < 0 and actor1_country < actor2_country; 
gold = order gold by diff desc; 
dump gold;
Pig advanced: Asymmetric country relations 
(PSE,USA,93.49999961256981,-76.30000001192093,169.79999962449074) 
(NGA,USA,15.900000423192978,-143.5999995470047,159.49999997019768) 
(ISR,JOR,143.89999967813492,-12.700000494718552,156.60000017285347) 
(IRN,SYR,103.50000095367432,-50.50000023841858,154.0000011920929) 
(IRN,ISR,16.60000056028366,-112.40000087022781,129.00000143051147) 
(GBR,RUS,73.09999999403954,-41.99999952316284,115.09999951720238) 
(EGY,SYR,-87.60000020265579,12.0,99.60000020265579) 
(USA,YEM,-78.30000007152557,15.700000047683716,94.00000011920929) 
(ISR,TUR,2.4000001549720764,-90.60000002384186,93.00000017881393) 
(MYS,UKR,35.10000038146973,-52.0,87.10000038146973) 
(GRC,TUR,-47.60000029206276,36.5,84.10000029206276) 
(HTI,USA,34.99999976158142,-45.40000009536743,80.39999985694885)
Apache Pig @ TrustYou 
● Before: 
○ Usage of Unix utilities (sort, cut, awk etc.) and 
custom tools (map_filter.py, reduce_agg.py) to 
transform data with Hadoop Streaming 
● Now: 
○ Data loading & transformation expressed in Pig 
○ PigUnit for testing 
○ Core algorithms still implemented in Python
Further Reading on Pig 
● O’Reilly Book - 
free online version 
See code samples on 
TrustYou GitHub account: 
https://github. 
com/trustyou/meetups/tre 
e/master/big-data
Luigi 
● Build complex pipelines of 
batch jobs 
○ Dependency resolution 
○ Parallelism 
○ Resume failed jobs 
● Pythonic replacement for Apache Oozie 
● Not a replacement for Pig, Cascading, Hive
Anatomy of a Luigi task 
class MyTask(luigi.Task): 
# Parameters which control the behavior of the task. Same parameters = the task only needs to run once! 
param1 = luigi.Parameter() 
# These dependencies need to be done before this task can start. Can also be a list or dict 
def requires(self): 
return DependentTask(self.param1) 
# Path to output file (local or HDFS). If this file is present, Luigi considers this task to be done. 
def output(self): 
return luigi.LocalTarget("data/my_task_output_{}".format(self.param1)) 
def run(self): 
# To make task execution atomic, Luigi writes all output to a temporary file, and only renames when 
you close the target. 
with self.output().open("w") as out: 
out.write("foo")
Luigi tasks vs. Makefiles 
class MyTask(luigi.Task): 
def requires(self): 
return DependentTask() 
def output(self): 
return luigi.LocalTarget 
("data/my_task_output")) 
def run(self): 
with self.output().open("w") as 
out: 
out.write("foo") 
data/my_task_output: DependentTask 
run 
run 
run ...
Luigi Hadoop integration 
class HadoopTask(luigi.hadoop.JobTask): 
def output(self): 
return luigi.HdfsTarget("output_in_hdfs") 
def requires(self): 
return { 
"some_task": SomeTask(), 
"some_other_task": SomeOtherTask() 
} 
def mapper(self, line): 
key, value = line.rstrip().split("t") 
yield key, value 
def reducer(self, key, values): 
yield key, ", ".join(values)
Luigi example 
Crawl a URL, then extract 
all links from it! CrawlTask(url) 
ExtractTask(url)
Luigi example: CrawlTask 
class CrawlTask(luigi.Task): 
url = luigi.Parameter() 
def output(self): 
url_hash = hashlib.md5(self.url).hexdigest() 
return luigi.LocalTarget(os.path.join("data", "crawl_" + url_hash)) 
def run(self): 
req = requests.get(self.url) 
res = req.text 
with self.output().open("w") as out: 
out.write(res.encode("utf-8"))
Luigi example: ExtractTask 
class ExtractTask(luigi.Task): 
url = luigi.Parameter() 
def requires(self): 
return CrawlTask(self.url) 
def output(self): 
url_hash = hashlib.md5(self.url).hexdigest() 
return luigi.LocalTarget(os.path.join("data", "extract_" + url_hash)) 
def run(self): 
soup = bs4.BeautifulSoup(self.input().open().read()) 
with self.output().open("w") as out: 
for link in soup.find_all("a"): 
out.write(str(link.get("href")) + "n")
Luigi example: Running it locally 
$ python luigi_demo.py --local-scheduler ExtractTask --url http://www.trustyou.com 
DEBUG: Checking if ExtractTask(url=http://www.trustyou.com) is complete 
INFO: Scheduled ExtractTask(url=http://www.trustyou.com) (PENDING) 
DEBUG: Checking if CrawlTask(url=http://www.trustyou.com) is complete 
INFO: Scheduled CrawlTask(url=http://www.trustyou.com) (PENDING) 
INFO: Done scheduling tasks 
INFO: Running Worker with 1 processes 
DEBUG: Asking scheduler for work... 
DEBUG: Pending tasks: 2 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running 
CrawlTask(url=http://www.trustyou.com) 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done 
CrawlTask(url=http://www.trustyou.com) 
DEBUG: 1 running tasks, waiting for next task to finish 
DEBUG: Asking scheduler for work... 
DEBUG: Pending tasks: 1 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running 
ExtractTask(url=http://www.trustyou.com) 
INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done 
ExtractTask(url=http://www.trustyou.com) 
DEBUG: 1 running tasks, waiting for next task to finish 
DEBUG: Asking scheduler for work...
Luigi @ TrustYou 
● Before: 
○ Bash scripts + cron 
○ Manual cleanup after 
failures due to network 
issues etc. 
● Now: 
○ Complex nested Luigi job 
graphs 
○ Failed jobs usually repair 
themselves
TrustYou wants you! 
We offer positions 
in Cluj & Munich: 
● Data engineer 
● Application developer 
● Crawling engineer 
Write me at swenz@trustyou.net, check out our website, 
or see you at the next meetup!
Backup
Cluj Big Data Meetup - Big Data in Practice
Cluj Big Data Meetup - Big Data in Practice

More Related Content

What's hot

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnMoriyoshi Koizumi
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by exampleYunWon Jeong
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engineDuoyi Wu
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stmAlexander Granin
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architectureJung Kim
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak PROIDEA
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboyKenneth Geisshirt
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopSaša Tatar
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015Michiel Borkent
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++nsm.nikhil
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJSKyung Yeol Kim
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolvedtrxcllnt
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the webMichiel Borkent
 
Data structure programs in c++
Data structure programs in c++Data structure programs in c++
Data structure programs in c++mmirfan
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsBartosz Konieczny
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! aleks-f
 
C c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoC c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoKim Phillips
 

What's hot (20)

Hacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 AutumnHacking Go Compiler Internals / GoCon 2014 Autumn
Hacking Go Compiler Internals / GoCon 2014 Autumn
 
PyCon KR 2019 sprint - RustPython by example
PyCon KR 2019 sprint  - RustPython by examplePyCon KR 2019 sprint  - RustPython by example
PyCon KR 2019 sprint - RustPython by example
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engine
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stm
 
Letswift19-clean-architecture
Letswift19-clean-architectureLetswift19-clean-architecture
Letswift19-clean-architecture
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loop
 
Python GC
Python GCPython GC
Python GC
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015
 
Writing native bindings to node.js in C++
Writing native bindings to node.js in C++Writing native bindings to node.js in C++
Writing native bindings to node.js in C++
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
 
RxJS Evolved
RxJS EvolvedRxJS Evolved
RxJS Evolved
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Data structure programs in c++
Data structure programs in c++Data structure programs in c++
Data structure programs in c++
 
Using Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasetsUsing Cerberus and PySpark to validate semi-structured datasets
Using Cerberus and PySpark to validate semi-structured datasets
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
C c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdoC c++-meetup-1nov2017-autofdo
C c++-meetup-1nov2017-autofdo
 

Viewers also liked

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012TrustYou
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture BookTrustYou
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020Steffen Wenz
 
Research paper in filipino
Research paper in filipinoResearch paper in filipino
Research paper in filipinoSFYC
 
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONTHESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONMi L
 
THESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) TagalogTHESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) Tagaloghm alumia
 

Viewers also liked (6)

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture Book
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
Research paper in filipino
Research paper in filipinoResearch paper in filipino
Research paper in filipino
 
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHONTHESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
THESIS - WIKANG FILIPINO, SA MAKABAGONG PANAHON
 
THESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) TagalogTHESIS (Pananaliksik) Tagalog
THESIS (Pananaliksik) Tagalog
 

Similar to Cluj Big Data Meetup - Big Data in Practice

Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Tzung-Bi Shih
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
Python tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterPython tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterJeff Hale
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Yuren Ju
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Python-GTK
Python-GTKPython-GTK
Python-GTKYuren Ju
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
PHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryPHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryMike Lively
 

Similar to Cluj Big Data Meetup - Big Data in Practice (20)

Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Python tools to deploy your machine learning models faster
Python tools to deploy your machine learning models fasterPython tools to deploy your machine learning models faster
Python tools to deploy your machine learning models faster
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Python GTK (Hacking Camp)
Python GTK (Hacking Camp)Python GTK (Hacking Camp)
Python GTK (Hacking Camp)
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Python-GTK
Python-GTKPython-GTK
Python-GTK
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
PHP CLI: A Cinderella Story
PHP CLI: A Cinderella StoryPHP CLI: A Cinderella Story
PHP CLI: A Cinderella Story
 

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Cluj Big Data Meetup - Big Data in Practice

  • 1. Big Data in Practice: The TrustYou Tech Stack Cluj Big Data Meetup, Nov 18th Steffen Wenz, CTO
  • 2. Goals of today’s talk ● Relate first-hand experiences with a big data tech stack ● Introduce to a few essential technologies beyond Hadoop: ○ Hortonworks HDP ○ Apache Pig ○ Luigi
  • 3.
  • 4. Who are we? ● For each hotel on the planet, provide a summary of all reviews ● Expertise: ○ NLP ○ Machine Learning ○ Big Data ● Clients: …
  • 5.
  • 6. TrustYou Tech Stack Batch Layer ● Hadoop (HDP 2.1) ● Python ● Pig ● Luigi Service Layer ● PostgreSQL ● MongoDB ● Redis ● Cassandra Data Data Queries Hadoop cluster (100 nodes) Application machines
  • 7. Hadoop cluster (includes all live and development machines)
  • 8. Python ♥ Big Data Hadoop ist Java-first, but: ● Hadoop streaming cat input | ./map.py | sort | ./reduce.py > output ○ MRJob, Luigi ○ VirtualEnv ● Pig: Python UDFS ● Real-time processing: PySpark, PyStorm ● Data processing: ○ Numpy, SciPy ○ Pandas ● NLP: ○ NLTK ● Machine learning: ○ Scikit-learn ○ Gensim (word2vec)
  • 9. Use case: Semantic analysis ● “Nice room” ● “Room wasn‘t so great” ● “The air-conditioning was so powerful that we were cold in the room even when it was off.” ● “อาหารรสชาติดี” ● “ ” خدمة جیدة ● 20 languages ● Linguistic system (morphology, taggers, grammars, parsers …) ● Hadoop: Scale out CPU ● Python for ML & NLP libraries
  • 11. Hortonworks Distribution ● Hortonworks Data Platform: Enterprise architecture out of the box ● Try out in VM: Hortonworks Sandbox ● Alternatives: Cloudera CDH, MapR TrustYou & Hortonworks @ BITKOM Big Data Summit
  • 12. Apache Pig ● Define & execute parallel data flows on Hadoop ○ Engine + Language (“Pig Latin”) + Shell (“Grunt”) ● “SQL of big data” (bad comparison; many differences) ● Goal: Make Pig Latin native language of parallel data processing ● Native support for: Projection, filtering, sort, group, join
  • 13. Why not just MapReduce? ● Projection SELECT a, b ... ● Filter WHERE ... ● Sort ● Distinct ● Group ● Join Source: Hadoop - the definitive guide
  • 14. Pig Example Load one day of raw GDELT data -- omitted: create table, insert select * from gdelt limit 10; gdelt = load '20141112.export.CSV'; gdelt = limit gdelt 10; dump gdelt; Pigs eat anything!
  • 15. Specifying a schema gdelt = load '20141112.export.CSV' as ( event_id: chararray, sql_date: chararray, month_year: chararray, year: chararray, fraction_date: chararray, actor1: chararray, actor1_name: chararray, -- ... 59 columns in total ... event: int, goldstein_scale: float, date_added: chararray, source_url: chararray );
  • 16. Pig Example Look at all non-empty actor countries select actor1_country from gdelt where actor1_country != ''; -- where: gdelt = filter gdelt by actor1_country != ''; -- select: country = foreach gdelt generate actor1_country; dump country;
  • 17. Pig Example Get histogram of actor countries select actor1_country, count(*) from gdelt group by actor1_country; gdelt_grp = group gdelt by actor1_country; gdelt_cnt = foreach gdelt_grp generate group as country, COUNT(gdelt) as count; dump gdelt_cnt;
  • 18. Pig Example Count total rows, count distinct event IDs select count(*) from gdelt; select count(distinct event_id) from gdelt; gdelt_grp = group gdelt all; gdelt_cnt = foreach gdelt_grp generate COUNT(gdelt); dump gdelt_cnt; -- 180793 event = foreach gdelt generate event; event_dis = distinct event; event_grp = group event_dis all; event_cnt = foreach event_grp generate COUNT(event_dis); dump event_cnt; -- 215
  • 19. Things you can’t do in Pig i = 2; Top-level variables are bags (sort of like tables). if (x#a == 2) dump xs; None of the usual control structures. You define data flows. For everything else: UDFs (user-defined functions). Custom operators implemented in Java or Python. Also: Directly call Java static methods
  • 20. Cool, but where’s the parallelism? event = foreach gdelt generate event; -- map event_dis = distinct event parallel 50; -- reduce! event_grp = group event_dis all parallel 50; -- reduce! event_cnt = foreach event_grp generate COUNT(event_dis); -- map dump event_cnt;
  • 21. Pig’s execution engine $ pig -x local -e "explain -script gdelt.pig" #----------------------------------------------- # New Logical Plan: #----------------------------------------------- event_cnt: (Name: LOStore Schema: #131:long) ColumnPrune:InputUids=[69]ColumnPrune:OutputUids=[69] | |---event_cnt: (Name: LOForEach Schema: #131:long) | | | (Name: LOGenerate[false] Schema: #131:long) | | | | | (Name: UserFunc(org.apache.pig.builtin. COUNT) Type: long Uid: 131) | | | | | |---event_dis:(Name: Project Type: bag Uid: 67 Input: 0 Column: (*)) | | | |---event_dis: (Name: LOInnerLoad[1] Schema: event#27:int) | |---event_grp: (Name: LOCogroup Schema: group#66: chararray,event_dis#67:bag{#129:tuple(event#27:int)}) $ pig -x local -e "explain -script gdelt.pig -dot -out gdelt.dot" $ dot -Tpng gdelt.dot > gdelt.png
  • 22. Pig advanced: Asymmetric country relations -- we're only interested in countries gdelt = filter ( foreach gdelt generate actor1_country, actor2_country, goldstein_scale ) by actor1_country != '' and actor2_country != ''; gdelt_grp = group gdelt by (actor1_country, actor2_country); -- it's not necessary to aggregate twice - except that Pig doesn't allow self joins gold_1 = foreach gdelt_grp generate group.actor1_country as actor1_country, group.actor2_country as actor2_country, SUM(gdelt.goldstein_scale) as goldstein_scale; gold_2 = foreach gdelt_grp generate group.actor1_country as actor1_country, group.actor2_country as actor2_country, SUM(gdelt.goldstein_scale) as goldstein_scale; -- join both sums together, to get the Goldstein values for both directions in one row gold = join gold_1 by (actor1_country, actor2_country), gold_2 by (actor2_country, actor1_country);
  • 23. Pig advanced: Asymmetric country relations -- compute the difference in Goldstein score gold = foreach gold generate gold_1::actor1_country as actor1_country, gold_1::actor2_country as actor2_country, gold_1::goldstein_scale as gold_1, gold_2::goldstein_scale as gold_2, ABS(gold_1::goldstein_scale - gold_2::goldstein_scale) as diff; -- keep only the values where one direction is positive, the other negative -- also, remove all duplicate rows gold = filter gold by gold_1 * gold_2 < 0 and actor1_country < actor2_country; gold = order gold by diff desc; dump gold;
  • 24. Pig advanced: Asymmetric country relations (PSE,USA,93.49999961256981,-76.30000001192093,169.79999962449074) (NGA,USA,15.900000423192978,-143.5999995470047,159.49999997019768) (ISR,JOR,143.89999967813492,-12.700000494718552,156.60000017285347) (IRN,SYR,103.50000095367432,-50.50000023841858,154.0000011920929) (IRN,ISR,16.60000056028366,-112.40000087022781,129.00000143051147) (GBR,RUS,73.09999999403954,-41.99999952316284,115.09999951720238) (EGY,SYR,-87.60000020265579,12.0,99.60000020265579) (USA,YEM,-78.30000007152557,15.700000047683716,94.00000011920929) (ISR,TUR,2.4000001549720764,-90.60000002384186,93.00000017881393) (MYS,UKR,35.10000038146973,-52.0,87.10000038146973) (GRC,TUR,-47.60000029206276,36.5,84.10000029206276) (HTI,USA,34.99999976158142,-45.40000009536743,80.39999985694885)
  • 25. Apache Pig @ TrustYou ● Before: ○ Usage of Unix utilities (sort, cut, awk etc.) and custom tools (map_filter.py, reduce_agg.py) to transform data with Hadoop Streaming ● Now: ○ Data loading & transformation expressed in Pig ○ PigUnit for testing ○ Core algorithms still implemented in Python
  • 26. Further Reading on Pig ● O’Reilly Book - free online version See code samples on TrustYou GitHub account: https://github. com/trustyou/meetups/tre e/master/big-data
  • 27. Luigi ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Pythonic replacement for Apache Oozie ● Not a replacement for Pig, Cascading, Hive
  • 28. Anatomy of a Luigi task class MyTask(luigi.Task): # Parameters which control the behavior of the task. Same parameters = the task only needs to run once! param1 = luigi.Parameter() # These dependencies need to be done before this task can start. Can also be a list or dict def requires(self): return DependentTask(self.param1) # Path to output file (local or HDFS). If this file is present, Luigi considers this task to be done. def output(self): return luigi.LocalTarget("data/my_task_output_{}".format(self.param1)) def run(self): # To make task execution atomic, Luigi writes all output to a temporary file, and only renames when you close the target. with self.output().open("w") as out: out.write("foo")
  • 29. Luigi tasks vs. Makefiles class MyTask(luigi.Task): def requires(self): return DependentTask() def output(self): return luigi.LocalTarget ("data/my_task_output")) def run(self): with self.output().open("w") as out: out.write("foo") data/my_task_output: DependentTask run run run ...
  • 30. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values)
  • 31. Luigi example Crawl a URL, then extract all links from it! CrawlTask(url) ExtractTask(url)
  • 32. Luigi example: CrawlTask class CrawlTask(luigi.Task): url = luigi.Parameter() def output(self): url_hash = hashlib.md5(self.url).hexdigest() return luigi.LocalTarget(os.path.join("data", "crawl_" + url_hash)) def run(self): req = requests.get(self.url) res = req.text with self.output().open("w") as out: out.write(res.encode("utf-8"))
  • 33. Luigi example: ExtractTask class ExtractTask(luigi.Task): url = luigi.Parameter() def requires(self): return CrawlTask(self.url) def output(self): url_hash = hashlib.md5(self.url).hexdigest() return luigi.LocalTarget(os.path.join("data", "extract_" + url_hash)) def run(self): soup = bs4.BeautifulSoup(self.input().open().read()) with self.output().open("w") as out: for link in soup.find_all("a"): out.write(str(link.get("href")) + "n")
  • 34. Luigi example: Running it locally $ python luigi_demo.py --local-scheduler ExtractTask --url http://www.trustyou.com DEBUG: Checking if ExtractTask(url=http://www.trustyou.com) is complete INFO: Scheduled ExtractTask(url=http://www.trustyou.com) (PENDING) DEBUG: Checking if CrawlTask(url=http://www.trustyou.com) is complete INFO: Scheduled CrawlTask(url=http://www.trustyou.com) (PENDING) INFO: Done scheduling tasks INFO: Running Worker with 1 processes DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 2 INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running CrawlTask(url=http://www.trustyou.com) INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done CrawlTask(url=http://www.trustyou.com) DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) running ExtractTask(url=http://www.trustyou.com) INFO: [pid 2279] Worker Worker(salt=083397955, host=steffen-thinkpad, username=steffen, pid=2279) done ExtractTask(url=http://www.trustyou.com) DEBUG: 1 running tasks, waiting for next task to finish DEBUG: Asking scheduler for work...
  • 35. Luigi @ TrustYou ● Before: ○ Bash scripts + cron ○ Manual cleanup after failures due to network issues etc. ● Now: ○ Complex nested Luigi job graphs ○ Failed jobs usually repair themselves
  • 36. TrustYou wants you! We offer positions in Cluj & Munich: ● Data engineer ● Application developer ● Crawling engineer Write me at swenz@trustyou.net, check out our website, or see you at the next meetup!