SlideShare a Scribd company logo
1 of 26
SCALABLE HADOOP WITH
SUCCINCT PYTHON:
THE BEST OF BOTH WORLDS
Donald Miner
@donaldpminer
dminer@minerkasch.com
Hadoop Summit 2015
1
About Don
Scalable Hadoop with succinct Python: the best of both worlds
3Scalable Hadoop with succinct Python: the best of both worlds
Hadoop with Python?
4Scalable Hadoop with succinct Python: the best of both worlds
The Good: Hadoop!
ā€¢ Linear scalability
ā€¢ Schema on read and unstructured data
ā€¢ Transparent parallelism
ā€¢ Open source
We want the things we love about Hadoop to be available
in Python, too!
5Scalable Hadoop with succinct Python: the best of both worlds
The Good: Succinct code
6Scalable Hadoop with succinct Python: the best of both worlds
The Good: Succinct code
Faster development
Less enterprise-y
Lower barrier of entry
7Scalable Hadoop with succinct Python: the best of both worlds
The Good: Interpreted, not compiled
Change code in place
Simpler continuous integration
More platform portable
8Scalable Hadoop with succinct Python: the best of both worlds
The Good: Python libraries for data
HYPE (Python & data science)
Tighter ā€œintegration" with data science
Pandas, scikitlearn, nltk, numpy, scipy,
gensim, matplotlib
9Scalable Hadoop with succinct Python: the best of both worlds
The Bad: Python shortfalls
Less enterprise-y
Type safety
No JVM sandbox safety
Performance
10Scalable Hadoop with succinct Python: the best of both worlds
Summary of Good & Bad
Love Python/Hadoop for Data Science and Analysis
Deal with Java/Hadoop for Data Engineering and
production code
The benefit of the good and the cost of the
bad is different for everyone!
11Scalable Hadoop with succinct Python: the best of both worlds
The Ugly
Nothing in Hadoop is native to Python
Performance can be awful due to serialization and IPC in
most cases
Python bindings are almost always behind, if they exist
Clone some random dudeā€™s code from github and pray to
Guido van Rossum that it compiles and/or works
12Scalable Hadoop with succinct Python: the best of both worlds
Overview of Hadoop/Python projects
13
mrjob
ā€¢ Write MapReduce jobs in Python!
ā€¢ Open sourced and maintained by Yelp
ā€¢ Wraps ā€œHadoop Streamingā€ in cpython Python 2.5+
ā€¢ Well documented
ā€¢ Can run locally, in Amazon EMR, or Hadoop
14
from mrjob.job import MRJob
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
Scalable Hadoop with succinct Python: the best of both worlds
Pydoop
ā€¢ Write MapReduce jobs in Python!
ā€¢ Uses Hadoop C++ Pipes, which should be faster than wrapping streaming
ā€¢ Actively being worked on
ā€¢ Iā€™m not sure which is better
15
def mapper(_, v, writer):
for word in v.split():
writer.emit(word.lower(), 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(icounts))
Scalable Hadoop with succinct Python: the best of both worlds
Python MapReduce options
http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Hadoop Streaming ā€“ More manual but faster
Hadoopy, Dumbo, havenā€™t seen commits in years, mrjob in the past 12 hours
Pydoop is main competitor (not in this list)
16Scalable Hadoop with succinct Python: the best of both worlds
Pig
Pig is a higher-level platform and language for analyzing data that happens to
run MapReduce underneath
You can write Pig UDFs in Python. Let Pig control data flow and let Python deal
with the data manipulation!
Can use jython (faster) or cpython (access to more libs)
17
b = FOREACH a GENERATE
revstr(phonenum);
m = GROUP j BY username;
n = FOREACH m GENERATE
group, sortedconcat(j.tags);
@outputSchema(ā€œtags:chararray")
def sortedconcat(bag):
out = set()
for tag in bag:
out.add(tag)
return ā€˜-ā€™.join(sorted(out))
@outputSchema(ā€œrev:chararray")
def revstr(instr):
return instr[::-1]
Scalable Hadoop with succinct Python: the best of both worlds
ā€¢ A pure Python client that interacts with HDFS
ā€¢ Handles most NameNode ops (moving/renaming files, deleting)
ā€¢ Handles most DataNode reading ops (reading files, getmerge)
ā€¢ Two ways to use: library and command line interface
18
from snakebite.client import Client
client = Client(ā€1.2.3.4", 54310, use_trash=False)
for x in client.ls(['/data']):
print x
print ā€˜ā€™.join(client.cat(ā€˜/data/ref/refdata*.csvā€™))
$ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt
Scalable Hadoop with succinct Python: the best of both worlds
Starbase or Happybase
Uses the HBase Thrift gateway interface (slow)
Last commit 6 months ago
Not really there yet and have failed to gain community momentum.
Java is still king.
19Scalable Hadoop with succinct Python: the best of both worlds
PySpark
ā€¢ Programming in Spark (and PySpark) is in the form of
chaining transformations and actions on RDDs
ā€¢ RDDs are ā€œResilient Distributed Datasetsā€
ā€¢ RDDs are kept in memory for the most part
20Scalable Hadoop with succinct Python: the best of both worlds
PySpark Word Count Example
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
21
Sparkā€™s native language is Scala, but it
also supports Java and Python
Python API is always a tad behind Scala
Scalable Hadoop with succinct Python: the best of both worlds
Some closing thoughts
22
Cassandra
23Scalable Hadoop with succinct Python: the best of both worlds
MongoDB
24Scalable Hadoop with succinct Python: the best of both worlds
Whatā€™s wrong with this?
Python bindings are almost always fringe projects
Other parts of Hadoop ecosystem are getting way more $
Lack of commercial support
Lack of cohesiveness in APIs and approaches
25Scalable Hadoop with succinct Python: the best of both worlds
SCALABLE HADOOP WITH
SUCCINCT PYTHON:
THE BEST OF BOTH WORLDS
Donald Miner
@donaldpminer
dminer@minerkasch.com
Hadoop Summit 2015
26

More Related Content

What's hot

Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
Ā 
Geek camp
Geek campGeek camp
Geek campjdhok
Ā 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
Ā 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
Ā 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
Ā 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
Ā 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
Ā 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
Ā 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPCGlenn K. Lockwood
Ā 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_presoHadoop User Group
Ā 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
Ā 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
Ā 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
Ā 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt DowleSri Ambati
Ā 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to PigPrashanth Babu
Ā 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
Ā 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
Ā 

What's hot (20)

Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Ā 
Geek camp
Geek campGeek camp
Geek camp
Ā 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Ā 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
Ā 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
Ā 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
Ā 
Apache Pig
Apache PigApache Pig
Apache Pig
Ā 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Ā 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Ā 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
Ā 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
Ā 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
Ā 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
Ā 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Ā 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
Ā 
data.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowledata.table and H2O at LondonR with Matt Dowle
data.table and H2O at LondonR with Matt Dowle
Ā 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Ā 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Ā 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Ā 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Ā 

Viewers also liked

EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Ā 
Lesson from Dumbo
Lesson from DumboLesson from Dumbo
Lesson from DumboTri Nguyen
Ā 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit
Ā 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
Ā 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
Ā 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3Yu Lun Teo
Ā 
How To Run Mapreduce Jobs In Python
How To Run Mapreduce Jobs In PythonHow To Run Mapreduce Jobs In Python
How To Run Mapreduce Jobs In PythonYi Wang
Ā 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
Ā 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
Ā 

Viewers also liked (11)

EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
Ā 
Lesson from Dumbo
Lesson from DumboLesson from Dumbo
Lesson from Dumbo
Ā 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Ā 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Ā 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Ā 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
Ā 
Deep Dive on Amazon S3
Deep Dive on Amazon S3Deep Dive on Amazon S3
Deep Dive on Amazon S3
Ā 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
Ā 
How To Run Mapreduce Jobs In Python
How To Run Mapreduce Jobs In PythonHow To Run Mapreduce Jobs In Python
How To Run Mapreduce Jobs In Python
Ā 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Ā 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Ā 

Similar to Scalable Hadoop with succinct Python: the best of both worlds

Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
Ā 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
Ā 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
Ā 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
Ā 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014soujavajug
Ā 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
Ā 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
Ā 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
Ā 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
Ā 
Data Science
Data ScienceData Science
Data ScienceSubhajit75
Ā 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
Ā 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
Ā 
Getting started big data
Getting started big dataGetting started big data
Getting started big dataKibrom Gebrehiwot
Ā 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
Ā 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
Ā 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
Ā 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
Ā 

Similar to Scalable Hadoop with succinct Python: the best of both worlds (20)

Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
Ā 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Ā 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
Ā 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Ā 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
Ā 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Ā 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
Ā 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Ā 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Ā 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
Ā 
Data Science
Data ScienceData Science
Data Science
Ā 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
Ā 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Ā 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
Ā 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
Ā 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Ā 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
Ā 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Ā 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Ā 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Ā 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
Ā 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
Ā 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Ā 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Ā 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
Ā 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
Ā 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
Ā 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Ā 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Ā 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Ā 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
Ā 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Ā 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Ā 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Ā 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
Ā 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Ā 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Ā 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
Ā 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Ā 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Ā 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Ā 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Ā 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Ā 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Ā 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
Ā 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
Ā 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Ā 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Ā 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Ā 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Ā 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Ā 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Ā 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Ā 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Ā 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Ā 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Ā 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Ā 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Ā 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Ā 
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Ā 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
Ā 
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Alan Dix
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Ā 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
Ā 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Ā 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
Ā 
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024BookNet Canada
Ā 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
Ā 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Ā 
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY šŸ” 8264348440 šŸ” Call Girls in Diplomatic Enclave | Delhi
Ā 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Ā 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
Ā 
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Swan(sea) Song ā€“ personal research during my six years at Swansea ... and bey...
Ā 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Ā 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Ā 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Ā 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Ā 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Ā 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Ā 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Ā 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
Ā 
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: Whatā€™s new for BISAC - Tech Forum 2024
Ā 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
Ā 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Ā 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Ā 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
Ā 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
Ā 

Scalable Hadoop with succinct Python: the best of both worlds

  • 1. SCALABLE HADOOP WITH SUCCINCT PYTHON: THE BEST OF BOTH WORLDS Donald Miner @donaldpminer dminer@minerkasch.com Hadoop Summit 2015 1
  • 2. About Don Scalable Hadoop with succinct Python: the best of both worlds
  • 3. 3Scalable Hadoop with succinct Python: the best of both worlds
  • 4. Hadoop with Python? 4Scalable Hadoop with succinct Python: the best of both worlds
  • 5. The Good: Hadoop! ā€¢ Linear scalability ā€¢ Schema on read and unstructured data ā€¢ Transparent parallelism ā€¢ Open source We want the things we love about Hadoop to be available in Python, too! 5Scalable Hadoop with succinct Python: the best of both worlds
  • 6. The Good: Succinct code 6Scalable Hadoop with succinct Python: the best of both worlds
  • 7. The Good: Succinct code Faster development Less enterprise-y Lower barrier of entry 7Scalable Hadoop with succinct Python: the best of both worlds
  • 8. The Good: Interpreted, not compiled Change code in place Simpler continuous integration More platform portable 8Scalable Hadoop with succinct Python: the best of both worlds
  • 9. The Good: Python libraries for data HYPE (Python & data science) Tighter ā€œintegration" with data science Pandas, scikitlearn, nltk, numpy, scipy, gensim, matplotlib 9Scalable Hadoop with succinct Python: the best of both worlds
  • 10. The Bad: Python shortfalls Less enterprise-y Type safety No JVM sandbox safety Performance 10Scalable Hadoop with succinct Python: the best of both worlds
  • 11. Summary of Good & Bad Love Python/Hadoop for Data Science and Analysis Deal with Java/Hadoop for Data Engineering and production code The benefit of the good and the cost of the bad is different for everyone! 11Scalable Hadoop with succinct Python: the best of both worlds
  • 12. The Ugly Nothing in Hadoop is native to Python Performance can be awful due to serialization and IPC in most cases Python bindings are almost always behind, if they exist Clone some random dudeā€™s code from github and pray to Guido van Rossum that it compiles and/or works 12Scalable Hadoop with succinct Python: the best of both worlds
  • 14. mrjob ā€¢ Write MapReduce jobs in Python! ā€¢ Open sourced and maintained by Yelp ā€¢ Wraps ā€œHadoop Streamingā€ in cpython Python 2.5+ ā€¢ Well documented ā€¢ Can run locally, in Amazon EMR, or Hadoop 14 from mrjob.job import MRJob class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) Scalable Hadoop with succinct Python: the best of both worlds
  • 15. Pydoop ā€¢ Write MapReduce jobs in Python! ā€¢ Uses Hadoop C++ Pipes, which should be faster than wrapping streaming ā€¢ Actively being worked on ā€¢ Iā€™m not sure which is better 15 def mapper(_, v, writer): for word in v.split(): writer.emit(word.lower(), 1) def reducer(word, icounts, writer): writer.emit(word, sum(icounts)) Scalable Hadoop with succinct Python: the best of both worlds
  • 16. Python MapReduce options http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ Hadoop Streaming ā€“ More manual but faster Hadoopy, Dumbo, havenā€™t seen commits in years, mrjob in the past 12 hours Pydoop is main competitor (not in this list) 16Scalable Hadoop with succinct Python: the best of both worlds
  • 17. Pig Pig is a higher-level platform and language for analyzing data that happens to run MapReduce underneath You can write Pig UDFs in Python. Let Pig control data flow and let Python deal with the data manipulation! Can use jython (faster) or cpython (access to more libs) 17 b = FOREACH a GENERATE revstr(phonenum); m = GROUP j BY username; n = FOREACH m GENERATE group, sortedconcat(j.tags); @outputSchema(ā€œtags:chararray") def sortedconcat(bag): out = set() for tag in bag: out.add(tag) return ā€˜-ā€™.join(sorted(out)) @outputSchema(ā€œrev:chararray") def revstr(instr): return instr[::-1] Scalable Hadoop with succinct Python: the best of both worlds
  • 18. ā€¢ A pure Python client that interacts with HDFS ā€¢ Handles most NameNode ops (moving/renaming files, deleting) ā€¢ Handles most DataNode reading ops (reading files, getmerge) ā€¢ Two ways to use: library and command line interface 18 from snakebite.client import Client client = Client(ā€1.2.3.4", 54310, use_trash=False) for x in client.ls(['/data']): print x print ā€˜ā€™.join(client.cat(ā€˜/data/ref/refdata*.csvā€™)) $ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt Scalable Hadoop with succinct Python: the best of both worlds
  • 19. Starbase or Happybase Uses the HBase Thrift gateway interface (slow) Last commit 6 months ago Not really there yet and have failed to gain community momentum. Java is still king. 19Scalable Hadoop with succinct Python: the best of both worlds
  • 20. PySpark ā€¢ Programming in Spark (and PySpark) is in the form of chaining transformations and actions on RDDs ā€¢ RDDs are ā€œResilient Distributed Datasetsā€ ā€¢ RDDs are kept in memory for the most part 20Scalable Hadoop with succinct Python: the best of both worlds
  • 21. PySpark Word Count Example import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: wordcount <file>" exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print "%s: %i" % (word, count) sc.stop() 21 Sparkā€™s native language is Scala, but it also supports Java and Python Python API is always a tad behind Scala Scalable Hadoop with succinct Python: the best of both worlds
  • 23. Cassandra 23Scalable Hadoop with succinct Python: the best of both worlds
  • 24. MongoDB 24Scalable Hadoop with succinct Python: the best of both worlds
  • 25. Whatā€™s wrong with this? Python bindings are almost always fringe projects Other parts of Hadoop ecosystem are getting way more $ Lack of commercial support Lack of cohesiveness in APIs and approaches 25Scalable Hadoop with succinct Python: the best of both worlds
  • 26. SCALABLE HADOOP WITH SUCCINCT PYTHON: THE BEST OF BOTH WORLDS Donald Miner @donaldpminer dminer@minerkasch.com Hadoop Summit 2015 26

Editor's Notes

  1. Working with Hadoop using Python instead of Java is entirely possible with a conglomeration of active open source projects. This talk will survey the key projects and show that not only is Hadoop with Python possible, but that it also has some advantages. With Python data analysts can leverage the scale of Hadoop while also leveraging the best of the best data analysis libraries available, most notably numpy, pandas, nltk, and scikit-learn. The following frameworks and tools will be surveyed: - Interacting with files in the Hadoop Distributed File System with snakebite - Writing MapReduce jobs with mrjob - Writing Pig Python UDFs - Interacting with HBase with starbase
  2. Joke about googling clip art, explaining the metaphor, and that it has nothing to do with this talk.
  3. Talk about what Hadoop with Python literally means here. It means using Python (instead of Java) to interact with Hadoop
  4. Working with Hadoop using Python instead of Java is entirely possible with a conglomeration of active open source projects. This talk will survey the key projects and show that not only is Hadoop with Python possible, but that it also has some advantages. With Python data analysts can leverage the scale of Hadoop while also leveraging the best of the best data analysis libraries available, most notably numpy, pandas, nltk, and scikit-learn. The following frameworks and tools will be surveyed: - Interacting with files in the Hadoop Distributed File System with snakebite - Writing MapReduce jobs with mrjob - Writing Pig Python UDFs - Interacting with HBase with starbase