SlideShare a Scribd company logo
Scaling out big-data computation & machine
learning using Pig, Python and Luigi
Ron Reiter
VP R&D, Crosswise
AGENDA
§  The goal
§  Data processing at Crosswise
§  The basics of prediction using machine learning
§  The “big data” stack
§  An introduction to Pig
§  Combining Pig and Python
§  Workflow management using Luigi and Amazon EMR
THE GOAL
1.  Process huge amounts of data points
2.  Allow data scientists to focus on their
research
3.  Adjust production systems according to
research conclusions quickly, without
duplicating logic between research and
production systems
DATA PROCESSING AT
CROSSWISE
§  We are building a graph of devices that belong to the
same user, based on browsing data of users
DATA PROCESSING AT
CROSSWISE
§  Interesting facts about our data processing
pipeline:
§  We process 1.5 trillion data points from 1 billion
devices
§  30TB of compressed data
§  Cluster with 1600 cores running for 24 hours
DATA PROCESSING AT
CROSSWISE
§  Our constraints
§  We are dealing with massive amounts of data, and we have to
go for a solid, proven and truly scalable solution
§  Our machine learning research team uses Python and sklearn
§  We are in a race against time (to market)
§  We do not want the overhead of maintaining two separate
processing pipelines, one for research and one for large-scale
prediction
PREDICTING AT SCALE
MODEL	
  BUILDING	
  PHASE	
  
(SMALL	
  /	
  LARGE	
  SCALE)	
  
PREDICTION	
  PHASE	
  
(MASSIVE	
  SCALE)	
  
Labeled	
  Data	
  
Train	
  Model	
  
Evaluate	
  
Model	
  
Model	
  
Unlabeled	
  
Data	
  
Predict	
  
Output	
  
PREDICTING AT SCALE
§  Steps
§  Training & evaluating the model (Iterations on training and
evaluation are done until the model’s performance is acceptable)
§  Predicting using the model at massive scale
§  Assumptions
§  Distributed learning is not required
§  Distributed prediction is required
§  Distributed learning can be achieved but not all machine
learning models support it, and not all infrastructures
know how to do it
THE “BIG DATA” STACK
YARN	
   Mesos	
  
MapReduce	
   Tez	
  
Resource	
  
Manager	
  
ComputaJon	
  
Framework	
  
High	
  Level	
  
Language	
  
Spark	
   Graphlab	
  
Spark	
  
Program	
  
GraphLab	
  
Script	
  
Pig	
   Scalding	
  
Oozie	
  Luigi	
   Azkaban	
  
Workflow	
  
Management	
  
Hive	
  
PIG
§  Pig is a high level, SQL-like language, which
runs on Hadoop
§  Pig also supports User Defined Functions written
in Java and Python
HOW DOES PIG WORK?
§  Pig converts SQL-like queries to MapReduce iterations
§  Pig builds a work plan based on a DAG it calculates
§  Newer versions of Pig know how to run on different
computation engines, such as Apache Tez and Spark
which offer a higher level of abstraction than MapReduce
Pig	
  Runner	
  
Map	
  
Reduce	
  
Map	
  
Reduce	
  
Map	
  
Reduce	
  
Map	
  
Reduce	
  
Map	
  
Reduce	
  
PIG DIRECTIVES
The most common Pig directives are:
§  LOAD/STORE – Load and save data sets
§  FOREACH – map function which constructs a new row for
each row in a data set
§  FILTER – filters in/out rows that obey to a certain criteria
§  GROUP – groups rows by a specific column / set of columns
§  JOIN – join two data sets based on a specific column
And many more functions:
http://pig.apache.org/docs/r0.14.0/func.html
PIG CODE EXAMPLE
customers	
  =	
  LOAD	
  'customers.tsv'	
  USING	
  PigStorage('t')	
  AS	
  	
  
	
  	
  (customer_id,	
  first_name,	
  last_name);	
  
orders	
  =	
  LOAD	
  'orders.tsv'	
  USING	
  PigStorage('t')	
  AS	
  	
  
	
  	
  (customer_id,	
  price);	
  
aggregated	
  =	
  FOREACH	
  (GROUP	
  orders	
  BY	
  customer_id)	
  GENERATE	
  	
  
	
  	
  group	
  AS	
  customer_id,	
  	
  
	
  	
  SUM(orders.price)	
  AS	
  price_sum;	
  
joined	
  =	
  JOIN	
  customers	
  ON	
  customer_id,	
  aggregated	
  ON	
  
customer_id;	
  
STORE	
  joined	
  INTO	
  'customers_total.tsv'	
  USING	
  PigStorage('t');	
  
COMBINING PIG &
PYTHON
COMBINING PIG AND
PYTHON
§  Pig gives you the power to scale and
process data conveniently with an SQL-
like syntax
§  Python is easy and productive, and has
many useful scientific packages available
(sklearn, nltk, numpy, scipy, pandas)
+	
  
MACHINE LEARNING IN PYTHON
USING SCIKIT-LEARN
PYTHON UDF
§  Pig provides two Python UDF (User-defined function)
engines: Jython (JVM) and CPython
§  Mortar (mortardata.com) added support for C Python
UDFs, which support scientific packages (numpy, scipy,
sklearn, nltk, pandas, etc.)
§  A Python UDF is a function with a decorator that specifies
the output schema. (since Python is dynamic the input
schema is not required)
from	
  pig_util	
  import	
  outputSchema	
  
	
  
@outputSchema('value:int')	
  
def	
  multiply_by_two(num):	
  
	
  	
  	
  	
  return	
  num	
  *	
  2	
  
USING THE PYTHON UDF
§  Register the Python UDF:
§  If you prefer speed over package compatibility, use Jython:
§  Then, use the UDF within a Pig expression:
REGISTER	
  'udfs.py'	
  USING	
  streaming_python	
  AS	
  udfs;	
  
processed	
  =	
  FOREACH	
  data	
  GENERATE	
  udfs.multiply_by_two(num);	
  
REGISTER	
  'udfs.py'	
  USING	
  jython	
  AS	
  udfs;	
  
CONNECT PIG AND PYTHON
JOBS
§  In many common scenarios, especially in machine
learning, a classifier can usually be trained using a simple
Python script
§  Using the classifier we trained, we can now predict on a
massive scale using a Python UDF
§  Requires a higher-level workflow manager, such as Luigi
PYTHON	
  JOB	
  
PIG	
  JOB	
  
	
  
	
  
	
  
PYTHON	
  UDF	
  
PICKLED	
  MODEL	
  
S3://model.pkl	
  
WORKFLOW MANAGEMENT
S3	
   HDFS	
  SFTP	
   FILE	
  DB	
  
Task	
  A	
   Task	
  B	
   Task	
  C	
  
REQUIRES	
  REQUIRES	
  
OUTPUTS	
   OUTPUTS	
   OUTPUTS	
  OUTPUTS	
  OUTPUTS	
  
USES	
   USES	
  
D	
  A	
  T	
  A	
  	
  	
  F	
  L	
  O	
  W	
  	
  
WORKFLOW MANAGEMENT
WITH LUIGI
§  Unlike Oozie and Azkaban which are heavy workflow
managers, Luigi is more of a Python package.
§  Luigi works based on dependency resolving, similar to a
Makefile (or Scons)
§  Luigi defines an interface of “Tasks” and “Targets”, which
we use to connect the two tasks using dependencies.
UNLABELED LOGS	

2014-01-01	

TRAINED MODEL	

2014-01-01	

OUTPUT	

2014-01-01	

LABELED LOGS	

2014-01-01	

UNLABELED LOGS	

2014-01-02	

TRAINED MODEL	

2014-01-02	

OUTPUT	

2014-01-02	

LABELED LOGS	

2014-01-02
EXAMPLE - TRAIN MODEL
LUIGI TASK
§  Let’s see how it’s done:
import	
  luigi,	
  numpy,	
  pandas,	
  pickle,	
  sklearn	
  
	
  
class	
  TrainModel(luigi.Task):	
  
	
  	
  	
  	
  target_date	
  =	
  luigi.DateParameter()	
  
	
  
	
  	
  	
  	
  def	
  requires(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  LabelledLogs(self.target_date)	
  
	
  
	
  	
  	
  	
  def	
  output(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  S3Target('s3://mybucket/model_%s.pkl'	
  %	
  self.target_date)	
  
	
  
	
  	
  	
  	
  def	
  run(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  clf	
  =	
  sklearn.linear_model.SGDClassifier()	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  with	
  self.output().open('w')	
  as	
  fd:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  df	
  =	
  pandas.load_csv(self.input())	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  clf.fit(df[["a","b","c"]].values,	
  df["class"].values)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  fd.write(pickle.dumps(clf))	
  
	
  
PREDICT RESULTS LUIGI
TASK
§  We predict using a Pig task which has access to the pickled
model:
import	
  luigi	
  
	
  
class	
  PredictResults(PigTask):	
  
	
  	
  	
  	
  PIG_SCRIPT	
  =	
  """	
  
	
  	
  	
  	
  	
  	
  	
  	
  REGISTER	
  'predict.py'	
  USING	
  streaming_python	
  AS	
  udfs;	
  
	
  	
  	
  	
  	
  	
  	
  	
  data	
  =	
  LOAD	
  '$INPUT'	
  USING	
  PigStorage('t');	
  
	
  	
  	
  	
  	
  	
  	
  	
  predicted	
  =	
  FOREACH	
  data	
  GENERATE	
  user_id,	
  predict.predict_results(*);	
  
	
  	
  	
  	
  	
  	
  	
  	
  STORE	
  predicted	
  INTO	
  '$OUTPUT'	
  USING	
  PigStorage('t');	
  
	
  	
  	
  	
  """	
  
	
  	
  	
  	
  PYTHON_UDF	
  =	
  'predict.py'	
  
	
  
	
  	
  	
  	
  target_date	
  =	
  luigi.DateParameter()	
  
	
  
	
  	
  	
  	
  def	
  requires(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  {'logs':	
  UnlabelledLogs(self.target_date),	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  'model':	
  TrainModel(self.target_date)}	
  
	
  
	
  	
  	
  	
  def	
  output(self):	
  
	
  	
  	
  	
  	
  	
  	
  	
  return	
  S3Target('s3://mybucket/results_%s.tsv'	
  %	
  self.target_date)	
  
PREDICTION PIG USER-DEFINED
FUNCTION (PYTHON)
§  We can then generate a custom UDF while replacing the
$MODEL with an actual model file.
§  The model will be loaded when the UDF is initialized (this will
happen on every map/reduce task using the UDF)
	
  	
  	
  from	
  pig_util	
  import	
  outputSchema	
  
	
  	
  	
  import	
  numpy,	
  pickle	
  
	
  
	
  	
  	
  clf	
  =	
  pickle.load(download_s3('$MODEL'))	
  
	
  
	
  	
  	
  @outputSchema('value:int')	
  
	
  	
  	
  def	
  predict_results(feature_vector):	
  
	
  	
  	
  	
  	
  	
  	
  return	
  clf.predict(numpy.array(feature_vector))[0]	
  
PITFALLS
§  For the classifier to work on your Hadoop
cluster, you have to install the required
packages on all of your Hadoop nodes
(numpy, sklearn, etc.)
§  Sending arguments to a UDF is tricky;
there is no way to initialize a UDF with
arguments. To load a classifier to a UDF,
you should generate the UDF using a
template with the model you wish to use
CLUSTER PROVISIONING
WITH LUIGI
§  To conserve resources, we use clusters only when needed. So
we created the StartCluster task:
§  With this mechanism in place, we also have a cron that kills idle
clusters and save even more money.
§  We use both EMR clusters and clusters provisioned by Xplenty
which provide us with their Hadoop provisioning infrastructure.
PigTask	
  
StartCluster	
  
REQUIRES	
  
ClusterTarget	
  OUTPUTS	
  
USES	
  
USING LUIGI WITH OTHER
COMPUTATION ENGINES
§  Luigi acts like the “glue” of data pipelines, and we use it to
interconnect Pig and GraphLab jobs
§  Pig is very convenient for large scale data processing, but it is very
weak when it comes to graph analysis and iterative computation
§  One of the main disadvantages of Pig is that it has no conditional
statements, so we need to use other tools to complete our arsenal
Pig	
  task	
   Pig	
  task	
  GraphLab	
  task	
  
GRAPHLAB AT CROSSWISE
§  We use GraphLab to run graph processing at scale – for
example, to run connected components and create
“users” from a graph of devices that belong to the same
user
PYTHON API
§  Pig is a “data flow” language, and not a real language. Its
abilities are limited - there are no conditional blocks or
loops. Loops are required when trying to reach
“convergence”, such as when finding connected
components in a graph. To overcome this limitation, a
Python API has been created.
from	
  org.apache.pig.scripting	
  import	
  Pig	
  
	
  
P	
  =	
  Pig.compile(	
  
	
  	
  	
  	
  "A	
  =	
  LOAD	
  '$input'	
  AS	
  (name,	
  age,	
  gpa);"	
  +	
  
	
  	
  	
  	
  "STORE	
  A	
  INTO	
  '$output';")	
  
	
  
Q	
  =	
  P.bind({	
  
	
  	
  	
  	
  'input':	
  'input.csv',	
  
	
  	
  	
  	
  'output':	
  'output.csv'})	
  
	
  
result	
  =	
  Q.runSingle()	
  
CROSSWISE HADOOP
SSH JOB RUNNER
STANDARD LUIGI
WORKFLOW
§  Standard Luigi Hadoop tasks need a
correctly configured Hadoop client to
launch jobs.
§  This can be a pain when running an
automatically provisioned Hadoop cluster
(e.g. an EMR cluster).
HADOOP	
  MASTER	
  NODE	
   HADOOP	
  SLAVE	
  
NODE	
  
HADOOP	
  SLAVE	
  
NODE	
  
LUIGI	
  
NAMENODE	
  
HADOOP	
  
CLIENT	
  
JOB	
  TRACKER	
  
LUIGI HADOOP SSH
RUNNER
§  At Crosswise, we implemented a Luigi task for running
Hadoop JARs (e.g. Pig) remotely, just like the Amazon
EMR API enables.
§  Instead of launching steps using EMR API, we
implemented our own, to enable running steps
concurrently.
LUIGI	
  
CLUSTER	
  
MASTER	
  NODE	
  
	
  
	
  
	
  
EMR	
  SLAVE	
  
NODE	
  
EMR	
  SLAVE	
  
NODE	
  
API	
  /	
  SSH	
  
API	
  /	
  SSH	
  
HADOOP	
  CLIENT	
  INSTANCE	
  
HADOOP	
  CLIENT	
  INSTANCE	
  
WHY RUN HADOOP JOBS
EXTERNALLY?
Working with the EMR API is convenient, but Luigi expects to run
jobs from the master node and not using the EMR job
submission API
Advantages:
§  Doesn’t require to run on a local configured Hadoop client
§  Allows to provision the clusters as a task (using Amazon
EMR’s API for example)
§  The same Luigi process can utilize several Hadoop clusters
at once
NEXT STEPS AT CROSSWISE
§  We are planning on moving to Apache Tez since
MapReduce has a high overhead for
complicated processes, and it is hard to tweak
and utilize the framework properly
§  We are also investigating Dato’s distributed data
processing, training and prediction capabilities at
scale (using GraphLab Create)
QUESTIONS?
THANK YOU!

More Related Content

What's hot

Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
Tzach Zohar
 
Google apps script introduction
Google apps script introductionGoogle apps script introduction
Google apps script introduction
KAI CHU CHUNG
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Chun-Kai Wang
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
Amazon Web Services
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Databricks
 
Synchronize applications with akeneo/batch
Synchronize applications with akeneo/batchSynchronize applications with akeneo/batch
Synchronize applications with akeneo/batch
gplanchat
 
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Ruby Meditation
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
Spark Summit
 
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.OrgCloudera, Inc.
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
Aaron (Ari) Bornstein
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
Barton Rhodes
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
Using Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App EngineUsing Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App Engine
River of Talent
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 

What's hot (20)

Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Google apps script introduction
Google apps script introductionGoogle apps script introduction
Google apps script introduction
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
Scaling Self Service Analytics with Databricks and Apache Spark with Amelia C...
 
Synchronize applications with akeneo/batch
Synchronize applications with akeneo/batchSynchronize applications with akeneo/batch
Synchronize applications with akeneo/batch
 
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
Hw09   Building Data Intensive Apps  A Closer Look At Trending Topics.OrgHw09   Building Data Intensive Apps  A Closer Look At Trending Topics.Org
Hw09 Building Data Intensive Apps A Closer Look At Trending Topics.Org
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Unsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at ScaleUnsupervised Aspect Based Sentiment Analysis at Scale
Unsupervised Aspect Based Sentiment Analysis at Scale
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Using Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App EngineUsing Task Queues and D3.js to build an analytics product on App Engine
Using Task Queues and D3.js to build an analytics product on App Engine
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 

Similar to BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi

Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
prevota
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
tsliwowicz
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Data herding
Data herdingData herding
Data herding
unbracketed
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Diego Freniche Brito
 
Automation in ArcGIS using Arcpy
Automation in ArcGIS using ArcpyAutomation in ArcGIS using Arcpy
Automation in ArcGIS using ArcpyGeodata AS
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
Jens Kleinschmidt
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 

Similar to BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi (20)

Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Automation in ArcGIS using Arcpy
Automation in ArcGIS using ArcpyAutomation in ArcGIS using Arcpy
Automation in ArcGIS using Arcpy
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 

More from Ron Reiter

Securing your Bitcoin wallet
Securing your Bitcoin walletSecuring your Bitcoin wallet
Securing your Bitcoin wallet
Ron Reiter
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and Git
Ron Reiter
 
Introduction to Bootstrap
Introduction to BootstrapIntroduction to Bootstrap
Introduction to Bootstrap
Ron Reiter
 
jQuery Mobile Workshop
jQuery Mobile WorkshopjQuery Mobile Workshop
jQuery Mobile Workshop
Ron Reiter
 
Multi screen HTML5
Multi screen HTML5Multi screen HTML5
Multi screen HTML5
Ron Reiter
 
Building Chrome Extensions
Building Chrome ExtensionsBuilding Chrome Extensions
Building Chrome ExtensionsRon Reiter
 
HTML5 New Features and Resources
HTML5 New Features and ResourcesHTML5 New Features and Resources
HTML5 New Features and ResourcesRon Reiter
 
Introduction to App Engine Development
Introduction to App Engine DevelopmentIntroduction to App Engine Development
Introduction to App Engine DevelopmentRon Reiter
 
Digital Audio & Signal Processing (Elad Gariany)
Digital Audio & Signal Processing (Elad Gariany)Digital Audio & Signal Processing (Elad Gariany)
Digital Audio & Signal Processing (Elad Gariany)Ron Reiter
 
Writing HTML5 Web Apps using Backbone.js and GAE
Writing HTML5 Web Apps using Backbone.js and GAEWriting HTML5 Web Apps using Backbone.js and GAE
Writing HTML5 Web Apps using Backbone.js and GAE
Ron Reiter
 

More from Ron Reiter (11)

Securing your Bitcoin wallet
Securing your Bitcoin walletSecuring your Bitcoin wallet
Securing your Bitcoin wallet
 
Brogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and GitBrogramming - Python, Bash for Data Processing, and Git
Brogramming - Python, Bash for Data Processing, and Git
 
Introduction to Bootstrap
Introduction to BootstrapIntroduction to Bootstrap
Introduction to Bootstrap
 
jQuery Mobile Workshop
jQuery Mobile WorkshopjQuery Mobile Workshop
jQuery Mobile Workshop
 
Multi screen HTML5
Multi screen HTML5Multi screen HTML5
Multi screen HTML5
 
Mobile Spaces
Mobile SpacesMobile Spaces
Mobile Spaces
 
Building Chrome Extensions
Building Chrome ExtensionsBuilding Chrome Extensions
Building Chrome Extensions
 
HTML5 New Features and Resources
HTML5 New Features and ResourcesHTML5 New Features and Resources
HTML5 New Features and Resources
 
Introduction to App Engine Development
Introduction to App Engine DevelopmentIntroduction to App Engine Development
Introduction to App Engine Development
 
Digital Audio & Signal Processing (Elad Gariany)
Digital Audio & Signal Processing (Elad Gariany)Digital Audio & Signal Processing (Elad Gariany)
Digital Audio & Signal Processing (Elad Gariany)
 
Writing HTML5 Web Apps using Backbone.js and GAE
Writing HTML5 Web Apps using Backbone.js and GAEWriting HTML5 Web Apps using Backbone.js and GAE
Writing HTML5 Web Apps using Backbone.js and GAE
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi

  • 1.
  • 2. Scaling out big-data computation & machine learning using Pig, Python and Luigi Ron Reiter VP R&D, Crosswise
  • 3. AGENDA §  The goal §  Data processing at Crosswise §  The basics of prediction using machine learning §  The “big data” stack §  An introduction to Pig §  Combining Pig and Python §  Workflow management using Luigi and Amazon EMR
  • 4. THE GOAL 1.  Process huge amounts of data points 2.  Allow data scientists to focus on their research 3.  Adjust production systems according to research conclusions quickly, without duplicating logic between research and production systems
  • 5. DATA PROCESSING AT CROSSWISE §  We are building a graph of devices that belong to the same user, based on browsing data of users
  • 6. DATA PROCESSING AT CROSSWISE §  Interesting facts about our data processing pipeline: §  We process 1.5 trillion data points from 1 billion devices §  30TB of compressed data §  Cluster with 1600 cores running for 24 hours
  • 7. DATA PROCESSING AT CROSSWISE §  Our constraints §  We are dealing with massive amounts of data, and we have to go for a solid, proven and truly scalable solution §  Our machine learning research team uses Python and sklearn §  We are in a race against time (to market) §  We do not want the overhead of maintaining two separate processing pipelines, one for research and one for large-scale prediction
  • 8. PREDICTING AT SCALE MODEL  BUILDING  PHASE   (SMALL  /  LARGE  SCALE)   PREDICTION  PHASE   (MASSIVE  SCALE)   Labeled  Data   Train  Model   Evaluate   Model   Model   Unlabeled   Data   Predict   Output  
  • 9. PREDICTING AT SCALE §  Steps §  Training & evaluating the model (Iterations on training and evaluation are done until the model’s performance is acceptable) §  Predicting using the model at massive scale §  Assumptions §  Distributed learning is not required §  Distributed prediction is required §  Distributed learning can be achieved but not all machine learning models support it, and not all infrastructures know how to do it
  • 10. THE “BIG DATA” STACK YARN   Mesos   MapReduce   Tez   Resource   Manager   ComputaJon   Framework   High  Level   Language   Spark   Graphlab   Spark   Program   GraphLab   Script   Pig   Scalding   Oozie  Luigi   Azkaban   Workflow   Management   Hive  
  • 11. PIG §  Pig is a high level, SQL-like language, which runs on Hadoop §  Pig also supports User Defined Functions written in Java and Python
  • 12. HOW DOES PIG WORK? §  Pig converts SQL-like queries to MapReduce iterations §  Pig builds a work plan based on a DAG it calculates §  Newer versions of Pig know how to run on different computation engines, such as Apache Tez and Spark which offer a higher level of abstraction than MapReduce Pig  Runner   Map   Reduce   Map   Reduce   Map   Reduce   Map   Reduce   Map   Reduce  
  • 13. PIG DIRECTIVES The most common Pig directives are: §  LOAD/STORE – Load and save data sets §  FOREACH – map function which constructs a new row for each row in a data set §  FILTER – filters in/out rows that obey to a certain criteria §  GROUP – groups rows by a specific column / set of columns §  JOIN – join two data sets based on a specific column And many more functions: http://pig.apache.org/docs/r0.14.0/func.html
  • 14. PIG CODE EXAMPLE customers  =  LOAD  'customers.tsv'  USING  PigStorage('t')  AS        (customer_id,  first_name,  last_name);   orders  =  LOAD  'orders.tsv'  USING  PigStorage('t')  AS        (customer_id,  price);   aggregated  =  FOREACH  (GROUP  orders  BY  customer_id)  GENERATE        group  AS  customer_id,        SUM(orders.price)  AS  price_sum;   joined  =  JOIN  customers  ON  customer_id,  aggregated  ON   customer_id;   STORE  joined  INTO  'customers_total.tsv'  USING  PigStorage('t');  
  • 16. COMBINING PIG AND PYTHON §  Pig gives you the power to scale and process data conveniently with an SQL- like syntax §  Python is easy and productive, and has many useful scientific packages available (sklearn, nltk, numpy, scipy, pandas) +  
  • 17. MACHINE LEARNING IN PYTHON USING SCIKIT-LEARN
  • 18. PYTHON UDF §  Pig provides two Python UDF (User-defined function) engines: Jython (JVM) and CPython §  Mortar (mortardata.com) added support for C Python UDFs, which support scientific packages (numpy, scipy, sklearn, nltk, pandas, etc.) §  A Python UDF is a function with a decorator that specifies the output schema. (since Python is dynamic the input schema is not required) from  pig_util  import  outputSchema     @outputSchema('value:int')   def  multiply_by_two(num):          return  num  *  2  
  • 19. USING THE PYTHON UDF §  Register the Python UDF: §  If you prefer speed over package compatibility, use Jython: §  Then, use the UDF within a Pig expression: REGISTER  'udfs.py'  USING  streaming_python  AS  udfs;   processed  =  FOREACH  data  GENERATE  udfs.multiply_by_two(num);   REGISTER  'udfs.py'  USING  jython  AS  udfs;  
  • 20. CONNECT PIG AND PYTHON JOBS §  In many common scenarios, especially in machine learning, a classifier can usually be trained using a simple Python script §  Using the classifier we trained, we can now predict on a massive scale using a Python UDF §  Requires a higher-level workflow manager, such as Luigi PYTHON  JOB   PIG  JOB         PYTHON  UDF   PICKLED  MODEL   S3://model.pkl  
  • 21. WORKFLOW MANAGEMENT S3   HDFS  SFTP   FILE  DB   Task  A   Task  B   Task  C   REQUIRES  REQUIRES   OUTPUTS   OUTPUTS   OUTPUTS  OUTPUTS  OUTPUTS   USES   USES   D  A  T  A      F  L  O  W    
  • 22. WORKFLOW MANAGEMENT WITH LUIGI §  Unlike Oozie and Azkaban which are heavy workflow managers, Luigi is more of a Python package. §  Luigi works based on dependency resolving, similar to a Makefile (or Scons) §  Luigi defines an interface of “Tasks” and “Targets”, which we use to connect the two tasks using dependencies. UNLABELED LOGS 2014-01-01 TRAINED MODEL 2014-01-01 OUTPUT 2014-01-01 LABELED LOGS 2014-01-01 UNLABELED LOGS 2014-01-02 TRAINED MODEL 2014-01-02 OUTPUT 2014-01-02 LABELED LOGS 2014-01-02
  • 23. EXAMPLE - TRAIN MODEL LUIGI TASK §  Let’s see how it’s done: import  luigi,  numpy,  pandas,  pickle,  sklearn     class  TrainModel(luigi.Task):          target_date  =  luigi.DateParameter()            def  requires(self):                  return  LabelledLogs(self.target_date)            def  output(self):                  return  S3Target('s3://mybucket/model_%s.pkl'  %  self.target_date)            def  run(self):                  clf  =  sklearn.linear_model.SGDClassifier()                                  with  self.output().open('w')  as  fd:                          df  =  pandas.load_csv(self.input())                          clf.fit(df[["a","b","c"]].values,  df["class"].values)                          fd.write(pickle.dumps(clf))    
  • 24. PREDICT RESULTS LUIGI TASK §  We predict using a Pig task which has access to the pickled model: import  luigi     class  PredictResults(PigTask):          PIG_SCRIPT  =  """                  REGISTER  'predict.py'  USING  streaming_python  AS  udfs;                  data  =  LOAD  '$INPUT'  USING  PigStorage('t');                  predicted  =  FOREACH  data  GENERATE  user_id,  predict.predict_results(*);                  STORE  predicted  INTO  '$OUTPUT'  USING  PigStorage('t');          """          PYTHON_UDF  =  'predict.py'            target_date  =  luigi.DateParameter()            def  requires(self):                  return  {'logs':  UnlabelledLogs(self.target_date),                                  'model':  TrainModel(self.target_date)}            def  output(self):                  return  S3Target('s3://mybucket/results_%s.tsv'  %  self.target_date)  
  • 25. PREDICTION PIG USER-DEFINED FUNCTION (PYTHON) §  We can then generate a custom UDF while replacing the $MODEL with an actual model file. §  The model will be loaded when the UDF is initialized (this will happen on every map/reduce task using the UDF)      from  pig_util  import  outputSchema        import  numpy,  pickle          clf  =  pickle.load(download_s3('$MODEL'))          @outputSchema('value:int')        def  predict_results(feature_vector):                return  clf.predict(numpy.array(feature_vector))[0]  
  • 26. PITFALLS §  For the classifier to work on your Hadoop cluster, you have to install the required packages on all of your Hadoop nodes (numpy, sklearn, etc.) §  Sending arguments to a UDF is tricky; there is no way to initialize a UDF with arguments. To load a classifier to a UDF, you should generate the UDF using a template with the model you wish to use
  • 27. CLUSTER PROVISIONING WITH LUIGI §  To conserve resources, we use clusters only when needed. So we created the StartCluster task: §  With this mechanism in place, we also have a cron that kills idle clusters and save even more money. §  We use both EMR clusters and clusters provisioned by Xplenty which provide us with their Hadoop provisioning infrastructure. PigTask   StartCluster   REQUIRES   ClusterTarget  OUTPUTS   USES  
  • 28. USING LUIGI WITH OTHER COMPUTATION ENGINES §  Luigi acts like the “glue” of data pipelines, and we use it to interconnect Pig and GraphLab jobs §  Pig is very convenient for large scale data processing, but it is very weak when it comes to graph analysis and iterative computation §  One of the main disadvantages of Pig is that it has no conditional statements, so we need to use other tools to complete our arsenal Pig  task   Pig  task  GraphLab  task  
  • 29. GRAPHLAB AT CROSSWISE §  We use GraphLab to run graph processing at scale – for example, to run connected components and create “users” from a graph of devices that belong to the same user
  • 30. PYTHON API §  Pig is a “data flow” language, and not a real language. Its abilities are limited - there are no conditional blocks or loops. Loops are required when trying to reach “convergence”, such as when finding connected components in a graph. To overcome this limitation, a Python API has been created. from  org.apache.pig.scripting  import  Pig     P  =  Pig.compile(          "A  =  LOAD  '$input'  AS  (name,  age,  gpa);"  +          "STORE  A  INTO  '$output';")     Q  =  P.bind({          'input':  'input.csv',          'output':  'output.csv'})     result  =  Q.runSingle()  
  • 32. STANDARD LUIGI WORKFLOW §  Standard Luigi Hadoop tasks need a correctly configured Hadoop client to launch jobs. §  This can be a pain when running an automatically provisioned Hadoop cluster (e.g. an EMR cluster). HADOOP  MASTER  NODE   HADOOP  SLAVE   NODE   HADOOP  SLAVE   NODE   LUIGI   NAMENODE   HADOOP   CLIENT   JOB  TRACKER  
  • 33. LUIGI HADOOP SSH RUNNER §  At Crosswise, we implemented a Luigi task for running Hadoop JARs (e.g. Pig) remotely, just like the Amazon EMR API enables. §  Instead of launching steps using EMR API, we implemented our own, to enable running steps concurrently. LUIGI   CLUSTER   MASTER  NODE         EMR  SLAVE   NODE   EMR  SLAVE   NODE   API  /  SSH   API  /  SSH   HADOOP  CLIENT  INSTANCE   HADOOP  CLIENT  INSTANCE  
  • 34. WHY RUN HADOOP JOBS EXTERNALLY? Working with the EMR API is convenient, but Luigi expects to run jobs from the master node and not using the EMR job submission API Advantages: §  Doesn’t require to run on a local configured Hadoop client §  Allows to provision the clusters as a task (using Amazon EMR’s API for example) §  The same Luigi process can utilize several Hadoop clusters at once
  • 35. NEXT STEPS AT CROSSWISE §  We are planning on moving to Apache Tez since MapReduce has a high overhead for complicated processes, and it is hard to tweak and utilize the framework properly §  We are also investigating Dato’s distributed data processing, training and prediction capabilities at scale (using GraphLab Create)