These are the slides from the Denver/Boulder Spark meet-up on February 24th, 2016. (deck build animations are all broken here... sorry!)
This talk provides an evaluation of existing machine learning pipelines in the eyes of different key stakeholders in the data science ecosystem. Focus is be placed upon the entire process from data to product (and keeping everyone in-between happy). Ultimately I explore how to utilize Spotify’s Luigi pipeline tool in combination with Spark to produce batch processing machine learning pipelines that have operational insights and redundancy built in.
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceAaron Knight
This talk is about how to make a system for gathering and processing data more robust, using the open source Luigi library and the AWS Data Pipeline and Elastic Map Reduce (EMR) services.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map ReduceAaron Knight
This talk is about how to make a system for gathering and processing data more robust, using the open source Luigi library and the AWS Data Pipeline and Elastic Map Reduce (EMR) services.
In this talk, we provide an introduction to Python Luigi via real life case studies showing you how you can break large, multi-step data processing task into a graph of smaller sub-tasks that are aware of the state of their interdependencies.
Growth Intelligence tracks the performance and activity of all the companies in the UK economy using their data ‘footprint’. This involves tracking numerous unstructured data points from multiple sources in a variety of formats and transforming them into a standardised feature set we can use for building predictive models for our clients.
In the past, this data was collected by in a somewhat haphazard fashion: combining manual effort, ad hoc scripting and processing which was difficult to maintain. In order to streamline the data flows, we’re using an open-source Python framework from Spotify called Luigi. Luigi was created for managing task dependencies, monitoring the progress of the data pipeline and providing frameworks for common batch processing tasks.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
When a large group of people change their habits, it can be tricky for infrastructures! Working from home and spending time indoor today means attending video calls and streaming movies and tv shows. This leads to increased internet traffic that can create congestion on the network infrastructure. So how do you get real-time visibility into your ISP connection? In this meetup, Mirko presents his setup based on a time series database and Raspberry Pi to better understand his ISP connection quality and speed — including upload and download speeds. Join us to discover how he does it using Telegraf, InfluxDB Cloud, Astro Pi, Telegram and Grafana! Finally, proof that your ISP connection is (or is not) as fast as it promises.
SharePoint Administration with PowerShellEric Kraus
Why limit yourself to STSADM? Discover the power of PowerShell 2.0 as it is used to perform advanced administrative & development tasks. This session will start with a brief introduction to PowerShell scripting and continue with a look into helpful SharePoint scripts including: filtering event and ULS logs, managing sites and users, streamlining feature development, working with the object model, and much more! Both administrators and developers will benefit from this powerful discussion.
We'll discuss our experiences with tooling aimed at finding and fixing performance problems in a production Rust application, as experienced through the eyes of somebody who's more familiar with the Go ecosystem but grew to love Rust. We'll cover CPU and Heap profiling, and also briefly touch causal profiling.
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
Here's one in a series of tutorials where you can do something useful from scratch in 5 minutes using Google Apps Script. This example shows how to use a Google Spreadsheet as a database
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
Giraffe is the open source React-based visualization library that powers data visualizations in the InfluxDB 2.0 UI. Giraffe can be used to display your data within your own app and is Fluxlang-supported! It uses algorithms to handle visualizing high volumes of time series data that InfluxDB can ingest and query.
Kristina Robinson, the engineering manager for the Giraffe team at InfluxData, will dive into:
The basics of using the Giraffe library including how to query your data with Flux
Specific Giraffe visualization types for dashboards (e.g. single number, table and graph)
How to incorporate visualizations in your own custom apps
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
In this session you will see how to take your machine model from development to production by watching the steps involved, which include: 1) Developing a ML model crafted via a Jupyter Notebook directly on top of Kubernetes/OpenShift; 2) Publishing that model as a service to be shared with your team or even the world; and 3) Monitoring the RESTful service via Grafana.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
SharePoint Administration with PowerShellEric Kraus
Why limit yourself to STSADM? Discover the power of PowerShell 2.0 as it is used to perform advanced administrative & development tasks. This session will start with a brief introduction to PowerShell scripting and continue with a look into helpful SharePoint scripts including: filtering event and ULS logs, managing sites and users, streamlining feature development, working with the object model, and much more! Both administrators and developers will benefit from this powerful discussion.
We'll discuss our experiences with tooling aimed at finding and fixing performance problems in a production Rust application, as experienced through the eyes of somebody who's more familiar with the Go ecosystem but grew to love Rust. We'll cover CPU and Heap profiling, and also briefly touch causal profiling.
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
Here's one in a series of tutorials where you can do something useful from scratch in 5 minutes using Google Apps Script. This example shows how to use a Google Spreadsheet as a database
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
Giraffe is the open source React-based visualization library that powers data visualizations in the InfluxDB 2.0 UI. Giraffe can be used to display your data within your own app and is Fluxlang-supported! It uses algorithms to handle visualizing high volumes of time series data that InfluxDB can ingest and query.
Kristina Robinson, the engineering manager for the Giraffe team at InfluxData, will dive into:
The basics of using the Giraffe library including how to query your data with Flux
Specific Giraffe visualization types for dashboards (e.g. single number, table and graph)
How to incorporate visualizations in your own custom apps
Psycopg2 - Connect to PostgreSQL using Python ScriptSurvey Department
It's the presentation slides I prepared for my college workshop. This demonstrates how you can talk with PostgreSql db using python scripting.For queries, mail at dipeshsuwal@gmail.com
Node collaboration - Exported Resources and PuppetDBm_richardson
Node Collaboration - How can your servers share information with each other. Exploring Exported Resources, PuppetDB and other methods.
This talk was given at Sydney Puppet Users Meetup on 14/08/2014.
Jupyter Notebooks for machine learning on Kubernetes & OpenShift | DevNation ...Red Hat Developers
In this session you will see how to take your machine model from development to production by watching the steps involved, which include: 1) Developing a ML model crafted via a Jupyter Notebook directly on top of Kubernetes/OpenShift; 2) Publishing that model as a service to be shared with your team or even the world; and 3) Monitoring the RESTful service via Grafana.
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
Introduction to Hadoop Map Reduce, Pig, Hive and Ambari technologies.
Workshop deck prepared and presented on September 5th 2015 by Radosław Stankiewicz.
During that the day participants had also the possibility to go through prepared tutorials and test their analysis on real cluster.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.
Also discusses some general tips and tricks for working with data and establishing a productive workflow.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
Building an ML Platform with Ray and MLflowDatabricks
A successful machine learning platform allows ML practitioners to focus solely on their experiments and models and minimizes the time it takes to develop ML applications and take them to production. However, building an ML Platform is typically not an easy task due to the many different components involved in the process. In this talk, we will show how two open source projects, Ray (https://ray.io/) and MLflow (https://mlflow.org/), work together to make it easy for ML platform developers to add scaling and experiment management to their platform.
We will first provide an overview of Ray and its native libraries: Ray Tune (https://tune.io) for distributed hyperparameter tuning and Ray Serve (https://docs.ray.io/en/master/serve/index.html) for scalable model serving. Then we will showcase how MLflow provides a perfect solution for managing experiments through integrations with Ray for tracking and model deployment. Finally, we will finish with a demo of an ML platform built on Ray, MLflow, and other open source tools.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
Flyte is a structured programming and distributed processing platform created at Lyft that enables highly concurrent, scalable and maintainable workflows for machine learning and data processing. Welcome to the documentation hub for Flyte.
Clojure is a new dialect of LISP that runs on the Java Virtual Machine (JVM). As a functional language, it offers great benefits in terms of programmer productivity; as a language that runs on the JVM, it also offers the opportunity to reuse existing Java libraries. Simon’s interest is in using Clojure to build desktop applications with the Java Swing GUI library. In this presentation Simon discusses how the power of Clojure can be applied to Swing, and whether it hits the sweet spot.
Similar to More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
More Data, More Problems: Evolving big data machine learning pipelines with Spark & Luigi
1. More Data, More Problems:
Evolving big data machine learning pipelines with Spark & Luigi
Alex Sadovsky
Director of Data Science: Oracle Data Cloud
alex.sadovsky@oracle.com
It's like the more data we come across
The more problems we see
3. Data Science is growing up
For Data Science to succeed, we need to learn to play
well with others.
Important
Business
Decisions
How will
operations
adapt to this
code change?
We’ll need a
classifier capable of
capturing non-
linear interactions!
4. Who are the players?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Litmus Test: These are the parties involved in every Data Science Product
5. What is success?
Success is getting from A to B with everyone staying happy
Data In
Product Goals
&
Data Science
Realization
Operational Insight
Utilization of
Existing
Architecture
Don’t Break the Bank
Data Out
6. Are we even talking about Spark today?
Outline:
1. Automated ML services
2. Scikit-Learn Pipelines
3. Spark Pipelines
4. Spotify’s Luigi
5. Data Science Pipelines: Spark + Luigi
Spoiler alert:
Spark is still going to be the answer to all of our big data problems
10. ML on AWS: Who’s happy?
Data Ingest Operations
Product* Architecture
Finance / Investors Data Scientists
Scoring 3 billion records = ($0.10 / 1000) * 3 000 000 000 = $300,000 USD + compute fees
*Amazon Machine Learning can train models on datasets up to 100 GB in size.
11. Scikit-Learn Pipelines
vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()
vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
VS
12. Scikit-Learn Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
I thought we were going to have Big Data…
13. Big Data? Sounds like we need Spark.
• Great for data manipulation
• Great for large scale modeling
15. Spark Pipelines: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Just not flexible enough to comprise a whole product
16. So why (not) Spark?
• Great for data manipulation
• Great for large scale modeling
• Not a data warehouse
• Not needed for reporting
• Not needed for operational insight
– If anything, it’s an error source!
18. Spotify’s Luigi
https://github.com/spotify/luigi
Luigi is a pipeline tool for workflow management
• Apache 2.0 License
• Similar to Make utility in Linux
• You have tasks which have dependencies
• Luigi makes sure those dependences are met
• Similar to Spark
• It creates a directed acyclic graph and executes accordingly
20. How does it work: Tasks & Targets
• Tasks
– Code we want to run (that requires other tasks)
– Tasks output targets
• Targets
– A desired state
21. Luigi works with anything
• Tasks
– Hadoop commands
– Spark jobs
– Python, perl, fortran, shell scripts
– Anything that can be wrapped in a python“run”
method
• Targets
– local, S3, FTP, HDFS files
– database entries
– Anything that can let a python wrapper return “true”
when it exists
22. It’s all python too
• No XML or YAML
• Configurable via code
23. Foo and Bar
class Foo( luigi.WrapperTask ):
def run(self):
print("Running Foo")
def requires(self):
yield Bar()
24. Foo and Bar
class Bar(luigi.Task):
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar’)
25. /anaconda/bin/python /Users/alexander.sadovsky/PycharmProjects/megalodon/foo2.py Foo
DEBUG: Checking if examples.Foo() is complete
DEBUG: Checking if examples.Bar() is complete
INFO: Informed scheduler that task examples.Foo() has status PENDING
INFO: Informed scheduler that task examples.Bar() has status PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Bar()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Bar()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task examples.Bar() has status DONE
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) running
examples.Foo()
INFO: [pid 12803] Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) done
examples.Foo()
DEBUG: 1 running tasks, waiting for next task to finish
Running Foo
INFO: Informed scheduler that task examples.Foo() has status DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=955608253, workers=1, host=asadovsky-mac.local, username=alexander.sadovsky, pid=12803) was stopped. Shutting down
Keep-Alive thread
INFO:
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 2 ran successfully:
- 1 examples.Bar()
- 1 examples.Foo()
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Process finished with exit code 0
28. Foo and Bars
class Foo(luigi.WrapperTask):
def run(self):
print("Running Foo")
def requires(self):
for i in range(10):
yield Bar(i)
class Bar(luigi.Task):
num = luigi.Parameter()
def run(self):
f = self.output().open('w')
f.write("hello, foobar worldn")
f.close()
def output(self):
return luigi.LocalTarget('/tmp/bar/%d' % self.num)
30. What about Spark?
import sys
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext()
sc.textFile(sys.argv[1])
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(sys.argv[2])
31. What about Spark?
class InlinePySparkWordCount(PySparkTask):
def input(self):
return S3Target("s3n://bucket.example.org/wordcount.input")
def output(self):
return S3Target('s3n://bucket.example.org/wordcount.output')
def main(self, sc, *args):
sc.textFile(self.input().path)
.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(self.output().path)
35. But wait! There’s more!
• Failure retries are built in
• Upstream failures will stop downstream
processing
• If tasks are files/filesystem/database states,
entire pipelines can be rerun without actually
“re-running” every step
36. Spark + Luigi: Who’s happy?
Data Ingest Operations
Product Architecture
Finance / Investors Data Scientists
Everybody!