This document discusses scaling out big data computation and machine learning using Pig, Python and Luigi. It describes how the speaker's company processes large amounts of data to build a graph of related devices. It then discusses using Pig for scalable data processing, Python for machine learning tasks, and Luigi for workflow management between Pig and Python jobs. The speaker provides examples of how to use Pig scripts with Python UDFs for tasks like training models and predicting at large scale.
Building data flows with Celery and SQLAlchemyRoger Barnes
Reporting and analysis systems rely on coherent and reliable data, often from disparate sources. To that end, a series of well established data warehousing practices have emerged to extract data and produce a consistent data store.
This talk will look at some options for composing workflows using Python. In particular, we'll explore beyond Celery's asynchronous task processing functionality into its workflow (aka Canvas) system and how it can be used in conjunction with SQLAlchemy's architecture to provide the building blocks for data stream processing.
Data Migrations in the App Engine DatastoreRyan Morlok
Data migration is a core problem when dealing with web frameworks. Rails and Django have their own built-in migration tools to help you manage data, but with Google Cloud Datastore, things are bit more manual. This presentation walks through several techniques and Python examples that leverage deferred tasks or map reduce to keep the data for your app consistent with the state of your code.
Om nom nom nom
Talk given at Clojure/conj 2014 in Washington DC
Video available here: https://www.youtube.com/watch?v=4-oyZpLRQ20
Have you ever needed an easily customisable dashboard? Or needed to visualise data in a browser but was overwhelmed by d3.js? This talk will cover basics of React and Om, some data visualisation libraries and techniques, ways to handle live data and combining all that into an easily customisable dashboard. Expect demos, code and maybe, just maybe, om nom nom nom cookies.
Benchx: An XQuery benchmarking web application Andy Bunce
A system to record query performance of XQuery statements running on the BaseX http:basex.org XML database. It uses Angular on the client side and RESTXQ on the server.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
Building data flows with Celery and SQLAlchemyRoger Barnes
Reporting and analysis systems rely on coherent and reliable data, often from disparate sources. To that end, a series of well established data warehousing practices have emerged to extract data and produce a consistent data store.
This talk will look at some options for composing workflows using Python. In particular, we'll explore beyond Celery's asynchronous task processing functionality into its workflow (aka Canvas) system and how it can be used in conjunction with SQLAlchemy's architecture to provide the building blocks for data stream processing.
Data Migrations in the App Engine DatastoreRyan Morlok
Data migration is a core problem when dealing with web frameworks. Rails and Django have their own built-in migration tools to help you manage data, but with Google Cloud Datastore, things are bit more manual. This presentation walks through several techniques and Python examples that leverage deferred tasks or map reduce to keep the data for your app consistent with the state of your code.
Om nom nom nom
Talk given at Clojure/conj 2014 in Washington DC
Video available here: https://www.youtube.com/watch?v=4-oyZpLRQ20
Have you ever needed an easily customisable dashboard? Or needed to visualise data in a browser but was overwhelmed by d3.js? This talk will cover basics of React and Om, some data visualisation libraries and techniques, ways to handle live data and combining all that into an easily customisable dashboard. Expect demos, code and maybe, just maybe, om nom nom nom cookies.
Benchx: An XQuery benchmarking web application Andy Bunce
A system to record query performance of XQuery statements running on the BaseX http:basex.org XML database. It uses Angular on the client side and RESTXQ on the server.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
Spark is quickly becoming the most popular framework in the MapReduce family. With better performance and much better APIs - it's easier than ever to perform the actual data wrangling; But as always - the challenges of operating, verifying and optimizing your application over time are much greater than the initial setup - and all the more so with distributes systems. In Kenshoo, we've used and developed some tools and techniques to monitor the state of our Spark application: health, correctness, performance, utilization, and business KPIs. We'll discuss some standard tools and less standard techniques to get the most information out of your Spark cluster.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
The need to crunch large amounts of data to extract useful statistics is increasingly common. Using services like Amazon Redshift and Amazon Elastic MapReduce, we will show how you can process log data to produce helpful reports and give your analysts the tools to find useful data. We will dive deep into these systems, building a usable example from scratch using the AWS SDK for Ruby.
Speech of Nihad Abbasov, Senior Software Engineer at Digital Classifieds, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
How fast is your code? Performance is crucial as your startup grows, and optimizing your application can make a huge impact on user experience. During this talk, you will learn hints, techniques and best practices for improving the overall speed of your Ruby application.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Aspect-based sentiment analysis is a text analysis technique that breaks down text into aspects (attributes or components of a product or service), and then scores the sentiment level (positive, negative or neutral) of each aspect. In this talk we'll walk through a production pipeline for training large Aspect Based Sentiment Analysis model in python with the Intel NLP Architect package based on the following open sourced code https://github.com/microsoft/nlp-recipes/tree/master/examples/sentiment_analysis/absa
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Tech introduction to Yaetos, an open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in the cloud.
Spark is quickly becoming the most popular framework in the MapReduce family. With better performance and much better APIs - it's easier than ever to perform the actual data wrangling; But as always - the challenges of operating, verifying and optimizing your application over time are much greater than the initial setup - and all the more so with distributes systems. In Kenshoo, we've used and developed some tools and techniques to monitor the state of our Spark application: health, correctness, performance, utilization, and business KPIs. We'll discuss some standard tools and less standard techniques to get the most information out of your Spark cluster.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
The need to crunch large amounts of data to extract useful statistics is increasingly common. Using services like Amazon Redshift and Amazon Elastic MapReduce, we will show how you can process log data to produce helpful reports and give your analysts the tools to find useful data. We will dive deep into these systems, building a usable example from scratch using the AWS SDK for Ruby.
Speech of Nihad Abbasov, Senior Software Engineer at Digital Classifieds, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
How fast is your code? Performance is crucial as your startup grows, and optimizing your application can make a huge impact on user experience. During this talk, you will learn hints, techniques and best practices for improving the overall speed of your Ruby application.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Aspect-based sentiment analysis is a text analysis technique that breaks down text into aspects (attributes or components of a product or service), and then scores the sentiment level (positive, negative or neutral) of each aspect. In this talk we'll walk through a production pipeline for training large Aspect Based Sentiment Analysis model in python with the Intel NLP Architect package based on the following open sourced code https://github.com/microsoft/nlp-recipes/tree/master/examples/sentiment_analysis/absa
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Tech introduction to Yaetos, an open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in the cloud.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Introduction to Yaetos, an open source tool for data engineers, scientists, and analysts to easily create data pipelines in python and SQL and put them in production in the AWS cloud. Focus on the Spark component.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.
Also discusses some general tips and tricks for working with data and establishing a productive workflow.
ANSI SQL - a shortcut to Microsoft SQL Server/Azure SQL Database for Intersho...Jens Kleinschmidt
These slides are for Intershop developers who want to start looking into an Oracle DB alternative - Microsoft SQL Server or Azure SQL Database.
It includes steps the vendor, Intershop, has undertaken to support MS SQL as well as migration hints for projects.
Agenda:
Introduction
Evaluation
Why Microsoft SQL Server?
Work Ahead
MS SQL Support
A Story of Epic proportion
ANSI SQL to the rescue
Migration Steps
Outlook
Summary
Q&A
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
Apache Spark is one of the most popular big data systems, but once the shiny finish starts to wear off you can find yourself wondering if you've accidentally deployed a Ford Pinto into production. This talk will look at the challenges that come with scaling Spark jobs. Also, the talk will explore Spark's new(ish) Dataset/DataFrame API, as well as how it’s evolving in Spark 2.3 with improved Python support.
If you're already a Spark user, come to find out why it’s not all your fault. If you aren't already a Spark user, come to find out how to save yourself from some of the pitfalls once you move beyond the example code.
Check out Holden's newest book, High Performance Spark, for more information!
From https://niketechtalksjan2018.splashthat.com/
Similar to BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi (20)
Writing HTML5 Web Apps using Backbone.js and GAERon Reiter
A walkthrough of how to write a complete HTML5 web app (both front end and back end) using Google App Engine (Python), Backbone.js, Require.js, underscore.js and jQuery.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi
1.
2. Scaling out big-data computation & machine
learning using Pig, Python and Luigi
Ron Reiter
VP R&D, Crosswise
3. AGENDA
§ The goal
§ Data processing at Crosswise
§ The basics of prediction using machine learning
§ The “big data” stack
§ An introduction to Pig
§ Combining Pig and Python
§ Workflow management using Luigi and Amazon EMR
4. THE GOAL
1. Process huge amounts of data points
2. Allow data scientists to focus on their
research
3. Adjust production systems according to
research conclusions quickly, without
duplicating logic between research and
production systems
5. DATA PROCESSING AT
CROSSWISE
§ We are building a graph of devices that belong to the
same user, based on browsing data of users
6. DATA PROCESSING AT
CROSSWISE
§ Interesting facts about our data processing
pipeline:
§ We process 1.5 trillion data points from 1 billion
devices
§ 30TB of compressed data
§ Cluster with 1600 cores running for 24 hours
7. DATA PROCESSING AT
CROSSWISE
§ Our constraints
§ We are dealing with massive amounts of data, and we have to
go for a solid, proven and truly scalable solution
§ Our machine learning research team uses Python and sklearn
§ We are in a race against time (to market)
§ We do not want the overhead of maintaining two separate
processing pipelines, one for research and one for large-scale
prediction
8. PREDICTING AT SCALE
MODEL
BUILDING
PHASE
(SMALL
/
LARGE
SCALE)
PREDICTION
PHASE
(MASSIVE
SCALE)
Labeled
Data
Train
Model
Evaluate
Model
Model
Unlabeled
Data
Predict
Output
9. PREDICTING AT SCALE
§ Steps
§ Training & evaluating the model (Iterations on training and
evaluation are done until the model’s performance is acceptable)
§ Predicting using the model at massive scale
§ Assumptions
§ Distributed learning is not required
§ Distributed prediction is required
§ Distributed learning can be achieved but not all machine
learning models support it, and not all infrastructures
know how to do it
10. THE “BIG DATA” STACK
YARN
Mesos
MapReduce
Tez
Resource
Manager
ComputaJon
Framework
High
Level
Language
Spark
Graphlab
Spark
Program
GraphLab
Script
Pig
Scalding
Oozie
Luigi
Azkaban
Workflow
Management
Hive
11. PIG
§ Pig is a high level, SQL-like language, which
runs on Hadoop
§ Pig also supports User Defined Functions written
in Java and Python
12. HOW DOES PIG WORK?
§ Pig converts SQL-like queries to MapReduce iterations
§ Pig builds a work plan based on a DAG it calculates
§ Newer versions of Pig know how to run on different
computation engines, such as Apache Tez and Spark
which offer a higher level of abstraction than MapReduce
Pig
Runner
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
13. PIG DIRECTIVES
The most common Pig directives are:
§ LOAD/STORE – Load and save data sets
§ FOREACH – map function which constructs a new row for
each row in a data set
§ FILTER – filters in/out rows that obey to a certain criteria
§ GROUP – groups rows by a specific column / set of columns
§ JOIN – join two data sets based on a specific column
And many more functions:
http://pig.apache.org/docs/r0.14.0/func.html
14. PIG CODE EXAMPLE
customers
=
LOAD
'customers.tsv'
USING
PigStorage('t')
AS
(customer_id,
first_name,
last_name);
orders
=
LOAD
'orders.tsv'
USING
PigStorage('t')
AS
(customer_id,
price);
aggregated
=
FOREACH
(GROUP
orders
BY
customer_id)
GENERATE
group
AS
customer_id,
SUM(orders.price)
AS
price_sum;
joined
=
JOIN
customers
ON
customer_id,
aggregated
ON
customer_id;
STORE
joined
INTO
'customers_total.tsv'
USING
PigStorage('t');
16. COMBINING PIG AND
PYTHON
§ Pig gives you the power to scale and
process data conveniently with an SQL-
like syntax
§ Python is easy and productive, and has
many useful scientific packages available
(sklearn, nltk, numpy, scipy, pandas)
+
18. PYTHON UDF
§ Pig provides two Python UDF (User-defined function)
engines: Jython (JVM) and CPython
§ Mortar (mortardata.com) added support for C Python
UDFs, which support scientific packages (numpy, scipy,
sklearn, nltk, pandas, etc.)
§ A Python UDF is a function with a decorator that specifies
the output schema. (since Python is dynamic the input
schema is not required)
from
pig_util
import
outputSchema
@outputSchema('value:int')
def
multiply_by_two(num):
return
num
*
2
19. USING THE PYTHON UDF
§ Register the Python UDF:
§ If you prefer speed over package compatibility, use Jython:
§ Then, use the UDF within a Pig expression:
REGISTER
'udfs.py'
USING
streaming_python
AS
udfs;
processed
=
FOREACH
data
GENERATE
udfs.multiply_by_two(num);
REGISTER
'udfs.py'
USING
jython
AS
udfs;
20. CONNECT PIG AND PYTHON
JOBS
§ In many common scenarios, especially in machine
learning, a classifier can usually be trained using a simple
Python script
§ Using the classifier we trained, we can now predict on a
massive scale using a Python UDF
§ Requires a higher-level workflow manager, such as Luigi
PYTHON
JOB
PIG
JOB
PYTHON
UDF
PICKLED
MODEL
S3://model.pkl
21. WORKFLOW MANAGEMENT
S3
HDFS
SFTP
FILE
DB
Task
A
Task
B
Task
C
REQUIRES
REQUIRES
OUTPUTS
OUTPUTS
OUTPUTS
OUTPUTS
OUTPUTS
USES
USES
D
A
T
A
F
L
O
W
22. WORKFLOW MANAGEMENT
WITH LUIGI
§ Unlike Oozie and Azkaban which are heavy workflow
managers, Luigi is more of a Python package.
§ Luigi works based on dependency resolving, similar to a
Makefile (or Scons)
§ Luigi defines an interface of “Tasks” and “Targets”, which
we use to connect the two tasks using dependencies.
UNLABELED LOGS
2014-01-01
TRAINED MODEL
2014-01-01
OUTPUT
2014-01-01
LABELED LOGS
2014-01-01
UNLABELED LOGS
2014-01-02
TRAINED MODEL
2014-01-02
OUTPUT
2014-01-02
LABELED LOGS
2014-01-02
23. EXAMPLE - TRAIN MODEL
LUIGI TASK
§ Let’s see how it’s done:
import
luigi,
numpy,
pandas,
pickle,
sklearn
class
TrainModel(luigi.Task):
target_date
=
luigi.DateParameter()
def
requires(self):
return
LabelledLogs(self.target_date)
def
output(self):
return
S3Target('s3://mybucket/model_%s.pkl'
%
self.target_date)
def
run(self):
clf
=
sklearn.linear_model.SGDClassifier()
with
self.output().open('w')
as
fd:
df
=
pandas.load_csv(self.input())
clf.fit(df[["a","b","c"]].values,
df["class"].values)
fd.write(pickle.dumps(clf))
24. PREDICT RESULTS LUIGI
TASK
§ We predict using a Pig task which has access to the pickled
model:
import
luigi
class
PredictResults(PigTask):
PIG_SCRIPT
=
"""
REGISTER
'predict.py'
USING
streaming_python
AS
udfs;
data
=
LOAD
'$INPUT'
USING
PigStorage('t');
predicted
=
FOREACH
data
GENERATE
user_id,
predict.predict_results(*);
STORE
predicted
INTO
'$OUTPUT'
USING
PigStorage('t');
"""
PYTHON_UDF
=
'predict.py'
target_date
=
luigi.DateParameter()
def
requires(self):
return
{'logs':
UnlabelledLogs(self.target_date),
'model':
TrainModel(self.target_date)}
def
output(self):
return
S3Target('s3://mybucket/results_%s.tsv'
%
self.target_date)
25. PREDICTION PIG USER-DEFINED
FUNCTION (PYTHON)
§ We can then generate a custom UDF while replacing the
$MODEL with an actual model file.
§ The model will be loaded when the UDF is initialized (this will
happen on every map/reduce task using the UDF)
from
pig_util
import
outputSchema
import
numpy,
pickle
clf
=
pickle.load(download_s3('$MODEL'))
@outputSchema('value:int')
def
predict_results(feature_vector):
return
clf.predict(numpy.array(feature_vector))[0]
26. PITFALLS
§ For the classifier to work on your Hadoop
cluster, you have to install the required
packages on all of your Hadoop nodes
(numpy, sklearn, etc.)
§ Sending arguments to a UDF is tricky;
there is no way to initialize a UDF with
arguments. To load a classifier to a UDF,
you should generate the UDF using a
template with the model you wish to use
27. CLUSTER PROVISIONING
WITH LUIGI
§ To conserve resources, we use clusters only when needed. So
we created the StartCluster task:
§ With this mechanism in place, we also have a cron that kills idle
clusters and save even more money.
§ We use both EMR clusters and clusters provisioned by Xplenty
which provide us with their Hadoop provisioning infrastructure.
PigTask
StartCluster
REQUIRES
ClusterTarget
OUTPUTS
USES
28. USING LUIGI WITH OTHER
COMPUTATION ENGINES
§ Luigi acts like the “glue” of data pipelines, and we use it to
interconnect Pig and GraphLab jobs
§ Pig is very convenient for large scale data processing, but it is very
weak when it comes to graph analysis and iterative computation
§ One of the main disadvantages of Pig is that it has no conditional
statements, so we need to use other tools to complete our arsenal
Pig
task
Pig
task
GraphLab
task
29. GRAPHLAB AT CROSSWISE
§ We use GraphLab to run graph processing at scale – for
example, to run connected components and create
“users” from a graph of devices that belong to the same
user
30. PYTHON API
§ Pig is a “data flow” language, and not a real language. Its
abilities are limited - there are no conditional blocks or
loops. Loops are required when trying to reach
“convergence”, such as when finding connected
components in a graph. To overcome this limitation, a
Python API has been created.
from
org.apache.pig.scripting
import
Pig
P
=
Pig.compile(
"A
=
LOAD
'$input'
AS
(name,
age,
gpa);"
+
"STORE
A
INTO
'$output';")
Q
=
P.bind({
'input':
'input.csv',
'output':
'output.csv'})
result
=
Q.runSingle()
32. STANDARD LUIGI
WORKFLOW
§ Standard Luigi Hadoop tasks need a
correctly configured Hadoop client to
launch jobs.
§ This can be a pain when running an
automatically provisioned Hadoop cluster
(e.g. an EMR cluster).
HADOOP
MASTER
NODE
HADOOP
SLAVE
NODE
HADOOP
SLAVE
NODE
LUIGI
NAMENODE
HADOOP
CLIENT
JOB
TRACKER
33. LUIGI HADOOP SSH
RUNNER
§ At Crosswise, we implemented a Luigi task for running
Hadoop JARs (e.g. Pig) remotely, just like the Amazon
EMR API enables.
§ Instead of launching steps using EMR API, we
implemented our own, to enable running steps
concurrently.
LUIGI
CLUSTER
MASTER
NODE
EMR
SLAVE
NODE
EMR
SLAVE
NODE
API
/
SSH
API
/
SSH
HADOOP
CLIENT
INSTANCE
HADOOP
CLIENT
INSTANCE
34. WHY RUN HADOOP JOBS
EXTERNALLY?
Working with the EMR API is convenient, but Luigi expects to run
jobs from the master node and not using the EMR job
submission API
Advantages:
§ Doesn’t require to run on a local configured Hadoop client
§ Allows to provision the clusters as a task (using Amazon
EMR’s API for example)
§ The same Luigi process can utilize several Hadoop clusters
at once
35. NEXT STEPS AT CROSSWISE
§ We are planning on moving to Apache Tez since
MapReduce has a high overhead for
complicated processes, and it is hard to tweak
and utilize the framework properly
§ We are also investigating Dato’s distributed data
processing, training and prediction capabilities at
scale (using GraphLab Create)