This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
While Hadoop is the dominant "Big Data" tool suite today, it's a first-generation technology. I discuss its strengths and weaknesses, then look at how we "should" be doing Big Data and currently-available alternative tools.
Reactive design: languages, and paradigmsDean Wampler
A talk first given at React 2014 and refined for YOW! LambdaJam 2014 that explores the meaning of Reactive Programming, as described in the Reactive Manifesto, and how well it is supported by general design paradigms, like Functional Programming, Object-Oriented Programming, and Domain Driven Design, and by particular design approaches, such as Functional Reactive Programming, Reactive Extensions, Actors, etc.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
While Hadoop is the dominant "Big Data" tool suite today, it's a first-generation technology. I discuss its strengths and weaknesses, then look at how we "should" be doing Big Data and currently-available alternative tools.
Reactive design: languages, and paradigmsDean Wampler
A talk first given at React 2014 and refined for YOW! LambdaJam 2014 that explores the meaning of Reactive Programming, as described in the Reactive Manifesto, and how well it is supported by general design paradigms, like Functional Programming, Object-Oriented Programming, and Domain Driven Design, and by particular design approaches, such as Functional Reactive Programming, Reactive Extensions, Actors, etc.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Lions and tigers and Spark, oh my! It's hard enough keeping up with the explosion in data, but just keeping track of the tools is a challenge. What is Big Data? How do I become a data scientist? How can I leverage the cloud? (What is the cloud?). These are all tough questions for anyone to answer, let alone the business analyst who does not have a strong programming and technology background. Put your mind at ease - we are here to help.
This talk will introduce the open source processing engine, Spark, highlighting not only it's awesome power but how it fits within the larger data landscape. You will learn why Spark was developed to crunch through Big Data, what MapReduce is, and why Spark can beat the pants off of it in terms of performance and ease of use. You will learn how non-programmers can get started with Spark (warning: there is no escaping code, but you can do it, we promise), where you can find great tutorials on Spark, and how you can use Spark in the cloud with IBM.
At the end of this talk you will be able to firmly place Spark in the Big Data ecosystem and articulate to your colleagues how data processing platforms have evolved to handle large amounts of data. You will know how to get started using Spark and be comfortable enough with Spark syntax to write a few lines of code like the boss you are. This talk will be fast and furious, but fun. Fasten your seatbelts and get ready to learn about Spark.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Error handling is critical for reactive systems, as expressed by the "resilient" trait. This talk examines strengths and weaknesses of existing approaches. I also consider preventing errors in the first place.
Lions and tigers and Spark, oh my! It's hard enough keeping up with the explosion in data, but just keeping track of the tools is a challenge. What is Big Data? How do I become a data scientist? How can I leverage the cloud? (What is the cloud?). These are all tough questions for anyone to answer, let alone the business analyst who does not have a strong programming and technology background. Put your mind at ease - we are here to help.
This talk will introduce the open source processing engine, Spark, highlighting not only it's awesome power but how it fits within the larger data landscape. You will learn why Spark was developed to crunch through Big Data, what MapReduce is, and why Spark can beat the pants off of it in terms of performance and ease of use. You will learn how non-programmers can get started with Spark (warning: there is no escaping code, but you can do it, we promise), where you can find great tutorials on Spark, and how you can use Spark in the cloud with IBM.
At the end of this talk you will be able to firmly place Spark in the Big Data ecosystem and articulate to your colleagues how data processing platforms have evolved to handle large amounts of data. You will know how to get started using Spark and be comfortable enough with Spark syntax to write a few lines of code like the boss you are. This talk will be fast and furious, but fun. Fasten your seatbelts and get ready to learn about Spark.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.
Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Error handling is critical for reactive systems, as expressed by the "resilient" trait. This talk examines strengths and weaknesses of existing approaches. I also consider preventing errors in the first place.
Slides from my talk at the Feb 2011 Seattle Tech Startups meeting. More info here (along with powerpoint slides): http://www.startupmonkeys.com/2011/02/scala-frugal-mechanic/
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Booz Allen Hamilton created the Field Guide to Data Science to help organizations and missions understand how to make use of data as a resource. The Second Edition of the Field Guide, updated with new features and content, delivers our latest insights in a fast-changing field. http://bit.ly/1O78U42
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
Quick overview of programming Apache Hadoop with R. Jonathan Seidman's sample code allows a quick comparison of several packages followed by a real example using RHadoop's rmr package. Our example demonstrates using compound (vs. single-field) keys and values and shows the data coming into and out of our mapper and reducer functions.
Presented at the Boston Predictive Analytics Big Data Workshop, March 10, 2012. Sample code and configuration files are available on github.
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
1. Big Data Analytics
- Big Data
- Spark: Big Data Analytics
- Resilient Distributed Datasets (RDD)
- Spark libraries (SQL, DataFrames, MLlib for machine learning, GraphX, and Streaming)
- PFP: Parallel FP-Growth
2. Ubiquitous Computing
- Edge Computing
- Cloudlet
- Fog computing
- Internet of Things (IoT)
- Virtualization
- Virtual Conferencing
- Virtual Events (2D, 3D, and Hybrid)
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Similar to Why Spark Is the Next Top (Compute) Model (20)
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
1. Why Spark Is
the Next Top
(Compute)
Model
@deanwampler
Scala
Days
Amsterdam
2015
Wednesday, June 10, 15
Copyright (c) 2014-2015, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: Detail of the London Eye
3. 3
Wednesday, June 10, 15
This page provides more information, as well as results of a recent survey of Spark usage, blog posts and webinars about the world of Spark.
6. Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Wednesday, June 10, 15
Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/
docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to
know.
7. Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Resource and Node Managers
Wednesday, June 10, 15
Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications
Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including
application-specific Containers and Application Masters are not shown.
8. Hadoop v2.X Cluster
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
node
DiskDiskDiskDiskDisk
Node Mgr
Data Node
master
Resource Mgr
Name Node
master
Name Node
Name Node and Data Nodes
Wednesday, June 10, 15
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
9. MapReduce
The classic
compute model
for Hadoop
Wednesday, June 10, 15
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
10. 1 map step
+ 1 reduce step
(wash, rinse, repeat)
MapReduce
Wednesday, June 10, 15
You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these
two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
12. Wednesday, June 10, 15
Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS
or one of the more “sophisticated” formats.
13. Wednesday, June 10, 15
Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and
outputs key-value pairs (“tuples”)...
16. Wednesday, June 10, 15
The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process,
too), where we want all occurrences of a given key to land on the same reduce task.
17. Wednesday, June 10, 15
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
18. Altogether...
Wednesday, June 10, 15
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
20. Restrictive model
makes most
algorithms hard to
implement.
Awkward
Wednesday, June 10, 15
Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/
MapReduceAlgorithms/.
21. Lack of flexibility
inhibits optimizations.
Awkward
Wednesday, June 10, 15
The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence,
optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
22. Full dump of
intermediate data
to disk between jobs.
Performance
Wednesday, June 10, 15
Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk,
then the next job reads it back in again. This makes iterative algorithms particularly painful.
23. MapReduce only
supports
“batch mode”
Streaming
Wednesday, June 10, 15
Processing data streams as soon as possible has become very important. MR can’t do this, due to its coarse-grained nature and relative
inefficiency, so alternatives have to be used.
25. Can be run in:
•YARN (Hadoop 2)
•Mesos (Cluster management)
•EC2
•Standalone mode
•Cassandra, Riak, ...
•...
Cluster Computing
Wednesday, June 10, 15
If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters,
but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource
manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means
you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually
standalone... Finally, database vendors like Datastax are integrating Spark.
27. RDD:
Resilient,
Distributed
Dataset
Compute Model
Wednesday, June 10, 15
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
28. Compute Model
Wednesday, June 10, 15
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
29. Written in Scala,
with Java, Python,
and R APIs.
Compute Model
Wednesday, June 10, 15
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
30. Inverted Index
in Java MapReduce
Wednesday, June 10, 15
Let’s
see
an
an
actual
implementa1on
of
the
inverted
index.
First,
a
Hadoop
MapReduce
(Java)
version,
adapted
from
hBps://
developer.yahoo.com/hadoop/tutorial/module4.html#solu1on
It’s
about
90
lines
of
code,
but
I
reformaBed
to
fit
beBer.
This
is
also
a
slightly
simpler
version
that
the
one
I
diagrammed.
It
doesn’t
record
a
count
of
each
word
in
a
document;
it
just
writes
(word,doc-‐1tle)
pairs
out
of
the
mappers
and
the
final
(word,list)
output
by
the
reducers
just
has
a
list
of
documenta1ons,
hence
repeats.
A
second
job
would
be
necessary
to
count
the
repeats.
31. import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
Wednesday, June 10, 15
I’m
not
going
to
explain
this
in
much
detail.
I
used
yellow
for
method
calls,
because
methods
do
the
real
work!!
But
no1ce
that
the
func1ons
in
this
code
don’t
really
do
a
whole
lot,
so
there’s
low
informa1on
density
and
you
do
a
lot
of
bit
twiddling.
32. JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
Wednesday, June 10, 15
boilerplate...
34. public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
Wednesday, June 10, 15
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
35. FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Wednesday, June 10, 15
The rest of the LineIndexMapper class and map
method.
36. public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
Wednesday, June 10, 15
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
37. boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Wednesday, June 10, 15
EOF
38. Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Wednesday, June 10, 15
The
whole
shebang
(6pt.
font)
39. Inverted Index
in Spark
(Scala).
Wednesday, June 10, 15
This
code
is
approximately
45
lines,
but
it
does
more
than
the
previous
Java
example,
it
implements
the
original
inverted
index
algorithm
I
diagrammed
where
word
counts
are
computed
and
included
in
the
data.
40. import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
The
InvertedIndex
implemented
in
Spark.
This
1me,
we’ll
also
count
the
occurrences
in
each
document
(as
I
originally
described
the
algorithm)
and
sort
the
(url,N)
pairs
descending
by
N
(count),
and
ascending
by
URL.
41. import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
It
starts
with
imports,
then
declares
a
singleton
object
(a
first-‐class
concept
in
Scala),
with
a
“main”
rou1ne
(as
in
Java).
The
methods
are
colored
yellow
again.
Note
this
1me
how
dense
with
meaning
they
are
this
1me.
42. import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
Wednesday, June 10, 15
You
being
the
workflow
by
declaring
a
SparkContext.
We’re
running
in
“local[*]”
mode,
in
this
case,
meaning
on
a
single
machine,
but
using
all
cores
available.
Normally
this
argument
would
be
a
command-‐line
parameter,
so
you
can
develop
locally,
then
submit
to
a
cluster,
where
“local”
would
be
replaced
by
the
appropriate
cluster
master
URI.
43. sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
The
rest
of
the
program
is
a
sequence
of
func1on
calls,
analogous
to
“pipes”
we
connect
together
to
construct
the
data
flow.
Data
will
only
start
“flowing”
when
we
ask
for
results.
We
start
by
reading
one
or
more
text
files
from
the
directory
“data/crawl”.
If
running
in
Hadoop,
if
there
are
one
or
more
Hadoop-‐style
“part-‐
NNNNN”
files,
Spark
will
process
all
of
them
(they
will
be
processed
synchronously
in
“local”
mode).
44. sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
sc.textFile
returns
an
RDD
with
a
string
for
each
line
of
input
text.
So,
the
first
thing
we
do
is
map
over
these
strings
to
extract
the
original
document
id
(i.e.,
file
name),
followed
by
the
text
in
the
document,
all
on
one
line.
We
assume
tab
is
the
separator.
“(array(0),
array(1))”
returns
a
two-‐element
“tuple”.
Think
of
the
output
RDD
has
having
a
schema
“fileName:
String,
text:
String”.
45. sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
Wednesday, June 10, 15
flatMap
maps
over
each
of
these
2-‐element
tuples.
We
split
the
text
into
words
on
non-‐alphanumeric
characters,
then
output
collec1ons
of
word
(our
ul1mate,
final
“key”)
and
the
path.
That
is,
each
line
(one
thing)
is
converted
to
a
collec1on
of
(word,path)
pairs
(0
to
many
things),
but
we
don’t
want
an
output
collec1on
of
nested
collec1ons,
so
flatMap
concatenates
nested
collec1ons
into
one
long
“flat”
collec1on
of
(word,path)
pairs.
46. }
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
((word1, path1), n1)
((word2, path2), n2)
...
Wednesday, June 10, 15
Next,
we
map
over
these
pairs
and
add
a
single
“seed”
count
of
1.
Note
the
structure
of
the
returned
tuple;
it’s
a
two-‐tuple
where
the
first
element
is
itself
a
two-‐tuple
holding
(word,
path).
The
following
special
method,
reduceByKey
is
like
a
groupBy,
where
it
groups
over
those
(word,
path)
“keys”
and
uses
the
func1on
to
sum
the
integers.
The
popup
shows
the
what
the
output
data
looks
like.
47. .reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
(word1, (path1, n1))
(word2, (path2, n2))
...
Wednesday, June 10, 15
So,
the
input
to
the
next
map
is
now
((word,
path),
n),
where
n
is
now
>=
1.
We
transform
these
tuples
into
the
form
we
actually
want,
(word,
(path,
n)).
I
love
how
concise
and
elegant
this
code
is!
48. .map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
(word1, iter(
(path11, n11), (path12, n12)...))
(word2, iter(
(path21, n21), (path22, n22)...))
...
Wednesday, June 10, 15
Now
we
do
an
explicit
group
by
to
bring
all
the
same
words
together.
The
output
will
be
(word,
seq(
(path1,
n1),
(path2,
n2),
...)).
49. .map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
Wednesday, June 10, 15
The
last
map
over
just
the
values
(keeping
the
same
keys)
sorts
by
the
count
descending
and
path
ascending.
(Sor1ng
by
path
is
mostly
useful
for
reproducibility,
e.g.,
in
tests!).
50. .map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
}
.saveAsTextFile("/path/out")
sc.stop()
}
}
Wednesday, June 10, 15
Finally,
write
back
to
the
file
system
and
stop
the
job.
51. Altogether
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(a: Array[String]) = {
val sc = new SparkContext(
"local[*]", "Inverted Idx")
sc.textFile("data/crawl")
.map { line =>
val Array(path, text) =
line.split("t",2)
(path, text)
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}
}
.saveAsTextFile("/path/to/out")
sc.stop()
}
}
Wednesday, June 10, 15
The
whole
shebang
(14pt.
font,
this
1me)
52. }
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupByKey
.mapValues { iter =>
iter.toSeq.sortBy {
case (path, n) => (-n, path)
}.mkString(", ")
Concise Operators!
Wednesday, June 10, 15
Once
you
have
this
arsenal
of
concise
combinators
(operators),
you
can
compose
sophis1ca1on
dataflows
very
quickly.
53. Wednesday, June 10, 15
Another
example
of
a
beau1ful
and
profound
DSL,
in
this
case
from
the
world
of
Physics:
Maxwell’s
equa1ons:
hBp://upload.wikimedia.org/
wikipedia/commons/c/c4/Maxwell'sEqua1ons.svg
54. The Spark version
took me ~30 minutes
to write.
Wednesday, June 10, 15
Once
you
learn
the
core
primi1ves
I
used,
and
a
few
tricks
for
manipula1ng
the
RDD
tuples,
you
can
very
quickly
build
complex
algorithms
for
data
processing!
The
Spark
API
allowed
us
to
focus
almost
exclusively
on
the
“domain”
of
data
transforma1ons,
while
the
Java
MapReduce
version
(which
does
less),
forced
tedious
aBen1on
to
infrastructure
mechanics.
55. Use a SQL query
when you can!!
Wednesday, June 10, 15
56. New DataFrame API
with query optimizer
(equal performance for Scala,
Java, Python, and R),
Python/R-like idioms.
Spark SQL!
Wednesday, June 10, 15
This API sits on top of a new query optimizer called Catalyst, that supports equally fast execution for all high-level languages, a first in the big
data world.
57. Mix SQL queries with
the RDD API.
Spark SQL!
Wednesday, June 10, 15
Use the best tool for a particular
problem.
58. Create, Read, and Delete
Hive Tables
Spark SQL!
Wednesday, June 10, 15
Interoperate with Hive, the original Hadoop SQL tool.
59. Read JSON and
Infer the Schema
Spark SQL!
Wednesday, June 10, 15
Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as
JSON.
60. Read and write
Parquet files
Spark SQL!
Wednesday, June 10, 15
Parquet is a newer file format developed by Twitter and Cloudera that is becoming very popular. IT stores in column order, which is better than
row order when you have lots of columns and your queries only need a few of them. Also, columns of the same data types are easier to
compress, which Parquet does for you. Finally, Parquet files carry the data schema.
61. ~10-100x the performance of
Hive, due to in-memory
caching of RDDs & better
Spark abstractions.
SparkSQL
Wednesday, June 10, 15
62. Combine SparkSQL
queries with Machine
Learning code.
Wednesday, June 10, 15
We’ll
use
the
Spark
“MLlib”
in
the
example,
then
return
to
it
in
a
moment.
63. CREATE TABLE Users(
userId INT,
name STRING,
email STRING,
age INT,
latitude DOUBLE,
longitude DOUBLE,
subscribed BOOLEAN);
CREATE TABLE Events(
userId INT,
action INT);
Equivalent
HiveQL
Schemas
defini1ons.
Wednesday, June 10, 15
This
example
adapted
from
the
following
blog
post
announcing
Spark
SQL:
hBp://databricks.com/blog/2014/03/26/Spark-‐SQL-‐manipula1ng-‐structured-‐data-‐using-‐Spark.html
Adapted
here
to
use
Spark’s
own
SQL,
not
the
integra1on
with
Hive.
Imagine
we
have
a
stream
of
events
from
users
and
the
events
that
have
occurred
as
they
used
a
system.
64. val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
Wednesday, June 10, 15
Here
is
some
Spark
(Scala)
code
with
an
embedded
SQL
query
that
joins
the
Users
and
Events
tables.
The
“””...”””
string
allows
embedded
line
feeds.
The
“sql”
func1on
returns
an
RDD.
If
we
used
the
Hive
integra1on
and
this
was
a
query
against
a
Hive
table,
we
would
use
the
hql(...)
func1on
instead.
65. val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
Wednesday, June 10, 15
We
map
over
the
RDD
to
create
LabeledPoints,
an
object
used
in
Spark’s
MLlib
(machine
learning
library)
for
a
recommenda1on
engine.
The
“label”
is
the
kind
of
event
and
the
user’s
age
and
lat/long
coordinates
are
the
“features”
used
for
making
recommenda1ons.
(E.g.,
if
you’re
25
and
near
a
certain
loca1on
in
the
city,
you
might
be
interested
a
nightclub
near
by...)
66. val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
Wednesday, June 10, 15
Next
we
train
the
recommenda1on
engine,
using
a
“logis1c
regression”
fit
to
the
training
data,
where
“stochas1c
gradient
descent”
(SGD)
is
used
to
train
it.
(This
is
a
standard
tool
set
for
recommenda1on
engines;
see
for
example:
hBp://www.cs.cmu.edu/~wcohen/10-‐605/assignments/
sgd.pdf)
67. val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
Wednesday, June 10, 15
Now
run
a
query
against
all
users
who
aren’t
already
subscribed
to
no1fica1ons.
68. case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")Wednesday, June 10, 15
Declare
a
class
to
hold
each
user’s
“score”
as
produced
by
the
recommenda1on
engine
and
map
the
“all”
query
results
to
Scores.
69. case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")Wednesday, June 10, 15
Then
“register”
the
scores
RDD
as
a
“Scores”
table
in
in
memory.
If
you
use
the
Hive
binding
instead,
this
would
be
a
table
in
Hive’s
metadata
storage.
70. // In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Wednesday, June 10, 15
Finally,
run
a
new
query
to
find
the
people
with
the
highest
scores
that
aren’t
already
subscribing
to
no1fica1ons.
You
might
send
them
an
email
next
recommending
that
they
subscribe...
71. val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// In-memory table
scores.registerTempTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Altogether
Wednesday, June 10, 15
12
point
font
again.
73. Use the same abstractions
for near real-time,
event streaming.
Spark Streaming
Wednesday, June 10, 15
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
74. “Mini batches”
Wednesday, June 10, 15
A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second
batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
76. val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
This
example
adapted
from
the
following
page
on
the
Spark
website:
hBp://spark.apache.org/docs/0.9.0/streaming-‐programming-‐guide.html#a-‐quick-‐example
77. val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
We
create
a
StreamingContext
that
wraps
a
SparkContext
(there
are
alterna1ve
ways
to
construct
it...).
It
will
“clump”
the
events
into
1-‐second
intervals.
78. val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(10))
// A DStream that will listen
// for text on server:port
val lines =
ssc.socketTextStream(s, p)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}Wednesday, June 10, 15
Next
we
setup
a
socket
to
stream
text
to
us
from
another
server
and
port
(one
of
several
ways
to
ingest
data).
79. // Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
Now
we
“count
words”.
For
each
mini-‐batch
(1
second’s
worth
of
data),
we
split
the
input
text
into
words
(on
whitespace,
which
is
too
crude).
Once
we
setup
the
flow,
we
start
it
and
wait
for
it
to
terminate
through
some
means,
such
as
the
server
socket
closing.
80. // Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
We
count
these
words
just
like
we
counted
(word,
path)
pairs
early.
81. val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
print
is
useful
diagnos1c
tool
that
prints
a
header
and
the
first
10
records
to
the
console
at
each
itera1on.
82. val pairs = words map ((_, 1))
val wordCounts =
pairs reduceByKey ((i,j) => i+j)
wordCount.saveAsTextFiles(outpath)
ssc.start()
ssc.awaitTermination()
Wednesday, June 10, 15
Now
start
the
data
flow
and
wait
for
it
to
terminate
(possibly
forever).
84. Distributed Graph
Computing...
Wednesday, June 10, 15
GraphX, which we won’t discuss further.
Some problems are more naturally represented as graphs.
Extends RDDs to support property graphs with directed edges.
85. A flexible, scalable distributed
compute platform with
concise, powerful APIs and
higher-order tools.
spark.apache.org
Spark
Wednesday, June 10, 15