This document contains code for implementing an inverted index using Apache Hadoop MapReduce. The code includes a main class that sets up the job, a LineIndexMapper class containing the map method to tokenize text and output (word, document) pairs, and a LineIndexReducer class containing the reduce method to aggregate values for each word key into a comma-separated string. The code implements the basic MapReduce pattern of mapping input data to intermediate key-value pairs, shuffling, and reducing to output the final index.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
A walk-through of various options in integration Apache Spark and Apache NiFi in one smooth dataflow. There are now several options in interfacing between Apache NiFi and Apache Spark with Apache Kafka and Apache Livy.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
A walk-through of various options in integration Apache Spark and Apache NiFi in one smooth dataflow. There are now several options in interfacing between Apache NiFi and Apache Spark with Apache Kafka and Apache Livy.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
While Hadoop is the dominant "Big Data" tool suite today, it's a first-generation technology. I discuss its strengths and weaknesses, then look at how we "should" be doing Big Data and currently-available alternative tools.
Back in summer of 2014, we launched the results of a survey on Java 8, which shared a lot of information we were looking for, but also contained a small golden nugget of data that we didn’t expect: that out of more than 3000 developers surveyed, a shocking 17% of them reported using Apache Spark in production.
So we did another survey with 2100+ respondents drilling down into what developers, data scientists, executives and organizations are looking forward to with Apache Spark. You can download the full version of the report for the whole story, but here is a sneak peak into the findings that we discovered.
The full version is at: http://typesafe.com/blog/apache-spark-preparing-for-the-next-wave-of-reactive-big-data
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Reactive design: languages, and paradigmsDean Wampler
A talk first given at React 2014 and refined for YOW! LambdaJam 2014 that explores the meaning of Reactive Programming, as described in the Reactive Manifesto, and how well it is supported by general design paradigms, like Functional Programming, Object-Oriented Programming, and Domain Driven Design, and by particular design approaches, such as Functional Reactive Programming, Reactive Extensions, Actors, etc.
Handouts to help you prepare for programming/coding/algorithm interview + behavioral interview questions + product management questions, especially at the top tech companies.
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
Speaker: Hari Shreedharan
Data Day Texas 2015
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Asynchronous API in Java8, how to use CompletableFutureJosé Paumard
Slides of my talk as Devoxx 2015. How to set up asynchronous data processing pipelines using the CompletionStage / CompletableFuture API, including how to control threads and how to handle exceptions.
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
Introduction of Apache Kafka - the open source platform for real time message queuing and reliable, scalable, distributed event handling and high volume pub/sub implementation.
see GitHub https://github.com/MaartenSmeets/kafka-workshop for the workshop resources.
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka. Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle.
Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data processing architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. Notably, we introduce Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced Interactive Queries functionality. As we will see, Kafka makes such architectures equally viable for small, medium, and large scale use cases.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Why Spark Is
the Next Top
(Compute)
Model
Big
Data
Everywhere
Chicago,
Oct.
1,
2014
@deanwampler
polyglotprogramming.com/talks
Tuesday, September 30, 14
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: Detail of the London Eye
2. Dean Wampler
dean.wampler@typesafe.com
polyglotprogramming.com/talks
@deanwampler
Tuesday, September 30, 14
About me. You can find this presentation and others on Big Data and Scala at polyglotprogramming.com.
Programming Scala, 2nd Edition is forthcoming.
photo: Dusk at 30,000 ft above the Central Plains of the U.S. on a Winter’s Day.
3. Or
this?
Tuesday, September 30, 14
photo: https://twitter.com/john_overholt/status/447431985750106112/photo/1
4. Hadoop
circa 2013
Tuesday, September 30, 14
The state of Hadoop as of last year.
Image: Detail of the London Eye
5. Hadoop v2.X Cluster
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
master
Resource Mgr
Name Node
master
Name Node
Tuesday, September 30, 14
Schematic view of a Hadoop 2 cluster. For a more precise definition of the services and what they do, see e.g., http://hadoop.apache.org/
docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html We aren’t interested in great details at this point, but we’ll call out a few useful things to
know.
6. Resource and Node Managers
Hadoop v2.X Cluster
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
master
Resource Mgr
Name Node
master
Name Node
Tuesday, September 30, 14
Hadoop 2 uses YARN to manage resources via the master Resource Manager, which includes a pluggable job scheduler and an Applications
Manager. It coordinates with the Node Manager on each node to schedule jobs and provide resources. Other services involved, including
application-specific Containers and Application Masters are not shown.
7. Name Node and Data Nodes
Hadoop v2.X Cluster
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
node
Node Mgr
Data Node
DiskDiskDiskDiskDisk
master
Resource Mgr
Name Node
master
Name Node
Tuesday, September 30, 14
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
8. MapReduce
The classic compute model
for Hadoop
Tuesday, September 30, 14
Hadoop 2 clusters federate the Name node services that manage the file system, HDFS. They provide horizontal scalability of file-system
operations and resiliency when service instances fail. The data node services manage individual blocks for files.
9. MapReduce
1 map step + 1 reduce step
(wash, rinse, repeat)
Tuesday, September 30, 14
You get 1 map step (although there is limited support for chaining mappers) and 1 reduce step. If you can’t implement an algorithm in these
two steps, you can chain jobs together, but you’ll pay a tax of flushing the entire data set to disk between these jobs.
10. MapReduce
Example:
Inverted Index
Tuesday, September 30, 14
The inverted index is a classic algorithm needed for building search engines.
11. Tuesday, September 30, 14
Before running MapReduce, crawl teh interwebs, find all the pages, and build a data set of URLs -> doc contents, written to flat files in HDFS
or one of the more “sophisticated” formats.
12. Tuesday, September 30, 14
Now we’re running MapReduce. In the map step, a task (JVM process) per file *block* (64MB or larger) reads the rows, tokenizes the text and
outputs key-value pairs (“tuples”)...
13. Map Task
(hadoop,(wikipedia.org/hadoop,1))
(provides,(wikipedia.org/hadoop,1))
(mapreduce,(wikipedia.org/hadoop, 1))
(and,(wikipedia.org/hadoop,1))
(hdfs,(wikipedia.org/hadoop, 1))
Tuesday, September 30, 14
... the keys are each word found and the values are themselves tuples, each URL and the count of the word. In our simplified example, there
are typically only single occurrences of each work in each document. The real occurrences are interesting because if a word is mentioned a
lot in a document, the chances are higher that you would want to find that document in a search for that word.
15. Tuesday, September 30, 14
The output tuples are sorted by key locally in each map task, then “shuffled” over the cluster network to reduce tasks (each a JVM process,
too), where we want all occurrences of a given key to land on the same reduce task.
16. Tuesday, September 30, 14
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
17. Altogether...
Tuesday, September 30, 14
Finally, each reducer just aggregates all the values it receives for each key, then writes out new files to HDFS with the words and a list of
(URL-count) tuples (pairs).
18. What’s
not to like? Tuesday, September 30, 14
This seems okay, right? What’s wrong with it?
19. Awkward
Most algorithms are
much harder to implement
in this restrictive
map-then-reduce model.
Tuesday, September 30, 14
Writing MapReduce jobs requires arcane, specialized skills that few master. For a good overview, see http://lintool.github.io/
MapReduceAlgorithms/.
20. Awkward
Lack of flexibility inhibits
optimizations, too.
Tuesday, September 30, 14
The inflexible compute model leads to complex code to improve performance where hacks are used to work around the limitations. Hence,
optimizations are hard to implement. The Spark team has commented on this, see http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-
structured-data-using-Spark.html
21. Performance
Full dump to disk
between jobs.
Tuesday, September 30, 14
Sequencing jobs wouldn’t be so bad if the “system” was smart enough to cache data in memory. Instead, each job dumps everything to disk,
then the next job reads it back in again. This makes iterative algorithms particularly painful.
23. Cluster Computing
Can be run in:
•YARN (Hadoop 2)
•Mesos (Cluster management)
•EC2
•Standalone mode
Tuesday, September 30, 14
If you have a Hadoop cluster, you can run Spark as a seamless compute engine on YARN. (You can also use pre-YARN Hadoop v1 clusters,
but there you have manually allocate resources to the embedded Spark cluster vs Hadoop.) Mesos is a general-purpose cluster resource
manager that can also be used to manage Hadoop resources. Scripts for running a Spark cluster in EC2 are available. Standalone just means
you run Spark’s built-in support for clustering (or run locally on a single box - e.g., for development). EC2 deployments are usually
standalone...
24. Compute Model
Fine-grained “combinators”
for composing algorithms.
Tuesday, September 30, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
25. Compute Model
RDDs:
Resilient,
Distributed
Datasets
Tuesday, September 30, 14
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
26. Compute Model
Tuesday, September 30, 14
RDDs shard the data over a cluster, like a virtualized, distributed collection (analogous to HDFS). They support intelligent caching, which
means no naive flushes of massive datasets to disk. This feature alone allows Spark jobs to run 10-100x faster than comparable MapReduce
jobs! The “resilient” part means they will reconstitute shards lost due to process/server crashes.
27. Compute Model
Written in Scala,
with Java and Python APIs.
Tuesday, September 30, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
28. Inverted Index
in MapReduce
(Java).
Tuesday, September 30, 14
Let’s
see
an
an
actual
implementaEon
of
the
inverted
index.
First,
a
Hadoop
MapReduce
(Java)
version,
adapted
from
hOps://
developer.yahoo.com/hadoop/tutorial/module4.html#soluEon
It’s
about
90
lines
of
code,
but
I
reformaOed
to
fit
beOer.
This
is
also
a
slightly
simpler
version
that
the
one
I
diagrammed.
It
doesn’t
record
a
count
of
each
word
in
a
document;
it
just
writes
(word,doc-‐Etle)
pairs
out
of
the
mappers
and
the
final
(word,list)
output
by
the
reducers
just
has
a
list
of
documentaEons,
hence
repeats.
A
second
job
would
be
necessary
to
count
the
repeats.
29. import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
Tuesday, September 30, 14
I’ve
shortened
the
original
code
a
bit,
e.g.,
using
*
import
statements
instead
of
separate
imports
for
each
class.
I’m
not
going
to
explain
every
line
...
nor
most
lines.
Everything
is
in
one
outer
class.
We
start
with
a
main
rouEne
that
sets
up
the
job.
LoOa
boilerplate...
I
used
yellow
for
method
calls,
because
methods
do
the
real
work!!
But
noEce
that
the
funcEons
in
this
code
don’t
really
do
a
whole
lot...
30. JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
Tuesday, September 30, 14
boilerplate...
31. LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
Tuesday, September 30, 14
main ends with a try-catch clause to run the
job.
32. public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
Tuesday, September 30, 14
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
33. Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
Tuesday, September 30, 14
The rest of the LineIndexMapper class and map
method.
34. public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
Tuesday, September 30, 14
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
35. Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Tuesday, September 30, 14
EOF
36. Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Tuesday, September 30, 14
The
whole
shebang
(6pt.
font)
37. Inverted Index
in Spark
(Scala).
Tuesday, September 30, 14
This
code
is
approximately
45
lines,
but
it
does
more
than
the
previous
Java
example,
it
implements
the
original
inverted
index
algorithm
I
diagrammed
where
word
counts
are
computed
and
included
in
the
data.
38. import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
Tuesday, September 30, 14
It
starts
with
imports,
then
declares
a
singleton
object
(a
first-‐class
concept
in
Scala),
with
a
main
rouEne
(as
in
Java).
The
methods
are
colored
yellow
again.
Note
this
Eme
how
dense
with
meaning
they
are
this
Eme.
}
39. val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
Tuesday, September 30, 14
You
being
the
workflow
by
declaring
a
SparkContext
(in
“local”
mode,
in
this
case).
The
rest
of
the
program
is
a
sequence
of
funcEon
calls,
analogous
to
“pipes”
we
connect
together
to
perform
the
data
flow.
Next
we
read
one
or
more
text
files.
If
“data/crawl”
has
1
or
more
Hadoop-‐style
“part-‐NNNNN”
files,
Spark
will
process
all
of
them
(in
parallel
if
running
a
distributed
configuraEon;
they
will
be
processed
synchronously
in
local
mode).
40. sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
Tuesday, September 30, 14
sc.textFile
returns
an
RDD
with
a
string
for
each
line
of
input
text.
So,
the
first
thing
we
do
is
map
over
these
strings
to
extract
the
original
document
id
(i.e.,
file
name),
followed
by
the
text
in
the
document,
all
on
one
line.
We
assume
tab
is
the
separator.
“(array(0),
array(1))”
returns
a
two-‐element
“tuple”.
Think
of
the
output
RDD
has
having
a
schema
“String
fileName,
String
text”.
41. }
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
Beautiful,
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
no?
Tuesday, September 30, 14
flatMap
maps
over
each
of
these
2-‐element
tuples.
We
split
the
text
into
words
on
non-‐alphanumeric
characters,
then
output
collecEons
of
word
(our
ulEmate,
final
“key”)
and
the
path.
Each
line
is
converted
to
a
collecEon
of
(word,path)
pairs,
so
flatMap
converts
the
collecEon
of
collecEons
into
one
long
“flat”
collecEon
of
(word,path)
pairs.
Then
we
map
over
these
pairs
and
add
a
single
count
of
1.
42. Tuesday, September 30, 14
Another
example
of
a
beauEful
and
profound
DSL,
in
this
case
from
the
world
of
Physics:
Maxwell’s
equaEons:
hOp://upload.wikimedia.org/
wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
43. case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
(word1,
(path1,
n1))
(word2,
(path2,
n2))
...
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
Tuesday, September 30, 14
reduceByKey
does
an
implicit
“group
by”
to
bring
together
all
occurrences
of
the
same
(word,
path)
and
then
sums
up
their
counts.
Note
the
input
to
the
next
map
is
now
((word,
path),
n),
where
n
is
now
>=
1.
We
transform
these
tuples
into
the
form
we
actually
want,
(word,
(path,
n)).
44. }
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
(word,
seq((word,
(path1,
n1)),
(word,
(path2,
n2)),
...))
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
(word,
“(path1,
n1),
(path2,
n2),
...”)
Tuesday, September 30, 14
Now
we
do
an
explicit
group
by
to
bring
all
the
same
words
together.
The
output
will
be
(word,
(word,
(path1,
n1)),
(word,
(path2,
n2)),
...).
The
last
map
removes
the
redundant
“word”
values
in
the
sequences
of
the
previous
output.
It
outputs
the
sequence
as
a
final
string
of
comma-‐
separated
(path,n)
pairs.
We
finish
by
saving
the
output
as
text
file(s)
and
stopping
the
workflow.
45. import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
Altogether
Tuesday, September 30, 14
The
whole
shebang
(12pt.
font,
this
Eme)
46. sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
Concise
Combinators!
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey{
case (n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
Tuesday, September 30, 14
I’ve
shortened
the
original
code
a
bit,
e.g.,
using
*
import
statements
instead
of
separate
imports
for
each
class.
I’m
not
going
to
explain
every
line
...
nor
most
lines.
Everything
is
in
one
outer
class.
We
start
with
a
main
rouEne
that
sets
up
the
job.
LoOa
boilerplate...
47. The Spark version
took me ~30 minutes
to write.
Tuesday, September 30, 14
Once
you
learn
the
core
primiEves
I
used,
and
a
few
tricks
for
manipulaEng
the
RDD
tuples,
you
can
very
quickly
build
complex
algorithms
for
data
processing!
The
Spark
API
allowed
us
to
focus
almost
exclusively
on
the
“domain”
of
data
transformaEons,
while
the
Java
MapReduce
version
(which
does
less),
forced
tedious
aOenEon
to
infrastructure
mechanics.
48. Use a SQL query when
you can!!
Tuesday, September 30, 14
49. Spark SQL!
Mix SQL queries with
the RDD API.
Tuesday, September 30, 14
Use the best tool for a particular
problem.
50. Spark SQL!
Create, Read, and Delete
Hive Tables
Tuesday, September 30, 14
Interoperate with Hive, the original Hadoop SQL tool.
51. Spark SQL!
Read JSON and
Infer the Schema
Tuesday, September 30, 14
Read strings that are JSON records, infer the schema on the fly. Also, write RDD records as
JSON.
52. SparkSQL
~10-100x the performance of
Hive, due to in-memory
caching of RDDs & better
Spark abstractions.
Tuesday, September 30, 14
53. Combine SparkSQL
queries with Machine
Learning code.
Tuesday, September 30, 14
We’ll
use
the
Spark
“MLlib”
in
the
example,
then
return
to
it
in
a
moment.
54. CREATE TABLE Users(
userId INT,
name STRING,
email STRING,
age INT,
latitude DOUBLE,
longitude DOUBLE,
subscribed BOOLEAN);
CREATE TABLE Events(
userId INT,
action INT);
HiveQL
Schemas
definiEons.
Tuesday, September 30, 14
This
example
adapted
from
the
following
blog
post
announcing
Spark
SQL:
hOp://databricks.com/blog/2014/03/26/Spark-‐SQL-‐manipulaEng-‐structured-‐data-‐using-‐Spark.html
Assume
we
have
these
Hive/Shark
tables,
with
data
about
our
users
and
events
that
have
occurred.
55. val trainingDataTable = sql("""
SELECT e.action, u.age,
u.latitude, u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
val trainingData =
trainingDataTable map { row =>
val features =
Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
Tuesday, September 30, 14
Here
is
some
Spark
(Scala)
code
with
an
embedded
SQL
query
that
joins
the
Users
and
Events
tables.
The
“””...”””
string
allows
embedded
line
feeds.
The
“sql”
funcEon
returns
an
RDD,
which
we
then
map
over
to
create
LabeledPoints,
an
object
used
in
Spark’s
MLlib
(machine
learning
library)
for
a
recommendaEon
engine.
The
“label”
is
the
kind
of
event
and
the
user’s
age
and
lat/long
coordinates
are
the
“features”
used
for
making
recommendaEons.
(E.g.,
if
you’re
25
and
near
a
certain
locaEon
in
the
city,
you
might
be
interested
a
nightclub
near
by...)
56. }
val model =
new LogisticRegressionWithSGD()
.run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
Tuesday, September 30, 14
Next
we
train
the
recommendaEon
engine,
using
a
“logisEc
regression”
fit
to
the
training
data,
where
“stochasEc
gradient
descent”
(SGD)
is
used
to
train
it.
(This
is
a
standard
tool
set
for
recommendaEon
engines;
see
for
example:
hOp://www.cs.cmu.edu/~wcohen/10-‐605/assignments/
sgd.pdf)
57. .run(trainingData)
val allCandidates = sql("""
SELECT userId, age, latitude, longitude
FROM Users
WHERE subscribed = FALSE""")
case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
Tuesday, September 30, 14
Now
run
a
query
against
all
users
who
aren’t
already
subscribed
to
noEficaEons.
58. case class Score(
userId: Int, score: Double)
val scores = allCandidates map { row =>
val features =
Array[Double](row(1), row(2), row(3))
Score(row(0), model.predict(features))
}
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""") Tuesday, September 30, 14
Declare
a
class
to
hold
each
user’s
“score”
as
produced
by
the
recommendaEon
engine
and
map
the
“all”
query
results
to
Scores.
Then
“register”
the
scores
RDD
as
a
“Scores”
table
in
Hive’s
metadata
respository.
This
is
equivalent
to
running
a
“CREATE
TABLE
Scores
...”
command
at
the
Hive/Shark
prompt!
59. }
// Hive table
scores.registerAsTable("Scores")
val topCandidates = sql("""
SELECT u.name, u.email
FROM Scores s
JOIN Users u ON s.userId = u.userId
ORDER BY score DESC
LIMIT 100""")
Tuesday, September 30, 14
Finally,
run
a
new
query
to
find
the
people
with
the
highest
scores
that
aren’t
already
subscribing
to
noEficaEons.
You
might
send
them
an
email
next
recommending
that
they
subscribe...
61. Spark Streaming
Use the same abstractions
for near real-time,
event streaming.
Tuesday, September 30, 14
Once you learn the core set of primitives, it’s easy to compose non-trivial algorithms with little
code.
62. Tuesday, September 30, 14
A DSTream (discretized stream) wraps the RDDs for each “batch” of events. You can specify the granularity, such as all events in 1 second
batches, then your Spark job is passed each batch of data for processing. You can also work with moving windows of batches.
64. val sc = new SparkContext(...)
val ssc = new StreamingContext(
sc, Seconds(1))
// A DStream that will listen to server:port
val lines =
ssc.socketTextStream(server, port)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map (word => (word, 1))
val wordCounts =
pairs reduceByKey ((n1, n2) => n1 + n2)
wordCount.print() // print a few counts...
ssc.start()
Tuesday, September 30, 14
This
example
adapted
from
the
following
page
on
the
Spark
website:
hOp://spark.apache.org/docs/0.9.0/streaming-‐programming-‐guide.html#a-‐quick-‐example
We
create
a
StreamingContext
that
wraps
a
SparkContext
(there
are
alternaEve
ways
to
construct
it...).
It
will
“clump”
the
events
into
1-‐second
intervals.
Next
we
setup
a
socket
to
stream
text
to
us
from
another
server
and
port
(one
of
several
ways
to
ingest
data).
65. ssc.socketTextStream(server, port)
// Word Count...
val words = lines flatMap {
line => line.split("""W+""")
}
val pairs = words map (word => (word, 1))
val wordCounts =
pairs reduceByKey ((n1, n2) => n1 + n2)
wordCount.print() // print a few counts...
ssc.start()
ssc.awaitTermination()
Tuesday, September 30, 14
Now
the
“word
count”
happens
over
each
interval
(aggregaEon
across
intervals
is
also
possible),
but
otherwise
it
works
like
before.
Once
we
setup
the
flow,
we
start
it
and
wait
for
it
to
terminate
through
some
means,
such
as
the
server
socket
closing.
67. MLlib
•Linear regression
•Binary classification
•Collaborative filtering
•Clustering
•Others...
Tuesday, September 30, 14
Not as full-featured as more mature toolkits, but the Mahout project has announced they are going to port their algorithms to Spark, which
include powerful Mathematics, e.g., Matrix support libraries.
68. Distributed Graph
Computing
Tuesday, September 30, 14
Some problems are more naturally represented as graphs.
Extends RDDs to support property graphs with directed edges.
72. Tuesday, September 30, 14
Why is Spark so good (and Java MapReduce so bad)? Because fundamentally, data analytics is Mathematics and programming tools inspired
by Mathematics - like Functional Programming - are ideal tools for working with data. This is why Spark code is so concise, yet powerful. This
is why it is a great platform for performance optimizations. This is why Spark is a great platform for higher-level tools, like SQL, graphs, etc.
Interest in FP started growing ~10 years ago as a tool to attack concurrency. I believe that data is now driving FP adoption even faster. I know
many Java shops that switched to Scala when they adopted tools like Spark and Scalding (https://github.com/twitter/scalding).
73. Spark
A flexible, scalable distributed
compute platform with
concise, powerful APIs and
higher-order tools.
spark.apache.org
Tuesday, September 30, 14
74. Why Spark Is
the Next Top
(Compute)
Model
@deanwampler
polyglotprogramming.com/talks
Tuesday, September 30, 14
Copyright (c) 2014, Dean Wampler, All Rights Reserved, unless otherwise noted.
Image: The London Eye on one side of the Thames, Parliament on the other.