Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.
Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.
On July 6, 2021, MariaDB 10.6 became generally available (production ready). This presentation focuses on the most important aspects of it as well as the influence it has. Improvements to InnoDB, SYS Schema Adoption, and deprecated variables and engines are all part of this presentation.
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...DataStax
Learn how to build an effective storage layer for a variety of workloads. With changing trends in system and storage hardware, understanding design trade-offs can be a challenge. This webinar will focus on cutting through the noise and diving into the choices that matter when designing for scale and performance.
Video: https://youtu.be/uEL8vyVSIis
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
Audience Level
Intermediate
Synopsis
Ceph – the most popular storage solution for OpenStack – stores all data as a collection of objects. This object store was originally implemented on top of a POSIX filesystem, an approach that turned out to have a number of problems, notably with performance and complexity.
BlueStore, a new storage backend for Ceph, was created to solve these issues; the Ceph Jewel release included an early prototype. The code and on-disk format were declared stable (but experimental) for Ceph Kraken, and now in the upcoming Ceph Luminous release, BlueStore will be the recommended default storage backend.
With a 2-3x performance boost, you’ll want to look at migrating your Ceph clusters to BlueStore. This talk goes into detail about what BlueStore does, the problems it solves, and what you need to do to use it.
Speaker Bio:
Tim works for SUSE, hacking on Ceph and related technologies. He has spoken often about distributed storage and high availability at conferences such as linux.conf.au. In his spare time he wrangles pigs, chickens, sheep and ducks, and was declared by one colleague “teammate most likely to survive the zombie apocalypse”.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing them to decouple the HBase RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more.
A brief, but action-packed introduction to DataStax Enterprise Search. In this deck, we'll get an overview of DSE Search's value proposition, see some example CQL search queries, and dive into the details of the indexing and query paths.
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
There are several exciting and long-awaited features released from MongoDB 4.0. He will focus on the prime features, the kind of problem it solves, and the best practices for deploying replica sets.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
Introduction to HCatalog - its primary motivation, goals, the most important features (e.g. data discovery, notifications of data availability, WebHCat), currently supported file formats and projects.
Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi
اسلاید روز هفتم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که به پیاده سازی یک نمونه سرویس صنعتی یادگیری ماشین و آشنایی با روش نصب و بکارگیری تنسورفلو انجام شد
زمان هر جلسه ۲ ساعت است
On July 6, 2021, MariaDB 10.6 became generally available (production ready). This presentation focuses on the most important aspects of it as well as the influence it has. Improvements to InnoDB, SYS Schema Adoption, and deprecated variables and engines are all part of this presentation.
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...DataStax
Learn how to build an effective storage layer for a variety of workloads. With changing trends in system and storage hardware, understanding design trade-offs can be a challenge. This webinar will focus on cutting through the noise and diving into the choices that matter when designing for scale and performance.
Video: https://youtu.be/uEL8vyVSIis
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
Audience Level
Intermediate
Synopsis
Ceph – the most popular storage solution for OpenStack – stores all data as a collection of objects. This object store was originally implemented on top of a POSIX filesystem, an approach that turned out to have a number of problems, notably with performance and complexity.
BlueStore, a new storage backend for Ceph, was created to solve these issues; the Ceph Jewel release included an early prototype. The code and on-disk format were declared stable (but experimental) for Ceph Kraken, and now in the upcoming Ceph Luminous release, BlueStore will be the recommended default storage backend.
With a 2-3x performance boost, you’ll want to look at migrating your Ceph clusters to BlueStore. This talk goes into detail about what BlueStore does, the problems it solves, and what you need to do to use it.
Speaker Bio:
Tim works for SUSE, hacking on Ceph and related technologies. He has spoken often about distributed storage and high availability at conferences such as linux.conf.au. In his spare time he wrangles pigs, chickens, sheep and ducks, and was declared by one colleague “teammate most likely to survive the zombie apocalypse”.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing them to decouple the HBase RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more.
A brief, but action-packed introduction to DataStax Enterprise Search. In this deck, we'll get an overview of DSE Search's value proposition, see some example CQL search queries, and dive into the details of the indexing and query paths.
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
There are several exciting and long-awaited features released from MongoDB 4.0. He will focus on the prime features, the kind of problem it solves, and the best practices for deploying replica sets.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
Introduction to HCatalog - its primary motivation, goals, the most important features (e.g. data discovery, notifications of data availability, WebHCat), currently supported file formats and projects.
Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi
اسلاید روز هفتم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که به پیاده سازی یک نمونه سرویس صنعتی یادگیری ماشین و آشنایی با روش نصب و بکارگیری تنسورفلو انجام شد
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 5 @ UTACMAmir Sedighi
اسلاید روز پنجم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری ژرف برگزار شد. جلسه ششم کارگاه نیز به یادگیری ژرف و کاربردها اختصاص خواهد یافت. این کارگاه به همت ایسیام دانشگاه تهران در محل دانشکده فنی برگزار میشود
زمان هر جلسه ۲ ساعت است
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupAmir Sedighi
During recent years, the data science has undergone a big shift towards big data processing. As a result, a change in our methodology seems to be inevitable. This change, however, does not necessarily translate to a loss in decades of investments in classical data processing technologies and data warehousing. Instead, it supports adapting to the new environment with regards to the mass production of business data, by adopting modern practices.
In this talk we review some frameworks and solutions to modern big data processing approaches, along with a few case studies that have been carried out in Iran.
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
Big Data and Machine Learning Workshop - Day 6 @ UTACMAmir Sedighi
اسلاید روز ششم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری ژرف برگزار شد. جلسه ششم کارگاه نیز به یادگیری ژرف و کاربردها اختصاص خواهد یافت. این کارگاه به همت ایسیام دانشگاه تهران در محل دانشکده فنی برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 4 @ UTACM Amir Sedighi
اسلاید روز چهارم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که شامل مقدمه ای بر شبکههای عصبی مصنوعی و یک نمونه پیاده سازی ساده به زبان جاوا است. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi
اسلاید سومین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین با معرفی راهکارهای متن باز پردازش دادههای بزرگ و راهحلهای پردازش جریانداده برگزار شد. مفاهیم مورد بررسی قرار گرفت. یک نمونه کوچک اجرایی از بهره گیری هدوپ ارائه شد. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 2 @ UTACMAmir Sedighi
اسلاید دومین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری بدون نظارت و یک نمونه کاربردی خوشه بندی متن با استفاده از الگوریتمهای وزندهی به واژهها، کانوپی و کیمینز در تاریخ ۱۳ مرداد ۱۳۹۵ در محل دانشکده فنی دانشگاه تهران برگزار شد. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 1 @ UTACMAmir Sedighi
اولین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین، با تاکید بر یادگیری بانظارت و یک نمونه کاربردی کشف تقلب در تاریخ ۶ مرداد ۱۳۹۵ در محل دانشکده فنی دانشگاه تهران برگزار شد. این اسلاید روز اول است. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
An introduction To Apache Spark
1. 1
An Introduction to Apache Spark
By Amir Sedighi
Datis Pars Data Technology
Slides adopted from Databricks
(Paco Nathan and Aaron Davidson)
@amirsedighi
http://hexican.com
2. 2
History
● Developed in 2009 at UC Berkeley AMPLab.
● Open sourced in 2010.
● Spark becomes one of the largest big-data
projects with more 400 contributors in 50+
organizations such as:
– Databricks, Yahoo!, Intel, Cloudera, IBM, …
3. 3
What is Spark?
● Fast and general cluster computing system
interoperable with Hadoop datasets.
4. 4
What are Spark improvements?
● Improves efficiency through:
– In-memory computing primitives.
– General computation graphs.
● Improves usability through
– Rich APIs in Scala, Java, Python
– Interactive shell (Scala/Python)
6. 6
MapReduce
● MapReduce is great for single-pass batch jobs
while in many use-cases we need to use
MapReduce in a multi-pass manner...
7. 7
What improvements Spark made on
running MapReduce?
● Improving the performance of MapReduce for
running as a multi-pass analytics, interactive,
real-time, distributed computation model on the
top of Hadoop.
Note:
– Spark is a hadoop successor.
12. 12
Spark Programming Model
● At a high level, every Spark application consists
of a driver program that runs the user’s main
function.
● Promotes you to write programs in term of
making transformations on distributed datasets.
13. 13
Spark Programming Model
● The main abstraction Spark provides is a
resilient distributed dataset (RDD).
– Collection of elements partitioned across the cluster
(Memory of Disk)
– Can be accessed and operated in parallel (map,
filter, ...)
– Automatically rebuilt on failure
14. 14
Spark Programming Model
● RDDs Operations
– Transformations: Create a new dataset from an
existing one.
● Example: map()
– Actions: Return a value to the driver program after
running a computation on the dataset.
● Example: reduce()
16. 16
Spark Programming Model
● Another abstraction is Shared Variables
– Broadcast Variables, which can be used to cache a
value in memory on all nodes.
– Accumulator
20. 20
Ease of Use
● Spark offers over 80 high-level operators that
make it easy to build parallel apps.
● Scala and Python shells to use it interactively.
23. 23
Apache Spark Core
● Spark Core is the general engine for the Spark
platform.
– In-memory computing capabilities deliver speed
– General execution model supports wide delivery of
use cases
– Ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)
35. 35
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
36. 36
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
DStream: a sequence of distributed datasets (RDDs)
representing a distributed stream of data
transformation: modify data in one DStream to create
another DStream
new DStream
37. 37
Spark Streaming
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()
sliding window
operation
window length sliding interval
40. 40
MLLib
● MLLib is Spark's scaleable machine learning
engine.
● MLLib works on any hadoop datasource such
as HDFS, HBase and local files.
41. 41
MLLib
● Algorithms:
– linear SVM and logistic regression
– classification and regression tree
– k-means clustering
– recommendation via alternating least squares
– singular value decomposition
– linear regression with L1- and L2-regularization
– multinomial naive Bayes
– basic statistics
– feature transformations
46. 46
Spark Runs Everywhere
● Spark runs on Hadoop, Mesos, standalone, or
in the cloud.
● Spark accesses diverse data sources including
HDFS, Cassandra, HBase, S3.
47. 47
Resources
● http://spark.apache.org
● Intro to Apache Spark by Paco Nathan
● Building a Unified Data Pipeline in Spark by Aaron
Davidson.
● http://www.slideshare.net/manishgforce/lightening-fast-bi
g-data-analytics-using-apache-spark
● Deep Dive with Spark Streaming - Tathagata Das - Spark
Meetup
● ZYMR