my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
Build 1 trillion warehouse based on carbon databoxu42
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Adobe Spark enables you to tell stories and share ideas quickly and beautifully. Spark lets you create three types of content:
Use Page to create a story using text, images, and video. When you’re done, Adobe will present your story as a responsive web page that can be viewed in any web browser.
Use Post to create images optimized for social media; you provide images and text and we’ll help with the design. Adobe will even help you create the right shape and size image for each social media platform.
Use Video to create, well, a video. Adobe gives you access to icons and images or use your own, add your voice and background music, and we’ll turn your story into a video ready to share with the world
Brian O'Neill from Monetate gave a presentation on Spark. He discussed Spark's history from Hadoop and MapReduce, the basics of RDDs, DataFrames, SQL and streaming in Spark. He demonstrated how to build and run Spark applications using Java and SQL with DataFrames. Finally, he covered Spark deployment architectures and ran a demo of a Spark application on Cassandra.
How to plan a hadoop cluster for testing and production environmentAnna Yen
Athemaster wants to share our experience to plan Hardware Spec, server initial and role deployment with new Hadoop Users. There are 2 testing environments and 3 production environments for case study.
Adobe Spark enables you to tell stories and share ideas quickly and beautifully. Spark lets you create three types of content:
Use Page to create a story using text, images, and video. When you’re done, Adobe will present your story as a responsive web page that can be viewed in any web browser.
Use Post to create images optimized for social media; you provide images and text and we’ll help with the design. Adobe will even help you create the right shape and size image for each social media platform.
Use Video to create, well, a video. Adobe gives you access to icons and images or use your own, add your voice and background music, and we’ll turn your story into a video ready to share with the world
Brian O'Neill from Monetate gave a presentation on Spark. He discussed Spark's history from Hadoop and MapReduce, the basics of RDDs, DataFrames, SQL and streaming in Spark. He demonstrated how to build and run Spark applications using Java and SQL with DataFrames. Finally, he covered Spark deployment architectures and ran a demo of a Spark application on Cassandra.
Apache Spark is an open-source framework developed by AMPlab of University of California and, successively, donated to Apache Software Foundation. Unlike the MapReduce paradigm based on twolevel disk of Hadoop, the primitive in-memory multilayer provided by Spark allow you to have performance up to 100 times better.
The document discusses concepts for rebranding an organization called Spark Leadership.
Concept 1 focuses on using a unique rounded font to give a soft expression to the name Spark Leadership. It also discusses using color symbolism by relating the word "growth" to turning green. Other branding ideas discussed include business cards, banners, t-shirts, notebooks and coffee cups.
Concept 2 uses an asterisk symbol next to the name to represent a focus point. It discusses using color references and features examples of other organizations to reference. Additional branding concepts include cards and banners.
The document outlines the proposed website structure and sitemap, including sections for growth, Rockefeller Habits training, Go Fast Forward training, events and the company blog
Lightning talk showing various aspectos of software system performance. It goes through: latency, data structures, garbage collection, troubleshooting method like workload saturation method, quick diagnostic tools, famegraph and perfview
Apache Spark is a fast, general engine for large-scale data processing. It supports batch, interactive, and stream processing using a unified API. Spark uses resilient distributed datasets (RDDs), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations like map, filter, and reduce and actions that return final results to the driver program. Spark provides high-level APIs in Scala, Java, Python, and R and an optimized engine that supports general computation graphs for data analysis.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
1. The document discusses the future of data science and big data technologies. It describes the roles of data scientists and their typical skills, salaries, and job outlook.
2. It discusses technologies like Hadoop, Spark, and distributed computing that are used to handle big data. While Hadoop is good for batch processing, Spark can perform both batch and real-time processing 100x faster.
3. Going forward, data science will shift from descriptive to predictive analytics using machine learning to improve customer experience and business outcomes across industries like internet search and digital advertising.
PixieDust is an open source library that simplifies and improves Jupyter Python notebooks. It allows users to:
1. Easily install Python packages and libraries without modifying configuration files.
2. Create visualizations with a simple display() API that includes options for performance statistics, panning, and zooming.
3. Export data to cloud services or locally in CSV, JSON, HTML formats for further use or sharing.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
Apache Spark has become the de-facto compute engine of choice for data engineers, developers, and data scientists because of its ability to run multiple analytic workloads with a single compute engine. Spark is speeding up data pipeline development, enabling richer predictive analytics, and bringing a new class of applications to market.
This document outlines steps for developing analytic applications using Apache Spark and Python. It covers prerequisites for accessing flight and weather data, deploying a simple data pipe tool to build training, test, and blind datasets, and using an IPython notebook to train predictive models on flight delay data. The agenda includes accessing necessary services on Bluemix, preparing the data, training models in the notebook, evaluating model accuracy, and deploying models.
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
20. RDDs
maintain
lineage
information
that
can
be
used
to
reconstruct
lost
partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
Result = messages.map(_.split(‘t’)(2))
HDFS
File
Filtered
RDD
Mapped
RDD
filter
(func
=
_.startsWith(...))
map
(func
=
_.split(...))
27. val
conf
=
new
SparkConf()
val
ssc
=
new
StreamingContext(conf,
Seconds(1))
val
lines
=
ssc.textFileStream(args(1))
val
words
=
lines.flatMap(_.split("
"))
val
result
=
words.map(x
=>
(x,
1)).reduceByKey(_
+
_).collect()
ssc.start()
val
conf
=
new
SparkConf()
val ハsc ハ= ハnew ハSparkContext(conf)
val ハlines ハ ハ= ハsc.textFile(args(1))
val ハwords ハ= ハlines.fl゚atMap(_.split("
"))
val result = words.map(x ハ=> ハ(x, ハ1)).reduceByKey(_
+
_).collect()
28. n Hive-‐like
interface(JDBC
Service
/
CLI)
n Both
Hive
QL
&
Simple
SQL
dialects
are
Supported
n DDL
is
100%
compatible
with
Hive
Metastore
n Hive
QL
aims
to
100%
compatible
with
Hive
DML
Spark
Core
Spark
Execution
Operators
Catalyst
Hive
QL
Simple
SQL
SQL
API
CLI
User Application
JDBC
Service
Data
Analyst
Hive
Meta
Store
Simple
Catalog
29. n First
released
in
Spark
1.0
(May,
2014)
n Initial
committed
by
Michael
Armbrust
&
Reynold
Xin
from
Databricks
30. ¡ MLlib
机器学习算法库:
§ Initial
contribution
from
AMPLab,
UC
Berkeley
§ Shipped
with
Spark
since
version
0.8
(Sep
2013)
¡ 数据类型
§ Dense
§ Sparse
(
Since
1.0)
▪ 现实世界中,众多的数据集都是稀疏的
¡ 算法集
§ Classification
/
Regression
/collaborative
filtering
/
Clustering
/
Decomposition
34. ¡ 用于交互式运⾏行测试Spark程序
§ 便于快速测试程序局部逻辑
¡ 构建在Scala
Repl的基础上
§ Repl:读取 执⾏行 打印 循环
§ 拓展:
▪ Modified
wrapper
code
generation
so
that
each
line
typed
has
references
to
objects
for
its
dependencies
▪ Distribute
generated
classes
over
the
network
35.
36. ¡ Pluggable
shuffl゚e
Interface
§ Hash
-‐>
Sort
▪ Memory/performance
etc.
¡ Improved
Data
transfer
mechanism
§ Pluggable
§ Employ
Netty
¡ Others
§ pySpark
/
JDBC
server
/
Dynamic
metric
…
37. ¡ Core
§ Pluggable
Storage
Interface
▪ To
support
various
Storage
type,
SSD,HDFS
Cache
etc ハ
¡ Spark ハSQL
§ 更多的数据源的支持
▪ (Cassandra, MongoDB)
RDMS
(SAP/Vertica/Oracle)
§ 性能优化(code
gen,
faster
joins,
etc)
§ 语法增强(towards
SQL92)
¡ Graphx
§ Move
graphx
out
of
“Alpha”
¡ 稳定性和可扩展性
38. ¡ Better
Yarn
Integration
§ Security
§ Dynamic
resource
adjustment
¡ More
Algorithms
for
Mllib
§ On
June,
15+
§ Should
Double
quickly.
¡ Spark ハStreaming
§ Streaming
SQL
/
More
data
source
etc.