Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
In this presentation we discuss Microsoft HDInsight offering of Spark. Azure HDInsight, Microsoft’s managed Hadoop and Spark cloud service that runs the Hortonworks Data Platform. Spark for Azure HDInsight offers customers an enterprise-ready Spark solution that’s fully managed, secured, and highly available and made simpler for users with compelling and interactive experiences.
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
1. A D N A N M A S O O D , P H D
S Y S T E M S A R C H I T E C T / D A T A S C I E N T I S T
A D N A N . M A S O O D @ O W A S P . O R G
( H T T P : / / B L O G . A D N A N M A S O O D . C O M )
G I T H U B ( G I T H U B . C O M / A D N A N M A S O O D ) ,
T W I T T E R ( @ A D N A N M A S O O D ) .
P R E S E N T E D A T M I C R O S O F T D A T A S C I E N C E G R O U P –
T A M P A B A Y D A T A S C I E N C E P R O F E S S I O N A L S
H T T P : / / W W W . M E E T U P . C O M / D A T A - S C I E N T I S T S - T A M P A - B A Y / E V E N T S / 2 3 1 2 9 3 0 7 7 /
Spark with Azure HDInsight
2. About the Speaker
Adnan Masood, Ph.D. is a developer, software architect, and researcher and specializes
in FinTech, machine learning and Bayesian belief networks. Before joining PDS
Health care, and GDC (a leading prepaid financial technology institution), he enjoyed
life as a principal engineer of a start-up and worked for a leading UK based nonprofit
organization as a solutions architect.
A strong believer in the development community, Adnan is an active member of the
Open Web Application Security Project (OWASP), an organization dedicated to
software security. In the .NET community, he is a cofounder and president of the
Pasadena .NET Developers group, which he has been successfully leading for 8 years.
He led a number of successful enterprise solutions and consulted for several Fortune
500 company projects.
Adnan devotes himself to his own continual, practical education. He holds
certifications in big data, machine learning, and systems architecture from
Massachusetts Institute of Technology; an Application Security certification from
Stanford University; an SOA Smarts certification from Carnegie Mellon University;
and certifications as a ScrumMaster, Microsoft Certified Trainer, Microsoft Certified
Solutions Developer, and Sun Certified Java Developer.
For more details, visit Adnan's blog (http://blog.adnanmasood.com), GitHub
repository (http://github.com/adnanmasood), and Twitter (@adnanmasood). Adnan
can be reached at adnan.masood@owasp.org.
4. Channel 9 Walk through of Apache Spark on Azure HDInsight
5. Spark 101
Spark is a unified framework for big data
analytics. Spark provides one integrated API
for use by developers, data scientists, and
analysts to perform diverse tasks that would
have previously required separate
processing engines such as batch analytics,
stream processing and statistical modeling.
Spark supports a wide range of popular
languages including Python, R, Scala, SQL,
and Java. Spark can read from diverse data
sources and scale to thousands of nodes.
8. Big Data Deployment – Public Cloud
• Hadoop-as-a-Service
- Amazon Web Services EC2 and EMR
- Microsoft Azure HDInsight
- Google Cloud Dataproc
- IBM Bluemix ... and others
• Spark-as-a-Service
- All of the above
- Databricks
9. Big Data Deployment – On-Premises
• Bare-Metal
• Virtual Machines
- VMware Big Data Extensions
- OpenStack Sahara
• Containers
- BlueData
- Mesos
17. HDInsight Spark Streaming
“Along with traditional Hadoop technologies, HDInsight also provides
Spark as a cloud service. Spark is an integrated set of open source
technologies that can run on a Hadoop cluster. The Spark family
includes options for analyzing large amounts of operational data,
doing machine learning, and more. It also includes Spark Streaming, a
technology for working with streaming data.
Spark Streaming is similar to Storm in some ways. Like Storm, it’s a
general-purpose technology for processing streaming data. Unlike
Storm, Spark Streaming is implemented as an extension to the basic
Spark engine—it’s not an add-on technology. This tight connection can
make Spark applications faster, since there’s less need to move data
between components, and easier to create, since everything uses the
same core Spark technology. Because of this, Spark Streaming (and
Spark in general) are getting more popular by the day”
David Chappell
STREAMING SCENARIOS USING THE MICROSOFT DATA PLATFORM
A GUIDE FOR IT LEADERS
18. HDInsight Spark Streaming
• What is it?
- Distributed compute framework, an extension of the core Apache Spark API
- Allows users to integrate real-time data from disparate event streams (e.g. Kafka,
HDFS, Twitter) in event-driven, asynchronous, scalable, type-safe, and fault tolerant
applications
• When to use it?
- When organizations need realtime decision making
- When you are working with streams of continuous data
• Why Spark Streaming?
- Enables high-throughput and reliable processing of live data streams
- Batch, Iterative, and Streaming analysis on the same platform
- Easily add Machine Learning for streaming data pathways
20. References & Further Reading
Use MapReduce in Hadoop on HDInsight
https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-use-mapreduce
Get started: Create Apache Spark cluster on HDInsight
Linux and run interactive queries using Spark SQL
https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-
zeppelin-notebook-jupyter-spark-sql/
Azure Machine Learning -
https://azure.microsoft.com/en-us/services/machine-
learning/
21. References & Further Reading
Announcing Apache Spark on Azure HDInsight
https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-
Spark-on-Azure-HDInsight
Apache Zeppelin https://zeppelin.incubator.apache.org
Project Jupyter http://jupyter.org/
https://azure.microsoft.com/en-us/services/hdinsight/
https://azure.microsoft.com/en-us/blog/apache-spark-for-azure-
hdinsight-now-generally-available/
Microsoft expands its commitment to Apache Spark big-data framework
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-
apache-spark-use-zeppelin-notebook/
https://channel9.msdn.com/Shows/Azure-Friday/Announcing-Apache-
Spark-on-Azure-HDInsight
http://www.c-sharpcorner.com/UploadFile/aa700f/jumpstart-into-big-
data-with-hdinsight/
22. References & Further Reading
Get started: Create Apache Spark cluster on HDInsight Linux and run
interactive queries using Spark SQL https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/
EdX Course: Processing Big Data with Azure HDInsight Processing Big
Data with Azure HDInsight Learn how to use Hadoop technologies in
Microsoft Azure HDInsight to process big data in this five week, hands-on
course. https://www.edx.org/course/processing-big-data-azure-hdinsight-
microsoft-dat202-1x-0
Apache Spark for Azure HDInsight https://azure.microsoft.com/en-
us/services/hdinsight/apache-spark/
Build Machine Learning applications to run on Apache Spark clusters on
HDInsight Linux https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-ipython-notebook-
machine-learning/
Slides courtesy Microsoft Corporation - Scott talks to Asad Khan about the addition of Apache Spark on Azure HDInsight. Apache Spark is a unified, open source, parallel data processing framework for Big Data Analytics. Spark brings together batch processing, real-time processing, stream analytics, machine learning, and interactive SQL and Azure makes Spark "Software as a Service." It's a great time for you to jump into the world of Big Data on Azure.