SlideShare a Scribd company logo
1 of 9
Download to read offline
Building Big Data Applications:
Getting started with Apache Spark
Happiest People Happiest Customers
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved2
Contents
Brief history of big data................................................................................................................................3
Trends in big data........................................................................................................................................3
How Spark address the big data trends.......................................................................................................4
Common misconceptions.............................................................................................................................5
Hadoop and Spark.......................................................................................................................................5
Besides Hadoop?k.......................................................................................................................................5
The future of Apache Spark.........................................................................................................................6
Use cases....................................................................................................................................................7
Conclusion...................................................................................................................................................8
About Author................................................................................................................................................9
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved
The rise of petabyte-scale data sets
These days business organizations are churning out huge amounts of transactional data, capturing trillions of bytes of
infor-mation about their customers, suppliers, and operations. While this may have once concerned only a few data geeks,
big data is now relevant across business sectors; consumers of products and services. Everyone stands to benefit from its
application. The combination of massive data sets and the rapid development of new technologies capable of storing and
processing infor-mation out of them have already transformed the way businesses operate.
Over the last decades, internet giants like Amazon, Google, Yahoo!, eBay and Twitter have invented and used tools for work-
ing with colossal data sets that was beyond the realm of traditional data management tools.
Some of the leading open-source tools for big data include Hadoop and NoSQL databases like Cassandra and MongoDB.
Hadoop has been a spectacular success, offering cheap storage in HDFS (Hadoop distributed file system) of datasets up to
Petabytes in size, with batch-mode (“offline”) analysis using Map Reduce jobs.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the
analytics requirements are changing rapidly resulting in businesses to either make it or be left behind.
Various tools have emerged in the big data space to address different use-cases for building NRT (near real time), advance
analytics, recommendation applications and many more making it complex to integrate these various tools and different
storage layers. At some point during this evolution, a powerful yet simple, open source framework called Apache Spark was
introduced to address various common use cases of big data under one umbrella.
Besides Apache Spark, another next generation tool called Apache Flink, formerly known as Stratosphere, is also available.
Apache Flink is almost similar to Apache Spark except in the way it handles streaming data; however it is still not as mature
as Apache Spark as a big data tool.
Both Apache Spark and Apache Flink have the capability to build interactive, real time applications.
Since stable version of Apache Flink was released only on Nov 15th and there is no commercial vendor support, we will
discuss this in our subsequent papers.
3
Brief history of big data:
Trends in big data:
4 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved4
Various projects like MLlib, Spark Streaming, Spark Sql, GraphX, Spark DataFrames, and MLPipelines are built on the core
Spark API.
Spark:
Spark provides core API to process data in a distributed environment. It offers various high level API like map, reduce, filter,
reduce by key to build distributed applications. Businesses can use these APIs for ETL pipelines and Data exploration. All
projects in the Spark ecosystem use these core APIs to build specific systems for machine learning, interactive queries and
graph based analytics.
Spark Sql:
Spark Sql is a spark module which provides SQL and data frame interfaces for structured data. This is an interesting
capability which allows use of Sql or Data frames API over any structured intermediate data. It also allows usage from any
underlying data source like hive, parquet, ORC, NoSQL and SQL databases. Use of data frames / SQL helps spark
optimize the performance since it is aware of the underlying schema. It is preferable to use Data frame API over the core
API as one gains performance improvement in that process.
Spark MLlib:
MLlib module provides help with scalable machine learning algorithms for supervised learning, unsupervised learning and
data mining. It is a fast growing package of distributed machine learning algorithms that has its own advantages.
Spark Streaming:
Spark streaming module helps in building applications to process real time data with a latency of above
0.5ms. Based on micro batch style computing, it allows windows function and has inbuilt capability to process data from
Apache Kafka, Flume and many other such data sources.
How Spark addresses the big data trends
Spark GraphX:
GraphX is a new component for graph parallel computation. GraphX includes a growing collection of graph algorithms
and builders to simplify graph analytic tasks.
Most common spark development environments (cluster Managers)
48% 40% 11%
Standalone
mode
Yarn Mesos
Image courtesy Spark-Survey-2015-Infographic
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved
Common Misconceptions:
Hadoop and Spark:
Besides Hadoop:
spark is used to create many types of products inside different organizations
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_
Content/Spark-Survey-2015-Infographic.pdf?t=1443057549926
5
As with everything there are some common misconceptions about Spark which can be dispelled.
a) Spark expects the data to sit in memory: Not true. Spark can use memory for holding the data sets if required. It can
work on disk too. There is a huge performance boost when data is cached, particularly for iterative machine learning
algorithms and interactive queries.
b) Do I need to invest in new hardware: No. One can use existing Hadoop/NoSQL deployment. Spark runs as another
service in the existing deployment. It can work independently too when starting with distributed systems.
c) Can I use my existing skill set: Yes. Spark exposes its APIs in 4 different languages (Scala, Java, Python and R). In that
sense, small learning curve is required to get started with Spark and some extensive training if one is well versed is any of
the above mentioned languages.
Preferable Languages: Of all the languages, Scala and Python is the most preferable when to comes to using Spark. Scala
and Python are the most preferred as they provide ease while building data products. All the new features on Spark s are
generally made available initially in Scala and Java as Spark itself is built in Scala.
If you already have a Hadoop deployment, Spark perfectly sits in your ecosystem. It is well integrated with Yarn, HBase,
Hdfs and Hive. It can leverage existing infrastructure and resource management capabilities of Yarn. Major Hadoop distri-
butions like Cloudera, Hortonworks, and MapR currently support Apache Spark and promote it.
Spark works closely with many of the NoSQL and storage layers like Cassandra, MongoDB, Elastic Search and any
database which provides a JDBC driver. It is well integrated with Apache Mesos which works similar to YARN but not
confined to only Hadoop ecosystem.
68%
44%
36%
12%
52%
40%
29%
Other Recommendation
System
User
Facing
Services
Business
Inteligence
Log
Processing
Data
Warehousing
Fraud Detection
& Security System
129
12
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved6
Apache Spark has some promising projects in the ecosystem. They are as follows:
DataFrames:
DataFrames API is inspired by DataFrames in R and Pandas in Python, it is being built from scratch on Big Data/Distributed
systems. It exposes APIs for Python, Java, Scala and R. It uses state of the art optimizations and code generation through SQL
catalyst optimizer.
MLPipelines:
Spark MLPipelines provides a uniform set of high level API built on top of DataFrame. It provides a set of reusable pipelines for
various data science activities like data preparation, feature extraction, model training and validation.
Future of Apache Spark:
64%
52% 28%
47%
Advanced analytics Data Frames
SQL StandardsReal - time streaming
Most Important Spark Features
Another interesting project which is not part of Apache Spark is KeystoneML which provides reusable Pipelines for
various tasks like audio, text and images. It provides various examples which reproduce state-of-the-art results on public
data sets. This project is also from Amp Labs.
Tungsten:
Project Tungsten brings in drastic change to the Spark's execution engine from the starting of the project.
Let us look into some factors that partake in achieving this.
• Memory Management and Binary Processing ► Used to maximize the usage of application semantics for memory
man agement exclusively, along with removing the disadvantages of garbage collection and JVM object model.
• Cache-aware processing ► Helps in exploiting the memory hierarchy for data structures and algorithm.
• Code generation ► Helps in exploiting the compilers and CPUs
Spark applications, while using this will be benefited with efficient CPU and memory utilization.
Spark for mobiles:
As processing is moving more each day towards mobile based technology , the contributors of Spark are re-architecting
Spark to fit the scenario and this ability is expected to be available from Spark version 2.0 . With this feature, mobiles
can act as nodes for the spark cluster. This can change the way big data is currently being processed.
129
12
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved7
Let us discuss some use case scenarios:
Interactive exploratory analysis:
• Leverage Spark’s in memory caching and powerful execution capabilities to explore large datasets.
• Use Spark’s rich SQL, DataFrame and core API to explore various kind of structured, semi structured and unstructured
datasets.
• Connect external applications like Tableau to power existing exploration through SQL drivers.
Faster ETL:
• Help port existing Hive scripts by changing the underlying execution engine from map-reduce to spark for an
instant boost.
• Help migrate your pig scripts to Spark.
• Leverage Spark’s optimized scheduling for efficient IO and in-memory processing capabilities on large data sets.
Real time dash boards:
• Use Spark streaming to perform near real time aggregations and window based aggregations.
• Integrate Spark SQL on real time data for interactive processing on real time data.
• Apply machine learning algorithms like clustering, classification, recommendations in NRT for identifying anomalies,
recommending products.
• Since the same core is used for Spark API for batch and real time, most of the code can be reused instead of re-engi
neering it separately for batch and streaming.
Use Cases:
Languages used with Spark:
http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-
Survey respondents can choose multiple languages
58%
31%
71%
36% 18%
Machine learning:
Use out of box algorithms from Spark MLlib for supervised, unsupervised Data mining.
Use Spark’s caching capabilities for improving performance of iterative algorithms.
Can take the advantage of some advance algorithms which can be learned as the data trickles in.
129
12
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved8
Conclusion
With the rate of information growth, businesses across verticals are challenged to finding actionable insight from the
massive data that they have accumulated. Big data is here to stay and using the right tool can give any business a boost
and help charter its growth.
We have seen a drastic improvement in the performance and a decrease in number of calendar days across various
projects executed in Spark. Many applications are being migrated to Spark for the ease it offers to developers. Apache
Spark is worth it.
Vishnu Subramanian works as solution architect for Happiest minds with years of experience in building
distributed systems using Hadoop, Spark, ElasticSearch , Cassandra, Machine Learning.A Databricks
certified spark developer and having experience in building Data Products. His interests are in IOT, Data
Science , BigData Security.
129
12
© Happiest Minds Technologies Pvt. Ltd. All Rights Reserved
Vishnu Subramanian
Happiest Minds
9
© 2014 Happiest Minds. All Rights Reserved.
E-mail: Business@happiestminds.com
Visit us: www.happiestminds.com
Follow us on
This Document is an exclusive property of Happiest Minds Technologies Pvt. Ltd
Happiest Minds enables Digital Transformation for enterprises and technology providers by delivering seamless
customer experience, business efficiency and actionable insights through an integrated set of disruptive technologies: big
data analyt-ics, internet of things, mobility, cloud, security, unified communications, etc. Happiest Minds offers domain
centric solutions applying skills, IPs and functional expertise in IT Services, Product Engineering, Infrastructure
Management and Security. These services have applicability across industry sectors such as retail, consumer packaged
goods, e-commerce, banking, insurance, hi-tech, engineering R&D, manufacturing, automotive and travel/transportation/
hospitality.Headquartered in Bangalore, India, Happiest Minds has operations in the US, UK, Singapore, Australia and has
secured $ 52.5 million Series-A funding. Its investors are JPMorgan Private Equity Group, Intel Capital and Ashok Soota.
About the Author

More Related Content

What's hot

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache SparkEdureka!
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Edureka!
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaEdureka!
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduceEdureka!
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!Edureka!
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...DataWorks Summit/Hadoop Summit
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 

What's hot (20)

Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
Spark SQL | Apache Spark
Spark SQL | Apache SparkSpark SQL | Apache Spark
Spark SQL | Apache Spark
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Big data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & ScalaBig data Processing with Apache Spark & Scala
Big data Processing with Apache Spark & Scala
 
Hadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data FrameworkHadoop Vs Spark — Choosing the Right Big Data Framework
Hadoop Vs Spark — Choosing the Right Big Data Framework
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 

Viewers also liked

No al Ciberbullying!!
No al Ciberbullying!!No al Ciberbullying!!
No al Ciberbullying!!HuskyS
 
Programa 5
Programa 5Programa 5
Programa 5yito24
 
"Board game" thing
"Board game" thing"Board game" thing
"Board game" thingSarah Wilson
 
Proyecto flor stella perez
Proyecto flor stella perez Proyecto flor stella perez
Proyecto flor stella perez SEBASGEL
 
'14,04,10 Scholar's Forum (Yuki Goto)
'14,04,10 Scholar's Forum (Yuki Goto)'14,04,10 Scholar's Forum (Yuki Goto)
'14,04,10 Scholar's Forum (Yuki Goto)Yuki Goto
 
We have a plan - Move to Transylvania
We have a plan - Move to TransylvaniaWe have a plan - Move to Transylvania
We have a plan - Move to TransylvaniaDaniel Pirciu
 
Schaaf Technology Ref ltr_1
Schaaf Technology Ref ltr_1Schaaf Technology Ref ltr_1
Schaaf Technology Ref ltr_1Anne Cornish
 
Environ Evo FINAL logo
Environ Evo FINAL logoEnviron Evo FINAL logo
Environ Evo FINAL logoScott White
 
What does the word “classical” mean
What does the word “classical” meanWhat does the word “classical” mean
What does the word “classical” meanJohn Peter Holly
 
Enabling digital transformation through digital business platforms
Enabling digital transformation through digital business platformsEnabling digital transformation through digital business platforms
Enabling digital transformation through digital business platformsHappiest Minds Technologies
 
20140915 etat de l'art sécurisation des sites sensibles personal interactor
20140915 etat de l'art sécurisation des sites sensibles personal interactor20140915 etat de l'art sécurisation des sites sensibles personal interactor
20140915 etat de l'art sécurisation des sites sensibles personal interactorPersonal Interactor
 
Les impacts du nouveau règlement européen GDPR
Les impacts du nouveau règlement européen GDPRLes impacts du nouveau règlement européen GDPR
Les impacts du nouveau règlement européen GDPRArismore
 
8086 instructions
8086 instructions8086 instructions
8086 instructionsRavi Anand
 

Viewers also liked (17)

No al Ciberbullying!!
No al Ciberbullying!!No al Ciberbullying!!
No al Ciberbullying!!
 
Programa 5
Programa 5Programa 5
Programa 5
 
"Board game" thing
"Board game" thing"Board game" thing
"Board game" thing
 
Proyecto flor stella perez
Proyecto flor stella perez Proyecto flor stella perez
Proyecto flor stella perez
 
'14,04,10 Scholar's Forum (Yuki Goto)
'14,04,10 Scholar's Forum (Yuki Goto)'14,04,10 Scholar's Forum (Yuki Goto)
'14,04,10 Scholar's Forum (Yuki Goto)
 
We have a plan - Move to Transylvania
We have a plan - Move to TransylvaniaWe have a plan - Move to Transylvania
We have a plan - Move to Transylvania
 
Schaaf Technology Ref ltr_1
Schaaf Technology Ref ltr_1Schaaf Technology Ref ltr_1
Schaaf Technology Ref ltr_1
 
Environ Evo FINAL logo
Environ Evo FINAL logoEnviron Evo FINAL logo
Environ Evo FINAL logo
 
IMG
IMGIMG
IMG
 
Campana Navidad
Campana NavidadCampana Navidad
Campana Navidad
 
Carta organisasi
Carta organisasiCarta organisasi
Carta organisasi
 
What does the word “classical” mean
What does the word “classical” meanWhat does the word “classical” mean
What does the word “classical” mean
 
Enabling digital transformation through digital business platforms
Enabling digital transformation through digital business platformsEnabling digital transformation through digital business platforms
Enabling digital transformation through digital business platforms
 
20140915 etat de l'art sécurisation des sites sensibles personal interactor
20140915 etat de l'art sécurisation des sites sensibles personal interactor20140915 etat de l'art sécurisation des sites sensibles personal interactor
20140915 etat de l'art sécurisation des sites sensibles personal interactor
 
Les impacts du nouveau règlement européen GDPR
Les impacts du nouveau règlement européen GDPRLes impacts du nouveau règlement européen GDPR
Les impacts du nouveau règlement européen GDPR
 
8086 instructions
8086 instructions8086 instructions
8086 instructions
 
Gestion des risques
Gestion des risquesGestion des risques
Gestion des risques
 

Similar to Started with-apache-spark

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark ZaranTech LLC
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!ankitbhandari32
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkROlgun Aydın
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySparkLadle Patel
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Codemotion
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 

Similar to Started with-apache-spark (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
IOT.ppt
IOT.pptIOT.ppt
IOT.ppt
 
Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!Pyspark vs Spark Let's Unravel the Bond!
Pyspark vs Spark Let's Unravel the Bond!
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Machine Learning with SparkR
Machine Learning with SparkRMachine Learning with SparkR
Machine Learning with SparkR
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Spark 101
Spark 101Spark 101
Spark 101
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 

More from Happiest Minds Technologies

Largest Electricity provider in the US- Case Study
Largest Electricity provider in the US- Case StudyLargest Electricity provider in the US- Case Study
Largest Electricity provider in the US- Case StudyHappiest Minds Technologies
 
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & Insurance
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & InsuranceExploring the Potential of ChatGPT in Banking, Financial SERVICES & Insurance
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & InsuranceHappiest Minds Technologies
 
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0Happiest Minds Technologies
 
Automating SOC1/2 Compliance- For a leading Software solution company in UK
Automating SOC1/2 Compliance- For a leading Software solution company in UKAutomating SOC1/2 Compliance- For a leading Software solution company in UK
Automating SOC1/2 Compliance- For a leading Software solution company in UKHappiest Minds Technologies
 
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...Happiest Minds Technologies
 
Complete Guide to General Data Protection Regulation (GDPR)
Complete Guide to General Data Protection Regulation (GDPR)Complete Guide to General Data Protection Regulation (GDPR)
Complete Guide to General Data Protection Regulation (GDPR)Happiest Minds Technologies
 
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)Happiest Minds Technologies
 
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITHappiest Minds Technologies
 
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDITHappiest Minds Technologies
 
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITHappiest Minds Technologies
 

More from Happiest Minds Technologies (20)

Largest Electricity provider in the US- Case Study
Largest Electricity provider in the US- Case StudyLargest Electricity provider in the US- Case Study
Largest Electricity provider in the US- Case Study
 
BFSI GLOBAL TRENDS FY 24
BFSI GLOBAL TRENDS FY 24BFSI GLOBAL TRENDS FY 24
BFSI GLOBAL TRENDS FY 24
 
ARTIFICIAL INTELLIGENCE IN DIGITAL BANKING
ARTIFICIAL INTELLIGENCE IN DIGITAL BANKINGARTIFICIAL INTELLIGENCE IN DIGITAL BANKING
ARTIFICIAL INTELLIGENCE IN DIGITAL BANKING
 
DIGITAL MANUFACTURING
DIGITAL MANUFACTURINGDIGITAL MANUFACTURING
DIGITAL MANUFACTURING
 
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & Insurance
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & InsuranceExploring the Potential of ChatGPT in Banking, Financial SERVICES & Insurance
Exploring the Potential of ChatGPT in Banking, Financial SERVICES & Insurance
 
AN OVERVIEW OF THE METAVERSE
AN OVERVIEW OF THE METAVERSEAN OVERVIEW OF THE METAVERSE
AN OVERVIEW OF THE METAVERSE
 
VMware to AWS Cloud Migration
VMware to AWS Cloud MigrationVMware to AWS Cloud Migration
VMware to AWS Cloud Migration
 
Digital-Content-Monetization-DCM-Platform-2.pdf
Digital-Content-Monetization-DCM-Platform-2.pdfDigital-Content-Monetization-DCM-Platform-2.pdf
Digital-Content-Monetization-DCM-Platform-2.pdf
 
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0
AUTOMATING CYBER RISK DETECTION AND PROTECTION WITH SOC 2.0
 
Cloud Reshaping Banking
Cloud Reshaping BankingCloud Reshaping Banking
Cloud Reshaping Banking
 
Automating SOC1/2 Compliance- For a leading Software solution company in UK
Automating SOC1/2 Compliance- For a leading Software solution company in UKAutomating SOC1/2 Compliance- For a leading Software solution company in UK
Automating SOC1/2 Compliance- For a leading Software solution company in UK
 
PAMaaS- Powered by CyberArk
PAMaaS- Powered by CyberArkPAMaaS- Powered by CyberArk
PAMaaS- Powered by CyberArk
 
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...
GUIDE TO KEEP YOUR END-USERS CONNECTED TO THE DIGITAL WORKPLACE DURING DISRUP...
 
SECURING THE CLOUD DATA LAKES
SECURING THE CLOUD DATA LAKESSECURING THE CLOUD DATA LAKES
SECURING THE CLOUD DATA LAKES
 
Complete Guide to General Data Protection Regulation (GDPR)
Complete Guide to General Data Protection Regulation (GDPR)Complete Guide to General Data Protection Regulation (GDPR)
Complete Guide to General Data Protection Regulation (GDPR)
 
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)
Azure bastion- Remote desktop RDP/SSH in Azure using Bastion Service as (PaaS)
 
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
 
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN CPG THROUGH INTELLIGENT FREIGHT AUDIT
 
How to Approach Tool Integrations
How to Approach Tool IntegrationsHow to Approach Tool Integrations
How to Approach Tool Integrations
 
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDITREDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
REDUCING TRANSPORTATION COSTS IN RETAIL THROUGH INTELLIGENT FREIGHT AUDIT
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Started with-apache-spark

  • 1. Building Big Data Applications: Getting started with Apache Spark Happiest People Happiest Customers
  • 2. © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved2 Contents Brief history of big data................................................................................................................................3 Trends in big data........................................................................................................................................3 How Spark address the big data trends.......................................................................................................4 Common misconceptions.............................................................................................................................5 Hadoop and Spark.......................................................................................................................................5 Besides Hadoop?k.......................................................................................................................................5 The future of Apache Spark.........................................................................................................................6 Use cases....................................................................................................................................................7 Conclusion...................................................................................................................................................8 About Author................................................................................................................................................9
  • 3. © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved The rise of petabyte-scale data sets These days business organizations are churning out huge amounts of transactional data, capturing trillions of bytes of infor-mation about their customers, suppliers, and operations. While this may have once concerned only a few data geeks, big data is now relevant across business sectors; consumers of products and services. Everyone stands to benefit from its application. The combination of massive data sets and the rapid development of new technologies capable of storing and processing infor-mation out of them have already transformed the way businesses operate. Over the last decades, internet giants like Amazon, Google, Yahoo!, eBay and Twitter have invented and used tools for work- ing with colossal data sets that was beyond the realm of traditional data management tools. Some of the leading open-source tools for big data include Hadoop and NoSQL databases like Cassandra and MongoDB. Hadoop has been a spectacular success, offering cheap storage in HDFS (Hadoop distributed file system) of datasets up to Petabytes in size, with batch-mode (“offline”) analysis using Map Reduce jobs. In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind. Various tools have emerged in the big data space to address different use-cases for building NRT (near real time), advance analytics, recommendation applications and many more making it complex to integrate these various tools and different storage layers. At some point during this evolution, a powerful yet simple, open source framework called Apache Spark was introduced to address various common use cases of big data under one umbrella. Besides Apache Spark, another next generation tool called Apache Flink, formerly known as Stratosphere, is also available. Apache Flink is almost similar to Apache Spark except in the way it handles streaming data; however it is still not as mature as Apache Spark as a big data tool. Both Apache Spark and Apache Flink have the capability to build interactive, real time applications. Since stable version of Apache Flink was released only on Nov 15th and there is no commercial vendor support, we will discuss this in our subsequent papers. 3 Brief history of big data: Trends in big data:
  • 4. 4 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved4 Various projects like MLlib, Spark Streaming, Spark Sql, GraphX, Spark DataFrames, and MLPipelines are built on the core Spark API. Spark: Spark provides core API to process data in a distributed environment. It offers various high level API like map, reduce, filter, reduce by key to build distributed applications. Businesses can use these APIs for ETL pipelines and Data exploration. All projects in the Spark ecosystem use these core APIs to build specific systems for machine learning, interactive queries and graph based analytics. Spark Sql: Spark Sql is a spark module which provides SQL and data frame interfaces for structured data. This is an interesting capability which allows use of Sql or Data frames API over any structured intermediate data. It also allows usage from any underlying data source like hive, parquet, ORC, NoSQL and SQL databases. Use of data frames / SQL helps spark optimize the performance since it is aware of the underlying schema. It is preferable to use Data frame API over the core API as one gains performance improvement in that process. Spark MLlib: MLlib module provides help with scalable machine learning algorithms for supervised learning, unsupervised learning and data mining. It is a fast growing package of distributed machine learning algorithms that has its own advantages. Spark Streaming: Spark streaming module helps in building applications to process real time data with a latency of above 0.5ms. Based on micro batch style computing, it allows windows function and has inbuilt capability to process data from Apache Kafka, Flume and many other such data sources. How Spark addresses the big data trends Spark GraphX: GraphX is a new component for graph parallel computation. GraphX includes a growing collection of graph algorithms and builders to simplify graph analytic tasks. Most common spark development environments (cluster Managers) 48% 40% 11% Standalone mode Yarn Mesos Image courtesy Spark-Survey-2015-Infographic
  • 5. © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved Common Misconceptions: Hadoop and Spark: Besides Hadoop: spark is used to create many types of products inside different organizations http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_ Content/Spark-Survey-2015-Infographic.pdf?t=1443057549926 5 As with everything there are some common misconceptions about Spark which can be dispelled. a) Spark expects the data to sit in memory: Not true. Spark can use memory for holding the data sets if required. It can work on disk too. There is a huge performance boost when data is cached, particularly for iterative machine learning algorithms and interactive queries. b) Do I need to invest in new hardware: No. One can use existing Hadoop/NoSQL deployment. Spark runs as another service in the existing deployment. It can work independently too when starting with distributed systems. c) Can I use my existing skill set: Yes. Spark exposes its APIs in 4 different languages (Scala, Java, Python and R). In that sense, small learning curve is required to get started with Spark and some extensive training if one is well versed is any of the above mentioned languages. Preferable Languages: Of all the languages, Scala and Python is the most preferable when to comes to using Spark. Scala and Python are the most preferred as they provide ease while building data products. All the new features on Spark s are generally made available initially in Scala and Java as Spark itself is built in Scala. If you already have a Hadoop deployment, Spark perfectly sits in your ecosystem. It is well integrated with Yarn, HBase, Hdfs and Hive. It can leverage existing infrastructure and resource management capabilities of Yarn. Major Hadoop distri- butions like Cloudera, Hortonworks, and MapR currently support Apache Spark and promote it. Spark works closely with many of the NoSQL and storage layers like Cassandra, MongoDB, Elastic Search and any database which provides a JDBC driver. It is well integrated with Apache Mesos which works similar to YARN but not confined to only Hadoop ecosystem. 68% 44% 36% 12% 52% 40% 29% Other Recommendation System User Facing Services Business Inteligence Log Processing Data Warehousing Fraud Detection & Security System
  • 6. 129 12 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved6 Apache Spark has some promising projects in the ecosystem. They are as follows: DataFrames: DataFrames API is inspired by DataFrames in R and Pandas in Python, it is being built from scratch on Big Data/Distributed systems. It exposes APIs for Python, Java, Scala and R. It uses state of the art optimizations and code generation through SQL catalyst optimizer. MLPipelines: Spark MLPipelines provides a uniform set of high level API built on top of DataFrame. It provides a set of reusable pipelines for various data science activities like data preparation, feature extraction, model training and validation. Future of Apache Spark: 64% 52% 28% 47% Advanced analytics Data Frames SQL StandardsReal - time streaming Most Important Spark Features Another interesting project which is not part of Apache Spark is KeystoneML which provides reusable Pipelines for various tasks like audio, text and images. It provides various examples which reproduce state-of-the-art results on public data sets. This project is also from Amp Labs. Tungsten: Project Tungsten brings in drastic change to the Spark's execution engine from the starting of the project. Let us look into some factors that partake in achieving this. • Memory Management and Binary Processing ► Used to maximize the usage of application semantics for memory man agement exclusively, along with removing the disadvantages of garbage collection and JVM object model. • Cache-aware processing ► Helps in exploiting the memory hierarchy for data structures and algorithm. • Code generation ► Helps in exploiting the compilers and CPUs Spark applications, while using this will be benefited with efficient CPU and memory utilization. Spark for mobiles: As processing is moving more each day towards mobile based technology , the contributors of Spark are re-architecting Spark to fit the scenario and this ability is expected to be available from Spark version 2.0 . With this feature, mobiles can act as nodes for the spark cluster. This can change the way big data is currently being processed.
  • 7. 129 12 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved7 Let us discuss some use case scenarios: Interactive exploratory analysis: • Leverage Spark’s in memory caching and powerful execution capabilities to explore large datasets. • Use Spark’s rich SQL, DataFrame and core API to explore various kind of structured, semi structured and unstructured datasets. • Connect external applications like Tableau to power existing exploration through SQL drivers. Faster ETL: • Help port existing Hive scripts by changing the underlying execution engine from map-reduce to spark for an instant boost. • Help migrate your pig scripts to Spark. • Leverage Spark’s optimized scheduling for efficient IO and in-memory processing capabilities on large data sets. Real time dash boards: • Use Spark streaming to perform near real time aggregations and window based aggregations. • Integrate Spark SQL on real time data for interactive processing on real time data. • Apply machine learning algorithms like clustering, classification, recommendations in NRT for identifying anomalies, recommending products. • Since the same core is used for Spark API for batch and real time, most of the code can be reused instead of re-engi neering it separately for batch and streaming. Use Cases: Languages used with Spark: http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015- Survey respondents can choose multiple languages 58% 31% 71% 36% 18% Machine learning: Use out of box algorithms from Spark MLlib for supervised, unsupervised Data mining. Use Spark’s caching capabilities for improving performance of iterative algorithms. Can take the advantage of some advance algorithms which can be learned as the data trickles in.
  • 8. 129 12 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved8 Conclusion With the rate of information growth, businesses across verticals are challenged to finding actionable insight from the massive data that they have accumulated. Big data is here to stay and using the right tool can give any business a boost and help charter its growth. We have seen a drastic improvement in the performance and a decrease in number of calendar days across various projects executed in Spark. Many applications are being migrated to Spark for the ease it offers to developers. Apache Spark is worth it.
  • 9. Vishnu Subramanian works as solution architect for Happiest minds with years of experience in building distributed systems using Hadoop, Spark, ElasticSearch , Cassandra, Machine Learning.A Databricks certified spark developer and having experience in building Data Products. His interests are in IOT, Data Science , BigData Security. 129 12 © Happiest Minds Technologies Pvt. Ltd. All Rights Reserved Vishnu Subramanian Happiest Minds 9 © 2014 Happiest Minds. All Rights Reserved. E-mail: Business@happiestminds.com Visit us: www.happiestminds.com Follow us on This Document is an exclusive property of Happiest Minds Technologies Pvt. Ltd Happiest Minds enables Digital Transformation for enterprises and technology providers by delivering seamless customer experience, business efficiency and actionable insights through an integrated set of disruptive technologies: big data analyt-ics, internet of things, mobility, cloud, security, unified communications, etc. Happiest Minds offers domain centric solutions applying skills, IPs and functional expertise in IT Services, Product Engineering, Infrastructure Management and Security. These services have applicability across industry sectors such as retail, consumer packaged goods, e-commerce, banking, insurance, hi-tech, engineering R&D, manufacturing, automotive and travel/transportation/ hospitality.Headquartered in Bangalore, India, Happiest Minds has operations in the US, UK, Singapore, Australia and has secured $ 52.5 million Series-A funding. Its investors are JPMorgan Private Equity Group, Intel Capital and Ashok Soota. About the Author