1. Big Data and Spark
Big Data overview and Spark
training@itversity.com
2. Agenda
• Introduction
• Big Data eco system
• Job roles
• Spark (Core Spark and Data Frames)
• Demo
training@itversity.com
3. Introduction
• About me - https://www.linkedin.com/in/durga0gadiraju/
• 13+ years of rich industry experience in building large scale data
driven applications
• IT Versity, LLC is Dallas based startup specializing in low cost quality
training in emerging technologies such as Big Data, Cloud etc
• We provide training using following platforms
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
4. Big Data eco system
HDFS
Hive
Pig
Sqoop
Impala
Tez
Flume
Spark Ganglia
HBase
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
Datameer
AWS s3
Azure Blob
Big Data technologies
training@itversity.com
6. Big Data eco system – High level categories
All the technologies in the previous slide can be categorized into these
• File system
• Data ingestion
• Data processing
• Batch
• Real time
• Streaming
• Visualization
• Support
training@itversity.com
7. File System
Big Data eco system – High level categories
Data Ingestion Data Processing Visualization
Insights
Support
training@itversity.com
8. Big Data eco system – File System
File systems supporting Big Data should be typically distributed file
systems. However cloud based storages are also becoming quite
popular as they can cut down the operational costs significantly with
pay-as-you-go model.
• HDFS – Hadoop Distributed File System
• AWS S3 – Amazon’s cloud based storage
• Azure Blob – Microsoft Azure’s cloud based storage
• NoSQL file systems
training@itversity.com
9. Big Data eco system – Data Ingestion
Data ingestion can be done either in real time or in batches. Data can be
pulled either from relational databases or streamed from web logs
• Sqoop – a map reduce based tool to pull data in batches from relational
databases into Big Data file systems
• Flume – an agent based technology which can poll web server logs and pull
data to save it in any sink. One category of sink is Big Data Technologies
• Kafka – a queue based technology from which data can be consumed to
any technology. One category is Big Data.
• There are many other tools and at times we might have to customize as per
our requirements
training@itversity.com
10. Big Data eco system – Data Processing
Data processing is categorized into
• Batch
• Map Reduce – I/O driven
• Spark – Memory driven
• Real time (real time operations)
• NoSQL – HBase/MongoDB/Cassandra
• Ad hoc querying – Impala/Presto/Spark SQL
• Streaming (near real time data processing)
• Spark Streaming
• Flink
• Storm
training@itversity.com
11. Big Data eco system – Data Processing
• Amazon Recommendation engine
• LinkedIn endorsements
training@itversity.com
12. Big Data eco system – Visualization
Once the data is processed we need to visualize the data using
standard reporting tools or custom applications.
• Datameer
• d3js
• Tableau
• Qlikview
• and many more
training@itversity.com
13. Big Data eco system – Support
There are bunch of tools which are used to support the clusters
• Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the
tools
• Zookeeper – Load balancing and fail over
• Kerberos – Security
• Knox/Ranger
training@itversity.com
14. File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
15. Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Skills Basic programming, Data
Warehousing, ETL, Data
integration
Reporting, Domain
knowledge, Data
Warehousing, BI
Advanced programming,
Application frameworks,
Databases
System Administration,
DevOps, Cloud based
technologies
Technologies (Big Data) Scala/Python, Spark,
Flume/Kafka, Spark
Streaming/Storm/Flink etc
BI Tools such as Tableau,
Data Modeling,
Visualization frameworks of
R, Python etc.
Java/Python, MVC, Micro
Services, NoSQL etc
Puppet/Chef/Ansible,
Cloudera/Hortonworks/Ma
pR etc, AWS
Solutions Architect
training@itversity.com
16. File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
17. Big Data eco system – Data Science
• Data Science and Big Data are 2 different fields
• Data Science is not always on top of huge volumes of data
• Data Science can be implemented even using excel on smaller volumes of data
• When it comes to larger volumes of data, Data Scientist team work closely with
Data Engineers to
• Ingest data from different sources
• Process data – Data Cleansing, Standardization, Aggregations etc
• Data can ported to data science algorithms after processing the data
• Data science algorithms can be applied by using Big Data modules such as Mahout, Spark
MLLib etc
• Data Scientists should be cognizant about Big Data eco system, but need not be
hands on. Data Engineers are the ones who work on Big Data eco system. But in
the smaller organization Data Scientist/Data Engineer has to be master of both.
training@itversity.com
18. Apache Spark
• Apache Spark is in memory distributed processing framework on top
of file systems such as HDFS, s3, Azure Blob etc
• There are bunch of APIs to process the data. They are called as
Transformations and Actions. They are also known as Core Spark.
• Tightly integrated with programming languages such as Scala, Python,
Java, R
• To be proficient in Spark, you need to learn one of the programming
languages – preferably Scala or Python
• You often hear about YARN, Mesos – they are just frameworks to run
the jobs. Developers do not need to worry about it.
training@itversity.com
21. Core Spark
• Core Spark is nothing but low level APIs to facilitate in memory
processing
• Building Blocks
• RDD (Resilient Distributed Dataset – In memory distributed collection)
• APIs (Transformations and Actions)
• You can develop distributed applications using programming
languages such as Java, Scala, Python, R (Scala and Python are better)
training@itversity.com
22. Data Frames
• In memory distributed collection with structure
• Structure can be defined on the fly
• Inherited from data frames of Python, R etc
• Once Data Frame is created, data can be processed using high level
APIs/interfaces such as Data Frame operations, SQL etc.
training@itversity.com
23. Spark – Core vs. Data Frames
• Standard transformations
• Data Cleansing
• Data Standardization
• Aggregations
• Joins
• Applying machine learning algorithms
• Core spark is
• Low level API which can perform standard transformations
• Ability to process both structured as well as unstructured data
• Data Frames is high level API to process structured data, it will be used as part of
• Data Frame operations
• Spark native SQL
• Spark SQL in Hive Context
training@itversity.com
24. Learning Spark (with weightages)
• It is mainly Data Engineer skill. A must have going forward
• Programming (Scala/Python/Java – preferably Scala or Python) – 30%
• Spark APIs – 10%
• SQL – 10%
• Sqoop – 5%
• Flume and Kafka – 15%
• Miscellaneous – 10%
• Overhead – 20%
• Start with Programming, understand APIs, learn SQL, learn Flume and Kafka
along with Spark Streaming and then keep adding other skills
training@itversity.com
25. Demo
• Programming Language – Scala and Python
• File System – HDFS
• Data Processing – Spark
• Execution Framework – YARN
• Demo will be given on labs.itversity.com
training@itversity.com
26. We believe in trainings related to open source also should be open source
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com