SlideShare a Scribd company logo
1 of 27
Big Data and Spark
Big Data overview and Spark
training@itversity.com
Agenda
• Introduction
• Big Data eco system
• Job roles
• Spark (Core Spark and Data Frames)
• Demo
training@itversity.com
Introduction
• About me - https://www.linkedin.com/in/durga0gadiraju/
• 13+ years of rich industry experience in building large scale data
driven applications
• IT Versity, LLC is Dallas based startup specializing in low cost quality
training in emerging technologies such as Big Data, Cloud etc
• We provide training using following platforms
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
Big Data eco system
HDFS
Hive
Pig
Sqoop
Impala
Tez
Flume
Spark Ganglia
HBase
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
Datameer
AWS s3
Azure Blob
Big Data technologies
training@itversity.com
Let us understand this vast array of tools
training@itversity.com
Big Data eco system – High level categories
All the technologies in the previous slide can be categorized into these
• File system
• Data ingestion
• Data processing
• Batch
• Real time
• Streaming
• Visualization
• Support
training@itversity.com
File System
Big Data eco system – High level categories
Data Ingestion Data Processing Visualization
Insights
Support
training@itversity.com
Big Data eco system – File System
File systems supporting Big Data should be typically distributed file
systems. However cloud based storages are also becoming quite
popular as they can cut down the operational costs significantly with
pay-as-you-go model.
• HDFS – Hadoop Distributed File System
• AWS S3 – Amazon’s cloud based storage
• Azure Blob – Microsoft Azure’s cloud based storage
• NoSQL file systems
training@itversity.com
Big Data eco system – Data Ingestion
Data ingestion can be done either in real time or in batches. Data can be
pulled either from relational databases or streamed from web logs
• Sqoop – a map reduce based tool to pull data in batches from relational
databases into Big Data file systems
• Flume – an agent based technology which can poll web server logs and pull
data to save it in any sink. One category of sink is Big Data Technologies
• Kafka – a queue based technology from which data can be consumed to
any technology. One category is Big Data.
• There are many other tools and at times we might have to customize as per
our requirements
training@itversity.com
Big Data eco system – Data Processing
Data processing is categorized into
• Batch
• Map Reduce – I/O driven
• Spark – Memory driven
• Real time (real time operations)
• NoSQL – HBase/MongoDB/Cassandra
• Ad hoc querying – Impala/Presto/Spark SQL
• Streaming (near real time data processing)
• Spark Streaming
• Flink
• Storm
training@itversity.com
Big Data eco system – Data Processing
• Amazon Recommendation engine
• LinkedIn endorsements
training@itversity.com
Big Data eco system – Visualization
Once the data is processed we need to visualize the data using
standard reporting tools or custom applications.
• Datameer
• d3js
• Tableau
• Qlikview
• and many more
training@itversity.com
Big Data eco system – Support
There are bunch of tools which are used to support the clusters
• Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the
tools
• Zookeeper – Load balancing and fail over
• Kerberos – Security
• Knox/Ranger
training@itversity.com
File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Skills Basic programming, Data
Warehousing, ETL, Data
integration
Reporting, Domain
knowledge, Data
Warehousing, BI
Advanced programming,
Application frameworks,
Databases
System Administration,
DevOps, Cloud based
technologies
Technologies (Big Data) Scala/Python, Spark,
Flume/Kafka, Spark
Streaming/Storm/Flink etc
BI Tools such as Tableau,
Data Modeling,
Visualization frameworks of
R, Python etc.
Java/Python, MVC, Micro
Services, NoSQL etc
Puppet/Chef/Ansible,
Cloudera/Hortonworks/Ma
pR etc, AWS
Solutions Architect
training@itversity.com
File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
Big Data eco system – Data Science
• Data Science and Big Data are 2 different fields
• Data Science is not always on top of huge volumes of data
• Data Science can be implemented even using excel on smaller volumes of data
• When it comes to larger volumes of data, Data Scientist team work closely with
Data Engineers to
• Ingest data from different sources
• Process data – Data Cleansing, Standardization, Aggregations etc
• Data can ported to data science algorithms after processing the data
• Data science algorithms can be applied by using Big Data modules such as Mahout, Spark
MLLib etc
• Data Scientists should be cognizant about Big Data eco system, but need not be
hands on. Data Engineers are the ones who work on Big Data eco system. But in
the smaller organization Data Scientist/Data Engineer has to be master of both.
training@itversity.com
Apache Spark
• Apache Spark is in memory distributed processing framework on top
of file systems such as HDFS, s3, Azure Blob etc
• There are bunch of APIs to process the data. They are called as
Transformations and Actions. They are also known as Core Spark.
• Tightly integrated with programming languages such as Scala, Python,
Java, R
• To be proficient in Spark, you need to learn one of the programming
languages – preferably Scala or Python
• You often hear about YARN, Mesos – they are just frameworks to run
the jobs. Developers do not need to worry about it.
training@itversity.com
Spark Modules
Spark Core
Programming Languages
(Scala, Java, Python, R)
DataFrames
Spark SQL
Spark
Streaming
MLlib GraphX
training@itversity.com
Spark Execution Modes
• Local
• Standalone
• Mesos
• YARN
training@itversity.com
Core Spark
• Core Spark is nothing but low level APIs to facilitate in memory
processing
• Building Blocks
• RDD (Resilient Distributed Dataset – In memory distributed collection)
• APIs (Transformations and Actions)
• You can develop distributed applications using programming
languages such as Java, Scala, Python, R (Scala and Python are better)
training@itversity.com
Data Frames
• In memory distributed collection with structure
• Structure can be defined on the fly
• Inherited from data frames of Python, R etc
• Once Data Frame is created, data can be processed using high level
APIs/interfaces such as Data Frame operations, SQL etc.
training@itversity.com
Spark – Core vs. Data Frames
• Standard transformations
• Data Cleansing
• Data Standardization
• Aggregations
• Joins
• Applying machine learning algorithms
• Core spark is
• Low level API which can perform standard transformations
• Ability to process both structured as well as unstructured data
• Data Frames is high level API to process structured data, it will be used as part of
• Data Frame operations
• Spark native SQL
• Spark SQL in Hive Context
training@itversity.com
Learning Spark (with weightages)
• It is mainly Data Engineer skill. A must have going forward
• Programming (Scala/Python/Java – preferably Scala or Python) – 30%
• Spark APIs – 10%
• SQL – 10%
• Sqoop – 5%
• Flume and Kafka – 15%
• Miscellaneous – 10%
• Overhead – 20%
• Start with Programming, understand APIs, learn SQL, learn Flume and Kafka
along with Spark Streaming and then keep adding other skills
training@itversity.com
Demo
• Programming Language – Scala and Python
• File System – HDFS
• Data Processing – Spark
• Execution Framework – YARN
• Demo will be given on labs.itversity.com
training@itversity.com
We believe in trainings related to open source also should be open source
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
Q&A
training@itversity.com

More Related Content

What's hot

Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Apache ignite Datagrid
Apache ignite DatagridApache ignite Datagrid
Apache ignite DatagridSurinder Mehra
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgradesDurga Gadiraju
 

What's hot (20)

Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Apache ignite Datagrid
Apache ignite DatagridApache ignite Datagrid
Apache ignite Datagrid
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
 

Similar to Big Data Introduction - Solix empower

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato ReviewHang Li
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 

Similar to Big Data Introduction - Solix empower (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Big Data Introduction - Solix empower

  • 1. Big Data and Spark Big Data overview and Spark training@itversity.com
  • 2. Agenda • Introduction • Big Data eco system • Job roles • Spark (Core Spark and Data Frames) • Demo training@itversity.com
  • 3. Introduction • About me - https://www.linkedin.com/in/durga0gadiraju/ • 13+ years of rich industry experience in building large scale data driven applications • IT Versity, LLC is Dallas based startup specializing in low cost quality training in emerging technologies such as Big Data, Cloud etc • We provide training using following platforms • https://labs.itversity.com - low cost big data lab to learn technologies. • http://discuss.itversity.com - support while learning • http://www.itversity.com - website for content • https://youtube.com/itversityin • https://github.com/dgadiraju training@itversity.com
  • 4. Big Data eco system HDFS Hive Pig Sqoop Impala Tez Flume Spark Ganglia HBase Impala Zookeeper Map Reduce YARN Kafka Flume Storm Flink Datameer AWS s3 Azure Blob Big Data technologies training@itversity.com
  • 5. Let us understand this vast array of tools training@itversity.com
  • 6. Big Data eco system – High level categories All the technologies in the previous slide can be categorized into these • File system • Data ingestion • Data processing • Batch • Real time • Streaming • Visualization • Support training@itversity.com
  • 7. File System Big Data eco system – High level categories Data Ingestion Data Processing Visualization Insights Support training@itversity.com
  • 8. Big Data eco system – File System File systems supporting Big Data should be typically distributed file systems. However cloud based storages are also becoming quite popular as they can cut down the operational costs significantly with pay-as-you-go model. • HDFS – Hadoop Distributed File System • AWS S3 – Amazon’s cloud based storage • Azure Blob – Microsoft Azure’s cloud based storage • NoSQL file systems training@itversity.com
  • 9. Big Data eco system – Data Ingestion Data ingestion can be done either in real time or in batches. Data can be pulled either from relational databases or streamed from web logs • Sqoop – a map reduce based tool to pull data in batches from relational databases into Big Data file systems • Flume – an agent based technology which can poll web server logs and pull data to save it in any sink. One category of sink is Big Data Technologies • Kafka – a queue based technology from which data can be consumed to any technology. One category is Big Data. • There are many other tools and at times we might have to customize as per our requirements training@itversity.com
  • 10. Big Data eco system – Data Processing Data processing is categorized into • Batch • Map Reduce – I/O driven • Spark – Memory driven • Real time (real time operations) • NoSQL – HBase/MongoDB/Cassandra • Ad hoc querying – Impala/Presto/Spark SQL • Streaming (near real time data processing) • Spark Streaming • Flink • Storm training@itversity.com
  • 11. Big Data eco system – Data Processing • Amazon Recommendation engine • LinkedIn endorsements training@itversity.com
  • 12. Big Data eco system – Visualization Once the data is processed we need to visualize the data using standard reporting tools or custom applications. • Datameer • d3js • Tableau • Qlikview • and many more training@itversity.com
  • 13. Big Data eco system – Support There are bunch of tools which are used to support the clusters • Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the tools • Zookeeper – Load balancing and fail over • Kerberos – Security • Knox/Ranger training@itversity.com
  • 14. File System Big Data eco system – High level categories and skills mapping Data Ingestion Data Processing Visualization Insights HDFS s3 Azure Blob Other Sqoop Flume Kafka Custom Batch Real Time Streaming Datameer BI Tools Custom Support DevOps Hadoop training@itversity.com
  • 15. Job Roles – Skills and Technologies BI Developer Application Developer DevOps Engineer Data Engineer Data Engineer Bi Developer Application Developer DevOps Engineer Responsibilities Data ingestion and processing Reporting and Visualization Developing applications Maintaining infrastructure such as Big Data clusters Skills Basic programming, Data Warehousing, ETL, Data integration Reporting, Domain knowledge, Data Warehousing, BI Advanced programming, Application frameworks, Databases System Administration, DevOps, Cloud based technologies Technologies (Big Data) Scala/Python, Spark, Flume/Kafka, Spark Streaming/Storm/Flink etc BI Tools such as Tableau, Data Modeling, Visualization frameworks of R, Python etc. Java/Python, MVC, Micro Services, NoSQL etc Puppet/Chef/Ansible, Cloudera/Hortonworks/Ma pR etc, AWS Solutions Architect training@itversity.com
  • 16. File System Big Data eco system – High level categories and skills mapping Data Ingestion Data Processing Visualization Insights HDFS s3 Azure Blob Other Sqoop Flume Kafka Custom Batch Real Time Streaming Datameer BI Tools Custom Support DevOps Hadoop training@itversity.com
  • 17. Big Data eco system – Data Science • Data Science and Big Data are 2 different fields • Data Science is not always on top of huge volumes of data • Data Science can be implemented even using excel on smaller volumes of data • When it comes to larger volumes of data, Data Scientist team work closely with Data Engineers to • Ingest data from different sources • Process data – Data Cleansing, Standardization, Aggregations etc • Data can ported to data science algorithms after processing the data • Data science algorithms can be applied by using Big Data modules such as Mahout, Spark MLLib etc • Data Scientists should be cognizant about Big Data eco system, but need not be hands on. Data Engineers are the ones who work on Big Data eco system. But in the smaller organization Data Scientist/Data Engineer has to be master of both. training@itversity.com
  • 18. Apache Spark • Apache Spark is in memory distributed processing framework on top of file systems such as HDFS, s3, Azure Blob etc • There are bunch of APIs to process the data. They are called as Transformations and Actions. They are also known as Core Spark. • Tightly integrated with programming languages such as Scala, Python, Java, R • To be proficient in Spark, you need to learn one of the programming languages – preferably Scala or Python • You often hear about YARN, Mesos – they are just frameworks to run the jobs. Developers do not need to worry about it. training@itversity.com
  • 19. Spark Modules Spark Core Programming Languages (Scala, Java, Python, R) DataFrames Spark SQL Spark Streaming MLlib GraphX training@itversity.com
  • 20. Spark Execution Modes • Local • Standalone • Mesos • YARN training@itversity.com
  • 21. Core Spark • Core Spark is nothing but low level APIs to facilitate in memory processing • Building Blocks • RDD (Resilient Distributed Dataset – In memory distributed collection) • APIs (Transformations and Actions) • You can develop distributed applications using programming languages such as Java, Scala, Python, R (Scala and Python are better) training@itversity.com
  • 22. Data Frames • In memory distributed collection with structure • Structure can be defined on the fly • Inherited from data frames of Python, R etc • Once Data Frame is created, data can be processed using high level APIs/interfaces such as Data Frame operations, SQL etc. training@itversity.com
  • 23. Spark – Core vs. Data Frames • Standard transformations • Data Cleansing • Data Standardization • Aggregations • Joins • Applying machine learning algorithms • Core spark is • Low level API which can perform standard transformations • Ability to process both structured as well as unstructured data • Data Frames is high level API to process structured data, it will be used as part of • Data Frame operations • Spark native SQL • Spark SQL in Hive Context training@itversity.com
  • 24. Learning Spark (with weightages) • It is mainly Data Engineer skill. A must have going forward • Programming (Scala/Python/Java – preferably Scala or Python) – 30% • Spark APIs – 10% • SQL – 10% • Sqoop – 5% • Flume and Kafka – 15% • Miscellaneous – 10% • Overhead – 20% • Start with Programming, understand APIs, learn SQL, learn Flume and Kafka along with Spark Streaming and then keep adding other skills training@itversity.com
  • 25. Demo • Programming Language – Scala and Python • File System – HDFS • Data Processing – Spark • Execution Framework – YARN • Demo will be given on labs.itversity.com training@itversity.com
  • 26. We believe in trainings related to open source also should be open source • https://labs.itversity.com - low cost big data lab to learn technologies. • http://discuss.itversity.com - support while learning • http://www.itversity.com - website for content • https://youtube.com/itversityin • https://github.com/dgadiraju training@itversity.com