SlideShare a Scribd company logo
A tool for big data analysis
● Ganesha Yadiyala
● Big data consultant at
datamantra.io
● Consult in spark and scala
● ganeshayadiyala@gmail.com
Agenda
● Problem Statement
● Business view
● Why spark
● Thinking REST
● Load API
● Transform API
● Machine learning
● Pipeline API
● Save API
Problem Statement
Build a generic solution which can be used to do
transformation on data and then analyse it to get useful result
out of it.
Business view
● This is an era of big data.
● All companies are trying to get something useful from the
data and solve problems.
● There exists many frameworks in big data but we need a
tool which will leverage most of them and can solve
problem easily.
● So if there is a general solution or tool which can be able
to solve many of these problem that would be a big plus.
Why we used spark
There are many big data frameworks out there which can be
used for analysis of data, but we chose to use spark because,
● Capability to handle multiple data source
● Easy binding with the external data
● Good support for machine learning through spark-ml and
spark mllib
Thinking REST
To do all this transformation and analysis we provided REST
api because,
● Minimise the coupling between client and server
● Different clients can use the REST api to interact with the
tool.
● Used Akka-http for rest service
Akka-http
It is an Actor-based toolkit for interacting with web services
and clients,
● It is also written in scala and it uses same configuration
management library as spark
● It is an actor and future based
Rest server interaction
Client
Spark cluster
Rest Server
Call Rest API
Call spark API
Rest server design
● Instead of going with spark jobserver we went with our
own rest server
● Once the rest server is started spark context is created
● All the configuration is passed to the spark context
through typesafe during its creation
● Same context is used for all the operation.
Loading from different sources
We supported different types of data,
● Csv datasource
● Json datasource
● Parquet datasource
● Xml datasource
Loading from different sources
We also supported some of the sources like,
● Mongodb
● Kafka
● JDBC
● Cassandra
Transformation
In big data world data which is coming to the system cannot
be used as it is, we may have to transform the data as
needed for the operation
We gave the API’s in REST to do this transformation, which
internally call spark dataframe API’s
Example
Some of the transformation we provided is,
● Cast - Cast the datatype of a column
● Filter - filter based on the formula or condition
● Aggregation - Max,min,sum,median etc
● Joins - Joining two datasets
Machine learning - spark ml
Spark ml provides higher level API which is built on top of the
dataframe.
● We did not used mllib because that is built on top of the
rdd.
● We provided rest API which will talk to these ML apis
Example
Some of the ml apis we provided are,
● Linear regression
● Decision tree (regressor and classifier)
● Ridge regression
● KMeans etc...
Challenges in spark ml
● It was very difficult to write generic api because not all the
ml algorithms expect similar inputs
● Not all the apis are documented properly
● Validation on the type of the columns which can be given
to these API are really difficult.
Save API
Once the transformation is done or ml gives the output use
may want to save the result. We support,
● text
● json
● parquet
● mongodb
● cassandra etc...
Pipeline and scheduling
We also implemented a pipeline api which will pipe all the
loading, transformation or ml apis.
If the user want to run this operation at scheduled time it is
possible through schedule API which we have provided.
Summary
No solution will be able to solve all the big data problems, but
we tried to build a tool which is generic enough to write your
own transformation on data, analyse it and we can solve
many of the problems

More Related Content

What's hot

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 

What's hot (20)

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 

Viewers also liked

Learn Cloud Computing the right way!
Learn Cloud Computing the right way!Learn Cloud Computing the right way!
Learn Cloud Computing the right way!
Stefano Bellasio
 
Learn How to Build Mobile Apps Using Cloud Services
Learn How to Build Mobile Apps Using Cloud ServicesLearn How to Build Mobile Apps Using Cloud Services
Learn How to Build Mobile Apps Using Cloud Services
Max Katz
 
Introduction to RichFaces
Introduction to RichFacesIntroduction to RichFaces
Introduction to RichFaces
Max Katz
 
Starky and hutch 70s tv theme
Starky and hutch 70s tv themeStarky and hutch 70s tv theme
Starky and hutch 70s tv themeAlex Grasso
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Big Data Spain
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
Greg Makowski
 
The Google Scholar Revolution: a big data bibliometric tool
The Google Scholar Revolution:  a big data bibliometric toolThe Google Scholar Revolution:  a big data bibliometric tool
The Google Scholar Revolution: a big data bibliometric tool
Emilio Delgado Lopez-Cozar, Universidad de Granada
 
Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
Amazon Web Services
 
Restaurant management
Restaurant managementRestaurant management
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
BigData_Europe
 
Big data 101
Big data 101Big data 101
Big data 101
Lars Marius Garshol
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate Change
BigData_Europe
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
Trupti Shingala, WAS, CPACC, CPWA, JAWS, CSM
 
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerBeyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerGE_India
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Jeffrey Breen
 
Pareto chart using Openoffice.org
Pareto chart using Openoffice.orgPareto chart using Openoffice.org
Pareto chart using Openoffice.org
wremes
 
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTBIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
JSPM's JSCOE , Pune Maharashtra.
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
Information Security Awareness Group
 

Viewers also liked (20)

Learn Cloud Computing the right way!
Learn Cloud Computing the right way!Learn Cloud Computing the right way!
Learn Cloud Computing the right way!
 
Learn How to Build Mobile Apps Using Cloud Services
Learn How to Build Mobile Apps Using Cloud ServicesLearn How to Build Mobile Apps Using Cloud Services
Learn How to Build Mobile Apps Using Cloud Services
 
Introduction to RichFaces
Introduction to RichFacesIntroduction to RichFaces
Introduction to RichFaces
 
Starky and hutch 70s tv theme
Starky and hutch 70s tv themeStarky and hutch 70s tv theme
Starky and hutch 70s tv theme
 
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the EnterpriseMongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
MongoDB World 2016: Lunch & Learn: Google Cloud for the Enterprise
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
The Google Scholar Revolution: a big data bibliometric tool
The Google Scholar Revolution:  a big data bibliometric toolThe Google Scholar Revolution:  a big data bibliometric tool
The Google Scholar Revolution: a big data bibliometric tool
 
Journey Through the Cloud - Data Analysis
Journey Through the Cloud - Data AnalysisJourney Through the Cloud - Data Analysis
Journey Through the Cloud - Data Analysis
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Big data 101
Big data 101Big data 101
Big data 101
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate Change
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind PowerBeyond Big Data: Harnessing the Industrial Internet for Wind Power
Beyond Big Data: Harnessing the Industrial Internet for Wind Power
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Pareto chart using Openoffice.org
Pareto chart using Openoffice.orgPareto chart using Openoffice.org
Pareto chart using Openoffice.org
 
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTBIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 

Similar to A Tool For Big Data Analysis using Apache Spark

Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
Stitch Fix Algorithms
 
Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEAN
Madhukara Phatak
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
Romi Kuntsman
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Building self service framework
Building self service frameworkBuilding self service framework
Building self service framework
Rovshan Musayev
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
Sadeka Islam
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
OpenSistemas
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 

Similar to A Tool For Big Data Analysis using Apache Spark (20)

Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
A compute infrastructure for data scientists
A compute infrastructure for data scientistsA compute infrastructure for data scientists
A compute infrastructure for data scientists
 
Building RESTtful services in MEAN
Building RESTtful services in MEANBuilding RESTtful services in MEAN
Building RESTtful services in MEAN
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Building self service framework
Building self service frameworkBuilding self service framework
Building self service framework
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 

More from datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 

More from datamantra (16)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 

Recently uploaded

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

A Tool For Big Data Analysis using Apache Spark

  • 1. A tool for big data analysis
  • 2. ● Ganesha Yadiyala ● Big data consultant at datamantra.io ● Consult in spark and scala ● ganeshayadiyala@gmail.com
  • 3. Agenda ● Problem Statement ● Business view ● Why spark ● Thinking REST ● Load API ● Transform API ● Machine learning ● Pipeline API ● Save API
  • 4. Problem Statement Build a generic solution which can be used to do transformation on data and then analyse it to get useful result out of it.
  • 5. Business view ● This is an era of big data. ● All companies are trying to get something useful from the data and solve problems. ● There exists many frameworks in big data but we need a tool which will leverage most of them and can solve problem easily. ● So if there is a general solution or tool which can be able to solve many of these problem that would be a big plus.
  • 6. Why we used spark There are many big data frameworks out there which can be used for analysis of data, but we chose to use spark because, ● Capability to handle multiple data source ● Easy binding with the external data ● Good support for machine learning through spark-ml and spark mllib
  • 7. Thinking REST To do all this transformation and analysis we provided REST api because, ● Minimise the coupling between client and server ● Different clients can use the REST api to interact with the tool. ● Used Akka-http for rest service
  • 8. Akka-http It is an Actor-based toolkit for interacting with web services and clients, ● It is also written in scala and it uses same configuration management library as spark ● It is an actor and future based
  • 9. Rest server interaction Client Spark cluster Rest Server Call Rest API Call spark API
  • 10. Rest server design ● Instead of going with spark jobserver we went with our own rest server ● Once the rest server is started spark context is created ● All the configuration is passed to the spark context through typesafe during its creation ● Same context is used for all the operation.
  • 11. Loading from different sources We supported different types of data, ● Csv datasource ● Json datasource ● Parquet datasource ● Xml datasource
  • 12. Loading from different sources We also supported some of the sources like, ● Mongodb ● Kafka ● JDBC ● Cassandra
  • 13. Transformation In big data world data which is coming to the system cannot be used as it is, we may have to transform the data as needed for the operation We gave the API’s in REST to do this transformation, which internally call spark dataframe API’s
  • 14. Example Some of the transformation we provided is, ● Cast - Cast the datatype of a column ● Filter - filter based on the formula or condition ● Aggregation - Max,min,sum,median etc ● Joins - Joining two datasets
  • 15. Machine learning - spark ml Spark ml provides higher level API which is built on top of the dataframe. ● We did not used mllib because that is built on top of the rdd. ● We provided rest API which will talk to these ML apis
  • 16. Example Some of the ml apis we provided are, ● Linear regression ● Decision tree (regressor and classifier) ● Ridge regression ● KMeans etc...
  • 17. Challenges in spark ml ● It was very difficult to write generic api because not all the ml algorithms expect similar inputs ● Not all the apis are documented properly ● Validation on the type of the columns which can be given to these API are really difficult.
  • 18. Save API Once the transformation is done or ml gives the output use may want to save the result. We support, ● text ● json ● parquet ● mongodb ● cassandra etc...
  • 19. Pipeline and scheduling We also implemented a pipeline api which will pipe all the loading, transformation or ml apis. If the user want to run this operation at scheduled time it is possible through schedule API which we have provided.
  • 20. Summary No solution will be able to solve all the big data problems, but we tried to build a tool which is generic enough to write your own transformation on data, analyse it and we can solve many of the problems