SlideShare a Scribd company logo
Using Spark for
Timeseries Graph
Analytics
- VED MULKALWAR
Content
 Introduction to sample problem statement
 Which Graph database is used and why
 Installing Titan
 Titan with Cassandra
 The Gremlin Cassandra script: A way to store data in cassandra from Titan
Gremlin
 Accessing Titan with Spark
Introducing which kind of time-series
problems that can be solved using
graph analytics and introducing the
problem statement which I have been
working on solving.
The dynamics of a complex system is usually
recorded in the form of time series. In recent
years, the visibility graph algorithm and the
horizontal visibility graph algorithm have been
recently introduced as the mapping between
time series and complex networks.
Transforming time series into the graphs, the
algorithms allows applying the methods of
graph theoretical tools for characterizing time
series, opening the possibility of building
fruitful connections between time series
analysis, nonlinear dynamics, and graph theory.
 The problem statement which I have been working on:
 Our initial goal was finding anomalies in the given timeseries. We started
with slicing the data into small meaningfull parts so that the data becomes
smaller and contiguous data points having the same property get clubbed
together.
 Later we created a graph with these parts and tried to find the parts having
similar properties. Doing this will single out anomalies which was the goal.
Which graph database I used and why
i.e. Difference between graphX, neo4j
and titan , why we used titan
Titan vs graphX
 The fundamental difference between Titan and GraphX lies in how they
persistence data and how they process data, Titan by default persists data
(vertices and edges and properties) to a distributed data store in the form of
tables bound to a specific schema. This schema can be stored in Berkley DB
tables, Cassandra tables or Hbase tables.
 In the case of Titan the graph is stored as vertices in a vertex table and edges
in an edge table.
 GraphX has no real persistence layer (yet?), yes it can persist to HDFS files, but
it cannot persist to a distributed datastore in a common schema like form as in
Titan’s case.
 A graph only exists in GraphX when it is loaded into memory off raw data and
interpreted as Graph RDDs, Titan stores the graph permanently.
 The other major difference between the two solutions is how they process
graph data.
 By default, GraphX solves queries via distributed processing on many nodes in
parallel where possible as opposed to Titan processing pipelines on a single
node. Titan can also take advantage of parallel processing via Faunus/HDFS if
necessary.
Neo4j vs titan
 The primary difference between Titan and Neo4j is scalability: Titan can distribute the graph
across multiple machines (using either Cassandra or Hbase as the storage backend) which
does three things:
 1) It allows Titan to store really, really large graphs by scaling out
 2) It allows Titan to handle massive write loads
 3) No single point of failure for very high availability. While Neo4j's HA setup gives you
redundancy through replication, death of the master in such a setup leads to temporary
service interruption while a new master is elected.
 This allows Titan to scale to large deployments with thousands of concurrent users as
demonstrated in the benchmark:
 http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/
 Neo4j has been around much longer and is therefore more "mature" in some regards:
 1) Integrated with an external lucene index gives it more sophisticated search capabilities
 2) More integration points into other programming languages and development frameworks
(e.g. spring)
 3) Has been used by more people
 Titan is a native blueprints implementation and benefits from all the features that come with
the Tinkerpop stack. Titan does not support cypher but only the gremlin standard.
Intro to Titan graph db and how to
use it with cassandra & spark
Installing Titan
 Downloaded the latest prebuilt version (0.9.0-M2) of Titan at
s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip.
 Carry out the following steps to ensure that Titan is installed on each node
in the cluster:
 Now, use the Linux su (switch user) command to change to the root
account, and move the install to the /usr/local/ location. Change the file
and group membership of the install to the hadoop user, and create a
symbolic link called titan so that the current Titan release can be referred
to as the simplified path called /usr/local/titan:
Titan with Cassandra
 In this section, the Cassandra NoSQL database will be used as a storage
mechanism for Titan. Although it does not use Hadoop, it is a large-scale,
cluster-based database in its own right, and can scale to very large cluster
sizes. A graph will be created, and stored in Cassandra using the Titan
Gremlin shell. It will then be checked using Gremlin, and the stored data
will be checked in Cassandra. The raw Titan Cassandra graph-based data
will then be accessed from Spark. The first step then will be to install
Cassandra on each node in the cluster.
 Install Cassandra on all the nodes
 Set up the Cassandra configuration under /etc/cassandra/conf by altering
the cassandra.yaml file:
 Install Cassandra on all the nodes
 Set up the Cassandra configuration under /etc/cassandra/conf by altering
the cassandra.yaml file:
 Log files can be found under /var/log/cassandra, and the data is stored
under /var/lib/cassandra. The nodetool command can be used on any
Cassandra node to check the status of the Cassandra cluster:
 The Cassandra CQL shell command called cqlsh can be used to access the
cluster, and create objects. The shell is invoked next, and it shows that
Cassandra version 2.0.13 is installed:
 The Cassandra query language next shows a key space called keyspace1
that is being created and used via the CQL shell:
The Gremlin Cassandra script
 The interactive Titan Gremlin shell can be found within the bin directory of
the Titan install, as shown here. Once started, it offers a Gremlin prompt:
 The following script will be entered using the Gremlin shell. The first
section of the script defines the configuration in terms of the storage
(Cassandra), the port number, and the keyspace name that is to be used:
 Next define the generic vertex properties' name and age for the graph to
be created using the Management System. It then commits the
management system changes:
 Now, six vertices are added to the graph. Each one is given a numeric label
to represent its identity. Each vertex is given an age and name value:
 Finally, the graph edges are added to join the vertices together. Each edge
has a relationship value. Once created, the changes are committed to store
them to Titan, and therefore Cassandra:
 This results in a simple person-based graph, shown in the following figure
 This graph can then be tested in Titan via the Gremlin shell using a similar
script to the previous one. Just enter the following script at the gremlin>
prompt, as was shown previously. It uses the same initial six lines to create
the titanGraph configuration, but it then creates a graph traversal variable
g.
 The graph traversal variable can be used to check the graph contents.
Using the ValueMap option, it is possible to search for the graph nodes
called Mike and Flo. They have been successfully found here:
 Using the Cassandra CQL shell, and the Titan keyspace, it can be seen that
a number of Titan tables have been created in Cassandra:
 It can also be seen that the data exists in the edgestore table within
Cassandra:
 This assures us that a Titan graph has been created in the Gremlin shell,
and is stored in Cassandra. Now, I will try to access the data from Spark.
Accessing Titan with Spark
 So far, Titan 0.9.0-M2 has been installed, and the graphs have successfully
been created using Cassandra as backend storage options. These graphs
have been created using Gremlin-based scripts. In this section, a properties
file will be used via a Gremlin script to process a Titan-based graph using
Apache Spark. Cassandra will be used with Titan as backend storage
option.
The following figure, shows the
architecture used in this section.
 Let us examine a properties file that can be used to connect to Cassandra
as a storage backend for Titan. It contains sections for Cassandra, Apache
Spark, and the Hadoop Gremlin configuration. My Cassandra properties
file is called cassandra.properties, and it looks like this
Into Titan code
 The following necessary TinkerPop and Aurelius classes that will be used:
Scala code
Output:
Thank You

More Related Content

What's hot

Graph computation
Graph computationGraph computation
Graph computation
Sigmoid
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
Matthias Niehoff
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Modeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and CassandraModeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and Cassandra
twilmes
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 

What's hot (20)

Graph computation
Graph computationGraph computation
Graph computation
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Modeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and CassandraModeling the IoT with TitanDB and Cassandra
Modeling the IoT with TitanDB and Cassandra
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 

Viewers also liked

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-Data
Guido Schmutz
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
Andrea Iacono
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Andrei Nikolaenko
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
Shamim bhuiyan
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Spark Summit
 
Apache spark
Apache sparkApache spark
Apache spark
Anton Anokhin
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Антон Шестаков
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
Vinay Sarda
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
Sigmoid
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
Sigmoid
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
Sigmoid
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
Sigmoid
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
Sigmoid
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
Sigmoid
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
Sigmoid
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
Sigmoid
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 

Viewers also liked (20)

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Apache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-DataApache Cassandra for Timeseries- and Graph-Data
Apache Cassandra for Timeseries- and Graph-Data
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
 
Temporal graph
Temporal graphTemporal graph
Temporal graph
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 

Similar to Using spark for timeseries graph analytics

GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAShaunak Das
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
DataStax
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
Arunit Gupta
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
achudhivi
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
Navanit Katiyar
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
niallmilton
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
Suresh Parmar
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
Wu Liang
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
Ravi Raj
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 

Similar to Using spark for timeseries graph analytics (20)

GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRAGRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
GRAPH 101- GETTING STARTED WITH TITAN AND CASSANDRA
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
No sql
No sqlNo sql
No sql
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
Time series data monitoring at 99acres.com
Time series data monitoring at 99acres.comTime series data monitoring at 99acres.com
Time series data monitoring at 99acres.com
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 

More from Sigmoid

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applications
Sigmoid
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
Sigmoid
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
Sigmoid
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutions
Sigmoid
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
Sigmoid
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at Scale
Sigmoid
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
Sigmoid
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
Sigmoid
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
Sigmoid
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
Dashboard design By Anu Vijayan
Dashboard design By Anu VijayanDashboard design By Anu Vijayan
Dashboard design By Anu Vijayan
Sigmoid
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
Sigmoid
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
Sigmoid
 

More from Sigmoid (19)

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applications
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutions
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Joining Large data at Scale
Joining Large data at ScaleJoining Large data at Scale
Joining Large data at Scale
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Dashboard design By Anu Vijayan
Dashboard design By Anu VijayanDashboard design By Anu Vijayan
Dashboard design By Anu Vijayan
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 

Recently uploaded

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

Using spark for timeseries graph analytics

  • 1. Using Spark for Timeseries Graph Analytics - VED MULKALWAR
  • 2. Content  Introduction to sample problem statement  Which Graph database is used and why  Installing Titan  Titan with Cassandra  The Gremlin Cassandra script: A way to store data in cassandra from Titan Gremlin  Accessing Titan with Spark
  • 3. Introducing which kind of time-series problems that can be solved using graph analytics and introducing the problem statement which I have been working on solving.
  • 4. The dynamics of a complex system is usually recorded in the form of time series. In recent years, the visibility graph algorithm and the horizontal visibility graph algorithm have been recently introduced as the mapping between time series and complex networks. Transforming time series into the graphs, the algorithms allows applying the methods of graph theoretical tools for characterizing time series, opening the possibility of building fruitful connections between time series analysis, nonlinear dynamics, and graph theory.
  • 5.  The problem statement which I have been working on:  Our initial goal was finding anomalies in the given timeseries. We started with slicing the data into small meaningfull parts so that the data becomes smaller and contiguous data points having the same property get clubbed together.  Later we created a graph with these parts and tried to find the parts having similar properties. Doing this will single out anomalies which was the goal.
  • 6. Which graph database I used and why i.e. Difference between graphX, neo4j and titan , why we used titan
  • 7. Titan vs graphX  The fundamental difference between Titan and GraphX lies in how they persistence data and how they process data, Titan by default persists data (vertices and edges and properties) to a distributed data store in the form of tables bound to a specific schema. This schema can be stored in Berkley DB tables, Cassandra tables or Hbase tables.  In the case of Titan the graph is stored as vertices in a vertex table and edges in an edge table.  GraphX has no real persistence layer (yet?), yes it can persist to HDFS files, but it cannot persist to a distributed datastore in a common schema like form as in Titan’s case.  A graph only exists in GraphX when it is loaded into memory off raw data and interpreted as Graph RDDs, Titan stores the graph permanently.  The other major difference between the two solutions is how they process graph data.  By default, GraphX solves queries via distributed processing on many nodes in parallel where possible as opposed to Titan processing pipelines on a single node. Titan can also take advantage of parallel processing via Faunus/HDFS if necessary.
  • 8. Neo4j vs titan  The primary difference between Titan and Neo4j is scalability: Titan can distribute the graph across multiple machines (using either Cassandra or Hbase as the storage backend) which does three things:  1) It allows Titan to store really, really large graphs by scaling out  2) It allows Titan to handle massive write loads  3) No single point of failure for very high availability. While Neo4j's HA setup gives you redundancy through replication, death of the master in such a setup leads to temporary service interruption while a new master is elected.  This allows Titan to scale to large deployments with thousands of concurrent users as demonstrated in the benchmark:  http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/  Neo4j has been around much longer and is therefore more "mature" in some regards:  1) Integrated with an external lucene index gives it more sophisticated search capabilities  2) More integration points into other programming languages and development frameworks (e.g. spring)  3) Has been used by more people  Titan is a native blueprints implementation and benefits from all the features that come with the Tinkerpop stack. Titan does not support cypher but only the gremlin standard.
  • 9. Intro to Titan graph db and how to use it with cassandra & spark
  • 10. Installing Titan  Downloaded the latest prebuilt version (0.9.0-M2) of Titan at s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip.  Carry out the following steps to ensure that Titan is installed on each node in the cluster:
  • 11.  Now, use the Linux su (switch user) command to change to the root account, and move the install to the /usr/local/ location. Change the file and group membership of the install to the hadoop user, and create a symbolic link called titan so that the current Titan release can be referred to as the simplified path called /usr/local/titan:
  • 12. Titan with Cassandra  In this section, the Cassandra NoSQL database will be used as a storage mechanism for Titan. Although it does not use Hadoop, it is a large-scale, cluster-based database in its own right, and can scale to very large cluster sizes. A graph will be created, and stored in Cassandra using the Titan Gremlin shell. It will then be checked using Gremlin, and the stored data will be checked in Cassandra. The raw Titan Cassandra graph-based data will then be accessed from Spark. The first step then will be to install Cassandra on each node in the cluster.
  • 13.  Install Cassandra on all the nodes  Set up the Cassandra configuration under /etc/cassandra/conf by altering the cassandra.yaml file:  Install Cassandra on all the nodes  Set up the Cassandra configuration under /etc/cassandra/conf by altering the cassandra.yaml file:
  • 14.  Log files can be found under /var/log/cassandra, and the data is stored under /var/lib/cassandra. The nodetool command can be used on any Cassandra node to check the status of the Cassandra cluster:  The Cassandra CQL shell command called cqlsh can be used to access the cluster, and create objects. The shell is invoked next, and it shows that Cassandra version 2.0.13 is installed:
  • 15.  The Cassandra query language next shows a key space called keyspace1 that is being created and used via the CQL shell:
  • 16. The Gremlin Cassandra script  The interactive Titan Gremlin shell can be found within the bin directory of the Titan install, as shown here. Once started, it offers a Gremlin prompt:
  • 17.  The following script will be entered using the Gremlin shell. The first section of the script defines the configuration in terms of the storage (Cassandra), the port number, and the keyspace name that is to be used:
  • 18.  Next define the generic vertex properties' name and age for the graph to be created using the Management System. It then commits the management system changes:
  • 19.  Now, six vertices are added to the graph. Each one is given a numeric label to represent its identity. Each vertex is given an age and name value:
  • 20.  Finally, the graph edges are added to join the vertices together. Each edge has a relationship value. Once created, the changes are committed to store them to Titan, and therefore Cassandra:
  • 21.  This results in a simple person-based graph, shown in the following figure
  • 22.  This graph can then be tested in Titan via the Gremlin shell using a similar script to the previous one. Just enter the following script at the gremlin> prompt, as was shown previously. It uses the same initial six lines to create the titanGraph configuration, but it then creates a graph traversal variable g.  The graph traversal variable can be used to check the graph contents. Using the ValueMap option, it is possible to search for the graph nodes called Mike and Flo. They have been successfully found here:
  • 23.  Using the Cassandra CQL shell, and the Titan keyspace, it can be seen that a number of Titan tables have been created in Cassandra:  It can also be seen that the data exists in the edgestore table within Cassandra:  This assures us that a Titan graph has been created in the Gremlin shell, and is stored in Cassandra. Now, I will try to access the data from Spark.
  • 24. Accessing Titan with Spark  So far, Titan 0.9.0-M2 has been installed, and the graphs have successfully been created using Cassandra as backend storage options. These graphs have been created using Gremlin-based scripts. In this section, a properties file will be used via a Gremlin script to process a Titan-based graph using Apache Spark. Cassandra will be used with Titan as backend storage option.
  • 25. The following figure, shows the architecture used in this section.
  • 26.  Let us examine a properties file that can be used to connect to Cassandra as a storage backend for Titan. It contains sections for Cassandra, Apache Spark, and the Hadoop Gremlin configuration. My Cassandra properties file is called cassandra.properties, and it looks like this
  • 27. Into Titan code  The following necessary TinkerPop and Aurelius classes that will be used: