Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
We present the basic functionality of the official DataStax spark-cassandra connector - how to load cassandra tables as Spark RDDs and how to save Spark RDDs to Cassandra.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
We present the basic functionality of the official DataStax spark-cassandra connector - how to load cassandra tables as Spark RDDs and how to save Spark RDDs to Cassandra.
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
This presentation was made at the Open Source India Conference Nov 2015. It explains how Apache Spark, pySpark, Cassandra, Node.js and D3.js can be used for creating a platform for visualizing and analyzing streaming big data
Apache Cassandra is a leading open-source distributed database capable of amazing feats of scale, but its data model requires a bit of planning for it to perform well. Of course, the nature of ad-hoc data exploration and analysis requires that we be able to ask questions we hadn’t planned on asking—and get an answer fast. Enter Apache Spark.
Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. It’s exactly what a Cassandra cluster needs to deliver real-time, ad-hoc querying of operational data at scale.
In this talk, we’ll explore Spark and see how it works together with Cassandra to deliver a powerful open-source big data analytic solution.
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
Most web applications start out with a Postgres database and it serves the application very well for an extended period of time. Based on type of application, the data model of the app will have a table that tracks some kind of state for either objects in the system or the users of the application. Names for this table include logs, messages or events. The growth in the number of rows in this table is not linear as the traffic to the app increases, it's typically exponential.
Over time, the state table will increasingly become the bulk of the data volume in Postgres, think terabytes, and become increasingly hard to query. This use case can be characterized as the one-big-table problem. In this situation, it makes sense to move that table out of Postgres and into Cassandra. This talk will walk through the conceptual differences between the two systems, a bit of data modeling, as well as advice on making the conversion.
About the Speaker
Rimas Silkaitis Product Manager, Heroku
Rimas currently runs Product for Heroku Postgres and Heroku Redis but the common thread throughout his career is data. From data analysis, building data warehouses and ultimately building data products, he's held various positions that have allowed him to see the challenges of working with data at all levels of an organization. This experience spans the smallest of startups to the biggest enterprises.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
Slides from my Strata+Hadoop 2015 Conference session titled: One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP. This talk describes the Doradus OLAP query/storage engine, which is an open source module that runs on top of the Cassandra NoSQL DB. Among the benefits of this service is fast data loading, a rich query language with full text and graph query features, and very dense data storage. See the Notes section for details on each slide.
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.Natalino Busa
Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
This presentation was made at the Open Source India Conference Nov 2015. It explains how Apache Spark, pySpark, Cassandra, Node.js and D3.js can be used for creating a platform for visualizing and analyzing streaming big data
Apache Cassandra is a leading open-source distributed database capable of amazing feats of scale, but its data model requires a bit of planning for it to perform well. Of course, the nature of ad-hoc data exploration and analysis requires that we be able to ask questions we hadn’t planned on asking—and get an answer fast. Enter Apache Spark.
Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. It’s exactly what a Cassandra cluster needs to deliver real-time, ad-hoc querying of operational data at scale.
In this talk, we’ll explore Spark and see how it works together with Cassandra to deliver a powerful open-source big data analytic solution.
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016DataStax
Most web applications start out with a Postgres database and it serves the application very well for an extended period of time. Based on type of application, the data model of the app will have a table that tracks some kind of state for either objects in the system or the users of the application. Names for this table include logs, messages or events. The growth in the number of rows in this table is not linear as the traffic to the app increases, it's typically exponential.
Over time, the state table will increasingly become the bulk of the data volume in Postgres, think terabytes, and become increasingly hard to query. This use case can be characterized as the one-big-table problem. In this situation, it makes sense to move that table out of Postgres and into Cassandra. This talk will walk through the conceptual differences between the two systems, a bit of data modeling, as well as advice on making the conversion.
About the Speaker
Rimas Silkaitis Product Manager, Heroku
Rimas currently runs Product for Heroku Postgres and Heroku Redis but the common thread throughout his career is data. From data analysis, building data warehouses and ultimately building data products, he's held various positions that have allowed him to see the challenges of working with data at all levels of an organization. This experience spans the smallest of startups to the biggest enterprises.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Apache Spark part of Eindhoven Java MeetupPatrick Deenen
The presentation of Apache Spark by Mylène Reiners during our first Eindhoven Java Meetup (see http://www.opencirclesolutions.nl/eindhoven-java-meetup/).
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Similar to 3 Dundee-Spark Overview for C* developers (20)
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
2. @chbatey
Scalability & Performance
• Scalability
- No single point of failure
- No special nodes that become the bottle neck
- Work/data can be re-distributed
• Operational Performance i.e single digit ms
- Single node for query
- Single disk seek per query
4. @chbatey
But but…
• Sometimes you don’t need a answers in milliseconds
• Data models done wrong - how do I fix it?
• New requirements for old data?
• Ad-hoc operational queries
• Managers always want counts / maxs
5. @chbatey
Apache Spark
• 10x faster on disk,100x faster in memory than Hadoop
MR
• Works out of the box on EMR
• Fault Tolerant Distributed Datasets
• Batch, iterative and streaming analysis
• In Memory Storage and Disk
• Integrates with Most File and Storage Options
9. @chbatey
RDD Operations
• Transformations - Similar to Scala collections API
• Produce new RDDs
• filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
• Actions
• Require materialization of the records to generate a value
• collect: Array[T], count, fold, reduce..
10. @chbatey
Word count
val file: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = file.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
13. @chbatey
Partitioning
• Large data sets from S3, HDFS, Cassandra etc
• Split into small chunks called partitions
• Each operation is done locally on a partition before
combining other partitions
• So partitioning is important for data locality
16. @chbatey
Spark Cassandra Connector
• Loads data from Cassandra to Spark
• Writes data from Spark to Cassandra
• Implicit Type Conversions and Object Mapping
• Implemented in Scala (offers a Java API)
• Open Source
• Exposes Cassandra Tables as Spark RDDs + Spark
DStreams
22. @chbatey
Boiler plate
import com.datastax.spark.connector.rdd._
import org.apache.spark._
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
object BasicCassandraInteraction extends App {
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("local[4]", "AppName", conf)
// cool stuff
}
Cassandra Host
Spark master e.g spark://host:port
23. @chbatey
Executing code against the driver
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }")
session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
}
24. @chbatey
Reading data from Cassandra
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE TABLE IF NOT EXISTS test.kv(key text PRIMARY KEY, value int)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('chris', 10)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('dan', 1)")
session.execute("INSERT INTO test.kv(key, value) VALUES ('charlieS', 2)")
}
val rdd: CassandraRDD[CassandraRow] = sc.cassandraTable("test", "kv")
println(rdd.max()(new Ordering[CassandraRow] {
override def compare(x: CassandraRow, y: CassandraRow): Int =
x.getInt("value").compare(y.getInt("value"))
}))
25. @chbatey
Word Count + Save to Cassandra
val textFile: RDD[String] = sc.textFile("Spark-Readme.md")
val words: RDD[String] = textFile.flatMap(line => line.split("s+"))
val wordAndCount: RDD[(String, Int)] = words.map((_, 1))
val wordCounts: RDD[(String, Int)] = wordAndCount.reduceByKey(_ + _)
println(wordCounts.first())
wordCounts.saveToCassandra("test", "words", SomeColumns("word", "count"))
26. @chbatey
Migrating from an RDMS
create table store(
store_name varchar(32) primary key,
location varchar(32),
store_type varchar(10));
create table staff(
name varchar(32)
primary key,
favourite_colour varchar(32),
job_title varchar(32));
create table customer_events(
id MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
customer varchar(12),
time timestamp,
event_type varchar(16),
store varchar(32),
staff varchar(32),
foreign key fk_store(store) references store(store_name),
foreign key fk_staff(staff) references staff(name))
27. @chbatey
Denormalised table
CREATE TABLE IF NOT EXISTS customer_events(
customer_id text,
time timestamp,
id uuid,
event_type text,
store_name text,
store_type text,
store_location text,
staff_name text,
staff_title text,
PRIMARY KEY ((customer_id), time, id))
28. @chbatey
Migration time
val customerEvents = new JdbcRDD(sc, () => { DriverManager.getConnection(mysqlJdbcString)},
"select * from customer_events ce, staff, store where ce.store = store.store_name and ce.staff = staff.name " +
"and ce.id >= ? and ce.id <= ?", 0, 1000, 6,
(r: ResultSet) => {
(r.getString("customer"),
r.getTimestamp("time"),
UUID.randomUUID(),
r.getString("event_type"),
r.getString("store_name"),
r.getString("location"),
r.getString("store_type"),
r.getString("staff"),
r.getString("job_title")
)
})
customerEvents.saveToCassandra("test", "customer_events",
SomeColumns("customer_id", "time", "id", "event_type", "store_name", "store_type", "store_location", "staff_name",
"staff_title"))
34. @chbatey
Now now…
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("test")
val rdd: SchemaRDD = cc.sql("SELECT store_name, event_type, count(store_name) from customer_events
GROUP BY store_name, event_type")
rdd.collect().foreach(println)
[SportsApp,WATCH_STREAM,1]
[SportsApp,LOGOUT,1]
[SportsApp,LOGIN,1]
[ChrisBatey.com,WATCH_MOVIE,1]
[ChrisBatey.com,LOGOUT,1]
[ChrisBatey.com,BUY_MOVIE,1]
[SportsApp,WATCH_MOVIE,2]
37. @chbatey
Network word count
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count(word text PRIMARY KEY, number int)")
session.execute("CREATE TABLE IF NOT EXISTS test.network_word_count_raw(time timeuuid PRIMARY KEY, raw text)")
}
val ssc = new StreamingContext(conf, Seconds(5))
val lines = ssc.socketTextStream("localhost", 9999)
lines.map((UUIDs.timeBased(), _)).saveToCassandra("test", "network_word_count_raw")
val words = lines.flatMap(_.split("s+"))
val countOfOne = words.map((_, 1))
val reduced = countOfOne.reduceByKey(_ + _)
reduced.saveToCassandra("test", "network_word_count")
39. @chbatey
Stream processing customer events
val joeBuy = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
val joeBuy2 = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
val joeSell = write(CustomerEvent("joe", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "SELL"))
val chrisBuy = write(CustomerEvent("chris", "chris", "WEB", "NEW_CUSTOMER", "lots of fancy content", event_type = "BUY"))
CassandraConnector(conf).withSessionDo { session =>
session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events_by_type ( nameAndType text primary key, number int)")
session.execute("CREATE TABLE IF NOT EXISTS streaming.customer_events ( " +
"customer_id text, " +
"staff_id text, " +
"store_type text, " +
"group text static, " +
"content text, " +
"time timeuuid, " +
"event_type text, " +
"PRIMARY KEY ((customer_id), time) )")
}
40. @chbatey
Save + Process
val rawEvents: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]
(ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY)
val events: DStream[CustomerEvent] = rawEvents.map({ case (k, v) =>
parse(v).extract[CustomerEvent]
})
events.saveToCassandra("streaming", "customer_events")
val eventsByCustomerAndType = events.map(event => (s"${event.customer_id}-${event.event_type}", 1)).reduceByKey(_ + _)
eventsByCustomerAndType.saveToCassandra("streaming", "customer_events_by_type")
41. @chbatey
Summary
• Cassandra is an operational database
• Spark gives us the flexibility to do slower things
- Schema migrations
- Ad-hoc queries
- Report generation
• Spark streaming + Cassandra allow us to build online
analytical platforms
42. @chbatey
Thanks for listening
• Follow me on twitter @chbatey
• Cassandra + Fault tolerance posts a plenty:
• http://christopher-batey.blogspot.co.uk/
• Github for all examples:
• https://github.com/chbatey/spark-sandbox
• Cassandra resources: http://planetcassandra.org/