Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
Escape From Hadoop: Spark One Liners for C* OpsRussell Spitzer
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
This document provides an agenda and overview of Big Data Analytics using Spark and Cassandra. It discusses Cassandra as a distributed database and Spark as a data processing framework. It covers connecting Spark and Cassandra, reading and writing Cassandra tables as Spark RDDs, and using Spark SQL, Spark Streaming, and Spark MLLib with Cassandra data. Key capabilities of each technology are highlighted such as Cassandra's tunable consistency and Spark's fault tolerance through RDD lineage. Examples demonstrate basic operations like filtering, aggregating, and joining Cassandra data with Spark.
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit
This document discusses how the Spark Cassandra Connector optimizes for data locality when performing analytics on Cassandra data using Spark. It does this by using the partition keys and token ranges to create Spark partitions that correspond to the data distribution across the Cassandra nodes, allowing work to be done locally to each data node without moving data across the network. This improves performance and avoids the costs of data shuffling.
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
The Spark Cassandra Connector allows integration between Spark and Cassandra for distributed analytics. Previously, integrating Hadoop and Cassandra required complex code and configuration. The connector maps Cassandra data distributed across nodes based on token ranges to Spark partitions, enabling analytics on large Cassandra datasets using Spark's APIs. This provides an easier method for tasks like generating reports, analytics, and ETL compared to previous options.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
This document summarizes a presentation about tuning the Spark Cassandra Connector for optimal performance. It discusses various write tuning techniques like batching writes by partition key, sorting data within partitions, and adjusting batch sizes and concurrency levels. It also covers read tuning, noting the relationship between Spark and Cassandra partitions and how to avoid out of memory errors by changing the number of partitions. Maximizing read speed requires tuning Cassandra's paging behavior. The presentation encourages contributions to the open source Spark Cassandra Connector project.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
- Apache Cassandra is a linearly scalable and fault tolerant NoSQL database that increases throughput linearly with additional machines
- It is an AP system that is eventually consistent according to the CAP theorem, sacrificing consistency in favor of availability and partition tolerance
- Cassandra uses replication and consistency levels to control fault tolerance at the server and client levels respectively
- Its data model and use of SSTables allows for fast writes and queries along clustering columns
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
Lightning fast analytics with Spark and CassandraRustam Aliyev
Spark is an open-source cluster computing framework that provides fast and general engine for large-scale data processing. It is up to 100x faster than Hadoop for certain applications. The Cassandra Spark driver allows accessing Cassandra tables as resilient distributed datasets (RDDs) in Spark, enabling analytics like joins, aggregations, and machine learning on Cassandra data. It maps Cassandra data types to Scala types and rows to case classes. This allows querying, transforming, and saving data to and from Cassandra using Spark's APIs and optimizations for performance and fault tolerance.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
This document discusses building real-time data pipelines with Apache Spark Streaming and Cassandra using Mesos. It provides an overview of data management challenges, introduces Cassandra and Spark concepts. It then describes how to use the Spark Cassandra Connector to expose Cassandra tables as Spark RDDs and write back to Cassandra. It recommends designing scalable pipelines by identifying bottlenecks, using efficient data parsing, proper data modeling, and compression.
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Maximum Overdrive: Tuning the Spark Cassandra ConnectorRussell Spitzer
This document summarizes a presentation about tuning the Spark Cassandra Connector for optimal performance. It discusses various write tuning techniques like batching writes by partition key, sorting data within partitions, and adjusting batch sizes and concurrency levels. It also covers read tuning, noting the relationship between Spark and Cassandra partitions and how to avoid out of memory errors by changing the number of partitions. Maximizing read speed requires tuning Cassandra's paging behavior. The presentation encourages contributions to the open source Spark Cassandra Connector project.
This document provides an agenda for a presentation on Big Data Analytics with Cassandra, Spark, and MLLib. The presentation covers Spark basics, using Spark with Cassandra, Spark Streaming, Spark SQL, and Spark MLLib. It also includes examples of querying and analyzing Cassandra data with Spark and Spark SQL, and machine learning with Spark MLLib.
The document discusses how the Spark Cassandra Connector works. It explains that the connector uses information about how data is partitioned in Cassandra nodes to generate Spark partitions that correspond to the token ranges in Cassandra. This allows data to be read from Cassandra in parallel across the Spark partitions. The connector also supports automatically pushing down filter predicates to the Cassandra database to reduce the amount of data read.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
- Apache Cassandra is a linearly scalable and fault tolerant NoSQL database that increases throughput linearly with additional machines
- It is an AP system that is eventually consistent according to the CAP theorem, sacrificing consistency in favor of availability and partition tolerance
- Cassandra uses replication and consistency levels to control fault tolerance at the server and client levels respectively
- Its data model and use of SSTables allows for fast writes and queries along clustering columns
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
Lightning fast analytics with Spark and CassandraRustam Aliyev
Spark is an open-source cluster computing framework that provides fast and general engine for large-scale data processing. It is up to 100x faster than Hadoop for certain applications. The Cassandra Spark driver allows accessing Cassandra tables as resilient distributed datasets (RDDs) in Spark, enabling analytics like joins, aggregations, and machine learning on Cassandra data. It maps Cassandra data types to Scala types and rows to case classes. This allows querying, transforming, and saving data to and from Cassandra using Spark's APIs and optimizations for performance and fault tolerance.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Real time data pipeline with spark streaming and cassandra with mesosRahul Kumar
This document discusses building real-time data pipelines with Apache Spark Streaming and Cassandra using Mesos. It provides an overview of data management challenges, introduces Cassandra and Spark concepts. It then describes how to use the Spark Cassandra Connector to expose Cassandra tables as Spark RDDs and write back to Cassandra. It recommends designing scalable pipelines by identifying bottlenecks, using efficient data parsing, proper data modeling, and compression.
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Owning time series with team apache Strata San Jose 2015Patrick McFadin
Break out your laptops for this hands-on tutorial is geared around understanding the basics of how Apache Cassandra stores and access time series data. We’ll start with an overview of how Cassandra works and how that can be a perfect fit for time series. Then we will add in Apache Spark as a perfect analytics companion. There will be coding as a part of the hands on tutorial. The goal will be to take a example application and code through the different aspects of working with this unique data pattern. The final section will cover the building of an end-to-end data pipeline to ingest, process and store high speed, time series data.
The document discusses Spark job failures and Spark/YARN architecture. It describes a Spark job failure due to a task failing 4 times with a NumberFormatException when parsing a string. It then explains that Spark jobs are divided into stages made up of tasks, and the entire job fails if a stage fails. The document also provides an overview of the Spark and YARN architectures, showing how Spark jobs are submitted to and run via the YARN resource manager.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
SGS Tekniks - Best Electronic Contract Manufacturing Company in IndiaSGS Tekniks
SGS Tekniks offers electronic contract manufacturing services in the Medical, Industrial, Automotive, Defence and Public Safety market segments. Our electronics design and electronic manufacturing services (EMS) core focus is vital in the creation of durable products that require high reliability. These products perform exceptionally well in even the most challenging environments.
SGS Tekniks Builds Success For Electronics Manufacturing Customers Around The Globe.
Visit our website to know more : http://www.sgst.com/
PCB Assemblies : http://www.sgst.com/services-offered/pcb-assemblies/
Box Products : http://www.sgst.com/services-offered/box-products/
Electronic Design Services : http://www.sgst.com/services-offered/electronic-design-services/
A good horse runs even at the shadow of the whipRhea Myers
This document discusses several topics including the music industry on Twitter, cyborg anthropology, the effects of technology on humans, and growing relationships online. It notes that advertising will subsidize fees for digital media libraries and that a smaller audience is enough to support a show if they genuinely enjoy it. The document concludes by recommending growing relationships, cultivating niche interests, disintermediating partnerships, and taking action even if it's small.
This document provides an introduction to an eBook about internet dating. It explains that the eBook contains the same content as the original published book, which is now out of print. As such, the author has made the eBook freely available as a PDF file. It encourages readers to consider donating if they find the book helpful for internet dating. The introduction provides an overview of what topics will be covered in the book, which aims to be a step-by-step practical guide to help people successfully find relationships through internet dating.
Doing Business 2015: au-delà de l’efficience est une publication phare du Groupe de la Banque Mondiale et est le 12ème d'une série de rapports annuels mesurant les réglementations favorables et défavorables de l'activité commerciale. Doing Business présente des indicateurs quantitatifs sur la réglementation des affaires et la protection des droits de propriété de 189 pays - de l'Afghanistan au Zimbabwe - au fil du temps.
Doing Business mesure les réglementations affectant 11 domaines de la vie d'une entreprise. Dix de ces domaines sont inclus dans le classement de cette année sur la facilité de faire des affaires: création d'entreprise, octroi de permis de construire, raccordement à l'électricité, transfert de propriété, obtention de prêts, protection des investisseurs minoritaires, paiement des impôts, commerce transfrontalier, exécution des contrats et règlement de l’insolvabilité. Doing Business mesure également la régulation du marché du travail, ce qui n'est pas inclus dans le classement de cette année.
Les données de Doing Business 2015 sont mises à jour en date du 1er Juin 2014. Les indicateurs sont utilisés pour analyser les résultats économiques et identifier les meilleures réformes de la réglementation des affaires, dépendant de l’endroit et de l’objectif. Le rapport de cette année présente une expansion notable de plusieurs ensembles d'indicateurs et un changement dans le calcul du classement.
Mei Mei left her favorite storybook in the kitchen while doing chores. When she returned, the book was dirty so she tried to wash it, but ended up tearing it. Her mother helped dry the book with heat from the kitchen so Mei Mei could read the story to her mother.
Metaphors are an essential part of how we think and communicate. They structure our perceptions and understandings. George Lakoff and Mark Johnson argue in their book "Metaphors We Live By" that metaphors are pervasive in everyday language and thought, not just in poetic language. The documents provides numerous examples of metaphors used in various contexts like love, politics, business, medicine, and literature to illustrate how metaphors shape our views and expectations. It also discusses different types of metaphors and techniques for developing creative metaphors.
This document presents a secondary structural model of expansion segments D2 and D3 from the 28S rRNA gene of 229 leaf beetles (Coleoptera: Chrysomelidae), most of which are in the subfamily Galerucinae. The model is based on a multiple sequence alignment that was analyzed to infer secondary structure based on compensatory base changes. The model consists of seven major compound helices and delineates conserved regions that can be confidently aligned versus regions of alignment ambiguity. This structural model of the D2 and D3 expansion segments will aid in assigning positional nucleotide homology and phylogenetic reconstruction for these and related beetle taxa.
This document is too brief and lacks substantive information to summarize meaningfully in 3 sentences or less. It contains random letters and symbols with no discernible meaning or context.
This document provides an overview of the topics and activities for the first week of a global supply chain management course. It includes introductions and expectations, as well as videos on supply chain management and key issues. Students will review the syllabus, textbook, and online course platform. The grading structure is outlined. Traditional and future supply chain models are presented, as well as an example of IBM's 2020 supply chain vision. A capstone project involving case studies and a research presentation is also introduced.
This document discusses characterization of the NADH dehydrogenase subunit 1 protein of Fasciola gigantica, a parasitic flatworm, through computational analysis. Key findings include:
- The protein sequence was analyzed using tools like MotifScan to identify functional motifs, with the NADH dehydrogenase motif found.
- Pairwise comparison to F. hepatica showed 91.5% identity, indicating the proteins are orthologs performing the same function.
- Secondary structure prediction identified helices and coils. A 3D model was built from different protein family folds due to the lack of a matching template.
- The model depicted coiled regions that may play roles in protein complex formation and proton translocation in
The document contains motivational messages and advice. It encourages the reader not to compare themselves to others or dwell on past mistakes and problems, but instead to face challenges, keep trying to improve and create success. It also notes that every successful person has faced pain and difficulties in the past, so the reader should accept pain as part of gaining experiences that can lead to future success.
Lightning fast analytics with Spark and Cassandranickmbailey
Spark is a fast and general engine for large-scale data processing. It provides APIs for Java, Scala, and Python that allow users to load data into a distributed cluster as resilient distributed datasets (RDDs) and then perform operations like map, filter, reduce, join and save. The Cassandra Spark driver allows accessing Cassandra tables as RDDs to perform analytics and run Spark SQL queries across Cassandra data. It provides server-side data selection and mapping of rows to Scala case classes or other objects.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Spark is a unified analytics engine for large-scale data processing. It provides APIs for SQL queries, streaming data, and machine learning. Spark uses RDDs (Resilient Distributed Datasets) as its fundamental data abstraction, which allows data to be operated on in parallel. RDDs track lineage information to efficiently recover lost data. Spark offers advantages over MapReduce like being faster, using less code, and supporting iterative algorithms. It can also be used for both batch and streaming workloads using the same APIs. While still maturing, Spark is gaining popularity for its ease of use and performance.
This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
The document discusses using Apache Spark and Cassandra for online analytical processing (OLAP) of big data. It describes challenges with relational databases and OLAP cubes at large scales and how Spark can provide fast, distributed querying of data stored in Cassandra. The key points made are that Spark and Cassandra combine to provide horizontally scalable storage with Cassandra and fast, in-memory analytics with Spark; and that for optimal performance, data should be cached in Spark SQL tables for column-oriented querying and aggregation.
This document discusses using Apache Spark to perform analytics on Cassandra data. It provides an overview of Spark and how it can be used to query and aggregate Cassandra data through transformations and actions on resilient distributed datasets (RDDs). It also describes how to use the Spark Cassandra connector to load data from Cassandra into Spark and write data from Spark back to Cassandra.
This document summarizes Spark, an open-source cluster computing framework that is 10-100x faster than Hadoop for interactive queries and stream processing. It discusses how Spark works and its Resilient Distributed Datasets (RDD) API. It then explains how Spark can be used with Cassandra for fast analytics, including reading and writing Cassandra data as RDDs and mapping rows to objects. Finally, it briefly covers the Shark SQL query engine on Spark.
Apache Spark: The Analytics Operating SystemAdarsh Pannu
This presentation was delivered by Adarsh Pannu at IBM's Insight Conference in Nov 2015. For a recording, visit: https://www.youtube.com/watch?v=Tbm7HIlmwJQ
The presentation provides an overview of Apache Spark, a general-purpose big data processing engine built around speed, ease of use and sophisticated analytics. It enumerates the benefits of incorporating Spark in the enterprise, including how it allows developers to write fully-featured distributed applications ranging from traditional data processing pipelines to complex machine learning. The presentation uses the Airline "On Time" data set to explore various components of the Spark stack.
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
This document provides an overview of streaming big data with Spark, Kafka, Cassandra, Akka, and Scala. It discusses delivering meaning in near-real time at high velocity and an overview of Spark Streaming, Kafka and Akka. It also covers Cassandra and the Spark Cassandra Connector as well as integration in big data applications. The presentation is given by Helena Edelson, a Spark Cassandra Connector committer and Akka contributor who is a Scala and big data conference speaker working as a senior software engineer at DataStax.
This document discusses using PySpark with Cassandra for analytics. It provides background on Cassandra, Spark, and PySpark. Key features of PySpark Cassandra include scanning Cassandra tables into RDDs, writing RDDs to Cassandra, and joining RDDs with Cassandra tables. Examples demonstrate using operators like scan, project, filter, join, and save to perform tasks like processing time series data, media metadata processing, and earthquake monitoring. The document discusses getting started, compatibility, and provides code samples for common operations.
Spark is an open-source cluster computing framework. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark supports batch, streaming, and interactive computations in a unified framework. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across a cluster for parallel processing. RDDs support transformations like map and filter that return new RDDs and actions that return values to the driver program.
Lightning Fast Analytics with Cassandra and SparkTim Vincent
Presentation on the integration of Apache Cassandra with Apache Spark to deliver near real-time analytics against operational data in your Cassandra distributed database
Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy
Speaker: Rich Beaudoin, Senior Software Engineer at Pearson eCollege
In the world of Big Data it's crucial that your data is accessible. Cassandra provides us with a means to reliably store our data, but how can we keep it flowing? That's where Spark steps up to provide a powerful one-two punch with Cassandra to get your data flowing in all the right directions.
Spark with Elasticsearch - umd version 2014Holden Karau
Holden Karau gave a talk on using Apache Spark and Elasticsearch. The talk covered indexing data from Spark to Elasticsearch both online using Spark Streaming and offline. It showed how to customize the Elasticsearch connector to write indexed data directly to shards based on partitions to reduce network overhead. It also demonstrated querying Elasticsearch from Spark, extracting top tags from tweets, and reindexing data from Twitter to Elasticsearch.
This document discusses managing Apache Cassandra at scale. It provides an overview of Cassandra's history and evolution from Dynamo and BigTable. It also discusses Cassandra's data model and how it handles operations like reads, writes and updates in a distributed system without relying on read-modify-writes. The document also covers Cassandra best practices like using collections, lightweight transactions and time series data modeling to optimize for scalability.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
The document discusses the evolution of Cassandra's data modeling capabilities over different versions of CQL. It covers features introduced in each version such as user defined types, functions, aggregates, materialized views, and storage attached secondary indexes (SASI). It provides examples of how to create user defined types, functions, materialized views, and SASI indexes in CQL. It also discusses when each feature should and should not be used.
Cisco has a large global IT infrastructure supporting many applications, databases, and employees. The document discusses Cisco's existing customer service and commerce systems (CSCC/SMS3) and some of the performance, scalability, and user experience issues. It then presents a proposed new architecture using modern technologies like Elasticsearch, Cassandra, and microservices to address these issues and improve agility, performance, scalability, uptime, and the user interface.
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
This document promotes Datastax Academy and Certification resources for learning Cassandra including a three step process of learning Cassandra, getting certified, and profiting. It lists community evangelists like Luke Tillman, Patrick McFadin, Jon Haddad, and Duy Hai Doan who can provide help and resources.
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
This document summarizes three presentations from a Cassandra Meetup:
1. Jason Cacciatore discussed monitoring Cassandra health at scale across hundreds of clusters and thousands of nodes using the reactive stream processing system Mantis.
2. Minh Do explained how Cassandra uses the gossip protocol for tasks like discovering cluster topology and sharing load information. Gossip also has limitations and race conditions that can cause problems.
3. Chris Kalantzis presented Cassandra Tickler, an open source tool he created to help repair operations that get stuck by running lightweight consistency checks on an old Cassandra version or a node with space issues.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
The document discusses Cassandra's use by Sony Network Entertainment to handle the large amount of user and transaction data from the growing PlayStation Network. It describes how the relational database they previously used did not scale sufficiently, so they transitioned to using Cassandra in a denormalized and customized way. Some of the techniques discussed include caching user data locally on application servers, secondary indexing, and using a real-time indexer to enable personalized search by friends.
This document provides guidance on setting up server monitoring, application metrics, log aggregation, time synchronization, replication strategies, and garbage collection for a Cassandra cluster. Key recommendations include:
1. Use monitoring tools like Monit, Munin, Nagios, or OpsCenter to monitor processes, disk usage, and system performance. Aggregate all logs centrally with tools like Splunk, Logstash, or Greylog.
2. Install NTP to synchronize server times which are critical for consistency.
3. Use the NetworkTopologyStrategy replication strategy and avoid SimpleStrategy for production.
4. Avoid shared storage and focus on low latency and high throughput using multiple local disks.
5. Understand
This document discusses real time analytics using Spark and Spark Streaming. It provides an introduction to Spark and highlights limitations of Hadoop for real-time analytics. It then describes Spark's advantages like in-memory processing and rich APIs. The document discusses Spark Streaming and the Spark Cassandra Connector. It also introduces DataStax Enterprise which integrates Spark, Cassandra and Solr to allow real-time analytics without separate clusters. Examples of streaming use cases and demos are provided.
Introduction to Data Modeling with Apache CassandraDataStax Academy
This document provides an introduction to data modeling with Apache Cassandra. It discusses how Cassandra data models are designed based on the queries an application will perform, unlike relational databases which are designed based on normalization rules. Key aspects covered include avoiding joins by denormalizing data, using a partition key to group related data on nodes, and controlling the clustering order of columns. The document provides examples of modeling time series and tag data in Cassandra.
The document discusses different data storage options for small, medium, and large datasets. It argues that relational databases do not scale well for large datasets due to limitations with replication, normalization, sharding, and high availability. The document then introduces Apache Cassandra as a fast, distributed, highly available, and linearly scalable database that addresses these limitations through its use of a hash ring architecture and tunable consistency levels. It describes Cassandra's key features including replication, compaction, and multi-datacenter support.
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
This document provides an overview of using Datastax Enterprise (DSE) Search to enable full-text search capabilities in Cassandra applications. It discusses how DSE Search integrates Solr/Lucene indexing with the Cassandra database to allow searching of application data without requiring a separate search cluster, external ETL processes, or custom application code for data management. The document also includes examples of different types of searches that can be performed, such as filtering, faceting, geospatial searches, and joins. It concludes with basic steps for getting started with DSE Search such as creating a Solr core and executing search queries using CQL.
The document discusses common bad habits that can occur when working with Apache Cassandra and provides recommendations to avoid them. Specifically, it addresses issues like sliding back into a relational mindset when the data model is different, improperly benchmarking Cassandra systems, having slow client performance, and neglecting important operations tasks. The presentation provides guidance on how to approach data modeling, querying, benchmarking, driver usage, and operations management in a Cassandra-oriented way.
This document provides an overview and examples of modeling data in Apache Cassandra. It begins with an introduction to thinking about data models and queries before modeling, and emphasizes that Cassandra requires modeling around queries due to its limitations on joins and indexes. The document then provides examples of modeling user, video, and other entity data for a video sharing application to support common queries. It also discusses techniques for handling queries that could become hotspots, such as bucketing or adding random values. The examples illustrate best practices for data duplication, materialized views, and time series data storage in Cassandra.
The document discusses best practices for using Apache Cassandra, including:
- Topology considerations like replication strategies and snitches
- Booting new datacenters and replacing nodes
- Security techniques like authentication, authorization, and SSL encryption
- Using prepared statements for efficiency
- Asynchronous execution for request pipelining
- Batch statements and their appropriate uses
- Improving performance through techniques like the new row cache
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
1. Escape From Hadoop:
Spark One Liners for C* Ops
Kurt Russell Spitzer
DataStax
2. Who am I?
• Bioinformatics Ph.D from UCSF
• Works on the integration of
Cassandra (C*) with Hadoop,
Solr, and SPARK!!
• Spends a lot of time spinning
up clusters on EC2, GCE,
Azure, …
http://www.datastax.com/dev/
blog/testing-cassandra-1000-
nodes-at-a-time
• Developing new ways to make
sure that C* Scales
3. Why escape from Hadoop?
HADOOP
Many Moving Pieces
Map Reduce
Single Points of Failure
Lots of Overhead
And there is a way out!
4. Spark Provides a Simple and Efficient
framework for Distributed Computations
Node Roles 2
In Memory Caching Yes!
Generic DAG Execution Yes!
Great Abstraction For Datasets? RDD!
Spark
Worker
Spark
Worker
Spark
Master
Spark
Worker
Resilient Distributed
Dataset
Spark Executor
6. Spark is Compatible with HDFS,
Parquet, CSVs, ….
AND
APACHE CASSANDRA
Apache
Cassandra
7. Apache Cassandra is a Linearly Scaling
and Fault Tolerant noSQL Database
Linearly Scaling:
The power of the database
increases linearly with the
number of machines
2x machines = 2x throughput
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
Fault Tolerant:
Nodes down != Database Down
Datacenter down != Database Down
8. Apache Cassandra
Architecture is Very Simple
Node Roles 1
Replication Tunable
Replication
Consistency Tunable
C*
C* C*
C*
Client
10. Spark Cassandra Connector uses the
DataStax Java Driver to Read from and
Write to C*
Spark C*
Full Token
Range
Each Executor Maintains
a connection to the C*
Cluster
Spark
Executor
DataStax
Java Driver
Tokens 1001 -2000
Tokens 1-1000
Tokens …
RDD’s read into different
splits based on sets of
tokens
11. Co-locate Spark and C* for
Best Performance
C*
C* C*
Spark
Worker
C*
Spark
Worker
Spark
Master
Spark
Running Spark Workers Worker
on
the same nodes as your
C* Cluster will save
network hops when
reading and writing
12. Setting up C* and Spark
DSE > 4.5.0
Just start your nodes with
dse cassandra -k
Apache Cassandra
Follow the excellent guide by Al Tobey
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
13. We need a Distributed System
For Analytics and Batch Jobs
But it doesn’t have to be complicated!
14. Even count needs to be
distributed
Ask me to write a Map Reduce
for word count, I dare you.
You could make this easier by adding yet another
technology to your Hadoop Stack (hive, pig, impala) or
we could just do one liners on the spark shell.
15. Basics: Getting a Table and
Counting
CREATE&KEYSPACE&newyork&WITH&replication&=&{'class':&'SimpleStrategy',&'replication_factor':&1&};&
use&newyork;&
CREATE&TABLE&presidentlocations&(&time&int,&location&text&,&PRIMARY&KEY&time&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&1&,&'White&House'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&2&,&'White&House'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&3&,&'White&House'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&4&,&'White&House'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&5&,&'Air&Force&1'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&6&,&'Air&Force&1'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&7&,&'Air&Force&1'&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&8&,&'NYC'&&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&9&,&'NYC'&&);&
INSERT&INTO&presidentlocations&(time,&location&)&VALUES&(&10&,&'NYC'&&);
22. Basics: Getting Row Values
out of a CassandraRow
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")&
!
res5:&Int&=&9
cassandraTable
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
23. Basics: Getting Row Values
out of a CassandraRow
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")&
!
res5:&Int&=&9
cassandraTable
take(1)
Array of CassandraRows
9 NYC
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
24. Basics: Getting Row Values
out of a CassandraRow
scala>&sc.cassandraTable("newyork","presidentlocations").take(1)(0).get[Int]("time")&
!
res5:&Int&=&9
cassandraTable
take(1)
Array of CassandraRows
9 NYC
9
get[Int]
get[Int]
get[String]
…
get[Any]
Got Null ?
get[Option[Int]]
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkSupportedTypes.html
25. Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
26. Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
27. Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
28. Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cassandraTable
get[Int] get[String]
1 white house
1,president,white house
29. get[Int] get[String]
C*
Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cassandraTable
1 white house
1,president,white house
saveToCassandra
30. get[Int] get[String]
C*
Copy A Table
Say we want to restructure our table or add a new column?
CREATE&TABLE&characterlocations&(&
& time&int,&&
& character&text,&&
& location&text,&&
& PRIMARY&KEY&(time,character)&
);
sc.cassandraTable(“newyork","presidentlocations")&
& .map(&row&=>&(&
& & & row.get[Int](“time"),&
& & & "president",&&
& & & row.get[String](“location")&
& )).saveToCassandra("newyork","characterlocations")
cqlsh:newyork>&SELECT&*&FROM&characterlocations&;&
!
&time&|&character&|&location&
kkkkkk+kkkkkkkkkkk+kkkkkkkkkkkkk&
&&&&5&|&president&|&Air&Force&1&
&&&10&|&president&|&&&&&&&&&NYC&
…&
…&
cassandraTable
1 white house
1,president,white house
saveToCassandra
31. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
32. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
Filter
33. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
Filter
_ (Anonymous Param)
1 white house
34. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
Filter
1 white house
get[Int]
1
_ (Anonymous Param)
35. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
_ (Anonymous Param) >7
1 white house
get[Int]
1
Filter
36. Filter a Table
What if we want to filter based on a
non-clustering key column?
scala>&sc.cassandraTable(“newyork","presidentlocations")&
& .filter(&_.get[Int]("time")&>&7&)&
& .toArray&
!
res9:&Array[com.datastax.spark.connector.CassandraRow]&=&&
Array(&
& CassandraRow{time:&9,&location:&NYC},&&
& CassandraRow{time:&10,&location:&NYC},&&
& CassandraRow{time:&8,&location:&NYC}&
)
cassandraTable
_ (Anonymous Param) >7
1 white house
get[Int]
1
Filter
37. Backfill a Table with a
Different Key!
CREATE&TABLE&timelines&(&
&&time&int,&
&&character&text,&
&&location&text,&
&&PRIMARY&KEY&((character),&time)&
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
38. Backfill a Table with a
Different Key!
CREATE&TABLE&timelines&(&
&&time&int,&
&&character&text,&
&&location&text,&
&&PRIMARY&KEY&((character),&time)&
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")&
& .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
president
39. Backfill a Table with a
Different Key!
CREATE&TABLE&timelines&(&
&&time&int,&
&&character&text,&
&&location&text,&
&&PRIMARY&KEY&((character),&time)&
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")&
& .saveToCassandra("newyork","timelines")
1 white house
cassandraTable
saveToCassandra
president C*
40. Backfill a Table with a
Different Key!
CREATE&TABLE&timelines&(&
&&time&int,&
&&character&text,&
&&location&text,&
&&PRIMARY&KEY&((character),&time)&
)
If we actually want to have quick
access to timelines we need a
C* table with a different
structure.
sc.cassandraTable(“newyork","characterlocations")&
& .saveToCassandra("newyork","timelines")
cqlsh:newyork>&select&*&from&timelines;&
!
&character&|&time&|&location&
kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkk&
&president&|&&&&1&|&White&House&
&president&|&&&&2&|&White&House&
&president&|&&&&3&|&White&House&
&president&|&&&&4&|&White&House&
&president&|&&&&5&|&Air&Force&1&
&president&|&&&&6&|&Air&Force&1&
&president&|&&&&7&|&Air&Force&1&
&president&|&&&&8&|&&&&&&&&&NYC&
&president&|&&&&9&|&&&&&&&&&NYC&
&president&|&&&10&|&&&&&&&&&NYC
1 white house
cassandraTable
saveToCassandra
president C*
41. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
42. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
43. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
plissken,1,Federal Reserve
44. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,Federal Reserve
split
plissken 1 Federal Reserve
plissken,1,Federal Reserve
saveToCassandra
C*
45. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,white house
split
plissken 1 white house
plissken,1,white house
saveToCassandra
C*
cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';&
!
&character&|&time&|&location&
kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk&
&&plissken&|&&&&1&|&Federal&Reserve&
&&plissken&|&&&&2&|&Federal&Reserve&
&&plissken&|&&&&3&|&Federal&Reserve&
&&plissken&|&&&&4&|&&&&&&&&&&&Court&
&&plissken&|&&&&5&|&&&&&&&&&&&Court&
&&plissken&|&&&&6&|&&&&&&&&&&&Court&
&&plissken&|&&&&7&|&&&&&&&&&&&Court&
&&plissken&|&&&&8&|&&Stealth&Glider&
&&plissken&|&&&&9&|&&&&&&&&&&&&&NYC&
&&plissken&|&&&10&|&&&&&&&&&&&&&NYC
46. Import a CSV
I have some data in another source which I
could really use in my Cassandra table
sc.textFile(“file:///Users/russellspitzer/ReallyImportantDocuments/PlisskenLocations.csv”)&
& .map(_.split(“,"))&
& .map(&line&=>&&
& & (line(0),line(1),line(2)))&
& .saveToCassandra("newyork","timelines")
textFile
Map
plissken,1,white house
split
plissken 1 white house
plissken,1,white house
saveToCassandra
C*
cqlsh:newyork>&select&*&from&timelines&where&character&=&'plissken';&
!
&character&|&time&|&location&
kkkkkkkkkkk+kkkkkk+kkkkkkkkkkkkkkkkk&
&&plissken&|&&&&1&|&Federal&Reserve&
&&plissken&|&&&&2&|&Federal&Reserve&
&&plissken&|&&&&3&|&Federal&Reserve&
&&plissken&|&&&&4&|&&&&&&&&&&&Court&
&&plissken&|&&&&5&|&&&&&&&&&&&Court&
&&plissken&|&&&&6&|&&&&&&&&&&&Court&
&&plissken&|&&&&7&|&&&&&&&&&&&Court&
&&plissken&|&&&&8&|&&Stealth&Glider&
&&plissken&|&&&&9&|&&&&&&&&&&&&&NYC&
&&plissken&|&&&10&|&&&&&&&&&&&&&NYC
47. Perform a Join with MySQL
Maybe a little more than one line …
MySQL Table “quotes” in “escape_from_ny”
import&java.sql._&
import&org.apache.spark.rdd.JdbcRDD&
Class.forName(“com.mysql.jdbc.Driver”).newInstance();//Connector/J&added&toSpark&Shell&Classpath&
val"es&=&new&JdbcRDD(&
& sc,&&
& ()&=>&{&
& & DriverManager.getConnection("jdbc:mysql://Localhost/escape_from_ny?user=root")},&&
& "SELECT&*&FROM"es&WHERE&?&<=&ID&and&ID&<=&?”,&
& 0,&
& 100,&
& 5,&&
& (r:&ResultSet)&=>&{&
& & (r.getInt(2),r.getString(3))&
& }&
)&
!
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
48. Perform a Join with MySQL
Maybe a little more than one line …
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
!
quotes.join(&
& sc.cassandraTable(“newyork","timelines")&
& .filter(&_.get[String]("character")&==&“plissken")&
& .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))&
& .take(1)&
& .foreach(println)&
!
(5,&
& (Bob&Hauk:&& There&was&an&accident.&&
& & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&&
& & & The&President&was&on&board.&
& &Snake&Plissken:&The&president&of&what?,&
& Court)&
)
cassandraTable
JdbcRDD
Needs to be in the form of RDD[K,V]
5, ‘Bob Hauk: …'
49. Perform a Join with MySQL
Maybe a little more than one line …
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
!
quotes.join(&
& sc.cassandraTable(“newyork","timelines")&
& .filter(&_.get[String]("character")&==&“plissken")&
& .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))&
& .take(1)&
& .foreach(println)&
!
(5,&
& (Bob&Hauk:&& There&was&an&accident.&&
& & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&&
& & & The&President&was&on&board.&
& &Snake&Plissken:&The&president&of&what?,&
& Court)&
)
cassandraTable
JdbcRDD
plissken,5,court
5,court
5, ‘Bob Hauk: …'
50. Perform a Join with MySQL
Maybe a little more than one line …
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
!
quotes.join(&
& sc.cassandraTable(“newyork","timelines")&
& .filter(&_.get[String]("character")&==&“plissken")&
& .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))&
& .take(1)&
& .foreach(println)&
!
(5,&
& (Bob&Hauk:&& There&was&an&accident.&&
& & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&&
& & & The&President&was&on&board.&
& &Snake&Plissken:&The&president&of&what?,&
& Court)&
)
cassandraTable
JdbcRDD
plissken,5,court
5,court 5,(‘Bob Hauk: …’,court)
5, ‘Bob Hauk: …'
51. Perform a Join with MySQL
Maybe a little more than one line …
quotes:&org.apache.spark.rdd.JdbcRDD[(Int,&String)]&=&JdbcRDD[9]&at&JdbcRDD&at&<console>:23&
!
quotes.join(&
& sc.cassandraTable(“newyork","timelines")&
& .filter(&_.get[String]("character")&==&“plissken")&
& .map(&row&=>&(row.get[Int](“time"),row.get[String]("location"))))&
& .take(1)&
& .foreach(println)&
!
(5,&
& (Bob&Hauk:&& There&was&an&accident.&&
& & & About&an&hour&ago,&a&small&jet&went&down&inside&New&York&City.&&
& & & The&President&was&on&board.&
& &Snake&Plissken:&The&president&of&what?,&
& Court)&
)
cassandraTable
JdbcRDD
plissken,5,court
5,court 5,(‘Bob Hauk: …’,court)
5, ‘Bob Hauk: …'
52. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
timelineRow
character,time,location
53. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
54. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
55. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
56. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
character:plissken,time:8,location: Stealth Glider
57. Easy Objects with Case
Classes
We have the technology to make this even easier!
case&class&timelineRow&&(character:String,&time:Int,&location:String)&
sc.cassandraTable[timelineRow](“newyork","timelines")&
& .filter(&_.character&==&“plissken")&
& .filter(&_.time&==&8)&
& .toArray&
res13:&Array[timelineRow]&=&Array(timelineRow(plissken,8,Stealth&Glider))
The Future
cassandraTable[timelineRow]
timelineRow
character,time,location
filter
character == plissken
time == 8
character:plissken,time:8,location: Stealth Glider
58. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
cassandraTable
59. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
1 white house
cassandraTable
get[String]
60. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
1 white house
white house
cassandraTable
get[String]
_.split()
61. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
1 white house
white house
white, 1 house, 1
cassandraTable
get[String]
_.split()
(_,1)
62. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTable
get[String]
_.split()
(_,1)
_ + _
63. A Map Reduce for Word
Count …
scala>&sc.cassandraTable(“newyork”,"presidentlocations")&
& .map(&_.get[String](“location”)&)&
& .flatMap(&_.split(“&“))&
& .map(&(_,1))&
& .reduceByKey(&_&+&_&)&
& .toArray&
res17:&Array[(String,&Int)]&=&Array((1,3),&(House,4),&(NYC,3),&(Force,3),&(White,4),&(Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTable
get[String]
_.split()
(_,1)
_ + _
64. Stand Alone App Example
https://github.com/RussellSpitzer/spark4cassandra4csv
Car,:Model,:Color
Dodge,:Caravan,:Red:
Ford,:F150,:Black:
Toyota,:Prius,:Green
Spark SCC
RDD:
[CassandraRow]
!!!
FavoriteCars
Table
Cassandra
Column:Mapping
CSV
66. Getting started with Cassandra?!
DataStax Academy offers free online Cassandra training!
Planet Cassandra has resources for learning the basics from ‘Try Cassandra’ tutorials to in depth
language and migration pages!
Find a way to contribute back to the community: talk at a meetup, or share your story on
PlanetCassandra.org!
Need help? Get questions answered with Planet Cassandra’s free virtual office hours running weekly!
Email us: Community@DataStax.com!
Thanks for coming to the meetup!!
In production?!
Tweet us: @PlanetCassandra!