This document introduces Riak and summarizes its key features. Riak is a flexible storage engine that uses a key-value store model and supports document storage via JSON. It has a REST API and supports map/reduce functions. Riak is highly distributed, fault-tolerant, and optimized for availability. It aims to balance consistency and availability according to the CAP theorem by allowing tunable consistency on a per-request basis.
Agenda
- What is NOSQL?
- Motivations for NOSQL?
- Brewer’s CAP Theorem
- Taxonomy of NOSQL databases
- Apache Cassandra
- Features
- Data Model
- Consistency
- Operations
- Cluster Membership
- What Does NOSQL means for RDBMS?
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
Agenda
- What is NOSQL?
- Motivations for NOSQL?
- Brewer’s CAP Theorem
- Taxonomy of NOSQL databases
- Apache Cassandra
- Features
- Data Model
- Consistency
- Operations
- Cluster Membership
- What Does NOSQL means for RDBMS?
The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital content such as photos, videos, blog posts and everything that revolves around the social networks (Facebook and Twitter in particular). Added to this is that with the "internet of things", we see an increase in devices such as watches, bracelets, thermostats and many other items that are able to connect to the network and therefore generate large data streams. This explosion of data justifies the birth, in the world of the term Big Data: it indicates the data produced in large quantities, with remarkable speed and in different formats, which requires processing technologies and resources that go far beyond the conventional systems management and storage of data. It is immediately clear that, 1) models of data storage based on the relational model, and 2) processing systems based on stored procedures and computations on grids are not applicable in these contexts. As regards the point 1, the RDBMS, widely used for a great variety of applications, have some problems when the amount of data grows beyond certain limits. The scalability and cost of implementation are only a part of the disadvantages: very often, in fact, when there is opposite to the management of big data, also the variability, or the lack of a fixed structure, represents a significant problem. This has given a boost to the development of the NoSQL database. The website NoSQL Databases defines NoSQL databases such as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open source and horizontally scalable." These databases are: distributed, open source, scalable horizontally, without a predetermined pattern (key-value, column-oriented, document-based and graph-based), easily replicable, devoid of the ACID and can handle large amounts of data. These databases are integrated or integrated with processing tools based on the MapReduce paradigm proposed by Google in 2009. MapReduce with the open source Hadoop framework represent the new model for distributed processing of large amounts of data that goes to supplant techniques based on stored procedures and computational grids (step 2). The relational model taught courses in basic database design, has many limitations compared to the demands posed by new applications based on Big Data and NoSQL databases that use to store data and MapReduce to process large amounts of data.
Course Website http://pbdmng.datatoknowledge.it/
Contact me for other informations and to download
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
- MariaDB 소개
- MariaDB 서버 구성 및 아키텍처 이해
- MariaDB 스토리지 엔진
- MariaDB 데이터베이스 관리
- 트랜잭션 / Locking 의 이해
- MariaDB 보안
- 백업과 복구를 통한 데이터베이스 관리
- MariaDB upgrade
- MariaDB 모니터링
- MySQL 에서 MariaDB 로의 전환
This tutorial covers all parallel replication implementation in MariaDB 10.0 and 10.1 and MySQL 5.6, 5.7 and 8.0 (including how it works in Group Replication).
MySQL and MariaDB have different types of parallel replication. In this tutorial, we present the different implementations that allow us to understand their limitations and tuning parameters. We cover how to make parallel replication faster and what to avoid for maximizing its benefits. We also present tests from Booking.com workloads.
Some of the subjects that are covered are group commit and optimistic parallel replication in MariaDB, the parallelism interval of MySQL and its Write Set optimization, and the ?slowing down the master to speed up the slave? optimization.
After this tutorial, you will know everything you need to implement and tune parallel replication in your environment. But more importantly, we will show how you can test parallel replication benefit in a non-disruptive way before deployment.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
Presentation slides of the workshop on "Introduction to Pig" at Fifth Elephant, Bangalore, India on 26th July, 2012.
http://fifthelephant.in/2012/workshop-pig
- MariaDB 소개
- MariaDB 서버 구성 및 아키텍처 이해
- MariaDB 스토리지 엔진
- MariaDB 데이터베이스 관리
- 트랜잭션 / Locking 의 이해
- MariaDB 보안
- 백업과 복구를 통한 데이터베이스 관리
- MariaDB upgrade
- MariaDB 모니터링
- MySQL 에서 MariaDB 로의 전환
This tutorial covers all parallel replication implementation in MariaDB 10.0 and 10.1 and MySQL 5.6, 5.7 and 8.0 (including how it works in Group Replication).
MySQL and MariaDB have different types of parallel replication. In this tutorial, we present the different implementations that allow us to understand their limitations and tuning parameters. We cover how to make parallel replication faster and what to avoid for maximizing its benefits. We also present tests from Booking.com workloads.
Some of the subjects that are covered are group commit and optimistic parallel replication in MariaDB, the parallelism interval of MySQL and its Write Set optimization, and the ?slowing down the master to speed up the slave? optimization.
After this tutorial, you will know everything you need to implement and tune parallel replication in your environment. But more importantly, we will show how you can test parallel replication benefit in a non-disruptive way before deployment.
Riak ( http://wiki.basho.com ), a Dynamo-inspired, open-source key/value datastore, was built to scale from a single machine to a 100+ server cluster without driving you or your operations team crazy. This presentation discusses the characteristics of Riak that become important in small, medium, and large clusters.
This is a presentation given by Matt Brender (@mjbrender) at Big Data TechCon 2015.
In this class, we will discuss why companies choose Riak over a relational database with a specific focus on availability, scalability, and the key/value data model. We then analyze the decision points that should be considered when choosing a non-relational solution and review data modeling, querying, and consistency guarantees. Finally, we end with simple patterns for building common applications in Riak using its key/value design, dealing with data conflicts that emerge in an eventually consistent system, and discuss multi-datacenter replication.
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
So, there's this "NoSQL" thing you may have heard of, and this related thing called "eventual consistency". Supposedly, they help you scale, but no one has ever explained why! Well, wonder no more! This talk will demystify NoSQL, eventual consistency, how they might help you scale, and -- most importantly -- why you should care.
We'll look closely at how Riak, a linearly-scalable, distributed and fault-tolerant NoSQL datastore, implements eventual consistency, and how you can harness it from Ruby via the slick Ripple client/ORM. When the talk is finished, you'll have the tools both to understand eventual consistency and to handle it like a pro inside your next Ruby application.
Find out how to build decentralized, fault-tolerant, stateful application services using core concepts and techniques from the Amazon Dynamo paper using riak_core as a toolkit.
Convergent Replicated Data Types in Riak 2.0Big Data Spain
Talk by Gordon Guthrie, Senior Software Engineer at Basho
Summary
A review of the CAP Theorem and the difficulties of resolving conflicts in highly distributed systems. Covering the issues and various theories on how to resolve including the use CRDTs in Riak
Details
CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronisation. This leads to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings.
The CRDT concept was first formally defined in 2007 by Marc Shapiro and Nuno Preguiça in terms of operation commutativity, and development was initially motivated by collaborative text editing. The concept of semilattice evolution of replicated states was first defined by Baquero and Moura in 1997, and development was initially motivated by mobile computing. The two concepts were later unified in 2011.
Basho has worked with the EU and Marc Shapiro's team to push CRDTs into distributed systems. Riak v2.x is the first commercial product to include this functionality
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScyllaDB
The increasing demand to manage real-time data (RTD) resulted in growing adoption of stream processing systems. Organizations can no longer wait for nightly batch jobs to process data and then take actions. In this talk we show how the powerful combination of KSQL, Kafka and ScyllaDB can help you implement scalable stream processing applications. We present a real-time streaming pipeline where massive amounts of data are ingested into Kafka, then processed by KSQL to keep the real-time results in Scylla tables. Whenever you query the Scylla tables you are sure you have the latest results at your fingertips.
Large Scale Machine Learning with Apache SparkCloudera, Inc.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
20. What Is Riak?
A flexible storage engine...
...with a REST API...
21. What Is Riak?
A flexible storage engine...
...with a REST API...
...and map/reduce capability...
22. What Is Riak?
A flexible storage engine...
...with a REST API...
...and map/reduce capability...
....designed to be fault-tolerant...
23. What Is Riak?
A flexible storage engine...
...with a REST API...
...and map/reduce capability...
....designed to be fault-tolerant...
...distributed...
24. What Is Riak?
A flexible storage engine...
...with a REST API...
...and map/reduce capability...
....designed to be fault-tolerant...
...distributed...
...and ops friendly
32. CAP Theorem
Consistent Reads and writes reflect a
globally consistent system state
Available System is available for reads and
writes
33. CAP Theorem
Consistent Reads and writes reflect a
globally consistent system state
Available System is available for reads and
writes
34. CAP Theorem
Consistent Reads and writes reflect a
globally consistent system state
Available System is available for reads and
writes
Partition Tolerant System can handle
the failure of individual parts
42. Dynamo
Influences
• N = The number of replicas
• R = The number of replicas needed for a
successful read
43. Dynamo
Influences
• N = The number of replicas
• R = The number of replicas needed for a
successful read
• W = The number of replicas needed for a
successful write
50. Dynamo Math
N = 4, W = 2, R = 1
4 - 2 = 2 hosts can be down and Riak can
still perform writes.
51. Dynamo Math
N = 4, W = 2, R = 1
4 - 2 = 2 hosts can be down and Riak can
still perform writes.
4 - 1 = 3 hosts can be down and Riak can
still perform reads.
80. Linking Objects
• Objects can store pointers, or links, to
other objects
• Doesn’t have to be the same bucket
81. Linking Objects
• Objects can store pointers, or links, to
other objects
• Doesn’t have to be the same bucket
• Object links described in a Link header
82. Link Header Format
Object URL
</riak/demo/test1>; riaktag="userinfo"
Link tag
95. Link Walking Examples
/riak/demo/test1/_,_,0/_,_,1
Start walking at /demo/test1, find any linked objects,
then find and return any objects linked to those
98. Link Walking Examples
/riak/demo/test1/_,child,0/_,_,1
Start walking at /demo/test1, find any linked objects
with the link tag “child”, then find and return any objects
linked to those
100. Map/Reduce
Terms
• Phase: A step within a job
101. Map/Reduce
Terms
• Phase: A step within a job
• Job: A sequence of phases and inputs
102. Map/Reduce
Terms
• Phase: A step within a job
• Job: A sequence of phases and inputs
• Map: Data collection phase
103. Map/Reduce
Terms
• Phase: A step within a job
• Job: A sequence of phases and inputs
• Map: Data collection phase
• Reduce: Data collation or processing
phase
106. Map/Reduce
Overview
• Map phases execute in parallel w/data
locality
• Reduce phases execute in parallel on the
node where job was submitted
107. Map/Reduce
Overview
• Map phases execute in parallel w/data
locality
• Reduce phases execute in parallel on the
node where job was submitted
• Results are not cached or stored
108. Map/Reduce
Overview
• Map phases execute in parallel w/data
locality
• Reduce phases execute in parallel on the
node where job was submitted
• Results are not cached or stored
• Phases can be written in Erlang or
Javascript
119. Erlang Map Phase
• Two types: modfun and qfun
• modfuns reference the module and name
of the Erlang function to call
120. Erlang Map Phase
• Two types: modfun and qfun
• modfuns reference the module and name
of the Erlang function to call
• qfuns are anonymous Erlang functions*
121. Erlang Map Phase
• Two types: modfun and qfun
• modfuns reference the module and name
of the Erlang function to call
• qfuns are anonymous Erlang functions*
*Must be on the server-side codepath
131. Erlang Map
Built-Ins
riak_mapreduce:map_object_value/3
• Returns object value wrapped in a list
132. Erlang Map
Built-Ins
riak_mapreduce:map_object_value/3
• Returns object value wrapped in a list
riak_mapreduce:map_object_value_list/3
133. Erlang Map
Built-Ins
riak_mapreduce:map_object_value/3
• Returns object value wrapped in a list
riak_mapreduce:map_object_value_list/3
• Returns object value. Object value must already
be a list
142. Erlang & Javascript
• Same environment as Firefox minus
browser bits
• Erlang to Javascript data is JSON encoded
143. Erlang & Javascript
• Same environment as Firefox minus
browser bits
• Erlang to Javascript data is JSON encoded
• Javascript to Erlang data is JSON decoded
152. Javascript Map
Built-Ins
Riak.mapValues
• Returns object values. Handles detecting
when/if to use list wrapping.
153. Javascript Map
Built-Ins
Riak.mapValues
• Returns object values. Handles detecting
when/if to use list wrapping.
Riak.mapValuesJson
154. Javascript Map
Built-Ins
Riak.mapValues
• Returns object values. Handles detecting
when/if to use list wrapping.
Riak.mapValuesJson
• Returns JSON parsed object values. Also
performs list wrapping, if needed.
157. Reduce Phase
• Performed on the node coordinating the
map/reduce job
• Two processes per reduce phase to add
minor parallelism
158. Reduce Phase
• Performed on the node coordinating the
map/reduce job
• Two processes per reduce phase to add
minor parallelism
• Must return a list
161. Erlang Reduce
Built-Ins
riak_mapreduce:reduce_set_union/2
• Returns unique set of values
162. Erlang Reduce
Built-Ins
riak_mapreduce:reduce_set_union/2
• Returns unique set of values
riak_mapreduce:reduce_sum/2
163. Erlang Reduce
Built-Ins
riak_mapreduce:reduce_set_union/2
• Returns unique set of values
riak_mapreduce:reduce_sum/2
• Returns the sum of inputs
164. Erlang Reduce
Built-Ins
riak_mapreduce:reduce_set_union/2
• Returns unique set of values
riak_mapreduce:reduce_sum/2
• Returns the sum of inputs
riak_mapreduce:reduce_sort/2
165. Erlang Reduce
Built-Ins
riak_mapreduce:reduce_set_union/2
• Returns unique set of values
riak_mapreduce:reduce_sum/2
• Returns the sum of inputs
riak_mapreduce:reduce_sort/2
• Returns the sorted list of inputs
168. Javascript Reduce
Built-Ins
Riak.reduceMin
• Returns the minimum value of the input set
169. Javascript Reduce
Built-Ins
Riak.reduceMin
• Returns the minimum value of the input set
Riak.reduceMax
170. Javascript Reduce
Built-Ins
Riak.reduceMin
• Returns the minimum value of the input set
Riak.reduceMax
• Returns the maximum value of the input set
171. Javascript Reduce
Built-Ins
Riak.reduceMin
• Returns the minimum value of the input set
Riak.reduceMax
• Returns the maximum value of the input set
Riak.reduceSort
172. Javascript Reduce
Built-Ins
Riak.reduceMin
• Returns the minimum value of the input set
Riak.reduceMax
• Returns the maximum value of the input set
Riak.reduceSort
• Returns a sorted list of the input set
188. Erlang Phase
(JSON)
{Type:{“language”:”erlang”, “module”: Module,
“function”: Function, “keep”:Flag}}
• Type: “map” or “reduce”
• Module: String name of Erlang module
189. Erlang Phase
(JSON)
{Type:{“language”:”erlang”, “module”: Module,
“function”: Function, “keep”:Flag}}
• Type: “map” or “reduce”
• Module: String name of Erlang module
• Function: String name of Erlang function
190. Erlang Phase
(JSON)
{Type:{“language”:”erlang”, “module”: Module,
“function”: Function, “keep”:Flag}}
• Type: “map” or “reduce”
• Module: String name of Erlang module
• Function: String name of Erlang function
• Flag: Boolean accumulation toggle
203. Javascript Phase
(JSON)
{Type:{“language”:”javascript”,
“name”:Name,“keep”:Flag}}
• Type: “map” or “reduce”
• Name: String name of Javascript function
204. Javascript Phase
(JSON)
{Type:{“language”:”javascript”,
“name”:Name,“keep”:Flag}}
• Type: “map” or “reduce”
• Name: String name of Javascript function
• Flag: Boolean accumulation toggle
205. Putting It
Together
{“inputs”: [[“stocks”, “goog”]],
“query”: [{“map”:{“language”:”javascript”,
“name”: “Riak.mapValuesJson”},
“keep”: true}]}
206. Putting It
Together
{“inputs”: [[“stocks”, “goog”],
[“stocks”, “csco”]],
“query”: [{“map”:{“language”:”javascript”,
“name”: “Riak.mapValuesJson”},
“keep”: true}]}
207. Putting It
Together
{“inputs”: “stocks”,
“query”: [{“map”:{“language”:”javascript”,
“name”: “App.extractTickers”,
“arg”: “GOOG”},
“keep”: false},
{“reduce”:{“language”:”javascript,
“name”: “Riak.reduceMin”},
“keep”: true}]}