This presentation introduces people to Cassandra and Column Family Datastores in general. I will discuss what Cassandra is, how and when it is useful, and how it integrates with Rails. I will also go in to lessons learned during our 3-month project, and the useful patterns that emerged. The discussion will be very technical, but targeted at developers who are not familiar with, or have not done a project with Cassandra.
SlideShare and Haiku Deck Team Up for Presentation Creation and Sharing -- Millions of Users Can Now Create Beautiful Visual Presentations Directly from SlideShare
199 Social Media and Content Marketing ToolsWishpond
The life of a social media manager, content creator, or overall online marketer can be a stressful one. There's so much to do and so much to think about it can get a bit hectic.
Luckily, there's software out there that can help us get our jobs done faster, easier, and more efficiently. These tools organize our lives, automate our jobs, and maximize the ROI of our efforts.
Here are 199 content and social media marketing tools. Are there any we missed that you love?
This presentation introduces people to Cassandra and Column Family Datastores in general. I will discuss what Cassandra is, how and when it is useful, and how it integrates with Rails. I will also go in to lessons learned during our 3-month project, and the useful patterns that emerged. The discussion will be very technical, but targeted at developers who are not familiar with, or have not done a project with Cassandra.
SlideShare and Haiku Deck Team Up for Presentation Creation and Sharing -- Millions of Users Can Now Create Beautiful Visual Presentations Directly from SlideShare
199 Social Media and Content Marketing ToolsWishpond
The life of a social media manager, content creator, or overall online marketer can be a stressful one. There's so much to do and so much to think about it can get a bit hectic.
Luckily, there's software out there that can help us get our jobs done faster, easier, and more efficiently. These tools organize our lives, automate our jobs, and maximize the ROI of our efforts.
Here are 199 content and social media marketing tools. Are there any we missed that you love?
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. Yahoo! has been using HBase for a long time as isolated one off deployments. Having a multi-tenant platform makes it possible for all our grid customers to take advantage of HBase capabilities now. We will provide a brief overview of HBase and how it works (several of you asked for back to basics type talks), and then spend the majority of our time talking about multi-tenancy with HBase.
Presenter(s):
Francis Christopher Liu, Software Engineer, Yahoo! and PPMC Member, Apache HCatalog
Vandana Ayyalasomayajula, Software Engineer, Yahoo! and PPMC Member, Apache HCatalog
Zachary Pinter and Tony Hillerson from EffectiveUI presented at RailsConf 2011. This presentation covers the basics of HBase, what type of apps it works well with, and how to use HBase with Rails.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
http://www.meetup.com/PhillyDB/events/104465902
This presentation will provide an overview of Hbase as well as MapR's M7 NoSQL database.
We will begin with a discussion of the basic Hbase architecture and the problems it solves. We will then discuss how MapR's M7, like M5, adds innovative features that provide tangible advantages to Hbase users while maintaining API compatibility."
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. Yahoo! has been using HBase for a long time as isolated one off deployments. Having a multi-tenant platform makes it possible for all our grid customers to take advantage of HBase capabilities now. We will provide a brief overview of HBase and how it works (several of you asked for back to basics type talks), and then spend the majority of our time talking about multi-tenancy with HBase.
Presenter(s):
Francis Christopher Liu, Software Engineer, Yahoo! and PPMC Member, Apache HCatalog
Vandana Ayyalasomayajula, Software Engineer, Yahoo! and PPMC Member, Apache HCatalog
Zachary Pinter and Tony Hillerson from EffectiveUI presented at RailsConf 2011. This presentation covers the basics of HBase, what type of apps it works well with, and how to use HBase with Rails.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
http://www.meetup.com/PhillyDB/events/104465902
This presentation will provide an overview of Hbase as well as MapR's M7 NoSQL database.
We will begin with a discussion of the basic Hbase architecture and the problems it solves. We will then discuss how MapR's M7, like M5, adds innovative features that provide tangible advantages to Hbase users while maintaining API compatibility."
Similar to TriHUG January 2012 Talk by Chris Shain (20)
Deploying enterprise grade security for Hadoop with Apache Sentry (incubating).
Apache Hive is deployed in the vast majority of Hadoop use cases despite the major practical flaws in it's most secure operational mode (Kerberos + User Impersonation).
In this talk we will discuss these flaws and how Apache Sentry addresses them. We will then enable Apache Sentry on a existing cluster. Additional topics will include Hadoop security and Role Based Access Control (RBAC).
Hadoop based applications are becoming critical in the financial services arena for the analysis and correlation of large volumes of structured and unstructured data. In addition, the Dodd-Frank Act signifies the largest US financial regulatory change in several decades and requires much greater transparency on financial data. In this session, we will answer common questions and demonstrate use cases in how Hadoop and Datameer help with asset management and risk management, fraud detection and data security.
Leave this session knowing about:
Financial data and Hadoop. What data lends itself to Hadoop? What doesn’t?
Benchmarks from real-world uses of Hadoop in finance
How to effectively migrate, manage, and analyze financial data using Hadoop
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
2. Hadoop for Financial Services
First completely Hadoop-powered analytics
application
Widely recognized as “Big Data Startup to
Watch”
Winner of the 2011 NCTA Award for Emerging
Tech Company of the Year
Based in Charlotte NC
We are hiring! curious@tresata.com
3. Software Development Lead at Tresata
Background in Financial Services IT
End-User Applications
Data Warehousing
ETL
Email: chris@tresata.com
Twitter: @chrisshain
4. What is HBase?
From: http://hbase.apache.org:
▪ “HBase is the Hadoop database.”
I think this is a confusing statement
5. ‘Database’, to many, means:
Transactions
Joins
Indexes
HBase has none of these
More on this later
6. HBase is a data storage platform designed to
work hand-in-hand with Hadoop
Distributed
Failure-tolerant
Semi-structured
Low latency
Strictly consistent
HDFS-Aware
“NoSQL”
7. Need for a low-latency, distributed datastore
with unlimited horizontal scale
Hadoop (MapReduce) doesn’t provide low-
latency
Traditional RDBMS don’t scale out
horizontally
8. November 2006: Google BigTable whitepaper
published:
http://research.google.com/archive/bigtable.html
February 2007: Initial HBase Prototype
October 2007: First ‘usable’ HBase
January 2008: HBase becomes Apache
subproject of Hadoop
March 2009: HBase 0.20.0
May 10th, 2010: HBase becomes Apache Top
Level Project
9. Web Indexing
Social Graph
Messaging (Email etc.)
10. HBase is written almost entirely in Java
JVM clients are first-class citizens
HBase Master
RegionServer
JVM Clients
RegionServer
Non-JVM Proxy
RegionServer
Clients (Thrift or REST)
RegionServer
11. All data is stored in Tables
Table rows have exactly one Key, and all rows in a
table are physically ordered by key
Tables have a fixed number of Column Families
(more on this later!)
Each row can have many Columns in each column
family
Each column has a set of values, each with a
timestamp
Each row:family:column:timestamp combination
represents coordinates for a Cell
12. Defined by the Table
A Column Family is a group of related
columns with it’s own name
All columns must be in a column family
Each row can have a completely different set
of columns for a column family
Row: Column Family: Columns:
Chris Friends:Bob
Bob Friends Friends:Chris Friends:James
James Friends:Bob
13. Not exactly the same as rows in a traditional
RDBMS
Key: a byte array (usually a UTF-8 String)
Data: Cells, qualified by column family, column, and
timestamp (not shown here)
Row Key: Column Families : Columns: Cells:
(Defined by the (Defined by the Row) (Created with Columns)
Table) (May vary between
rows)
Attributes Attributes:Age 30
Attributes:Height 68
Chris
Friends Friends:Bob 1 (Bob’s a cool guy)
Friends:Jane 0 (Jane and I don’t get along)
14. All cells are created with a timestamp
Column family defines how many versions of
a cell to keep
Updates always create a new cell
Deletes create a tombstone (more on that
later)
Queries can include an “as-of” timestamp to
return point-in-time values
15. HBase deletes are a form of write called a
“tombstone”
Indicates that “beyond this point any previously
written value is dead”
Old values can still be read using point-in-time
queries
Timestamp Write Type Resulting Value Point-In-Time
Value “as of” T+1
T+0 PUT (“Foo”) “Foo” “Foo”
T+1 PUT (“Bar”) “Bar” “Bar”
T+2 DELETE <none> “Bar”
T+3 PUT (“Foo Too”) “Foo Too” “Bar”
16. Requirement: Store real-time stock tick data
Ticker Timestamp Sequence Bid Ask
IBM 09:15:03:001 1 179.16 179.18
MSFT 09:15:04:112 2 28.25 28.27
GOOG 09:15:04:114 3 624.94 624.99
IBM 09:15:04:155 4 179.18 179.19
Requirement: Accommodate many
simultaneous readers & writers
Requirement: Allow for reading of current
price for any ticker at any point in time
18. Row Key Family:Column
Prices:Bid
[Ticker].[Rev_Timestamp].[Rev_Sequence_Number]
Prices:Ask
HBase throughput will scale linearly with # of
nodes
No need to keep separate “latest price” table
A scan starting at “ticker” will always
return the latest price row
19. HBase scales horizontally
Needs to split data over many RegionServers
Regions are the unit of scale
20. All HBase tables are broken into 1 or more
regions
Regions have a start row key and an end row
key
Each Region lives on exactly one
RegionServer
RegionServers may host many Regions
When RegionServers die, Master detects this
and assigns Regions to other RegionServers
21. “Users” Table
Row Keys in Region
“Aaron” – “George”
-META- Table “Aaron”
Region “Bob”
Table Region
Server “Chris”
“Aaron” – “George” Node01 Row Keys in Region
Users “George” – “Matthew” Node02 “George” – “Matthew”
“Matthew” – “Zachary” Node01 “George”
Row Keys in Region
“Matthew” – “Zachary”
“Matthew”
“Nancy”
“Zachary”
24. ZooKeeper
Keeps track of which server is the current HBase
Master
HBase Master
Keeps track of Region/RegionServer mapping
Manages the -ROOT- and .META. tables
Responsible for updating ZooKeeper when these
change
25. RegionServer
Stores table regions
Clients
Need to be smarter than RDBMS clients
First connect to ZooKeeper to get RegionServer
for a given Table/Region
Then connect directly to RegionServer to interact
with the data
All connections over Hadoop RPC – non-JVM
clients use proxy (Thrift or REST (Stargate))
27. HBase Master is not necessarily a single point of
failure (SPOF)
Multiple masters can be running
Current ‘active’ Master controlled via ZooKeeper
Make sure you have enough ZooKeeper nodes!
Master is not needed for client connectivity
Clients connect directly to ZooKeeper to find Regions
Everything Master does can be put off until one is
elected
29. HBase tolerates RegionServer failure when
running on HDFS
Data is replicated by HDFS (dfs.replication setting)
Lots of issues around fsync, failure before data is
flushed - some probably still not fixed
Thus, data can still be lost if node fails after a write
HDFS NameNode is still SPOF, even for HBase
30. Similar to log in many RDBMS
All operations by default written to log before considered
‘committed’ (can be overridden for ‘disposable fast writes’)
Log can be replayed when region is moved to another
RegionServer
One WAL per RegionServer
Flushed periodically
(10s by default)
WAL HFile
Writes
MemStore HFile
Flushed when
MemStore gets too big
32. A RegionServer is not guaranteed to be on
the same physical node as it’s data
Compaction causes RegionServer to write
preferentially to local node
But this is a function of HDFS Client, not HBase
33. All data is in memory initially (memstore)
HBase is a write-only system
Modifications and deletes are just writes with
later timestamps
Function of HDFS being append-only
Eventually old writes need to be discarded
2 Types of Compactions:
Minor
Major
34. All HBase edits are initially stored in memory
(memstore)
Flushes occur when memstore reaches a
certain size
By default 67,108,864 bytes
Controlled by hbase.hregion.memstore.flush.size
configuration property
Each flush creates a new HFile
35. Triggered when a certain number of HFiles are created for
a given Region Store (+ some other conditions)
By default 3 HFiles
Controlled by hbase.hstore.compactionThreshold configuration
property
Compacts most recent HFiles into one
By default, uses RegionServer-local HDFS node
Does not eliminate deletes
Only touches most recent HFiles
NOTE: All column families are compacted at once (this
might change in the future)
36. Triggered every 24 hours (with random
offset) or manually
Large HBase installations usually leave this for
manual operators
Re-writes all HFiles into one
Processes deletes
Eliminates tombstones
Erases earlier entries
37. HBase does not have transactions
However:
Row-level modifications are atomic: All
modifications to a row will succeed or fail as a unit
Gets are consistent for a given point in time
▪ But Scans may return 2 rows from different points in
time
All data read has been ‘durably stored’
▪ Does NOT mean flushed to disk- can still be lost!
38.
39. DO: Design your schema for linear range scans on your
most common queries.
Scans are the most efficient way to query a lot of rows
quickly
DON’T: Use more than 2 or 3 column families.
Some operations (flushing and compacting) operate
on the whole row
DO: Be aware of the relative cardinality of column
families
Wildly differing cardinality leads to sparsity and bad
scanning results.
40. DO: Be mindful of the size of your row and column
keys
They are used in indexes and queries, can be quite
large!
DON’T: Use monotonically increasing row keys
Can lead to hotspots on writes
DO: Store timestamp keys in reverse
Rows in a table need to be read in order, usually
you want most recent
41. DO: Query single rows using exact-match on key
(Gets) or Scans for multiple rows
Scans allow efficient I/O vs. multiple gets
DON’T: Use regex-based or non-prefix column filters
Very inefficient
DO: Tune the scan cache and batch size parameters
Drastically improves performance when returning lots of
rows
45. Requirement: Store an arbitrary set of
preferences for all users
Requirement: Each user may choose to store
a different set of preferences
Requirement: Preferences may be of
different data types (Strings, Integers, etc)
Requirement: Developers will add new
preference options all the time, so we
shouldn’t need to modify the database
structure when adding them
46. One possible RDBMS solution:
Key/Value table
All values as strings
Flexible, but wastes space
Keys: Column: Data Type:
UserID Int
PK
PreferenceName Varchar
PreferenceValue Varchar
47. Store all preferences in the Preferences
column family
Preference name as column name,
preference value as (serialized) byte array:
HBase client library provides methods for
serializing many common data types
Row Key: Family: Column: Value:
Age 30
Chris Preferences
Hometown “Mineola, NY”
Joe Preferences Birthdate 11/13/1987