This document summarizes a webinar about data modeling and indexing for Apache Accumulo using Sqrrl. It discusses Accumulo and Sqrrl technology, including table designs for dynamic documents, graphs and inverted indexes. It also describes how Sqrrl Enterprise allows building advanced indexes and the real-time operational applications it enables.
This talk introduces the new Keras interface for R. This is significant, because it opens up all the great innovation using Keras with a Tensorflow backend. The talk walks through the history of connecting R/Python together, the importance of Keras, and an overview of a simple image classification project using this new interface.
Event Sourcing is powerful way to think about domain objects and transaction processing. Rather than persisting current state, event sourcing instead writes an immutable log of deltas to the db. Application state is derived, at any point in the past, simply by replaying the event history. Event sourcing is a deceptively radical idea which challenges our contemporary notions about transaction processing, while also being a mature pattern with a long history, used in databases, IoT and finance.
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
Deep dive into the hadoop / spark connector for MongoDB. Video of this presentation is available here:
https://www.youtube.com/watch?v=MPPwn1XmhzQ&t=24m54s
인공 지능을 공부하려는 개발자들이 필연적으로 부딪히는 문제 상황에 대한 해법을 알려드립니다. 본 세션에서는 손쉬운 딥러닝 인프라 설정, 빠른 모델 학습 과정, 기존 서비스에 인공 지능 기능 탑재 방법 등에 대한 다양한 서비스와 활용 사례를 데모와 함께 보여 드립니다.
- 딥러닝 인프라 설정 및 빠른 모델 학습 과정 소개
- AI 기반 이미지 인식 및 TTS 서비스 서비스 활용 사례
This talk introduces the new Keras interface for R. This is significant, because it opens up all the great innovation using Keras with a Tensorflow backend. The talk walks through the history of connecting R/Python together, the importance of Keras, and an overview of a simple image classification project using this new interface.
Event Sourcing is powerful way to think about domain objects and transaction processing. Rather than persisting current state, event sourcing instead writes an immutable log of deltas to the db. Application state is derived, at any point in the past, simply by replaying the event history. Event sourcing is a deceptively radical idea which challenges our contemporary notions about transaction processing, while also being a mature pattern with a long history, used in databases, IoT and finance.
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
Deep dive into the hadoop / spark connector for MongoDB. Video of this presentation is available here:
https://www.youtube.com/watch?v=MPPwn1XmhzQ&t=24m54s
인공 지능을 공부하려는 개발자들이 필연적으로 부딪히는 문제 상황에 대한 해법을 알려드립니다. 본 세션에서는 손쉬운 딥러닝 인프라 설정, 빠른 모델 학습 과정, 기존 서비스에 인공 지능 기능 탑재 방법 등에 대한 다양한 서비스와 활용 사례를 데모와 함께 보여 드립니다.
- 딥러닝 인프라 설정 및 빠른 모델 학습 과정 소개
- AI 기반 이미지 인식 및 TTS 서비스 서비스 활용 사례
AllTomato project helps people to find the best local restaurants based on your Facebook friends' recommendations.
I joined this project as an iOS developer.
Misión: “Dar a conocer los avances del trabajo de reposicionamiento de la marca CAMPING Y CARAVANING en la red, tanto a empresarios,
como a los usuarios actuales y potenciales de este sector, potenciando sus principales
atributos y trabajando para convertirlo en una opción turística prioritaria.”
http://www.campingycaravaning.es
Magic Mirror is proven to be a must-have photo booth for a successful event, be it a promotional event for a top brand, festive events or private parties. This powerpoint slides highlight the successful cases of using Magic Mirror in different occasions and industries.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
AllTomato project helps people to find the best local restaurants based on your Facebook friends' recommendations.
I joined this project as an iOS developer.
Misión: “Dar a conocer los avances del trabajo de reposicionamiento de la marca CAMPING Y CARAVANING en la red, tanto a empresarios,
como a los usuarios actuales y potenciales de este sector, potenciando sus principales
atributos y trabajando para convertirlo en una opción turística prioritaria.”
http://www.campingycaravaning.es
Magic Mirror is proven to be a must-have photo booth for a successful event, be it a promotional event for a top brand, festive events or private parties. This powerpoint slides highlight the successful cases of using Magic Mirror in different occasions and industries.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
What was a data product before the world changed and got so complex.
Why distributed computing/data science is the solution.
What problems does that add?
How to solve most of them using the right technologies like spark notebook, spark, scala, mesos and so on in a accompanied framework
Towards a rebirth of data science (by Data Fellas)Andy Petrella
Nowadays, Data Science is buzzing all over the place.
But what is a, so-called, Data Scientist?
Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data.
However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial.
In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results.
Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data.
The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.
On-demand slides provide a technical overview of the open source, NoSQL database Apache Accumulo. We will discuss how Accumulo was born out of the security and performance needs of the National Security Agency (NSA) and cover the concept of "cell-level security".
Query Your Streaming Data on Kafka using SQL: Why, How, and WhatHostedbyConfluent
"Streaming data is rapidly becoming a key component in modern applications, and Apache Kafka has emerged as a popular and powerful platform for managing and processing these data streams. However, as the volume and complexity of streaming data continue to grow, it becomes increasingly critical to have efficient and effective ways of querying and analyzing this data.
This is where query engines like Apache Flink, Trino, Timeplus, Materialize, and ksqlDB come in. These powerful tools offer flexible and scalable ways of processing and analyzing streaming data in real-time, enabling users to extract valuable insights from their data streams.
In this talk, we will introduce the audience to the world of querying streaming data on Apache Kafka with SQL, compare and contrast the features and capabilities of each of these tools, and provide an in-depth analysis of their respective Pros and Cons. We will also discuss the best practices and scenarios where each tool is most effective.
In conclusion, query engines like Apache Flink, Trino, ksqlDB, Timeplus, Materialize and are useful tools in processing and analyzing streaming data on Kafka. With their ability to extract valuable insights from real-time data streams, these tools are a valuable asset for modern data-driven applications."
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
SolrTM is the popular, blazing fast open Source Enterprise search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites like (Aol, Yahoo, Buy.com, Cnet, CitySearch, Netflix, Zappos, Stubhub!, digg, eTrade, Disney, Apple, NASA and MTV).
En esta charla miraremos al futuro introduciendo Spark como alternativa al clásico motor de Hadoop MapReduce. Describiremos las diferencias más importantes frente al mismo, se detallarán los componentes principales que componen el ecosistema Spark, e introduciremos conceptos básicos que permitan empezar con el desarrollo de aplicaciones básicas sobre el mismo.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...DevClub_lv
This talk will be about difficulties they have met on the road, and how they overcame some of them. It will reveal both - technical and management side.
Alberts Bušinskis (Garuda Dev Giri) - Yoga & Meditation teacher, IT enthusiast, software developer. Many years ago he had left a very good job in Switzerland and went to Indian Himalayas to practice Yoga. After he realized that Software Development is not different from Yoga, he came back to homeland trying to be helpful and inspiring.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Jennie Wang, Software Engineer (Intel)
Tsai Louie, Software Engineer (Intel)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
Data-intensive tasks call for distributed computation. By using one of the many available libraries based on the Hadoop Spark engine, even a newcomer of distributed computing can take advantage of parallelism using familiar programming languages without having to worry about the underlying data and task distribution. We present demos of SparkR (the distributed version of R), Koalas (similar to Pandas) and SparkSQL.
Presentation at ASHPC22, the Austrian-Slovenian HPC Meeting - 31 May – 2 June 2022 (https://vsc.ac.at/research/conferences/ashpc22/).
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Similar to Sqrrl October Webinar: Data Modeling and Indexing (20)
Leveraging Threat Intelligence to Guide Your HuntsSqrrl
This webinar training session covers everything from what threat intelligence is to specific examples of how to hunt with it; applying intel during a tactical hunt and what you should be looking out for when searching for adversaries on your enterprise network. Taught by Keith Gilbert, Keith is an experienced threat researcher with a background in Digital Forensics and Incident Response.
How to Hunt for Lateral Movement on Your NetworkSqrrl
Once inside your network, most cyber-attacks go sideways. They progressively move deeper into the network, laterally compromising other systems as they search for key assets and data. Would you spot this lateral movement on your enterprise network?
In this training session, we review the various techniques attackers use to spread through a network, which data sets you can use to reliably find them, and how data science techniques can be used to help automate the detection of lateral movement.
Machine Learning for Incident Detection: Getting StartedSqrrl
This presentation walks you through the uses of machine learning in incident detection and response, outlining some of the basic features of machine learning and specific tools you can use.
Watch the presentation with audio here: https://www.youtube.com/watch?v=4pArapSIu_w
Building a Next-Generation Security Operations Center (SOC)Sqrrl
So, you need to build a Security Operations Center (SOC)? What does that mean? What does the modern SOC need to do? Learn from Dr. Terry Brugger, who has been doing information security work for over 15 years, including building out a SOC for a large Federal agency and consulting for numerous large enterprises on their security operations.
Watch the presentation with audio here: http://info.sqrrl.com/sqrrl-october-webinar-next-generation-soc
User and Entity Behavior Analytics using the Sqrrl Behavior GraphSqrrl
UEBA leverages advanced statistical techniques and machine learning to surface subtle behaviors that are indicative of attacker presence. In this presentation, Sqrrl's Director of Data Science, Chris McCubbin, and Sqrrl's Director of Products, Joe Travaglini, provide an overview of how machine learning and UEBA can be used to detect cyber threats using Sqrrl's Behavior Graph.
Watch the presentation with audio here: http://info.sqrrl.com/april-2016-ueba-webinar-on-demand
Threat Hunting Platforms (Collaboration with SANS Institute)Sqrrl
Traditional security measures like firewalls, IDS, endpoint protection, and SIEMs are only part of the network security puzzle. Threat hunting is a proactive approach to uncovering threats that lie hidden in your network or system, that can evade more traditional security tools. Go in-depth with Sqrrl and SANS Institute to learn how hunting platforms work.
Watch the recording with audio here: http://info.sqrrl.com/sans-sqrrl-threat-hunting-webcast
Sqrrl and IBM: Threat Hunting for QRadar UsersSqrrl
This joint webinar, in collaboration with IBM, offers a look at the industry leading Threat Hunting App for IBM QRadar. By combining the threat detection capabilities of QRadar and Sqrrl, security analysts are armed with advanced analytics and visualization to hunt for unknown threats and more efficiently investigate known incidents.
Watch the training with audio here: http://info.sqrrl.com/sqrrl-ibm-threat-hunting-for-qradar-users
Threat Hunting for Command and Control ActivitySqrrl
Sqrrl's Security Technologist Josh Liburdi provides an overview of how to detect C2 through a combination of automated detection and hunting.
Watch the presentation with audio here: http://info.sqrrl.com/threat-hunting-for-command-and-control-activity
Today's threats demand a more active role in detecting and isolating sophisticated attacks. This must-see presentation provides practical guidance on modernizing your SOC and building out an effective threat hunting program. Ed Amoroso and David Bianco discuss best practices for developing and staffing a modern SOC, including the essential shifts in how to think about threat detection.
Watch the presentation with audio here: http://info.sqrrl.com/webinar-modernizing-your-security-operations
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together Sqrrl
This presentation explains how security teams can leverage hunting and analytics to detect advanced threats faster, more reliably, and with common analyst skill sets. Watch the presentation with audio here: http://info.sqrrl.com/threat-hunting-and-ueba-webinar
In this training session, two leading security experts review how adversaries use DNS to achieve their mission, how to use DNS data as a starting point for launching an investigation, the data science behind automated detection of DNS-based malicious techniques and how DNS tunneling and DGA machine learning algorithms work.
Watch the presentation with audio here: http://info.sqrrl.com/leveraging-dns-for-proactive-investigations
If you follow the trade press, one theme you hear over and over again is that organizations are drowning in alerts. It’s true that we need technological solutions to prioritize and escalate the most important alerts to our analysts, but the humans have a critical part to play in this process as well. The quicker they are able to make decisions about the alerts they review, the better they are able to keep up. An incident responders’ most common task is alert triage, the process of investigation and escalation that ultimately results in the creation of security incidents. As crucial as this process is, there has been remarkably little written about how to do it correctly and efficiently. In this presentation, learn incident response best practices from Sqrrl security expert, David Bianco.
Slides from the webinar led by Ely Kahn and Luis Maldonado discussing strategies to reduce Mean Time to Know in detecting cybersecurity attacks, threats, or data breaches.
Sqrrl Enterprise: Big Data Security Analytics Use CaseSqrrl
Organizations are utilizing Sqrrl Enterprise to securely integrate vast amounts of multi-structured data (e.g., tens of petabytes) onto a single Big Data platform and then are building real-time applications using this data and Sqrrl Enterprise’s analytical interfaces. The secure integration is enabled by Accumulo’s innovative cell-level security capabilities and Sqrrl Enterprise’s security extensions, such as encryption.
Benchmarking The Apache Accumulo Distributed Key–Value StoreSqrrl
This paper presents results of benchmarking Apache Accumulo distributed table store using the continuous tests suite included in its open source distribution.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. Securely explore your data
DATA MODELING AND
INDEXING FOR
APACHE ACCUMULO
Sqrrl Webinar Series
October, 2013
Adam Fuchs, CTO
Sqrrl Data, Inc.
2. RECAP
1. Introduction to Sqrrl and Accumulo
2. Security In The Wild
3. Sqrrl and Accumulo Technology
4. The Data-Centric Security Ecosystem
In our September Webinar:
Sqrrl, Apache Accumulo, and Cell-Level Security
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%
3. TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review
2. Table Designs
1. Dynamic Documents
2. Graphs
3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%
4. LAYERED ARCHITECTURE
Turtles all the way down...
Accumulo'RPC'
(Sorted(Key/Value(I/O)(
Hadoop'RPC'
(File(I/O)(
Application
Sqrrl Enterprise
Sqrrl'API'over'Apache'Thri8'RPC'
(JSON,(Graph,(Aggrega=on,(
Search,(etc.)(
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%
5. An Accumulo key is a 5-tuple, consisting of:
" Row: Controls Atomicity
" Column Family: Controls Locality
" Column Qualifier: Controls Uniqueness
" Visibility Label: Controls Access
" Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912
Patient suffers
from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo(Key/Value(Example(
ACCUMULO DATA FORMAT
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%
8. TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review
2. Table Designs
1. Dynamic Documents
2. Graphs
3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%
17. TODAY’S DISCUSSION
1. Sqrrl and Accumulo Technology Review
2. Table Designs
1. Dynamic Documents
2. Graphs
3. Inverted Indexes
3. Putting It All Together with Sqrrl
Data Modeling and Indexing for Apache Accumulo
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%
18. SQRRL ENTERPRISE
• Dynamic Documents
• JSON I/O support
• Cell-level Security and Efficient Aggregation Extensions
• Dynamic Graphs
• Co-partitioned with Documents for Integrated Search and
Discovery
• Search
• Lucene Query Syntax
• Accumulo Indexes Preserve Security Model
• Processing
• SQL-Like Language for Transforming and Aggregating Results
• Parallel Slicing and Extraction
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18%
Simple API for Advanced Accumulo Usage
20. HOW TO LEARN MORE
Download our White Paper
" www.sqrrl.com/whitepaper
Watch a video
" www.sqrrl.com/downloads#videos
Request a demo or one-on-one workshop
" www.sqrrl.com/contact
Come meet us
" Accumulo Meetup (October 28, New York)
" Strata + Hadoop World (October 28-30, New York)
" IBM IOD (November 4-7, Las Vegas)
" SC13 (November 18-21, Denver)
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%
21. THANK YOU
Thanks for attending!
To keep up to date
with Sqrrl, check out
or social media sites:
www.twitter.com/sqrrl_inc
www.linkedin.com/company/sqrrl
Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%