Stratio big data spain

•Download as PPTX, PDF•

1 like•1,854 views

This document discusses efficient data mining solutions using Hadoop, Cassandra, and Spark. It describes Cassandra as a fast, robust, and efficient key-value database but notes it has limitations for certain queries. Spark is presented as an alternative to Hadoop MapReduce that can be 100 times faster for interactive algorithms and data mining. The document demonstrates how Spark can integrate with Cassandra to allow distributed data processing over Cassandra data without needing to clone the data or use other databases. Future extensions are proposed to directly access Cassandra's SSTable files from Spark and extend CQL3 to leverage Spark.

Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Goals
•
•
•
•

#StratioBD

Why do you need Cassandra?
What is the problem?
Why do you need Spark?
How do they work together?

Cassandra
•
•
•
•

#StratioBD

Based on DynamoDB…
Replication, Key/Value, P2P
And based on Big Table…
Column oriented

NO
BOTTLENECK

DECENTRALIZED

REPLICATED

Case A

One User – Lot of data
#StratioBD

Case C

Many user – Lot of data
#StratioBD

Crawler app
100M
Indexed
pages

3k
reads

Cassandra, I choose you

#StratioBD

Query time

< 1s

New query
“I need to find all the reference to the domain ACME.
I need the answer by Friday.”

#StratioBD

Problem
Cassandra is not well suited to resolved this type of

queries
You need to design the schema with the query in mind

#StratioBD

What options do we have?

•
•
•

#StratioBD

Run Hive Query on top of C*
Write an ETL script and load data into another DB
Clone the cluster

What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster

#StratioBD

And now… what can we do?

“We can't solve problems by using the same kind
of thinking we used when we created them”

Albert Einstein

#StratioBD

Spark
•
•
•
•
•

Alternative to MapReduce
A low latency cluster computing system
For very large datasets
Create by UC Berkeley AMP Lab in 2010.
May be 100 times faster than MapReduce for:



#StratioBD

Interactive algorithms.
Interactive data mining

Logistic regression in
Spark vs Hadoop

SOURCE | http://spark.incubator.apache.org/

#StratioBD

Spark and Cassandra

Integration points

#StratioBD

Cassandra’s HDFS abstraction layer
Advantantages:
•

Easily integrates with legacy systems.

Drawbacks:
•
•

Very high-level: no access to low level Cassandra’s features.
Questionable performance.

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

Cassandra’s Hadoop Interface
•

Thrift protocol

•

CQL3 (our implementation)


Uses the novel Cassandra’s CqlPagingInputFormat

INTEGRATION POINTS: HDFS OVER CASSANDRA

#StratioBD

CQL3 Integration
•
•
•

Supports CQL3 features
Respects data locality
Good compromise between
performance / implementation complexity

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (II)
Provides a Java friendly API:
•

Developers map Column Families to custom serializable POJOs

•

StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

#StratioBD

CQL3 Integration (III)
Drawbacks:
•

Still not preforming as well as we’d like


•

No analyst-friendly interface:


#StratioBD

Uses Cassandra’s Hadoop Interface

No SQL-like query features

INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

Future extensions
What are we currently working on?

Bring the integration to another level:
•
•
•

#StratioBD

Dump Cassandra’s Hadoop Interface
Direct access to Cassandra’s SSTable(s) files.
Extend Cassandra’s CQL3 to make use of Spark’s distributed
data processing power

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Spark Summit

The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Why spark by Stratio - v.1.0

Stratio

Tactical Data Science Tips: Python and Spark Together

Databricks

Running Spark and Python data science workloads can be challenging given the complexity of the various data science tools in the ecosystem like sci-kit Learn, TensorFlow, Spark, Pandas, and MLlib. All these various tools and architectures, provide important trade-offs to consider when it comes to moving to proofs of concept and going to production. While proof of concepts may be relatively straightforward, moving to production can be challenging because it’s difficult to understand not just the short term effort to develop a solution, but the long term cost of supporting projects over the long term. This talk will discuss important tactical patterns for evaluating projects, running proofs of concept to inform going to production, and finally the key tactics we use internally at Databricks to take data and machine learning projects into production. This session will cover some architectural choices involving Spark, PySpark, Pandas, notebooks, various machine learning toolkits, as well as frameworks and technologies necessary to support them.

Spark Summit EU talk by Tug Grall

Spark Summit

DASK and Apache Spark

Databricks

For a Python driven Data Science team, DASK presents a very obvious logical next step for distributed analysis. However, today the de-facto standard choice for exact same purpose is Apache Spark. DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper. Given the above statement, do we even need to compare and contrast to make a choice? Shouldn't DASK be the default choice? Well, that's what this session is about. It goes in detail explaining the various viewpoints and dimensions that need to be considered to pick one over other.

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...

Spark Summit

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency. In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. Our experiments on a 128 node EC2 cluster show that Drizzle can achieve end-to-end streaming latencies of less than 100ms and can get up to 3.5x lower latency than Spark Streaming. Compared to Apache Flink, a record-at-a-time streaming system, we show that Drizzle can recover around 4x faster from failures and that Drizzle has up to 13x lower latency during recovery.

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Databricks

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Spark Summit

Cybercrime is big business. Gartner reports worldwide security spending at $80B, with annual losses totalling more than $1.2T (in 2015). Small to medium sized businesses now account for more than half of the attacks targeting enterprises today. The threat actors behind these attacks are continually shifting their techniques and toolkits to evade the security defenses that businesses commonly use. Thanks to the growing frequency and complexity of attacks, the task of identifying and mitigating security-related events has become increasingly difficult. At eSentire, we use a combination of data and human analytics to identify, respond to and mitigate cyber threats in real-time. We capture all network traffic on our customers’ networks, hence ingesting a large amount of time-series data. We process the data as it is being streamed into our system to extract relevant threat insights and block attacks in real-time. Furthermore, we enable our cybersecurity analysts to perform in-depth investigations to: i) confirm attacks and ii) identify threats that analytical models miss. Having security experts in the loop provides feedback to our analytics engine, thereby improving the overall threat detection effectiveness. So how exactly can you build an analytics pipeline to handle a large amount of time-series/event-driven data? How do you build the tools that allow people to query this data with the expectation of mission-critical response times? In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.

Rental Cars and Industrialized Learning to Rank with Sean Downes

Databricks

Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp. In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Spark Summit

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...

Spark Summit

Across all assets globally, Shell carries a huge stock of spare part inventory which ties up large quantities of working capital. Over the past 2 years an interdisciplinary project team has produced a tool, Inventory Optimization Analytics solution (IOTA), based on advanced analytical methods, that helps assets optimise stock levels and purchase strategies. To calculate the recommended stocking inventory level requirement for a material the Data Science team have written a Markov Chain Monte Carlo (MCMC) bootstrapping statistical model in R. Cumulatively, the computational task is large but, fortunately, is one of an embarrassingly parallel nature because the model can be applied independently to each material. The original solution which utilised the R “parallel” package was deployed on a single 48 core PC and took 48 hours to run. In this presentation, we describe how we moved the original solution to a distributed cloud-based Apache Spark framework. Using the new R User Defined Functions API in Apache Spark and with only a minimal amount of code changes the computational run time was reduced to 4 hours. A restructuring of the architecture to “pipeline” the problem resulted in a run time of less than 1 hour. This use case is important because it verifies the scalability and performance of SparkR.

AI on Spark for Malware Analysis and Anomalous Threat Detection

Databricks

At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Spark Summit

Reinsurance company’s core competencies include the quantification of risk associated with catastrophes, such as hurricanes and earthquakes. Various so-called catastrophe models are available publicly, some commercial and some open-source. The volume of data processed by such “cat models” requires Big Data and High Performance capabilities. This is clearly reflected in the landscape of public models. And the observed trend is towards more and more detailed inputs, as well as outputs. This makes scalability an important concern. Companies that deal with catastrophe risk commonly use one or several public cat models. If they wish to differentiate themselves from the market, they may build internal proprietary models, in particular in areas that are not covered by existing models. The result is a deeper understanding and an independent quantification of risk, both of which can lead to a competitive edge.

Apache Spark Briefing

Thomas W. Dinsmore

Implementing the Lambda Architecture efficiently with Apache Spark

DataWorks Summit

Spark Summit EU talk by Christos Erotocritou

Spark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Spark Summit

As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.

Apache Spark At Scale in the Cloud

Databricks

Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on. You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal. By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs: Sizing the cluster based on your dataset (shuffle partitions) Ingestion challenges – well begun is half done (globbing S3, small files) Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you) Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win) Scheduling (FAIR vs FIFO, is there a difference for your pipeline?) Caching and persistence (it’s the cost of doing business, so what are your options?) Fault tolerance (blacklisting, speculation, task reaping) Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans) Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Spark Summit

In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc. Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language. We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.

Learn to use Stratio Crossdata

Álvaro Agea Herradón

This is achieved thanks to its generic architecture and the definition of a custom SQL-like language. Our language augments the classical SQL data manipulation language in order to add support for streaming queries. From the point of view of the user, a common logical view of the existing catalogs and datastores is presented independently of which cluster or technology stores a particular table. Supporting multiple architectures imposes two main challenges: how to normalize the access to the datastores, and how to cope with datastore limitations. In order to be able to access multiple datastore technologies Crossdata defines a common unifying interface containing a basic set of operations that a datastore may support. New connectors can be easily added to Crossdata to increase its connectivity

Why Spark?

Álvaro Agea Herradón

What's hot

Big Data Ecosystem - 1000 Simulated Drones

Espeo Software

Hadoop at ayasdi

Mohit Jaggi

Scala: the unpredicted lingua franca for data science

Andy Petrella

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Spark Summit

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Databricks

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Spark Summit

Rental Cars and Industrialized Learning to Rank with Sean Downes

Databricks

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Spark Summit

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...

Spark Summit

AI on Spark for Malware Analysis and Anomalous Threat Detection

Databricks

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Spark Summit

Apache Spark Briefing

Thomas W. Dinsmore

Implementing the Lambda Architecture efficiently with Apache Spark

DataWorks Summit

Spark Summit EU talk by Christos Erotocritou

Spark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Spark Summit

Apache Spark At Scale in the Cloud

Databricks

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Spark Summit

What's hot (20)

Big Data Ecosystem - 1000 Simulated Drones

Hadoop at ayasdi

Scala: the unpredicted lingua franca for data science

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Rental Cars and Industrialized Learning to Rank with Sean Downes

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...

AI on Spark for Malware Analysis and Anomalous Threat Detection

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Apache Spark Briefing

Implementing the Lambda Architecture efficiently with Apache Spark

Spark Summit EU talk by Christos Erotocritou

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Apache Spark At Scale in the Cloud

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Viewers also liked

Learn to use Stratio Crossdata

Álvaro Agea Herradón

Why Spark?

Álvaro Agea Herradón

StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...

Álvaro Agea Herradón

We present StratioDeep, an integration layer between the Spark distributed computing framework and Cassandra, a NoSQL distributed database. Cassandra brings together the distributed system technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent and based on a P2P model without a single point of failure. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. For these reasons, C* is one of the most popular NoSQL databases, but one of its handicaps is that it’s necessary to model the schema on the executed queries. This is because C* is oriented to search by key. Integrating C* and Spark gives us a system that combines the best of both worlds. Existing integrations between the two systems are not satisfactory: they basically provide an HDFS abstraction layer over C*. We believe this solution is not efficient because introduces an important overhead between the two systems. The purpose of our work has been to provide an much lower-level integration that not only performs better, it also opens to Cassandra the possibility to solve a wide range of new use cases thanks to the powerfulness of the Spark distributed computing framework. We’ve already deployed this solution in real applications with diverse clients: pattern detection, log mining, fraud detection, sentiment analysis and financial transaction analysis. In addition this integration is the building block for our challenging and novel Lambda architecture completely based on Cassandra. In order to complete the integration, we provide a seamless extension to the Cassandra Query Language: CQL is oriented to key-based search. As such, it is not a good choice to perform queries that move an huge amount of data. We’ve extended CQL in order to provide a user-friendly interface. This is a new approach for batch processing over C*. It consists in an abstraction layer that translates custom CQL queries to Spark jobs and delegates the complexity of distributing the query itself over the underlying cluster of commodity machines to Spar

Stratio platform overview v4.1

Stratio

Stratio CrossData: an efficient distributed datahub with batch and streaming ...

Stratio

Crossdata: an efficient distributed datahub with batch and streaming query ca...

Álvaro Agea Herradón

Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.

Primeros pasos con Apache Spark - Madrid Meetup

dhiguero

La Unión Bancaria Europea

koball

PresentacionComunicacionesPDB

El modelo europeo de reporting y el lenguaje XBRL - Ignacio Boixo

Asociación XBRL España

UNION BANCARIA EN LA UNION EUROPEA

Ramiro Ojeda

Recuperación y Unión Bancaria Europea. Emilio Ontiveros

Universidad de Deusto - Deustuko Unibertsitatea - University of Deusto

11 Tools for your Open Source devops stack

Kris Buytaert

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...Asociación XBRL España

Tutorial en Apache Spark - Clasificando tweets en realtime

Socialmetrix

Apache Spark [1] es un nuevo framework de procesamiento distribuido para big data, escrito en Scala con wrappers para Python, que viene generando mucha atención de la comunidad por su potencia, simplicidad de uso y velocidad de procesamiento. Ya siendo llamado como el remplazo de Apache Hadoop. Socialmetrix desarrolla soluciones en este framework para generar reportes y dashboards de información a partir de los datos extraídos de redes sociales. Los participantes de este tutorial van aprender a levantar información de Twitter usando Spark Streaming, Desarrollar algoritmos para calcular hashtags más frecuentes, usuarios más activos en batch processing aplicarlos en realtime a los nuevos tweets que lleguen a través del stream.

Distributed Logistic Model Trees

Stratio

Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano and Mateo Alvarez presented a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leaves. While being highly interpretable, the LMT consistently performs equally or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.

[Strata] Sparkta

Stratio

La translación del marco regulatorio Solvencia II al estándar XBRL - Aitor Az...

Asociación XBRL España

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Spark Summit

El impacto del big data en la estrategia de los medios de comunicacion by Osc...

ACTUONDA

El impacto del Big Data en la estrategia de negocio de los medios de comunicación Oscar Mendez (CEO, Stratio) @omendezsoto @stratioDB Primer encuentro BIG MEDIA Conectando Media, Audiencia y Publicidad con Datos 24 de junio 2014, Madrid • Sponsor Platinum : Perfect Memory • Sponsor Gold : Stratio, Paradigma • Con el apoyo de : Big Data Spain, Medios On • Socio tecnológico : Agora News • Organizadores : Actuonda y Cátedra Big Data UAM-BM • Contacto : Nicolas Moulard (Actuonda) moulard@actuonda.com @Radio_20 www.bigmediaconnect.es

Viewers also liked (20)

Learn to use Stratio Crossdata

Why Spark?

StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit ...

Stratio platform overview v4.1

Stratio CrossData: an efficient distributed datahub with batch and streaming ...

Crossdata: an efficient distributed datahub with batch and streaming query ca...

Primeros pasos con Apache Spark - Madrid Meetup

La Unión Bancaria Europea

Presentacion

El modelo europeo de reporting y el lenguaje XBRL - Ignacio Boixo

UNION BANCARIA EN LA UNION EUROPEA

Recuperación y Unión Bancaria Europea. Emilio Ontiveros

11 Tools for your Open Source devops stack

Estándares en Unión Europea: Marco, Desafíos y Oportunidades - Francisco Garc...

Tutorial en Apache Spark - Clasificando tweets en realtime

Distributed Logistic Model Trees

[Strata] Sparkta

La translación del marco regulatorio Solvencia II al estándar XBRL - Aitor Az...

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

El impacto del big data en la estrategia de los medios de comunicacion by Osc...

Similar to Stratio big data spain

SparkNitish Upreti

Big data clustering

Jagadeesan A S

Paris Data Geek - Spark Streaming

Djamel Zouaoui

DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

Victor Coustenoble

Processing Large Data with Apache Spark -- HasGeek

Venkata Naga Ravi

Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Cesare Cugnasco

Navigating NoSQL in cloudy skies

shnkr_rmchndrn

NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Databricks

Hadoop world overview trends and topics

Valentin Kropov

Valentyn Kropov, Big Data Solutions Architect has recently attended "Hadoop World / Strata" – biggest and coolest Big Data conference in a World, and he can't wait to share fresh trends and topics straight from New-York. Come and learn how Hadoop cluster will help NASA to explore Mars, how Netflix build 10PB platform, what are the latest trends in Spark, to learn about newest, just announced storage engine from Cloudera called Kudu and many many more interesting stuff.

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Spark Summit

Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on. In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc. Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.

Apache spark-melbourne-april-2015-meetup

Ned Shawa

BI, Reporting and Analytics on Apache Cassandra

Victor Coustenoble

Sa introduction to big data pipelining with cassandra & spark west mins...

Simon Ambridge

Big Data for Data Scientists - WeCloudData

WeCloudData

New Analytics Toolbox DevNexus 2015

Robbie Strickland

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.

Cleveland Hadoop Users Group - Spark

Vince Gonzalez

Business Growth Is Fueled By Your Event-Centric Digital Strategy

zitipoff

Neo, Titan & Cassandra

johnrjenson

Cassandra - A Basic Introduction Guide

Mohammed Fazuluddin

Jump Start on Apache Spark 2.2 with Databricks

Anyscale

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands-On Labs

Similar to Stratio big data spain (20)

Spark

Big data clustering

Paris Data Geek - Spark Streaming

DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

Processing Large Data with Apache Spark -- HasGeek

Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration

Navigating NoSQL in cloudy skies

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Hadoop world overview trends and topics

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

Apache spark-melbourne-april-2015-meetup

BI, Reporting and Analytics on Apache Cassandra

Sa introduction to big data pipelining with cassandra & spark west mins...

Big Data for Data Scientists - WeCloudData

New Analytics Toolbox DevNexus 2015

Cleveland Hadoop Users Group - Spark

Business Growth Is Fueled By Your Event-Centric Digital Strategy

Neo, Titan & Cassandra

Cassandra - A Basic Introduction Guide

Jump Start on Apache Spark 2.2 with Databricks

Recently uploaded

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

The Future of Platform Engineering

Jemma Hussein Allen

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Bits & Pixels using AI for Good.........

Alison B. Lowndes

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Overview.pdf

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Accelerate your Kubernetes clusters with Varnish Caching

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

How world-class product teams are winning in the AI era by CEO and Founder, P...

The Future of Platform Engineering

Key Trends Shaping the Future of Infrastructure.pdf

Securing your Kubernetes cluster_ a step-by-step guide to success !

Bits & Pixels using AI for Good.........

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

UiPath Test Automation using UiPath Test Suite series, part 3

DevOps and Testing slides at DASA Connect

Generating a custom Ruby SDK for your web service or Rails API using Smithy

The Art of the Pitch: WordPress Relationships and Sales

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Stratio big data spain

1. AN EFFICIENT DATA MINING SOLUTION

2. Hadoop?

3. Cassandra?

4. Spark?

5. Stratio Deep An efficient data mining solution “Two and two are four? Sometimes… Sometimes they are five.” G. Orwell #StratioBD

7. Goals • • • • #StratioBD Why do you need Cassandra? What is the problem? Why do you need Spark? How do they work together?

8. Cassandra • • • • #StratioBD Based on DynamoDB… Replication, Key/Value, P2P And based on Big Table… Column oriented

9. ROBUST FAST EFFICENT

10. NO BOTTLENECK DECENTRALIZED REPLICATED

11. Another Database?

12. Why?

13. Case A One User – Lot of data #StratioBD

14. Case B Many User – Few data #StratioBD

15. Case C Many user – Lot of data #StratioBD

16. Crawler app 100M Indexed pages 3k reads Cassandra, I choose you #StratioBD Query time < 1s

17. But…

18. Marketing walks in

19. New query “I need to find all the reference to the domain ACME. I need the answer by Friday.” #StratioBD

20. Problem Cassandra is not well suited to resolved this type of queries You need to design the schema with the query in mind #StratioBD

21. Challenge Accepted

22. What options do we have? • • • #StratioBD Run Hive Query on top of C* Write an ETL script and load data into another DB Clone the cluster

23. What options do we have? Run Hive Query on top of C* Write ETL scripts and load into another DB Clone the cluster #StratioBD

24. And now… what can we do? “We can't solve problems by using the same kind of thinking we used when we created them” Albert Einstein #StratioBD

25. Spark • • • • • Alternative to MapReduce A low latency cluster computing system For very large datasets Create by UC Berkeley AMP Lab in 2010. May be 100 times faster than MapReduce for:   #StratioBD Interactive algorithms. Interactive data mining

26. Logistic regression in Spark vs Hadoop SOURCE | http://spark.incubator.apache.org/ #StratioBD

27. WHO USES SPARK?

28. Spark and Cassandra Integration points #StratioBD

29. Cassandra’s HDFS abstraction layer Advantantages: • Easily integrates with legacy systems. Drawbacks: • • Very high-level: no access to low level Cassandra’s features. Questionable performance. INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD

30. Cassandra’s Hadoop Interface • Thrift protocol • CQL3 (our implementation)  Uses the novel Cassandra’s CqlPagingInputFormat INTEGRATION POINTS: HDFS OVER CASSANDRA #StratioBD

31. CQL3 Integration • • • Supports CQL3 features Respects data locality Good compromise between performance / implementation complexity INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD

32. CQL3 Integration (II) Provides a Java friendly API: • Developers map Column Families to custom serializable POJOs • StratioDeep wraps the complexity of performing Spark calculations directly over the user provided POJOs. INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3 #StratioBD

33. Demo

34. CQL3 Integration (III) Drawbacks: • Still not preforming as well as we’d like  • No analyst-friendly interface:  #StratioBD Uses Cassandra’s Hadoop Interface No SQL-like query features INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3

35. Future extensions What are we currently working on? Bring the integration to another level: • • • #StratioBD Dump Cassandra’s Hadoop Interface Direct access to Cassandra’s SSTable(s) files. Extend Cassandra’s CQL3 to make use of Spark’s distributed data processing power

36. Conclusion #StratioBD

37. THANKS

Editor's Notes

Good afternoon, in this moment, everybody should know Stratio, the big data company. Now, I need to know if you seem familiar some concepts.

Stratio big data spain

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Stratio big data spain

Similar to Stratio big data spain (20)

Recently uploaded

Recently uploaded (20)

Stratio big data spain

Editor's Notes