SlideShare a Scribd company logo
1 of 35
Download to read offline
Fast Queries
on Data Lakes
Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray
Natalino Busa
@natalinobusa
Big and Fast.
Tools Architecture Hands on Application!
Parallelism Hadoop Cassandra Akka
Machine Learning Statistics Big Data
Algorithms Cloud Computing Scala Spray
Natalino Busa
@natalinobusa
www.natalinobusa.com
Challenges
Not much time to react
Events must be delivered fast to the new machine APIs
It’s Web, and Mobile Apps: latency budget is limited
Loads of information to process
Understand well the user history
Access a larger context
OK, let’s build some apps
home brewed
wikipedia search
engine … Yeee ^-^/
Tools of the day:
Hadoop: Distributed Data OS
Reliable
Distributed, Replicated File System
Low cost
↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for
running queries
Cassandra: A low-latency 2D store
Reliable
Distributed, Replicated File System
Low latency
Sub msec. read/write operations
Tunable CAP
Define your level of consistency
Data model:
hashed rows, sorted wide columns
Architecture model:
No SPOF, ring of nodes,
omogeneous system
Lambda architecture
Batch
Computing
HTTP RESTful API
In-Memory
Distributed Database
In-memory
Distributed DB’s
Lambda Architecture
Batch + Streaming
low-latency
Web API services
Streaming
Computing
All Data Fast Data
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.py
hadoop
reducer.py
Publish pages on
Cassandra
Produce
inverted index
entries
Top 10 Urls per word
go to Cassandra
How to: Build an inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
CREATE KEYSPACE wikipedia WITH replication =
{'class': 'SimpleStrategy', 'replication_factor': 3};
CREATE TABLE wikipedia.pages (
url text,
title text,
abstract text,
length int,
refs int,
PRIMARY KEY (url)
);
CREATE TABLE wikipedia.inverted (
keyword text,
relevance int,
url text,
PRIMARY KEY ((keyword), relevance)
);
Data model ...
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py
Map-Reduce
demystified
./mapper.py
produces tab separated triplets:
element 008930 http://en.wikipedia.org/wiki/Gold
with 008930 http://en.wikipedia.org/wiki/Gold
symbol 008930 http://en.wikipedia.org/wiki/Gold
atomic 008930 http://en.wikipedia.org/wiki/Gold
number 008930 http://en.wikipedia.org/wiki/Gold
dense 008930 http://en.wikipedia.org/wiki/Gold
soft 008930 http://en.wikipedia.org/wiki/Gold
malleable 008930 http://en.wikipedia.org/wiki/Gold
ductile 008930 http://en.wikipedia.org/wiki/Gold
Map-Reduce
demistified
./reducer.py
produces tab separated triplets for the same key:
ductile 008930 http://en.wikipedia.org/wiki/Gold
ductile 008452 http://en.wikipedia.org/wiki/Hydroforming
ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement
...
Map-Reduce
demistified
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
def main():
global cassandra_client
logging.basicConfig()
cassandra_client = CassandraClient()
cassandra_client.connect(['127.0.0.1'])
readLoop()
cassandra_client.close()
Mapper ...
doc = ET.fromstring(doc)
...
#extract words from title and abstract
words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2]
#relevance algorithm
relevance = len(abstract) * len(links)
#mapper output to cassandra wikipedia.pages table
cassandra_client.insertPage(url, title, abstract, length, refs)
#emit unique the key-value pairs
emitted = list()
for word in words:
if word not in emitted:
print '%st%06dt%s' % (word, relevance, url)
emitted.append(word)
Mapper ...
T split !!!
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
Top 10 Urls per word
go to Cassandra
Inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
Export during the
"map" phase
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cassandra
cassandra
cassandra
from cassandra.cluster import Cluster
class CassandraClient:
session = None
insert_page_statement = None
def connect(self, nodes):
cluster = Cluster(nodes)
metadata = cluster.metadata
self.session = cluster.connect()
log.info('Connected to cluster: ' + metadata.cluster_name)
prepareStatements()
def close(self):
self.session.cluster.shutdown()
self.session.shutdown()
log.info('Connection closed.')
Cassandra client
def prepareStatement(self):
self.insert_page_statement = self.session.prepare("""
INSERT INTO wikipedia.pages
(url, title, abstract, length, refs)
VALUES (?, ?, ?, ?, ?);
""")
def insertPage(self, url, title, abstract, length, refs):
self.session.execute(
self.insert_page_statement.bind(
(url, title, abstract, length, refs)))
Cassandra client
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar 
-files mapper.py,reducer.py 
-mapper ./mapper.py 
-reducer ./reducer.py 
-jobconf stream.num.map.output.key.fields=1 
-jobconf stream.num.reduce.output.key.fields=1 
-jobconf mapred.reduce.tasks=16 
-input wikipedia-latest-abstract 
-output $HADOOP_OUTPUT_DIR
YARN: mapreduce v2
Using map-reduce and yarn
wikipedia abstracts
(url, title, abstract, sections)
hadoop
mapper.sh
hadoop
reducer.sh
Publish pages on
Cassandra
Extract
inverted index
Top 10 Urls per word
go to Cassandra
Inverted index :
Apple -> Apple Inc, Apple Tree, The Big Apple
Export inverted inded
during "reduce" phase
SELECT TRANSFORM (url, abstract, links)
USING 'mapper.py' AS
(relevance, url)
FROM hive_wiki_table
ORDER BY relevance LIMIT 50;
Hive UDF
functions and
hooks
Second method: using hive sql queries
def emit_ranking(n=100):
global sorted_dict
for i in range(n):
cassandra_client.insertWord(current_word, relevance,
url)
…
def readLoop():
# input comes from STDIN
for line in sys.stdin:
# parse the input we got from mapper.py
word, relevance, url = line.split('t', 2)
if current_word == word :
sorted_dict[relevance] = url
else:
if current_word:
emit_ranking()
… Reducer ...
memory
disk compute disk
diskcomputedisk
memory
disk diskcompute
memory
disk diskcompute
memory
memory
diskcomputedisk
map (k,v) shuffle & sort reduce (k,list(v))
compute
cassandra
cassandra
Front-end:
@app.route('/word/<keyword>')
def fetch_word(keyword):
db = get_cassandra()
pages = []
results = db.fetchWordResults(keyword)
for hit in results:
pages.append(db.fetchPageDetails(hit["url"]))
return Response(json.dumps(pages), status=200, mimetype="
application/json")
if __name__ == '__main__':
app.run()
Front-End:
prototyping in Flask
Expose during Map or Reduce?
Expose Map
- only access to local information
- simple, distributed "awk" filter
Expose in Reduce
- need to collect data scattered across your cluster
- analysis on all the available data
Latency tradeoffs
Two runtimes frameworks:
cassandra : in-memory, low-latency
hadoop : extensive, exhaustive, churns all the data
Statistics and machine learning:
Python and R : they can be used for batch and/or realtime
Fastest analysis: still the domain on C, Java, Scala
Some lessons learned
● Use mapreduce to (pre)process data
● Connect to Cassandra during MR
● Use MR as for batch heavy lifting
● Lambda architecture: Fast Data + All Data
Some lessons learned
Expose results to Cassandra for fast access
- responsive apps
- high troughput / low latency
Hadoop as a background tool
- data validation, new extractions, new algorithms
- data harmonization, correction, immutable system of records
The tutorial is on github
https://github.com/natalinobusa/wikipedia
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa
@natalinobusa
www.natalinobusa.com
Thanks !
Any questions?

More Related Content

What's hot

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
Patrick McFadin
 

What's hot (20)

Apache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series dataApache cassandra & apache spark for time series data
Apache cassandra & apache spark for time series data
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REX
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Real time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesosReal time data pipeline with spark streaming and cassandra with mesos
Real time data pipeline with spark streaming and cassandra with mesos
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Wide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data ModelingWide Column Store NoSQL vs SQL Data Modeling
Wide Column Store NoSQL vs SQL Data Modeling
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Monitoring Cassandra with Riemann
Monitoring Cassandra with RiemannMonitoring Cassandra with Riemann
Monitoring Cassandra with Riemann
 

Viewers also liked

Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
Joyabrata Das
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Modern Data Stack France
 

Viewers also liked (20)

Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
 
Hadoop+Cassandra_Integration
Hadoop+Cassandra_IntegrationHadoop+Cassandra_Integration
Hadoop+Cassandra_Integration
 
Setting up a mini big data architecture, just for you! - Bas Geerdink
Setting up a mini big data architecture, just for you! - Bas GeerdinkSetting up a mini big data architecture, just for you! - Bas Geerdink
Setting up a mini big data architecture, just for you! - Bas Geerdink
 
Ready for smart data banking?
Ready for smart data banking?Ready for smart data banking?
Ready for smart data banking?
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
 
HBase introduction talk
HBase introduction talkHBase introduction talk
HBase introduction talk
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Tutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduceTutorial hadoop hdfs_map_reduce
Tutorial hadoop hdfs_map_reduce
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing
 

Similar to Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
Sperasoft
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
 

Similar to Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial. (20)

Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Scala+data
Scala+dataScala+data
Scala+data
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers3 Dundee-Spark Overview for C* developers
3 Dundee-Spark Overview for C* developers
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Quick trip around the Cosmos - Things every astronaut supposed to know
Quick trip around the Cosmos - Things every astronaut supposed to knowQuick trip around the Cosmos - Things every astronaut supposed to know
Quick trip around the Cosmos - Things every astronaut supposed to know
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17Hopping in clouds - phpuk 17
Hopping in clouds - phpuk 17
 
Perform Like a frAg Star
Perform Like a frAg StarPerform Like a frAg Star
Perform Like a frAg Star
 
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010 Daniel Sikar: Hadoop MapReduce - 06/09/2010
Daniel Sikar: Hadoop MapReduce - 06/09/2010
 

More from Natalino Busa

Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 

More from Natalino Busa (18)

Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovation
 
Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter Notebooks
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
Strata London 16: sightseeing, venues, and friends
Strata  London 16: sightseeing, venues, and friendsStrata  London 16: sightseeing, venues, and friends
Strata London 16: sightseeing, venues, and friends
 
Data in Action
Data in ActionData in Action
Data in Action
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analytics
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topics
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

  • 1. Fast Queries on Data Lakes Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray Natalino Busa @natalinobusa
  • 2. Big and Fast. Tools Architecture Hands on Application!
  • 3. Parallelism Hadoop Cassandra Akka Machine Learning Statistics Big Data Algorithms Cloud Computing Scala Spray Natalino Busa @natalinobusa www.natalinobusa.com
  • 4. Challenges Not much time to react Events must be delivered fast to the new machine APIs It’s Web, and Mobile Apps: latency budget is limited Loads of information to process Understand well the user history Access a larger context
  • 5. OK, let’s build some apps
  • 8. Hadoop: Distributed Data OS Reliable Distributed, Replicated File System Low cost ↓ Cost vs ↑ Performance/Storage Computing Powerhouse All clusters CPU’s working in parallel for running queries
  • 9. Cassandra: A low-latency 2D store Reliable Distributed, Replicated File System Low latency Sub msec. read/write operations Tunable CAP Define your level of consistency Data model: hashed rows, sorted wide columns Architecture model: No SPOF, ring of nodes, omogeneous system
  • 10. Lambda architecture Batch Computing HTTP RESTful API In-Memory Distributed Database In-memory Distributed DB’s Lambda Architecture Batch + Streaming low-latency Web API services Streaming Computing All Data Fast Data
  • 11. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.py hadoop reducer.py Publish pages on Cassandra Produce inverted index entries Top 10 Urls per word go to Cassandra How to: Build an inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple
  • 12. CREATE KEYSPACE wikipedia WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; CREATE TABLE wikipedia.pages ( url text, title text, abstract text, length int, refs int, PRIMARY KEY (url) ); CREATE TABLE wikipedia.inverted ( keyword text, relevance int, url text, PRIMARY KEY ((keyword), relevance) ); Data model ...
  • 13. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  • 14. cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py Map-Reduce demystified
  • 15. ./mapper.py produces tab separated triplets: element 008930 http://en.wikipedia.org/wiki/Gold with 008930 http://en.wikipedia.org/wiki/Gold symbol 008930 http://en.wikipedia.org/wiki/Gold atomic 008930 http://en.wikipedia.org/wiki/Gold number 008930 http://en.wikipedia.org/wiki/Gold dense 008930 http://en.wikipedia.org/wiki/Gold soft 008930 http://en.wikipedia.org/wiki/Gold malleable 008930 http://en.wikipedia.org/wiki/Gold ductile 008930 http://en.wikipedia.org/wiki/Gold Map-Reduce demistified
  • 16. ./reducer.py produces tab separated triplets for the same key: ductile 008930 http://en.wikipedia.org/wiki/Gold ductile 008452 http://en.wikipedia.org/wiki/Hydroforming ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement ... Map-Reduce demistified
  • 17. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  • 18. def main(): global cassandra_client logging.basicConfig() cassandra_client = CassandraClient() cassandra_client.connect(['127.0.0.1']) readLoop() cassandra_client.close() Mapper ...
  • 19. doc = ET.fromstring(doc) ... #extract words from title and abstract words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2] #relevance algorithm relevance = len(abstract) * len(links) #mapper output to cassandra wikipedia.pages table cassandra_client.insertPage(url, title, abstract, length, refs) #emit unique the key-value pairs emitted = list() for word in words: if word not in emitted: print '%st%06dt%s' % (word, relevance, url) emitted.append(word) Mapper ... T split !!!
  • 20. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export during the "map" phase
  • 21. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra cassandra
  • 22. from cassandra.cluster import Cluster class CassandraClient: session = None insert_page_statement = None def connect(self, nodes): cluster = Cluster(nodes) metadata = cluster.metadata self.session = cluster.connect() log.info('Connected to cluster: ' + metadata.cluster_name) prepareStatements() def close(self): self.session.cluster.shutdown() self.session.shutdown() log.info('Connection closed.') Cassandra client
  • 23. def prepareStatement(self): self.insert_page_statement = self.session.prepare(""" INSERT INTO wikipedia.pages (url, title, abstract, length, refs) VALUES (?, ?, ?, ?, ?); """) def insertPage(self, url, title, abstract, length, refs): self.session.execute( self.insert_page_statement.bind( (url, title, abstract, length, refs))) Cassandra client
  • 24. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -files mapper.py,reducer.py -mapper ./mapper.py -reducer ./reducer.py -jobconf stream.num.map.output.key.fields=1 -jobconf stream.num.reduce.output.key.fields=1 -jobconf mapred.reduce.tasks=16 -input wikipedia-latest-abstract -output $HADOOP_OUTPUT_DIR YARN: mapreduce v2 Using map-reduce and yarn
  • 25. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export inverted inded during "reduce" phase
  • 26. SELECT TRANSFORM (url, abstract, links) USING 'mapper.py' AS (relevance, url) FROM hive_wiki_table ORDER BY relevance LIMIT 50; Hive UDF functions and hooks Second method: using hive sql queries def emit_ranking(n=100): global sorted_dict for i in range(n): cassandra_client.insertWord(current_word, relevance, url) … def readLoop(): # input comes from STDIN for line in sys.stdin: # parse the input we got from mapper.py word, relevance, url = line.split('t', 2) if current_word == word : sorted_dict[relevance] = url else: if current_word: emit_ranking() … Reducer ...
  • 27. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra
  • 29. @app.route('/word/<keyword>') def fetch_word(keyword): db = get_cassandra() pages = [] results = db.fetchWordResults(keyword) for hit in results: pages.append(db.fetchPageDetails(hit["url"])) return Response(json.dumps(pages), status=200, mimetype=" application/json") if __name__ == '__main__': app.run() Front-End: prototyping in Flask
  • 30. Expose during Map or Reduce? Expose Map - only access to local information - simple, distributed "awk" filter Expose in Reduce - need to collect data scattered across your cluster - analysis on all the available data
  • 31. Latency tradeoffs Two runtimes frameworks: cassandra : in-memory, low-latency hadoop : extensive, exhaustive, churns all the data Statistics and machine learning: Python and R : they can be used for batch and/or realtime Fastest analysis: still the domain on C, Java, Scala
  • 32. Some lessons learned ● Use mapreduce to (pre)process data ● Connect to Cassandra during MR ● Use MR as for batch heavy lifting ● Lambda architecture: Fast Data + All Data
  • 33. Some lessons learned Expose results to Cassandra for fast access - responsive apps - high troughput / low latency Hadoop as a background tool - data validation, new extractions, new algorithms - data harmonization, correction, immutable system of records
  • 34. The tutorial is on github https://github.com/natalinobusa/wikipedia
  • 35. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?