Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

3,807 views

Published on

Today’s services rely on massive amount of data to be processed, but require at the same time to be fast and responsive. Building fast services on big data batch- oriented frameworks is definitely a challenge. At ING, we have worked on a stack that can alleviate this problem. Namely, we materialize data model by map-reducing Hadoop queries from Hive to Cassandra. Instead of sinking the results back to hdfs, we propagate the results into Cassandra key-values tables. Those Cassandra tables are finally exposed via a http API front-end service.

Published in: Technology

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

  1. 1. Fast Queries on Data Lakes Exposing bigdata and streaming analytics using hadoop, cassandra, akka and spray Natalino Busa @natalinobusa
  2. 2. Big and Fast. Tools Architecture Hands on Application!
  3. 3. Parallelism Hadoop Cassandra Akka Machine Learning Statistics Big Data Algorithms Cloud Computing Scala Spray Natalino Busa @natalinobusa www.natalinobusa.com
  4. 4. Challenges Not much time to react Events must be delivered fast to the new machine APIs It’s Web, and Mobile Apps: latency budget is limited Loads of information to process Understand well the user history Access a larger context
  5. 5. OK, let’s build some apps
  6. 6. home brewed wikipedia search engine … Yeee ^-^/
  7. 7. Tools of the day:
  8. 8. Hadoop: Distributed Data OS Reliable Distributed, Replicated File System Low cost ↓ Cost vs ↑ Performance/Storage Computing Powerhouse All clusters CPU’s working in parallel for running queries
  9. 9. Cassandra: A low-latency 2D store Reliable Distributed, Replicated File System Low latency Sub msec. read/write operations Tunable CAP Define your level of consistency Data model: hashed rows, sorted wide columns Architecture model: No SPOF, ring of nodes, omogeneous system
  10. 10. Lambda architecture Batch Computing HTTP RESTful API In-Memory Distributed Database In-memory Distributed DB’s Lambda Architecture Batch + Streaming low-latency Web API services Streaming Computing All Data Fast Data
  11. 11. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.py hadoop reducer.py Publish pages on Cassandra Produce inverted index entries Top 10 Urls per word go to Cassandra How to: Build an inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple
  12. 12. CREATE KEYSPACE wikipedia WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}; CREATE TABLE wikipedia.pages ( url text, title text, abstract text, length int, refs int, PRIMARY KEY (url) ); CREATE TABLE wikipedia.inverted ( keyword text, relevance int, url text, PRIMARY KEY ((keyword), relevance) ); Data model ...
  13. 13. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  14. 14. cat enwiki-latest-abstracts.xml | ./mapper.py | ./reducer.py Map-Reduce demystified
  15. 15. ./mapper.py produces tab separated triplets: element 008930 http://en.wikipedia.org/wiki/Gold with 008930 http://en.wikipedia.org/wiki/Gold symbol 008930 http://en.wikipedia.org/wiki/Gold atomic 008930 http://en.wikipedia.org/wiki/Gold number 008930 http://en.wikipedia.org/wiki/Gold dense 008930 http://en.wikipedia.org/wiki/Gold soft 008930 http://en.wikipedia.org/wiki/Gold malleable 008930 http://en.wikipedia.org/wiki/Gold ductile 008930 http://en.wikipedia.org/wiki/Gold Map-Reduce demistified
  16. 16. ./reducer.py produces tab separated triplets for the same key: ductile 008930 http://en.wikipedia.org/wiki/Gold ductile 008452 http://en.wikipedia.org/wiki/Hydroforming ductile 007930 http://en.wikipedia.org/wiki/Liquid_metal_embrittlement ... Map-Reduce demistified
  17. 17. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute
  18. 18. def main(): global cassandra_client logging.basicConfig() cassandra_client = CassandraClient() cassandra_client.connect(['127.0.0.1']) readLoop() cassandra_client.close() Mapper ...
  19. 19. doc = ET.fromstring(doc) ... #extract words from title and abstract words = [w for w in txt.split() if w not in STOPWORDS and len(w) > 2] #relevance algorithm relevance = len(abstract) * len(links) #mapper output to cassandra wikipedia.pages table cassandra_client.insertPage(url, title, abstract, length, refs) #emit unique the key-value pairs emitted = list() for word in words: if word not in emitted: print '%st%06dt%s' % (word, relevance, url) emitted.append(word) Mapper ... T split !!!
  20. 20. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export during the "map" phase
  21. 21. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra cassandra
  22. 22. from cassandra.cluster import Cluster class CassandraClient: session = None insert_page_statement = None def connect(self, nodes): cluster = Cluster(nodes) metadata = cluster.metadata self.session = cluster.connect() log.info('Connected to cluster: ' + metadata.cluster_name) prepareStatements() def close(self): self.session.cluster.shutdown() self.session.shutdown() log.info('Connection closed.') Cassandra client
  23. 23. def prepareStatement(self): self.insert_page_statement = self.session.prepare(""" INSERT INTO wikipedia.pages (url, title, abstract, length, refs) VALUES (?, ?, ?, ?, ?); """) def insertPage(self, url, title, abstract, length, refs): self.session.execute( self.insert_page_statement.bind( (url, title, abstract, length, refs))) Cassandra client
  24. 24. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar -files mapper.py,reducer.py -mapper ./mapper.py -reducer ./reducer.py -jobconf stream.num.map.output.key.fields=1 -jobconf stream.num.reduce.output.key.fields=1 -jobconf mapred.reduce.tasks=16 -input wikipedia-latest-abstract -output $HADOOP_OUTPUT_DIR YARN: mapreduce v2 Using map-reduce and yarn
  25. 25. wikipedia abstracts (url, title, abstract, sections) hadoop mapper.sh hadoop reducer.sh Publish pages on Cassandra Extract inverted index Top 10 Urls per word go to Cassandra Inverted index : Apple -> Apple Inc, Apple Tree, The Big Apple Export inverted inded during "reduce" phase
  26. 26. SELECT TRANSFORM (url, abstract, links) USING 'mapper.py' AS (relevance, url) FROM hive_wiki_table ORDER BY relevance LIMIT 50; Hive UDF functions and hooks Second method: using hive sql queries def emit_ranking(n=100): global sorted_dict for i in range(n): cassandra_client.insertWord(current_word, relevance, url) … def readLoop(): # input comes from STDIN for line in sys.stdin: # parse the input we got from mapper.py word, relevance, url = line.split('t', 2) if current_word == word : sorted_dict[relevance] = url else: if current_word: emit_ranking() … Reducer ...
  27. 27. memory disk compute disk diskcomputedisk memory disk diskcompute memory disk diskcompute memory memory diskcomputedisk map (k,v) shuffle & sort reduce (k,list(v)) compute cassandra cassandra
  28. 28. Front-end:
  29. 29. @app.route('/word/<keyword>') def fetch_word(keyword): db = get_cassandra() pages = [] results = db.fetchWordResults(keyword) for hit in results: pages.append(db.fetchPageDetails(hit["url"])) return Response(json.dumps(pages), status=200, mimetype=" application/json") if __name__ == '__main__': app.run() Front-End: prototyping in Flask
  30. 30. Expose during Map or Reduce? Expose Map - only access to local information - simple, distributed "awk" filter Expose in Reduce - need to collect data scattered across your cluster - analysis on all the available data
  31. 31. Latency tradeoffs Two runtimes frameworks: cassandra : in-memory, low-latency hadoop : extensive, exhaustive, churns all the data Statistics and machine learning: Python and R : they can be used for batch and/or realtime Fastest analysis: still the domain on C, Java, Scala
  32. 32. Some lessons learned ● Use mapreduce to (pre)process data ● Connect to Cassandra during MR ● Use MR as for batch heavy lifting ● Lambda architecture: Fast Data + All Data
  33. 33. Some lessons learned Expose results to Cassandra for fast access - responsive apps - high troughput / low latency Hadoop as a background tool - data validation, new extractions, new algorithms - data harmonization, correction, immutable system of records
  34. 34. The tutorial is on github https://github.com/natalinobusa/wikipedia
  35. 35. Parallelism Mathematics Programming Languages Machine Learning Statistics Big Data Algorithms Cloud Computing Natalino Busa @natalinobusa www.natalinobusa.com Thanks ! Any questions?

×