Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Emerging technologies /frameworks in Big Data

2,036 views

Published on

A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.

Published in: Technology

Emerging technologies /frameworks in Big Data

  1. 1. Emerging Technologies/Frameworks in Big Data Rahul Jain @rahuldausa Meetup Sep 2015
  2. 2. About Me • Independent Big data/Search Consultant • 8+ years of learning experience. • Worked (got a chance) on High volume distributed applications. • Still a learner (and beginner)
  3. 3. Quick Questionnaire How many people know/heard Apache Parquet ? How many people know/heard Apache Drill ? How many people Know/heard Apache Flink ?
  4. 4. What we are going to learn/see today ? • Columnar Storage (overview) • Apache Parquet (with Demo) • Dremel (Basic overview) • Apache Drill (with Demo) • Apache Flink (with Demo)
  5. 5. Let’s discuss Columnar Storage
  6. 6. Lets say we have a Employee table RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000
  7. 7. table storage in row oriented system In Row-oriented systems, It will be stored as 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000
  8. 8. table storage in column oriented system In Row-oriented systems, It will be stored as 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; RowId EmpId Lastname Firstname Salary 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000 But In Column-oriented systems, It will be stored as 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004;
  9. 9. Row vs Column Storage Row-oriented storage 001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,440 00;004:22,Jones,Bob,55000; Column-oriented storage 10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001, Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;
  10. 10. Apache Parquet (Columnar Storage for Hadoop ecosystem)
  11. 11. About Apache Parquet • Columnar based Storage format • Initially started by Twitter and Cloudera • stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google. • Can store very-2 large dataset with very high compression rate. • Due to compression, less IO and Faster Processing. • Provides high level APIs in Java • Integration with Hadoop and its eco-system • http://parquet.apache.org
  12. 12. Parquet Design • required: exactly one occurrence • optional: 0 or 1 occurrence • repeated: 0 or more occurrences For e.g, an address book schema: message AddressBook { required string owner; repeated string ownerPhoneNumbers; repeated group contacts { required string name; optional string phoneNumber; } }
  13. 13. Size Comparison $ du -sch test.* 407M test.csv (1 million records, 4 columns) 70M test.csv.gz (~83% reduction) 35M test.parquet (~92% reduction)
  14. 14. Let’s discuss first Dremel: Interactive Analysis of Web-Scale Datasets
  15. 15. What is Dremel • A Published a Paper in 2010 by Google • Interactive Analysis of Web-Scale Datasets – An adhoc query on a very large scale dataset (in Petabytes) – Near Real time – MR (Map-Reduce) works but that is meant for Batch Processing • SQL like Query Interface • Nested Data (with a Column storage representation) • Paper: – http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf • Projects (Implementation): – Google Big Query (Cloud based) – Apache Drill (Open source)
  16. 16. Why Dremel: Speed Matters Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
  17. 17. Widely used inside Google Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets
  18. 18. Tree based structure Credit: http://www.alberton.info/images/articles/papers/dremel1.png
  19. 19. Column striped representation Credit: http://www.alberton.info/images/articles/papers/dremel2.png
  20. 20. Query Processing Credit: http://farm9.staticflickr.com/8426/7843420938_9cb23a4cb0_b.jpg
  21. 21. Let’s move to Apache Drill
  22. 22. About Apache Drill • Based on Google’s Dremel Paper • Supports data-intensive distributed applications for interactive analysis of large-scale datasets • Have a Datastore aware optimizer – which constructs the query plan based on datastore’s processing capabilities. • Supports Data locality. • http://drill.apache.org/
  23. 23. So Why Drill? • Flexible Data Model • Fixed Schema(Avro)/Dynamic Schema(JSON)/Schema less SQL • Schema can be discovered on the Fly • Built-in optimistic query execution engine. • Doesn’t require a particular storage or execution system (Map-Reduce, Spark, Tez) • Better Performance and Manageability • Cluster of commodity servers • Daemon (drillbit) on each data node • Works with Hadoop, CSV, JSON, Avro/Parquet, MongoDB, HBase, Solr etc.
  24. 24. Query any non-relational datastore
  25. 25. Distributed SQL query engine Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
  26. 26. Designed to support wide set of use-cases Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel
  27. 27. Querying CSV: 0: jdbc:drill:> select count(*) from dfs.`/tmp/test.csv`; +-----------+ | EXPR$0 | +-----------+ | 10000001 | +-----------+ 1 row selected (5.771 seconds) Parquet: 0: jdbc:drill:> select count(*) from dfs.`/tmp/test.parquet`; +-----------+ | EXPR$0 | +-----------+ | 10000001 | +-----------+ 1 row selected (0.257 seconds)
  28. 28. Drill Shell ./bin/drill-embedded It will start Drill in Embedded Mode. You will see output like this, org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.0.0 "say hello to my little drill" 0: jdbc:drill:zk=local> For windows: This will start the shell with Drill in embedded Mode. ./bin/sqlline.bat –u "jdbc:drill:schema=dfs;zk=local"
  29. 29. Terminology • Drillbit – Drillbit runs on each data node in the cluster, Drill maximizes data locality during query execution. Movement of data over the network or between nodes is minimized or eliminated when possible.
  30. 30. Drill Configuration drill.exec:{ cluster-id: "<cluster_name>", zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>“ } Configuration: $DRILL_HOME/conf/drill-override.conf Default configuration: drill.exec: { cluster-id: "drillbits1", zk.connect: "localhost:2181" }
  31. 31. Starting Drill in Distributed Mode ./bin/drillbit.sh restart ./bin/drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart) It will restart the Drillbit service. Tip: Check the hostname on Drillbit is listening. For e.g. 2015-09-05 03:21:20,070 [main] INFO o.apache.drill.exec.server.Drillbit - Drillbit environment: host.name=192.168.0.101 This will start the drill shell on local machine based on configuration provided in drill-overide.conf Start the shell: ./bin/drill-localhost (if drillbit listening on localhost) otherwise ./bin/sqlline -u "jdbc:drill:drillbit=192.168.0.101"
  32. 32. Verify it once; and try a sample 0: jdbc:drill:zk=local> select * from sys.drillbits; +----------------+------------+---------------+------------+----------+ | hostname | user_port | control_port | data_port | current | +----------------+------------+---------------+------------+----------+ | 192.168.0.101 | 31010 | 31011 | 31012 | true | +----------------+------------+---------------+------------+----------+ 0: jdbc:drill:zk=local> select count(*) from `dfs`.`$DRILL_HOME/sample- data/nation.parquet`; +---------+ | EXPR$0 | +---------+ | 25 | +---------+ 1 row selected (1.752 seconds)
  33. 33. Drill – Web Client A Storage Plugin can be added/Enabled
  34. 34. Let’s move to Apache Flink
  35. 35. About Apache Flink • Open source framework for Big Data Analytics • Distributed Streaming dataflow engine • Runs Computing In-Memory. • Executes programs in data-parallel and pipelined manner. • Most popular for running Stream Data Processing. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data of HDFS or HBase. • https://flink.apache.org
  36. 36. So Why Flink? Credit: Compiled based on several articles,Blogs, Stackoverflow posts added in references page. • Share a lot of Similarities with relational DBMS • Data is serialized in byte buffers and processed a lot in binary representation • So allows Fine grained memory control • Uses a Pipeline based Processing Model with Cost based Optimizer to choose the execution strategy. • optimized for cyclic or iterative processes by using iterative transformations on collections • achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. • Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive • also has its own memory management system separate from Java’s garbage collector.
  37. 37. Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
  38. 38. Flink vs Spark (they looks to be pretty similar) Apache Flink: case class Word (word: String, frequency: Int) val counts = text .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency") Apache Spark: val counts = text .flatMap(line => line.split(" ")).map(word => (word, 1)) .reduceByKey{case (x, y) => x + y}
  39. 39. But…. Apache Spark: is batch processing framework that can approximate stream processing (called as micro-batching) Apache Flink: is primarily a stream processing framework that can look like a batch processor.
  40. 40. Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
  41. 41. Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview
  42. 42. Flink – Web Client Arguments to program separated by spaces
  43. 43. Flink – Web Client
  44. 44. References • https://flink.apache.org/ • https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink • http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark- and-apache-flink • http://statrgy.com/2015/06/01/best-data-processing-engine-flink-vs-spark/ • http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for- large-scale-machine-learning • http://www.infoworld.com/article/2919602/hadoop/flink-hadoops-new-contender-for- mapreduce-spark.html • http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html
  45. 45. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa

×