Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Small intro to Big Data - Old version

799 views

Published on

Presented on Codemotion Warsaw 2016 and JDD 2016.

Pig, Hive, Flink, Kafka, Zeppelin... if you now wonder if someone just tried to offend you or are those just Pokemon names, then this talk is just for you!

Big Data is everywhere and new tools for it are released almost at the speed of new JavaScript frameworks. During this entry level presentation we will walk though the challenges which Big Data presents, reflect how big is big and introduce currently most fancy and popular (mostly open source) tools.

We'll try to spark off interest in Big Data by showing application areas and by throwing ideas where you can later dive into.

Published in: Software
  • Be the first to comment

Small intro to Big Data - Old version

  1. 1. Small intro to Big Data Michał Matłoka @mmatloka
  2. 2. Outline 1. What is Big Data? 2. Storing 3. Batch & Streams processing 4. Resource Managers 5. Machine Learning 6. Analysis & Visualization 7. Other
  3. 3. What is Big Data? ● Volume ● Velocity ● Variety
  4. 4. How big is Big?
  5. 5. ThoughtWorks: Big Data envy
  6. 6. Storing
  7. 7. CAP theorem (Brewer’s theorem) In distributed system you can only have two of three guarantees: ● Consistency ● Availability ● Partition Tolerance
  8. 8. Relational scaling (horizontal) Example limitations: ● Max 48 nodes ● Read-only nodes ● Cross-shard joins… ● Auto-increments ● Distributed transactions, possible, but… It can work!
  9. 9. You don’t always need ACID
  10. 10. BASE might be enough
  11. 11. NoSQL (Not only SQL) ● Key-value (Redis, Dynamo, ...) ● Column (Cassandra, HBase, ...) ● Document (MongoDB, … ) ● Graph (Neo4J, … ● Multi-model (OrientDB, …) Apple - 115k Cassandra nodes with over 10PB of data!
  12. 12. Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
  13. 13. Batch Processing ● Processes data from 1 or more sources from bigger period of time (e.g. day, month) ● Source: db, Apache Parquet, ... ● Not real-time ● Can take hours or more
  14. 14. Apache Hadoop ● Based on Google paper ● First release in 2006! ● Map -> (shuffle) -> Reduce ● Was the beginning of many projectsMapReduce
  15. 15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Hadoop Wordcount - part I Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  16. 16. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Hadoop Wordcount - part II Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  17. 17. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } Hadoop Wordcount - part III Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  18. 18. Apache Spark RDD (Resilient Distributed Dataset) DAG (Directed acyclic graph) ● RDD - map, filter, count etc ● Spark SQL ● MLib ● GraphX ● Spark Streaming ● API: Scala, Java, Python, R*
  19. 19. val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark Wordcount Source: http://spark.apache.org/examples.html
  20. 20. Apache Flink DAGs with iterations ● Batch & native streaming ● FlinkML ● Table API & SQL (Beta) ● Gelly - graph analytics ● FlinkCEP - detect patterns in data streams ● Compatible with Apache Hadoop and Apache Storm APIs ● API: Scala, Java, Python*
  21. 21. Stream processing ● Near real-time/real time ● Processing (usually) does not end ● Source: files, Apache Kafka, Socket, Akka Actors, Twitter, RabbitMQ etc ● Event time vs processing time ● Windows - fixed, sliding, session ● Watermarks ● State
  22. 22. Stream processing ● Native/micro-batch ● Latency ● Throughput ● Delivery guarantees ● Resources managers ● API - compositional/declarative ● Maturity Differences
  23. 23. Stream processing ● Apache Storm ● Apache Storm Trident ● Alibaba JStorm ● Twitter Heron ● Apache Spark Streaming ● Apache Flink ● Apache Beam ● Apache Kafka Streams ● Apache Samza ● Apache Gearpump ● Apache Apex ● Apache Ignite Streaming ● Apache S4 ● ...
  24. 24. Resource management ● Apache YARN ● Apache Mesos (1.0.1!) ● Apache Slider - deploy existing apps on YARN ● Apache Myriad - YARN on Mesos ● DC/OS
  25. 25. Source: https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb oard-ee-600x395@2x.gif
  26. 26. Analysis SQL Engines & Querying ● Apache Hive ● Apache Pig ● Apache HAWQ ● Apache Impala ● Apache Phoenix ● Apache Spark SQL ● Apache Drill ● Facebook Presto ● ...
  27. 27. Machine Learning ● Apache Mahout ● Apache Samoa ● Spark MLib ● FlinkML ● H2O ● TensorFlow
  28. 28. Notebooks ● IPython ● Jupyter ● Apache Zeppelin
  29. 29. Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
  30. 30. Hadoop-related ● Apache Sqoop ● Apache Flume ● Apache Oozie ● Hue ● Apache HDFS ● Apache Ambari ● Apache Knox ● Apache ZooKeeper
  31. 31. Awesome Big Data https://github.com/onurakpolat /awesome-bigdata
  32. 32. Conclusions ● There is a lot of it! ● https://pixelastic.github.io/pokem onorbigdata/ ● If you want to learn, start with SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka)
  33. 33. Articles & references ● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t echnologies/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing- frameworks-part-1 ● https://dcos.io/ ● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn ● http://spark.apache.org/ ● https://flink.apache.org/ ● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of- big-data-analytics-frameworks ● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing ● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  34. 34. Thank you, Q&A? @mmatloka http://www.slideshare.net/softwaremill https://softwaremill.com/blog/

×