Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Small intro to Big Data
Michał Matłoka @mmatloka
Outline
1. What is Big Data?
2. Storing
3. Batch & Streams processing
4. Resource Managers
5. Machine Learning
6. Analysis...
What is Big Data?
● Volume
● Velocity
● Variety
How big is Big?
ThoughtWorks: Big Data envy
Storing
CAP theorem
(Brewer’s theorem)
In distributed system you can only
have two of three guarantees:
● Consistency
● Availabili...
Relational scaling
(horizontal)
Example limitations:
● Max 48 nodes
● Read-only nodes
● Cross-shard joins…
● Auto-incremen...
You don’t always need ACID
BASE might be enough
NoSQL
(Not only SQL)
● Key-value (Redis, Dynamo, ...)
● Column (Cassandra, HBase, ...)
● Document (MongoDB, … )
● Graph (N...
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
Batch Processing
● Processes data from 1 or more
sources from bigger period of
time (e.g. day, month)
● Source: db, Apache...
Apache Hadoop
● Based on Google paper
● First release in 2006!
● Map -> (shuffle) -> Reduce
● Was the beginning of many
pr...
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one ...
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new In...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInsta...
Apache Spark
RDD (Resilient Distributed
Dataset)
DAG (Directed acyclic graph)
● RDD - map, filter, count etc
● Spark SQL
●...
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
....
Apache Flink
DAGs with iterations
● Batch & native streaming
● FlinkML
● Table API & SQL (Beta)
● Gelly - graph analytics
...
Stream
processing
● Near real-time/real time
● Processing (usually) does not
end
● Source: files, Apache Kafka,
Socket, Ak...
Stream
processing
● Native/micro-batch
● Latency
● Throughput
● Delivery guarantees
● Resources managers
● API - compositi...
Stream processing
● Apache Storm
● Apache Storm Trident
● Alibaba JStorm
● Twitter Heron
● Apache Spark Streaming
● Apache...
Resource
management
● Apache YARN
● Apache Mesos (1.0.1!)
● Apache Slider - deploy existing
apps on YARN
● Apache Myriad -...
Source:
https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb
oard-ee-600x395@2x.gif
Analysis
SQL Engines & Querying
● Apache Hive
● Apache Pig
● Apache HAWQ
● Apache Impala
● Apache Phoenix
● Apache Spark S...
Machine Learning
● Apache Mahout
● Apache Samoa
● Spark MLib
● FlinkML
● H2O
● TensorFlow
Notebooks
● IPython
● Jupyter
● Spark Notebook
● Apache Zeppelin
Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
Hadoop-related
● Apache Sqoop
● Apache Flume
● Apache Oozie
● Hue
● Apache HDFS
● Apache Ambari
● Apache Knox
● Apache Zoo...
Awesome Big Data
https://github.com/onurakpolat
/awesome-bigdata
Conclusions
● There is a lot of it!
● https://pixelastic.github.io/pokem
onorbigdata/
● If you want to learn, start with
S...
Articles & references
● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t
echnologies/
● htt...
Thank you, Q&A?
@mmatloka
http://www.slideshare.net/softwaremill
https://softwaremill.com/blog/
Upcoming SlideShare
Loading in …5
×

JDD 2016 - Michal Matloka - Small Intro To Big Data

48 views

Published on

Pig, Hive, Flink, Kafka, Zeppelin... if you now wonder if someone just tried to offend you or are those just Pokemon names, then this talk is just for you! Big Data is everywhere and new tools for it are released almost at the speed of new JavaScript frameworks. During this entry level presentation we will walk though the challenges which Big Data presents, reflect how big is big and introduce currently most fancy and popular (mostly open source) tools. We'll try to spark off interest in Big Data by showing application areas and by throwing ideas where you can later dive into.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

JDD 2016 - Michal Matloka - Small Intro To Big Data

  1. 1. Small intro to Big Data Michał Matłoka @mmatloka
  2. 2. Outline 1. What is Big Data? 2. Storing 3. Batch & Streams processing 4. Resource Managers 5. Machine Learning 6. Analysis & Visualization 7. Other
  3. 3. What is Big Data? ● Volume ● Velocity ● Variety
  4. 4. How big is Big?
  5. 5. ThoughtWorks: Big Data envy
  6. 6. Storing
  7. 7. CAP theorem (Brewer’s theorem) In distributed system you can only have two of three guarantees: ● Consistency ● Availability ● Partition Tolerance
  8. 8. Relational scaling (horizontal) Example limitations: ● Max 48 nodes ● Read-only nodes ● Cross-shard joins… ● Auto-increments ● Distributed transactions, possible, but… It can work!
  9. 9. You don’t always need ACID
  10. 10. BASE might be enough
  11. 11. NoSQL (Not only SQL) ● Key-value (Redis, Dynamo, ...) ● Column (Cassandra, HBase, ...) ● Document (MongoDB, … ) ● Graph (Neo4J, … ● Multi-model (OrientDB, …) Apple - 115k Cassandra nodes with over 10PB of data!
  12. 12. Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
  13. 13. Batch Processing ● Processes data from 1 or more sources from bigger period of time (e.g. day, month) ● Source: db, Apache Parquet, ... ● Not real-time ● Can take hours or more
  14. 14. Apache Hadoop ● Based on Google paper ● First release in 2006! ● Map -> (shuffle) -> Reduce ● Was the beginning of many projectsMapReduce
  15. 15. public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Hadoop Wordcount - part I Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  16. 16. public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } Hadoop Wordcount - part II Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  17. 17. public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } Hadoop Wordcount - part III Source: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client- core/MapReduceTutorial.html
  18. 18. Apache Spark RDD (Resilient Distributed Dataset) DAG (Directed acyclic graph) ● RDD - map, filter, count etc ● Spark SQL ● MLib ● GraphX ● Spark Streaming ● API: Scala, Java, Python, R*
  19. 19. val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Spark Wordcount Source: http://spark.apache.org/examples.html
  20. 20. Apache Flink DAGs with iterations ● Batch & native streaming ● FlinkML ● Table API & SQL (Beta) ● Gelly - graph analytics ● FlinkCEP - detect patterns in data streams ● Compatible with Apache Hadoop and Apache Storm APIs ● API: Scala, Java, Python*
  21. 21. Stream processing ● Near real-time/real time ● Processing (usually) does not end ● Source: files, Apache Kafka, Socket, Akka Actors, Twitter, RabbitMQ etc ● Event time vs processing time ● Windows - fixed, sliding, session ● Watermarks ● State
  22. 22. Stream processing ● Native/micro-batch ● Latency ● Throughput ● Delivery guarantees ● Resources managers ● API - compositional/declarative ● Maturity Differences
  23. 23. Stream processing ● Apache Storm ● Apache Storm Trident ● Alibaba JStorm ● Twitter Heron ● Apache Spark Streaming ● Apache Flink ● Apache Beam ● Apache Kafka Streams ● Apache Samza ● Apache Gearpump ● Apache Apex ● Apache Ignite Streaming ● Apache S4 ● ...
  24. 24. Resource management ● Apache YARN ● Apache Mesos (1.0.1!) ● Apache Slider - deploy existing apps on YARN ● Apache Myriad - YARN on Mesos ● DC/OS
  25. 25. Source: https://docs.mesosphere.com/wp-content/uploads/2016/04/dashb oard-ee-600x395@2x.gif
  26. 26. Analysis SQL Engines & Querying ● Apache Hive ● Apache Pig ● Apache HAWQ ● Apache Impala ● Apache Phoenix ● Apache Spark SQL ● Apache Drill ● Facebook Presto ● ...
  27. 27. Machine Learning ● Apache Mahout ● Apache Samoa ● Spark MLib ● FlinkML ● H2O ● TensorFlow
  28. 28. Notebooks ● IPython ● Jupyter ● Spark Notebook ● Apache Zeppelin
  29. 29. Source: https://zeppelin.apache.org/assets/themes/zeppelin/img/notebook.png
  30. 30. Hadoop-related ● Apache Sqoop ● Apache Flume ● Apache Oozie ● Hue ● Apache HDFS ● Apache Ambari ● Apache Knox ● Apache ZooKeeper
  31. 31. Awesome Big Data https://github.com/onurakpolat /awesome-bigdata
  32. 32. Conclusions ● There is a lot of it! ● https://pixelastic.github.io/pokem onorbigdata/ ● If you want to learn, start with SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka)
  33. 33. Articles & references ● https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-t echnologies/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing- frameworks-part-1 ● https://dcos.io/ ● https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn ● http://spark.apache.org/ ● https://flink.apache.org/ ● http://www.51zero.com/blog/2015/12/13/why-apache-flink-is-the-4th-generation-of- big-data-analytics-frameworks ● http://www.slideshare.net/AndyPiper1/reactconf-2014-event-stream-processing ● https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  34. 34. Thank you, Q&A? @mmatloka http://www.slideshare.net/softwaremill https://softwaremill.com/blog/

×