Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Parallelizing with Apache Spark in Unexpected WaysDatabricks
"Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
"
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Parallelizing with Apache Spark in Unexpected WaysDatabricks
"Out of the box, Spark provides rich and extensive APIs for performing in memory, large-scale computation across data. Once a system has been built and tuned with Spark Datasets/Dataframes/RDDs, have you ever been left wondering if you could push the limits of Spark even further? In this session, we will cover some of the tips learned while building retail-scale systems at Target to maximize the parallelization that you can achieve from Spark in ways that may not be obvious from current documentation. Specifically, we will cover multithreading the Spark driver with Scala Futures to enable parallel job submission. We will talk about developing custom partitioners to leverage the ability to apply operations across understood chunks of data and what tradeoffs that entails. We will also dive into strategies for parallelizing scripts with Spark that might have nothing to with Spark to support environments where peers work in multiple languages or perhaps a different language/library is just the best thing to get the job done. Come learn how to squeeze every last drop out of your Spark job with strategies for parallelization that go off the beaten path.
"
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Observabilidad: Todo lo que hay que verSoftware Guru
El código que hacemos vive y tiene razón de ser al momento de llegar a producción… ¿Cómo sabemos que tan efectivo es ? Solo podemos saber midiendolo.
Presentada por Isaac Ruiz Guerra en SG Virtual Conference 2020
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Rethinking State Management in Cloud-Native Streaming SystemsYingjun Wu
Current 2022 talk.
Speaker: Yingjun Wu
Title: Rethinking State Management in Cloud-Native Streaming Systems.
Abstract:
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
Zen is a storage service built at Pinterest that offers a graph data model of top of HBase and potentially other storage backends. In this talk, Zen's architects go over the design motivation for Zen and describe its internals including the API, type system, and HBase backend.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
Hadoop Summit June 2016
The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.
Observabilidad: Todo lo que hay que verSoftware Guru
El código que hacemos vive y tiene razón de ser al momento de llegar a producción… ¿Cómo sabemos que tan efectivo es ? Solo podemos saber midiendolo.
Presentada por Isaac Ruiz Guerra en SG Virtual Conference 2020
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Rethinking State Management in Cloud-Native Streaming SystemsYingjun Wu
Current 2022 talk.
Speaker: Yingjun Wu
Title: Rethinking State Management in Cloud-Native Streaming Systems.
Abstract:
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
Zen is a storage service built at Pinterest that offers a graph data model of top of HBase and potentially other storage backends. In this talk, Zen's architects go over the design motivation for Zen and describe its internals including the API, type system, and HBase backend.
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Aspect-level sentiment analysis of customer reviews using Double PropagationHardik Dalal
Aspect-Based Sentiment Analysis (ABSA) of customer reviews is one of the on going research in Data Mining domain. The algorithm used to detect aspect from reviews using Double Propagation. It uses PageRank to rank the aspect which is based on occurrence.
Discover Psycho-graphic Marketing And How Your Business Can Turn Around, Within 30 Days. Not too many people are talking about this type of marketing, because it can be confusing.
Hopefully, this will show you to break it down, easier to digest. Marketing Is key To Success!
Yelp Data Challenge - Discovering Latent Factors using Ratings and ReviewsTharindu Mathew
A restaurant's average rating and reviews on Yelp in influence customers to an incredible degree. An extra half-star rating causes restaurants to sell out 19 percentage points (49%) more frequently. Despite the impact on the restaurant's business, achieving a better overall rating is not straightforward. A user may give only one star to the restaurant just because he or she found the quality of service to be abysmal even though the food and the restaurant's location were up to his or her standard. These facts may have been mentioned in the review in detail but the final rating would just reflect the poor quality of service. The user rating alone does not provide any additional details, and as a result, the restaurant may not be able to understand which aspects create a negative impact on user experience. Another case may be that a certain popular dish will make users give the restaurant five star ratings, but they would not be satisfied with another aspect of the restaurant such as the dessert. The high user ratings may hide the fact that some aspects of the user experience was negative and that the restaurant has room to improve. Traditional recommender systems usually use only the aggregated ratings without considering the hidden factors in the preference of the users and the properties of the restaurants. For the restaurant domain, this could mean main cuisine, dessert, service, staff friendliness, knowledge of staff, location, ambiance, price and many more aspects. Without considering the ratings for individual aspects, it is likely that the recommendation systems will give inaccurate predictions to restaurants as well as users.
In this project, we aim to uncover hidden details about the users' preferences with respect to restaurant properties. With this information, we can provide precise recommendations to the restaurants regarding what aspects they should concentrate on to improve user experience. Since we are backed by more meaningful information about users' preferences we can provide better recommendations to users as to which restaurants they would prefer and why. To summarize, from the results of this project, we can answer the following questions: "what does a particular user care about when dining from a restaurant?", "which aspect should the restaurant improve in order to effectively increase the rating?", and "which restaurant is the best for a particular user?"
Jed Nachman, Vice President of Sales at Yelp, presents on how companies can manage and benefits from user-generated reviews. He the presentation by reviewing the benefits and the downside to online reviews, and then moves onto to ways apartment operators and other industries can manage a significant amount of user-generated content.
Snapchat is an awesome messaging app, but why can users not yet communicate with groups in-app? This presentation shows how Group Snaps can fit snugly into the existing Snapchat app, from start to finish in the product development cycle.
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Since its debut in 2010, Apache Spark has become one of the most popular Big Data technologies in the Apache open source ecosystem. In addition to enabling processing of large data sets through its distributed computing architecture, Spark provides out-of-the-box support for machine learning, streaming and graph processing in a single framework. Spark has been supported by companies like Microsoft, Google, Amazon and IBM and in financial services, companies like Blackrock (http://bit.ly/1Q1DVJH ) and Bloomberg (http://bit.ly/29LXbPv ) have started to integrate Apache Spark into their tool chain and the interest is growing. Unlike other big-data technologies which require intensive programming using Java etc., Spark enables data scientists to work with a big-data technology using higher level languages like Python and R making it accessible to conduct experiments and for rapid prototyping.
In this talk, we will introduce Apache Spark and discuss the key features that differentiate Apache Spark from other technologies. We will provide examples on how Apache Spark can help scale analytics and discuss how the machine learning API could be used to solve large-scale machine learning problems using Spark’s distributed computing framework. We will also illustrate enterprise use cases for scaling analytics with Apache Spark.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
This presentation on Pig will help you understand why Pig is required, what is Pig, MapReduce vs Hive vs Pig, Pig architecture, working of Pig, Pig Latin data model, Pig Execution modes, and finally a demo which shows Pig Latin scripts. Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large datasets. It operates on various types of data like structured, semi-structured and unstructured data. Pig Latin is the procedural data flow language used in Pig to analyze data. It is easy to program using Pig Latin as it is similar to SQL.
Now, let us get started with Pig.
Below topics are explained in this Pig presentation:
1. Why Pig?
2. What is Pig?
3. MapReduce vs Hive vs Pig
4. Pig architecture
5. Working of Pig
6. Pig Latin data model
7. Pig Execution modes
8. Use case – Twitter
9. Features of Pig
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Big Data Analysis : Deciphering the haystack Srinath Perera
A primary outcome of Bigdata is to derive useful and actionable insights from large or challenges data collections. The goal is to run the transformations from data, to information, to knowledge, and finally to insights. This includes calculating simple analytics like Mean, Max, and Median, to derive overall understanding about data by building models, and finally to derive predictions from data. Some cases we can afford to wait to collect and processes them, while in other cases we need to know the outputs right away. MapReduce has been the defacto standard for data processing, and we will start our discussion from there. However, that is only one side of the problem. There are other technologies like Apache Spark and Apache Drill graining ground, and also realtime processing technologies like Stream Processing and Complex Event Processing. Finally there are lot of work on porting decision technologies like Machine learning into big data landscape. This talk discusses big data processing in general and look at each of those different technologies comparing and contrasting them.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Start guide to web scraping with Scrapy, one of best python modules to do web scraping, with Scrapy everything is more easy.
This presentation covers the key concepts of scrapy and the process of criation of spiders.
It's the first draft version and will be other versions, until the last version, if you see something that you want to be improved, give feedback and I will take that in consideration.
I also talk about some alternatives to scrapy like lxml, newspapers and others.
In the final i give you acess to the code used on this presentation, so you cant test easy and fast the concepts talked on this presentation.
I hope you like it :D
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
Mining data requires a deep investment in people and time. How can you be sure you’re building the right models? What tools help you connect with the customer’s needs? With this hands-on presentation, you’ll learn a flexible toolset and methodology for building effective analytics applications. Agile Data (the book) shows you how to create an environment for exploring data, using lightweight tools such as Python, Apache Pig, and the D3.js (Data-Driven Documents) JavaScript library. You’ll learn an iterative approach that allows you to quickly change the kind of analysis you’re doing, as you discover what the data is telling you. All the example code in this book is available as working web applications. We will cover how to: * Build an application to mine your own email inbox * Use different data structures and algorithms to extract multiple features from a single dataset, and learn how different perspectives can yield insight * Rapidly boot your applications as simple front-ends to a document store * Add features driven by descriptive and inferential statistics, machine learning, and data visualization * Gather usage data and talk to real users to help guide your data-driven exploration
The story of how solving one problem the OpenSource way
opened doors to so much more. Talk presented by Pranav Prakash and Hari Prasanna at OSDConf 2014, New Delhi.
Introduction to GraphQL (or How I Learned to Stop Worrying about REST APIs)Hafiz Ismail
Talk for FOSSASIA 2016 (http://2016.fossasia.org)
----
This talk will give a brief and enlightening look into how GraphQL can help you address common weaknesses that you, as a web / mobile developer, would normally face with using / building typical REST API systems.
Let's stop fighting about whether we should implement the strictest interpretation of REST or how pragmatic REST-ful design is the only way to go, or debate about what REST is or what it should be.
A couple of demos (In Golang! Yay!) will be shown that are guaranteed to open up your eyes and see that the dawn of liberation for product developers is finally here.
Background: GraphQL is a data query language and runtime designed and used at Facebook to request and deliver data to mobile and web apps since 2012.
Hafiz Ismail (@sogko) is a contributor to Go / Golang implementation of GraphQL server library (https://github.com/graphql-go/graphql) and is looking to encourage fellow developers to join in the collaborative effort.
Similar to Apache Giraph: Large-scale graph processing done better (20)
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
2. Basic concepts Let’s start Get our hands dirty
Hi!
Simone Santacroce
santacroce.1542338@studenti.uniroma1.it
https://it.linkedin.com/in/simone-santacroce-272739134
Manuel Coppotelli
coppotelli.1540732@studenti.uniroma1.it
https://it.linkedin.com/in/manuelcoppotelli
George Adrian Munteanu
munteanu.1540833@studenti.uniroma1.it
https://it.linkedin.com/in/george-adrian-munteanu-707744134
Lorenzo Marconi
marconi.1494505@studenti.uniroma1.it
https://www.linkedin.com/in/lorenzo-marconi-1a2580105
Antonio La Torre
alatorre182@hotmail.it
https://www.linkedin.com/in/antonio-la-torre-768738134
Lucio Burlini
burlini.1705432@studenti.uniroma1.it
https://www.linkedin.com/in/lucio-burlini-827739134
Apache Giraph
3. Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts
• Graphs in the real world
• Challenges on graphs
• MapReduce
• Giraph
2 Let’s start
• Out-Degree & In-Degree
3 Get our hands dirty
• Simple PageRank
Apache Giraph
4. Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts
• Graphs in the real world
• Challenges on graphs
• MapReduce
• Giraph
2 Let’s start
• Out-Degree & In-Degree
3 Get our hands dirty
• Simple PageRank
Apache Giraph
5. Basic concepts Let’s start Get our hands dirty
Graphs 101
• Graph: representation of a set
of objects G =< V , E >
• Captures pairwise relationships
between objects
• Can have directions, weights,
. . .
Apache Giraph
9. Basic concepts Let’s start Get our hands dirty
Social networks
• Both physical and Internet mediated
• Users are vertices
• Any kind of interaction generates edges
Apache Giraph
11. Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
Apache Giraph
12. Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
Apache Giraph
13. Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
Apache Giraph
14. Basic concepts Let’s start Get our hands dirty
Graph are nasty
• Graph needs processing
• Each vertex depends on its neighbors, recursively
• Recursive problems are nicely solved iteratively
So what?
Apache Giraph
15. Basic concepts Let’s start Get our hands dirty
Why not MapReduce?1
MapReduce is the current standard to manage big sets of data for
intensive computing.
Repeat N times . . .
1
https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
Apache Giraph
16. Basic concepts Let’s start Get our hands dirty
MapReduce Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send values and structure
• Extensive IO at input, shuffle & sort, output
Disk I/O and Job scheduling quickly dominate the algorithm
Apache Giraph
17. Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
2
https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
18. Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
2
https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
19. Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
2
https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
20. Basic concepts Let’s start Get our hands dirty
Google’s Pregel2
• Especially developed for large scale graph processing
• Intuitive API that let’s you “think like a vertex”
• Bulk Synchronous Parallel (BSP) as execution model
• Fault tolerance by checkpointing
2
https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf
Apache Giraph
23. Basic concepts Let’s start Get our hands dirty
Think like a vertex
• Each vertex has an id, a value, a list of adjacent neighbors and
corresponding edge values
• Vertices implement algorithms by sending messages
• Messages are delivered at the start of each superstep
Apache Giraph
28. Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
3
The function has to be both commutative and associative
Apache Giraph
29. Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
3
The function has to be both commutative and associative
Apache Giraph
30. Basic concepts Let’s start Get our hands dirty
Other things
Aggregators
• Mechanism for global communication and global computation
• Global value calculated in superstep t available in t + 1
• Pre-defined (e.g. sum, max, min) or user-definable functions3
Combiners
• User-defined function3 for messages before being sent or delivered
• Similar to Hadoop ones
• Saves on network or memory
Checkpointing
• Store work to disk at user-defined intervals (isn’t always evil)
• Restart on failure
3
The function has to be both commutative and associative
Apache Giraph
31. Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts
• Graphs in the real world
• Challenges on graphs
• MapReduce
• Giraph
2 Let’s start
• Out-Degree & In-Degree
3 Get our hands dirty
• Simple PageRank
Apache Giraph
32. Basic concepts Let’s start Get our hands dirty
LongLongNullTextInputFormat
org.apache.giraph.io.formats.LongLongNullTextInputFormat
If there is ad edge from Node 1 to Node 2 then
Node 2 appears in the neighbor list of Node 1
<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...
...
Apache Giraph
33. Basic concepts Let’s start Get our hands dirty
IdWithValueTextOutputFormat
org.apache.giraph.io.formats.IdWithValueTextOutputFormat
For each node print the Node ID and the Node Value
<NODE1 ID> <TAB> <NODE1 VALUE>
<NODE2 ID> <TAB> <NODE2 VALUE>
...
Apache Giraph
35. Basic concepts Let’s start Get our hands dirty
Agenda
1 Basic concepts
• Graphs in the real world
• Challenges on graphs
• MapReduce
• Giraph
2 Let’s start
• Out-Degree & In-Degree
3 Get our hands dirty
• Simple PageRank
Apache Giraph
36. Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
37. Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
38. Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
39. Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages
◦ Look at the structure of the underlying network
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
40. Basic concepts Let’s start Get our hands dirty
Google’s PageRank4
• The success factor of Google’s search engine
• A graph algorithm computing the “importance” of webpages
◦ Important pages have a lot of links from other important pages
◦ Look at the structure of the underlying network
• Ability to conduct web scale graph processing
4
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Apache Giraph
41. Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =
1 − d
N
+ d ·
u→v
PageRanki (u)
O(u)
Apache Giraph
42. Basic concepts Let’s start Get our hands dirty
Simple PageRank
• Recursive definition
PageRanki+1(v) =
1 − d
N
+ d ·
u→v
PageRanki (u)
O(u)
• Where:
◦ d: damping factor; which percentage of the PageRank must be
transferred to the neighbors. Usually 0.85
◦ N: total number of pages
◦ O: out-degree; total number of link within a page
Apache Giraph
43. Basic concepts Let’s start Get our hands dirty
Simple PageRank Example
1.0
1.0
1.0
Apache Giraph
50. Basic concepts Let’s start Get our hands dirty
Thank you for your attention
Contact us for any questions or problem
Demo code
https://github.com/manuelcoppotelli/giraph-demo
Homework
https://github.com/manuelcoppotelli/giraph-homework
Apache Giraph