Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
Quick Brief about " What is Hadoop"
I didn't explain in detail about hadoop, but reading this slides will give you insight of Hadoop and core product usage. This document will be more useful for PM, Newbies, Technical Architect entering into Cloud Computing.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Comparison between RDBMS, Hadoop and Apache based on parameters like Data Variety, Data Storage, Querying, Cost, Schema, Speed, Data Objects, Hardware profile, and Used cases. It also mentions benefits and limitations.
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation
In Cassandra Lunch #89, we will discuss how to store and parse semi-structured data in Cassandra using Spark
Accompanying Blog: Coming Soon!
Accompanying YouTube: https://youtu.be/ZhNnn51BRUc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Quick Brief about " What is Hadoop"
I didn't explain in detail about hadoop, but reading this slides will give you insight of Hadoop and core product usage. This document will be more useful for PM, Newbies, Technical Architect entering into Cloud Computing.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Comparison between RDBMS, Hadoop and Apache based on parameters like Data Variety, Data Storage, Querying, Cost, Schema, Speed, Data Objects, Hardware profile, and Used cases. It also mentions benefits and limitations.
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
Cassandra Lunch #89: Semi-Structured Data in CassandraAnant Corporation
In Cassandra Lunch #89, we will discuss how to store and parse semi-structured data in Cassandra using Spark
Accompanying Blog: Coming Soon!
Accompanying YouTube: https://youtu.be/ZhNnn51BRUc
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
1. Big Data Analytics
- Big Data
- Spark: Big Data Analytics
- Resilient Distributed Datasets (RDD)
- Spark libraries (SQL, DataFrames, MLlib for machine learning, GraphX, and Streaming)
- PFP: Parallel FP-Growth
2. Ubiquitous Computing
- Edge Computing
- Cloudlet
- Fog computing
- Internet of Things (IoT)
- Virtualization
- Virtual Conferencing
- Virtual Events (2D, 3D, and Hybrid)
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
Today, most personalized and recommendation services are built around interest extraction models but the outputs of these algorithms are ambiguous in nature. This makes it difficult to understand what users are personally interested in and more importantly what they are feeling towards these interests and how their interests transition through time. By studying both users' interests and emotions, simultaneously, one can further investigate the motivation behind these interests. Such findings can be useful to build better interest extraction models and algorithms that leverage personalized and recommendation services (e.g., ads. targeting, e-commerce and dating sites). In this paper, we propose the demonstration of a web visualization tool - EmoViz - which facilitates the further exploration of users' interests and their emotions at a global scale. Such tool, through the use of various visual components, aims to alleviate the problem of understanding what users of the world are interested in and the motivations behind their interests and feelings.
Accompanying paper for this work: http://ieeexplore.ieee.org/document/7403627/
Subconscious Crowdsourcing: A Feasible Data Collection Mechanism for Mental D...Elvis Saravia
Mental disorders are currently affecting millions of people from different cultures, age groups and geographic regions. The challenge of mental disorders is that they are difficult to detect on suffering patients, thus presenting an alarming number of undetected cases and misdiagnosis. In this paper, we aim at building predictive models that leverage language and behavioral patterns, used particularly in social media, to determine whether a user is suffering from two cases of mental disorder. These predictive models are made possible by employing a novel data collection process, coined as Subconscious Crowdsourcing, which helps to collect a faster and more reliable dataset of patients. Our experiments suggest that extracting specific language patterns and social interaction features from reliable patient datasets can greatly contribute to further analysis and detection of mental disorders.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. BigData
It was first mentioned by NASA researchers Michael Cox and
David Ellsworth in 1997.
Definition: data sets that are so large or complex that
traditional data processing applications are inadequate to
deal with them. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
querying, updating, information privacy, and real-time
capabilities. (Wikipedia)
3. BigDataProcessing
The rise of Big Data required faster tools for processing
data.
Can traditional databases handle Big Data? What are the
limitations?
What are the solutions?
4. WhatisSpark?
A big data processing framework built around speed,
generality, ease of use, and accessibility.
5. PropertiesofSPARK
Speed: Run programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk.
Ease of Use: Write applications quickly in Java, Scala,
Python, and R.
Generality: Combines SQL, streaming, and complex analytics.
Accessibility: Spark runs on Hadoop, Mesos, standalone, or
in the cloud. It can access diverse data sources including
HDFS, Cassandra, HBase, and S3.
6. SparkFeatures
❏ Works directly on memory for speed-up
❏ Supports MapReduce
❏ Lazy evaluation of big data queries for optimization. (It
does exactly what you tell it to do)
❏ Operators and API that allow for easy interaction through
Scala, R, and Python.
7. Sparknuggets
Spark Streaming - process real-time data using basic
abstraction called Resilient Distributed Datasets (RDDs).
Spark SQL - Allows for ad-hoc querying using JDBC API
Spark MLlib - provides optimized machine learning libraries
for regression, classification, and clustering tasks.
Spark GraphX - an extension to RDD for graph-parallel
computation.
8. RDD
A fundamental data structure in Spark. An immutable
collection of objects that support in-memory processing.
Two ways of creating RDDs:
❏ Parallelize existing data collection in your driver
program
❏ Importing directly from HDFS, HBASE or any other
Hadoop Input Format.
9. Map/Reduce
A programming model for processing and generating large
amounts of data with a parallel, distributed algorithm on a
cluster.
10. ASimpleWordCountprogram
text_file = spark.textFile("hdfs://...")
text_file.flatMap(lambda line:
line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b)
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount extends
Configured implements Tool{
public int run(String[] args)
throws Exception
{
//creating a JobConf
object and assigning a job name for
identification purposes
JobConf conf = new
JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");
//Setting configuration
object with the Data Type of output
Key and Value
conf.setOutputKeyClass(Text.class);