Big data analytics has evolved beyond batch processing with Hadoop to extract intelligence from data streams in real time. New technologies preserve data locality, allow real-time processing and streaming, support complex analytics functions, provide rich data models and queries, optimize data flow and queries, and leverage CPU caches and distributed memory for speed. Frameworks like Spark and Shark improve on MapReduce with in-memory computation and dynamic resource allocation.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Big Data may well be the Next Big Thing in the IT world. The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
Speaker Presentation from U.S. News Healthcare of Tomorrow leadership summit, November 2-4, 2016 in Washington, DC. Find out more about this forum at www.usnewshot.com.
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolHakka Labs
How Pinterest uses HBase at massive scale. This presentation will also focus on how Pinterest runs HBase on Amazon EC2 cloud and also how they monitor, trouble shoot and scale the system.
Big data characteristics, value chain and challengesMusfiqur Rahman
Abstract—Recently the world is experiencing an deluge of
data from different domains such as telecom, healthcare
and supply chain systems. This growth of data has led to
an explosion, coining the term Big Data. In addition to the
growth in volume, Big Data also exhibits other unique
characteristics, such as velocity and variety. This large
volume, rapidly increasing and verities of data is becoming
the key basis of completion, underpinning new waves of
productivity growth, innovation and customer surplus. Big
Data is about to offer tremendous insight to the
organizations, but the traditional data analysis
architecture is not capable to handle Big Data. Therefore,
it calls for a sophisticated value chain and proper analytics
to unearth the opportunity it holds. This research
identifies the characteristics of Big Data and presents a
sophisticated Big Data value chain as finding of this
research. It also describes the typical challenges of Big
Data, which are required to be solved. As a part of this
research twenty experts from different industries and
academies of Finland were interviewed.
Compare and contrast big data processing platforms RDBMS, Hadoop, and Spark. pros and cons of each platform are discussed. Business use cases are also included.
The critical thing to remember about Spark and Hadoop is they are not mutually exclusive or inclusive but they work well together and makes the combination strong enough for lots of big data applications.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
The need to process huge data is increasing day by day. Processing huge data involves compute, network and storage. In terms of Big Data, What it takes to innovate and what is innovation at the end? This talk provide high level details on the need of big data and capabilities of Mapr converged data platform.
Speaker: Vijaya Saradhi Uppaluri, Technical Director at MapR Technologies
Hadoop is getting replaced with Scala.The basic reason behind that is Scala is 100 times faster than Hadoop MapReduce so the task performed on Scala is much faster and efficient than Hadoop.
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
Which storage technology, HDDs or SSDs, excels in big data architecture? SSDs clearly win on speed, offering higher sequential read/write speeds and higher IOPS. However, deploying SSDs in hundreds or thousands of nodes could add up to a very expensive proposition. A better approach identifies critical locations where SSDs enable immediate cost-per-performance wins. This whitepaper will look at the basics of big data tools, review two performance wins with SSDs in a well-known framework, as well as present some examples of emerging opportunities on the leading edge of big data technology.
Big Data is an evolution of Business Intelligence (BI).
Whereas traditional BI relies on data warehouses limited in size
(some terabytes) and it hardly manages unstructured data and
real-time analysis, the era of Big Data opens up a new technological
period offering advanced architectures and infrastructures
allowing sophisticated analyzes taking into account these new
data integrated into the ecosystem of the business . In this article,
we will present the results of an experimental study on the performance
of the best framework of Big Analytics (Spark) with the
most popular databases of NoSQL MongoDB and Hadoop. The
objective of this study is to determine the software combination
that allows sophisticated analysis in real time.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
ML Algorithms usually solve an optimization problem such that we need to find parameters for a given model that minimizes
— Loss function (prediction error)
— Model simplicity (regularization)
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Core concepts and Key technologies - Big Data Analytics
1. Core concepts and Key technologies - Big Data Analytics!
!
Big Data Business has solved the problem of 'Big data acquisition and persistence' using daily
ETL and Batch Analysis through the Hadoop Eco-system.!
!
Now lets now see how the Big Data Market evolved beyond batch processing (Hadoop ,
RDBMS - world) to extract intelligence from global data streams in real time !!
!
Tremendous Research work underway , last few years , to challenge conventional wisdom and
create the augmented reality of Big Data Analytics !!
!
'Dynamic Decision Making' is no longer driven by 'Traditional Business Intelligence' but involves
'Fast exploration of data patterns', 'performing complex deterministic or stochastic
approximations or accurate queries' !!
!
Lets glance through some of the core concepts and key technologies that are driving this
Renaissance.!
!
Need to preserve Data Locality !
!
Traditional Hadoop MR does not preserve 'data locality' during Map Reduce transition or
between iterations. In order to send data to the next job in a MR workflow, a MR job
needs to store its data in HDFS. So it incurs communication overhead and extra
processing time. Bulk Synchronous Parallel - concept was implemented in (Pregel,
Giraph, Hama) to solve this very important MR problem. !
Hama initiates a peer2peer communication only when necessary and peers focus on
keeping locally processed data in local nodes.!
!
BSP manages the synchronization and communication in the middle layer as opposed to
file-system based parallel random access pattern ! It uses the K-Means Clustering
algorithm. Apache Hama provides a stable reference implementation for analyzing
streaming events or big data with graph/network structure by implementing deadlock-
free 'message passing interface' and 'barrier synchronization' (reduces significant n/w
overheads)!
!
Need for Real-time processing and streaming ETL!
!
Hadoop was purpose-built for 'distributed batch processing' using static input, output and
processor configuration. But fast ad-hoc machine learning query requires real-time
distributed processing and real-time updates based on dynamically changing
configuration without requiring any code change. Saffron Memory Base (SMB) is a
technology that offers real-time analytic on hybrid data.!
!
SQLstream Connector for Hadoop provides bi-directional, continuous integration with
Hadoop HBase. DataTorrent Apex is another front-runner in Stream Processing.!
!
With SciDB, one can run a query the moment it occurs to the user. By contrast, arguably
Hadoop enforces a huge burden of infrastructure setup, data preparation, map-reduce
2. configuration and architectural coding. Both SciDB and SMB positions themselves as a
complete replacement of Hadoop-MR when it comes to complex data analysis.!
!
Need for complex analytic functions!
!
Its not suitable for increasingly complex mathematical and graphical functions like Page-
Ranking, BFS, Matrix-Multiplication which require repetitive MR Jobs.!
!
So many interesting research works spawned in recent time; BSP, Twister, Haloop,
RHadoop!
!
Need for rich Data Model and rich Query syntax!
!
Existing MR Query API has limited syntax for relational joins and group-bys. They do
not directly support iteration or recursion in declarative form and are not able to
handle complex semi-structured nested scientific data. Here comes MRQL (Map-Reduce
Query Language) to the rescue of Hadoop MR by supporting nested collections, trees,
arbitrary query nesting, and user-defined types and functions.!
!
Impala : read directly from HDFS and HBase data. it will add a columnar storage engine,
cost-based optimizer and other distinctly database-like features.!
!
Need to optimize data flow and query execution!
!
All the 'Big data Analytics datastores' both proprietery and open-sourced trying their best
to redefine the 'traditional Hadoop MR'!
!
Well MapReduce does not make sense as an engine for querying !!
!
So here comes Shark is a Distributed In-Memory MR framework with great speed of
Execution!
!
Need for speed and versatility of MR Query!
!
Apache Hadoop is designed to achieve very high throughput, but is not designed to
achieve the sub-second latency needed for interactive data analysis and exploration.
Here comes Google Dremel and Apache Drill. Columnar query execution engine offers
low latency interactive reporting using DrQL which encompasses a broad range of low
latency frameworks like Mongo Query, Cascading, Plume. It adheres to Hadoop
philosophy of connecting to multiple storage systems, but broadens the scope and
introduces enormous flexibility through supporting multiple query languages, data
formats and data sources. Spark extends Hive syntax but employs a very efficient
column-storage with interactive multi-stage computing flow.!
!
Ability to cache and reuse intermediate map outputs!
!
HaLoop introduces recursive joins for effective iterative data analysis by caching and
reusing loop-independent data across iterations. A significant improvement over
conventional general-purpose MapReduce.!
3. !
Twister offers a modest mechanism for managing configurable and cacheable.mr tasks
and implements effective pub/sub based communication and off course special support
for 'iterative MR computations'.!
!
Leverage CPU cache and distributed memory access patterns!
!
There are quite a few frameworks to store data in distributed memory instead of HDFS
like GridGain, HazelCast, RDD, Piccolo!
!
Hadoop was not designed to facilitate interactive analytics. So it required few game
changers to exploit 'CPU cache' and 'Distributed Memory’!
!
Single-bus Multi-core CPU!
!
It's a well-known fact how Single-bus Multi-core CPU offers simultaneous multi-threading
that significantly reduces latency for certain type of algorithm and data structures ! So
DBMS being re-architected from ground up to leverage the 'cpu-bound partitioning-
phase hash-join' as opposed to 'memory-bound hash-join' !!
!
Its the ever-increasing speed of CPU caches and TLBs which allow blazing fast
computation and retrieval of hashed result. Also its noteworthy how modern multi-core
CPU and GPGPU offer cheap compression schemes at virtually no CPU cost. As we
know access to memory becoming pathetically slower compared to the ever galloping
processor clock-speed!!
!
ElasticCube from SiSense leverages 'query plans optimized for fast response time and
parallel execution based on multi-cores' and continuous 'instruction recycling' for
reusing pre-computed results.!
!
Few other Business Analytics database/tools like VectorWise, Tableau Data Engine
and SalesEdge also make great use of CPU cache to offer blazing fast ad-hoc query.!
!
!
Implementing Parallel Vectorization of compressed data through SIMD
(Single-Instruction, Multiple Data) : !
!
VectorWise efficiently utilized the techniques of vectorization, cpu compression and
'using cpu as execution memory'. This is also a core technology behind many leading
analytics column stores.!
!
Driving Positional-Delta-Tree (PDT) : !
!
PDT stores both position and the delta are stored in memory and effectively merged with
data during optimized query execution. Ad-hoc query in most cases is about identifying
the 'difference'. VectorWise makes effective use of PDT. More can be found in its white
paper.!
!
4. !
Need for directly query compressed data residing in heavily indexed
columnar files : !
!
SSDs and Flash storages will get cheaper (means more cheaper Cloud service) with
innovative compression and de-duplication on file system. All the analytics datastores
are gearing up to make most of this feature. Watch out for Pure Storage .!
!
Need for Dynamic Memory Computation and Resource Allocation!
!
Output of batched job need to be dumped in secondary storage. Rather it would be good
idea to constantly compute the data size and create peer processors and make sure
collective memory does not exceed entire data size. So its important to understand
Hadoop-MR is not the best-fit for processing all types data structures ! BSP model
should be adopted for massive Graph processing where bulk of static data can remain in
filesystem while dynamic data processed by peers and result kept in memory.!
!
!
Usage of Fractal Tree Indexes : !
!
Local processing is the key ! Keep enough buffers and pivots in the Tree node itself in
order to avoid frequent costly round trips along the tree for individual items! That
means keep filling up your local buffer and then do bulk flush! New age drives love bulk
updates (more changes per write) to avoid fragmentation !!
!
TokuDB replaces MySQL and MongoDB binary tree implementation with Fractral tree
and achieved massive performance gain.!
!
Distributed Shared Memory Abstraction !
!
It provides a radical performance improvement over disk-based MR ! Partial DAG
execution (Directed Acyclic Graph) model to describe parallel processing for for in-
memory computation (aggregate result set that fits in memory e.g. like intermediate Map
outputs).!
!
Spark uses 'Resilient Distributed Dataset' architecture for converting query into 'operator
tree' ! Shark keeps on reoptimizing a running query after running first few stages of the
task DAG, thereby selecting better Join strategy and right degree of parallelism. Shark
also offers 'co-partitioning multiple tables based on common key' for faster join query!!
!
It leverages SSD, CPU cores and Main Memory to the fullest extent !!
!
Its worth mentioning how DDF (Distributed DataFrame - nexgen extension of Spark
RDD) , H2O DataFrame and Dato SFrame expanding the horizons machine learning at
massive scale.!
!
Parallel Array Computation !
!
5. It is a very simplistic yet powerful mathematical approach to embed Big Math functions
directly inside database engine. SciDB has mastered this concept by embedding
statistical computation using distributed, multidimensional arrays.!
!
Semi-computation and instant approximation!
!
'fast response to ad-hoc query' through Continuous learning from experience and instant
approximation as opposed to waiting for the end of processing and computation of final
result . A bunch of analytics products coming to market with built-in 'data science
capabilities' - for example H20 from 0xdata .!
!
!
Need for Push-based Map Reduce Resource Management!
!
Though Hadoop's main strength is distributed processing over clusters, but at peak load
the utilization drops due to scheduling overhead. Its well-known how map-reduce cluster
is divided into fixed number of processor slots based on static configuration. !
So Facebook introduced Corona - a 'push-based scheduling where a cluster-manager
tracks nodes, dynamically allocates a slot making it easier to utilize all the slots based on
cluster workload for both map-reduce and non-mr applications.!
!
Topological Data Analysis !
!
It is a giant leap forward ! It treat data model as a topology of nodes and discovers
patterns and results by measuring similarity ! Ayasdi has pioneered this idea to build the
first 'Query-free exploratory analytics tool' ! Its a true example of 'analytics based on
unsupervised learning without requiring a priori algebraic model'.!
!
Ayasdi Iris is a mind-boggling insight discovery tool !!
!
Learn from KnowledgeBase!
!
InfoBright offers fast analytics based on the Knowledge-base built by continuously
updating metadata about data, data access patterns, query, aggregate result. !
This type of innovative 'dynamic introspective' approach helps columnar storage to avoid
requirement for indexing and costly sub-selects and allows to decompress only the
required data !!
!
!
Reduction in Cluster size!
!
'Data Analysis' processors do not need the same number of machines like Hadoop
nodes. For example ParAccel can handle data analysis in one node as opposed to (avg)
8 nodes required by Hadoop to perform same type of analysis due to advanced storage
schema optimization. Enhanced File systems like QFS offer much lower replication
factor (1.5) and higher throughput.!
!
!
6. Avoid Data duplication !
!
Both ParAccel and Hadapt share a similar vision of analyzing the data as close to the
data node as possible without moving data to different BI layer.!
!
As opposed to Hadoop Connector strategies of MPP analytic datastores, Hadapt
processes data in an RDBMS layer sitting close to HDFS and load-balances queries in a
virtualized environment of adaptive query engines.!
!
Ensure Fault-tolerance & Reliability !
!
Apache Hama ensures reliability through Process Communication and Barrier
Synchronization. Each peer uses checkpoint recovery to occasionally flush the volatile
part of its state to the DFS and allows rollback to the last checkpoint in the event of
failure.!
!
Per Shark documentation, "RDDs track the series of transformations used to build them
(their lineage) to recompute lost data "!
!
Shark/Spark with its compact Resilient Data Set -based distributed in-memory
computing offers a great hope to startups and open source enthusiasts for building a
lightning fast data warehouse system.!
!
References : !
!
** Single-Bus-Multi-Core CPU : http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf !
** Bulk Synchronous Parallel : http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf !
!
Spark : http://spark-project.org/research/ !
Shark : https://github.com/amplab/shark !
Pregel : http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-
google.html !
Giraph : http://giraph.apache.org/ !
Hama : http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf !
Saffron Memory Base : http://www.slideshare.net/paulhofmann/big-data-and-saffron !
SQL Stream : http://www.sqlstream.com/applications/hadoop/ !
RHadoop : https://github.com/RevolutionAnalytics/RHadoop/wiki !
MRQL : http://lambda.uta.edu/mrql/ !
Apache Drill : http://incubator.apache.org/drill/ !
HaLoop : http://code.google.com/p/haloop/wiki/UserManual !
Twister : http://www.iterativemapreduce.org/ !
BSP : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel !
Corona : https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona !
ParAccell : http://img.en25.com/Web/ParAccel/%7Be72a7284-edb0-4e58-bb75-ff1145717d2b
%7D_Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf !
QFS : https://github.com/quantcast/qfs/wiki/Performance-Comparison-to-HDFS !
Hadapt : http://hadapt.com/assets/Hadapt-Product-Overview1.pdf !
ElasticCube : http://pages.sisense.com/elasticube-whitepaper.html?src=bottom !
VectorWise : http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf !