The document provides an overview of Hadoop and the Hadoop ecosystem. It discusses the history of Hadoop, how big data is defined in terms of volume, velocity, variety and veracity. It then explains what Hadoop is, the core components of HDFS and MapReduce, how Hadoop is used for distributed processing of large datasets, and how Hadoop compares to traditional RDBMS. The document also outlines other tools in the Hadoop ecosystem like Pig, Hive, HBase and gives a brief demo.
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
In the age of Big Data and large volume analytics there is a lot to cover and a lot to learn. While at Microsoft developing Windows HDInsight and now developing a one of kind Big Data product at my own company Big Data Perspective, San Francisco I have lived last several years covering Big Data at various level. This talk is customized for database and business intelligence (BI) professionals, programmers, Hadoop administrators, researchers, technical architects, operations engineers, data analysts, and data scientists understand the core concepts of Big Data Analytics on Hadoop. This webinar will be useful for those, who wants to know what is Hadoop, and how they can take advantage just by spending few dollars to run the cluster. The webinar is great for those who are looking to deploy their first data cluster and run MapReduce jobs to discover insights.
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
Dev Lakhani, Data Scientist at Batch Insights talks on "Real Time Big Data Applications for Investment Banks and Financial Institutions" at the first Big Data Frankfurt event that took place at Die Zentrale, organised by Dataconomy Media
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
In the age of Big Data and large volume analytics there is a lot to cover and a lot to learn. While at Microsoft developing Windows HDInsight and now developing a one of kind Big Data product at my own company Big Data Perspective, San Francisco I have lived last several years covering Big Data at various level. This talk is customized for database and business intelligence (BI) professionals, programmers, Hadoop administrators, researchers, technical architects, operations engineers, data analysts, and data scientists understand the core concepts of Big Data Analytics on Hadoop. This webinar will be useful for those, who wants to know what is Hadoop, and how they can take advantage just by spending few dollars to run the cluster. The webinar is great for those who are looking to deploy their first data cluster and run MapReduce jobs to discover insights.
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
Dev Lakhani, Data Scientist at Batch Insights talks on "Real Time Big Data Applications for Investment Banks and Financial Institutions" at the first Big Data Frankfurt event that took place at Die Zentrale, organised by Dataconomy Media
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Big Data with Hadoop and HDInsight. This is an intro to the technology. If you are new to BigData or just heard of it. This presentation help you to know just little bit more about the technology.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
This talk was for GDG Fresno meeting. The demo used Google Compute Engine and Google Cloud Storage. The actual talk was different than the slides. There were a lot of good questions from the audience, and diverted to side topics many times.
What is Bigdata
Sources of Bigdata
What can be done with Big data?
Handling Bigdata
MapReduce
What is Hadoop?
Why Hadoop is Useful?
Other big data use cases
This is an updated version of Amr's Hadoop presentation. Amr gave this talk recently at NASA CIDU event, TDWI LA Chapter, and also Netflix HQ. You should watch the powerpoint version as it has animations. The slides also include handout notes with additional information.
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
3. 3
Brief History in time
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for
bigger computers, but more systems of computers.
—Grace Hopper, American Computer Scientist
9. 9
BIG
DATA
Volume
Big Data comes in on large
scale. Its on TB and even PB
Records, Transaction,
Tables , Files
Veracity
Quality, consistency,
reliability and provenance of
data
Good, bad, undefined,
inconsistency, incomplete.
Variety
Big Data extends structured,
including semi- structured and
unstructured data of all variety
text, log, xml, audio, video, stream,
flat files
Velocity
Data flown continues, time
sensitive, streaming flow
Batch, Real time, Streams,
Historic
Challenges in managing Big Data
10. 10
To overcome Big Data challenges Hadoop evolves
• Cost Effective – Commodity HW
• Big Cluster – (1000 Nodes) --- Provides Storage &
Processing
• Parallel Processing – Map reduce
• Big Storage – Memory per node * no of Nodes / RF
• Fail over mechanism – Automatic Failover
• Data Distribution
• Moving Code to data
• Heterogeneous Hardware System
(IBM,HP,AIX,Oracle Machine of any
memory and CPU configuration)
• Scalable
16. 16
Stop and Ponder
• Is Hadoop an alternative for RDBMS?
• Hadoop is not replacing the traditional data systems used for building
analytic applications – the RDBMS, EDW and MPP systems – but rather is a
complement. & Works fine together with RDBMs.
• Hadoop is being used to distill large quantities of data into something more
manageable
17. 17
Stop and Ponder
• But Don’t we know Coherence to be distributed too? Why Hadoop?
Coherence is the market leading In-Memory Data Grid. While Hadoop works fine
for large processing operations, i.e. requiring many TB of data, that can be
processed in a batch like way, there are use cases where the processing
requirements are more real-time and the data volumes are smaller, where
Coherence is a better choice than HDFS for storing the data
18. 18
Hadoop vs. RDBMS
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High
20. 20
Hadoop Architecture
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop MapReduce: A software framework for distributed processing of
large data sets on compute clusters.
HDFS
Map
Reduce
Hadoop
30. 30
Let’s do it again…
• Map/Reduce has 3 stages : Map/Shuffle/Reduce
• The Shuffle part is done automatically by Hadoop, you just need to
implement the Map and Reduce parts.
• You get input data as <Key,Value> for the Map part.
• In this example, the Key is the City name, and the Value is the set
of attributes : State and City yearly average temperature.
31. 31
• Since you want to regroup your temperatures by state, you’re going to get
rid of the city name, and the State will become the Key, while the
Temperature will become the Value.
32. 32
Shuffle
• Now, the shuffle task will run on the output of the Map task. It is going to
group all the values by Key, and you’ll get a List<Value>
33. 33
Reduce
• The Reduce task is the one that does the logic on the data, in our case this
is the calculation of the State yearly average temperature.
• And that’s what we will get as final output