7. Breakthrough Enabling Technologies
1)Hadoop
2)Spark
• Apache Hadoop is an open source software framework for storage and large
scale processing of data-sets on clusters of commodity hardware
-MapReduce : MapReduce is a software framework/model for processing
large datasets that are in commodity hardware that form a cluster (Hadoop uses
MapReduce to process data)
9. Why is Big Data so important?
Enables businesses /organizations to derive new and better insights
It’s the backbone to these emerging technologies that are radically
changing our world
Big Data is the Lifeblood of the 4th Industrial Revolution
Artificial
Intelligence
Machine
Learning
Deep learning
20. References
Slides on Hadoop ecosystem overview
Big Data Modelling and Management Systems Course on Coursera
https://www.coursera.org/learn/big-data-management
Source of Mayors Speech
Mayor Andrew Ginther's 2017 State of The City Speech
https://www.columbusunderground.com/full-text-of-the-2017-state-of-the-city-
address
Editor's Notes
Not a single definition for Big Data. In a nutshell it refers to data that is so big in scale and variety that it becomes impossible to store and process it using traditional RDBMS technologies
Big data sources: Machines, people and organizations
Big Data characteristics: volume, velocity, variety and veracity
Structured data: basically from csv, RDBMS etc. Unstructured data: videos, audios, pdf, email messages etc (Data has no underlying model/structure). Semi-structured data: xml, json
Hadoop is an open-source eco-system of software tools used to store and process big data. It was conceived or started by Doug cutting while he worked in Yahoo. The idea was gotten from a paper published by Google on its “Big Table Project”. Its currently an open-source project of Apache and has numerous contributors with new software included from time to time
Spark is also for processing data in a cluster. Its 100x more faster than MapReduce and supports a concept called “in-memory computation” which very important for the field machine learning. Spark has an SQL interface called Spark-SQL and it provides an interface for 3 programming languages (java, python and scala)
Autonomous trucks are already been used in the outbacks of Australia, its not a matter of if it will occur, its just a matter of time. A lot of white collar jobs today will be non-existent in 15yrs, in scenarios were they do exist, the number of professionals needed per task will be significantly fewer.
HDFS is basically the filesystem of Hadoop.
Manages resources in the eco-system (CPU cores, RAM, Storage), basically allocates resources to jobs performed on the platform.
MapReduce model operates basically by processing data in two stages: the “Map’ stage and the “Reduce”. These stages generate outputs based on the code you write into them
It’s a high level scripting language used for ETL
Used to analyze data from social networks
Spark is also for processing data in a cluster. Its 100x more faster than MapReduce and supports a concept called “in-memory computation” which very important for the field machine learning. Spark has an SQL interface called Spark-SQL and it provides an interface for 3 programming languages (java, python and scala)
Basically ensures all the software's in the eco-system are in-sync ie are working harmoniously
Sir Arthur C. Clarke, he was a British science fiction writer.