Google Next Extended (https://cloudnext.withgoogle.com/) is an annual Google event focusing on Google cloud technologies. This presentation is from tech talk held in Google Next Extended 2017 Karachi event
A Guideline to Gorgias to to Re:amaze Data Migration
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Karachi)
1. Big Data with Hadoop, Spark
and BigQuery
Google Cloud Next Extended 2017
Speaker: Imam Raza
2. Speaker.bio.toString()
Senior Software Architect @Folio3
Specialities:
Designing scalable Enterprise Software Architecture,
Designing scalable mobile app.
IBM Big Data certified professional.
MongoDB certified professional.
3. About this presentation
me.loveQuestion==true. Let's have interactive session.
The content is designed on basis of industry experience.
Would have some lab sessions
Switching the gear with interesting silicon valley facts.
4. Agenda
What is Big Data?
What is Big Data components?
What is hadoop?
What is spark
What is BigQuery?
Designing scalable Vs fashionable applications.
20. Big data business application
Better understand and target customers
Understand and optimize Business process
Improving Health
Improving security and Law enforcement
Improving sports performance
Improving and optimizing Cities and Countries
21. Types of Source of Big Data
Structured Data (RDBMS, Spreadsheets)
Unstructured Data (raw data)
Semi-Structured Data (XML,JSON)
22. Switching the gear
A mandatory books for silicon valley
graduates looking for jobs.
27. Hadoop
Hadoop is an open-source software framework that
supports data-intensive distributed applications
A Hadoop cluster is composed of a single master node
and multiple worker nodes
28. Hadoop Primary Components
HDFS – Hadoop Distributed File System.(Storing large
amounts of data)
MapReduce Programming Model- (Processing large
amounts of data)
30. Moving Code to Data Philosophy
If code and data are on different machines, one of them must be moved to
the other machine before the code can be executed on the data.
If the code is smaller than the data, better to send the code to the machine
holding the data than the other way around, if all the machines are equally
fast.
In the world of Big Data, the code is almost always smaller than the data.
34. Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate
35. Apache Spark
Apache Spark is a lightning-fast cluster computing technology,
designed for fast computation.
It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations,
which includes interactive queries and stream processing
36. Apache Spark features
Speed: Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running
on disk.
Support Multi languages: provides built-in APIs in Java, Scala, or
Python
Advanced Analytics: Supports SQL queries, Streaming data,
Machine learning (ML), and Graph algorithms.
41. BigQuery
A service that enables interactive analysis of massively large datasets
Based on Dremel, a scalable, interactive ad hoc query system for analysis
of read-only nested data
Working in conjunction with Google Storage
Has a RESTful web service interface.
42. BigQuery
You can issue SQL queries over big data
Interactive web interface
As small response time as possible
Auto scales under the hood
.
45. Switching the gear
Zareen is a pakistani restaurant in
Google Mountain View.
1477 Plymouth Street, Suite C
Mountain View, CA 94043
http://www.zareensrestaurant.com/
…refers to the vast amounts of data generated every second. We are not talking Terabytes but Zettabytes or Brontobytes. If we take all the data generated in the world between the beginning of time and 2000, the same amount of data will soon be generated every minute. New big data tools use distributed systems so that we can store and analyse data across databases that are dotted around anywhere in the world.
Quintillion 10^18
How big is a zettabyte?
One bit is binary.
It's either a one or a zero.
Eight bits make up one byte, and 1024 bytes
make up one kilobyte.
1024 kilobytes make up one megabyte.
Large videos and DVDs will be in gigabytes
where 1024 megabytes make up one gigabyte of storage space.
These days we have USBs or memory sticks
that can store a few dozen gigabytes of information
where computers and hard drives now store
terabytes of information.
One terabyte is 1025 gigabytes.
1024 terabytes make up one petabyte,
and 1024 petabytes make up an exabyte.
Think of a big urban city or a busy international airport
like Heathrow, JFK, O'Hare, Dubai,
or O. R. Tambo in Johannesburg.
And now we're talking petabytes and exabytes.
All those airplanes are capturing and transmitting data.
All the people in those airports have mobile devices.
Also consider the security cameras and all the staff
in and around the airport.
A digital universe study conducted by IDC
claimed digital information reached
0.8 zettabytes last year and predicted this number
would grow to 35 zettabytes by 2020.
It is predicted that by 2020, one tenth of the world's data
will be produced by machines, and most of the world's data
will be produced in emerging markets.
It is also predicted that the amount of data produced
will increasingly outpace available storage.
Advances in cloud computing have contributed
Refers to the different types of data we can now use.In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world’s data is unstructured (text, images, video, voice, etc.) With big data technology we can now analyse and bring together data of different types such as messages, social media conversations, photos, sensor data, video or voice recordings.
Big Data Veracity refers to the biases, noise and abnormality in data.
refers to the messiness or trustworthiness of the data. With many forms of big data quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but technology now allows us to work with this type of data
The first season of the show was released in 2013 and it was an immediate hit.
At the time, the New York Times reported that
Netflix executives knew that House of Cards
would be a hit before they even filmed it,
but how do they know that?
Big data.
Netflix has a lot of data.
Netflix knows the time of day when movies are watched.
It logs when users pause, rewind and fast forward.
It has ratings from millions of users
as well as the information on searches they make.
By looking at all these big data,
Netflix knew many of its users
had streamed the work of David Fincher
and films featuring Kevin Spacey had always done well.
And it knew that the British version of House of Cards
had also done well.
It also knew that people who liked Fincher
also liked Spacey.
All these information suggested
that buying the series would be a good bet for the company,
and in fact it was.
In other words, thanks to big data,
Netflix knows what people want before they do.
Better understand and target customers:
To better understand and target customers, companies expand their traditional data sets with social media data, browser, text analytics or sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. Using big data, Telecom companies can now better predict customer churn; retailers can predict what products will sell, and car insurance companies understand how well their customers actually drive.
Understand and Optimize Business Processes:
Big data is also increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictive models generated from social media data, web search trends and weather forecasts. Another example is supply chain or delivery route optimization using data from geographic positioning and radio frequency identification sensors.
Improving Health:
The computing power of big data analytics enables us to find new cures and better understand and predict disease patterns. We can use all the data from smart watches and wearable devices to better understand links between lifestyles and diseases. Big data analytics also allow us to monitor and predict epidemics and disease outbreaks, simply by listening to what people are saying, i.e. “Feeling rubbish today - in bed with a cold” or searching for on the Internet, i.e. “cures for flu”.
Improving Security and Law Enforcement:
Security services use big data analytics to foil terrorist plots and detect cyber attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions
Improving Sports Performance:
Most elite sports have now embraced big data analytics. Many use video analytics to track the performance of every player in a football or baseball game, sensor technology is built into sports equipment such as basket balls or golf clubs, and many elite sports teams track athletes outside of the sporting environment – using smart technology to track nutrition and sleep, as well as social media conversations to monitor emotional wellbeing.