Your SlideShare is downloading. ×
Big Data: El qué y el cómo

Alan Koo
Senior Consultant | Nagnoi, Inc.
www.alankoo.com | @alan_koo
Acerca de mi
•
•
•
•
•
•
•
•
•
•
•

Senior Consultant en Nagnoi, Inc.
13+ años en SQL Servidor
8+ años en BI & OLAP
Certif...
Agenda
Qué es Big Data?
Fuentes comunes de Big Data
Escenarios comunes de Big Data
HDInisght: Windows Azure + Hadoop

Alan...
¿Qué es Big Data?
Orden en el viejo mundo

Hans Lipperhey

Alan Koo | www.alankoo.com
El nuevo mundo

Alan Koo | www.alankoo.com
Orden del viejo mundo

Alan Koo | www.alankoo.com
El nuevo mundo

Alan Koo | www.alankoo.com
Tendencias claves

Alan Koo | www.alankoo.com
La transición a Big Data
small
fk/pk

Volumen

big
pull

PDW
SQL Server

Velocidad

k/v

push

HDInsight
¿Qué es Big Data?
Social
Sentiment
Click
Stream

Petabytes
(10E15)
Terabytes
(10E12)

Volumen

Exabytes
(10E18)

Gigabytes...
Big Data
“Big data es un término que describe el almacenaje y el
análisis de grandes y/o complejos conjuntos de datos
usan...
Big Data es….
No el Tamaño de los Datos
No las herramientas chéveres como Hadoop y R
Un Nuevo Paradigma en Cómo Recolectar...
Escenarios para Big Data

Alan Koo | www.alankoo.com
Entorno Tradicional para DW/BI

ETL
Data Warehouse

OLAP

Reporting
Cosas que los clientes pueden estar diciendo
• Necesitamos paralelizar las operaciones de datos pero es muy costoso y
comp...
Tipos de data generada por sector
Fundamentos
Si Chris Paul pasa la bola a un compañero

a 1.5

metros o menos del

89
porciento de chance de que
canasto, existe un

te...
Fuentes de datos comunes

Progressive: http://articles.chicagotribune.com/2013-09-15/classified/ct-biz-0915--telematics-in...
Algoritmos comunes en Big Data
Haciéndolo real
Hadoop
• Colección de proyectos “open source” en Apache para
almacenar/procesar big data (grandes datos no/semiestructurad...
Hadoop: Arquitectura distribuida

Alan Koo | www.alankoo.com
MapReduce: Moviendo código a los datos

Alan Koo | www.alankoo.com
¿Cómo funciona?

Alan Koo | www.alankoo.com
RDBMS Tradicional vs. NoSQL

Alan Koo | www.alankoo.com
¿Qué es HDInsight?
• Plataforma de datos de nivel empresarial
• Contruído sobre Hadoop en sociedad con Hortonworks
• Actua...
Windows Azure HDInsight Service
Job submission (hive query, etc)

Query &
Metadata:
Data
Movement:

Workflow:

Monitoring:...
Windows Azure HDInsight Service
Job submission (hive query, etc)

Alan Koo | www.alankoo.com
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)

ODBC

Query
(Hive)

Legend
Red = Core
Hadoop
Gray = Data
pr...
Almacenando datos en HDInsight
HDFS en Azure: Historia de dos Sistemas de Archivos
HDFS API
Name Node

Azure Blob Storage

de
Front end
Front end
Front e...
Azure Storage (ASV)
• Sistema de archivos por defecto para HDInsight
• Provee almacenamiento que se puede compartir, persi...
Consumiendo resultados desde HDInsight
Destino

Herramienta / Librería

Requiere un Cluster de
HDInsight Activo

SQL Serve...
Entorno Tradicional para DW/BI

ETL
Data Warehouse

OLAP

Reporting
Entorno de DW/BI del mañana

ETL
Data Warehouse

Crítico para el negocio

OLAP

Reporting
Solución de Big Data de Microsoft

Alan Koo | www.alankoo.com
En resumen
• HDInsight es una plataforma de nivel empresarial basada
en Hadoop para almacenamiento de “big data
• Azure Bl...
¿Preguntas?
Alan Koo | www.alankoo.com
Recursos / Referencias
http://brianwmitchell.com/
bit.ly/loKoMN

– Do You Have Big Data? (Most Likely!) bit.ly/1awKcqE
– I...
¡Gracias!
Alan Koo Labrín

Senior Consultant | Nagnoi, Inc.
Blog: www.alankoo.com
Twitter: @alan_koo

Alan Koo | www.alank...
Big Data:  El qué y el cómo
Upcoming SlideShare
Loading in...5
×

Big Data: El qué y el cómo

3,115

Published on

El video de esta presentación esta en mi blog (www.alankoo.com).
Introducción a lo que és y no es Big Data, y la estrategia de Microsoft basada en HDInsight, una distribución basada 100% en Apache Hadoop la cual nos lleva a manejar nuevos escenarios dentro del mundo de Inteligencia de Negocios. Este webcast fue originalmente presentado en el evento "Maratón de Business Intelligence" (Intermezo y Microsoft TechNet)

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,115
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Slide Objectives:Set up the problem: Devices, social network are causing an explosion of data. 1.8 Zbytes last year and in 2 years we will have 7.8 Zbyte worth of data being created each year.Transition:Transition statement(s) to setup the slideSpeaking Points:New devices and use scenarios are creating more data than ever. Cheaper Storage and compute makes it possible to process some of the data, thus “big data” tools and industry have been created.Notes:These are the trends that are triggering the big data revolution. Most of us are already familiar with them, however we need to take another look at them from new perspectives. Almost everyone here has one or more mobile devices, the world currently has 5.5 billion devices which reaches 70% of the world’s population. Social Network, such as Facebook and twitter, have more than 2 billion users and are growing fast, we will reach 7.2 Zetta bytes of information created per year by 2015. In addition to the data humans are creating, the next growth area is sensornetworks or “internet of thigns”, we will have more than 10 billion networked sensors in the very near future. At the same time, we are seeing two other trends that are going in the opposite directions, the cost of compute and storage have gone down rapidly. These two trends are also helping to grow the big data industry. When you see an explosive growth of data and the rapid decrease of storage prices. There’s suddenly an opportunity to invest in big data. In return we get not only information, insight, but also increased productivity and competitiveness. Things we weren’t able to do before suddenly became feasible.
  • Slide Objectives:Types of data and the characteristics of big data Transition:Big data is not simply about the volume, but about how fast they move and their unstructured nature.Speaking Points:Volume: we’ve created 1.8 Zettabyte in the past year, and it will double every 1.5 years.Data velocityadds to the difficulties; the SLA becomes much more difficult to service when you have constant incoming data such as social networks and internet of things. We just can’t simply stop data sources from producing data while we fix our systems.Notes:Variety = different types of data. and variability => data structure changes over time. Gartner’s Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”McKinsey Global Institute in May 2011: “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”
  • Slide Objectives:Types of data and the characteristics of big data Transition:Big data is not simply about the volume, but about how fast they move and their unstructured nature.Speaking Points:Volume: we’ve created 1.8 Zettabyte in the past year, and it will double every 1.5 years.Data velocityadds to the difficulties; the SLA becomes much more difficult to service when you have constant incoming data such as social networks and internet of things. We just can’t simply stop data sources from producing data while we fix our systems.Notes:Variety = different types of data. and variability => data structure changes over time. Gartner’s Merv Adrian in a Q1, 2011 Teradata Magazine article. He said, “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it within a tolerable elapsed time for its user population.”McKinsey Global Institute in May 2011: “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.”
  • Telematics:Progressive and telematics car device to help insurers to get better discounts based on their driving habitsTEXTO: JSON (semi-structure) into relational Dell – twitter. Pattern recognition. Kind of complaints. Detect before it happensHEALTHCARE: Fraud detection, patients, doctor notes. Legal cases: Search in emailsPLACE AND TIME: Foursquare, facebook, fitbit, run keeper. Geospatial information.Facebook recommend friends using geospatial, familiary in places that I goBiking: who people does biking, recommend thatRFID: (hoy) Tags en warehouses, en paletasEach item in the grossery store, where people take the items (at the door, in the aisle, in the checkout, etc.) SMART GRID: Smart metters, a lot of data, to bill specifically for what you are using, more information when the service is useSensors in everywhere, cars, airplanes, looking for predictive analytics to prevent failures in the futureXbox, gaming, what are you using, what is to hard, to easy? So they can do the game more difficult or easier.Retail: Vending machines, inventory, stocksLaw enforcements: braceletsMove from one phone company to another: how many different people she interacts with, they don’t want to loose her. A lot of interactions with their customers.Organization:
  • Similar items:Similar web pages.Colaborative Filtering: AmazonData Stream mining:Summarize it? Or evaluate a setLast 30 twits this is what people is sayingImages, case study with the NY police department – ManhatanItem sets:Diapers and beersBuy a laptop, likely to buy a mouse or monitorPut items together in the market, wine and cheesePlagerism (plagio), items (documents)Related web pages (based on words)Customers are more positive about this and more negative about thisClustering: cluster items: SSAS Excel Data MiningRecommendation systems: Netflix: movieSocial Network: Communication unsuccessful between member teams. They
  • Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Map reduce is about minimizing the movement of data inside your cluster.The job tracker understands where all the data blocks are, and will send the operation code to the node that contains the data.Notes:Any notes go here
  • Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
  • Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
  • Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
  • Slide Objectives:Understand the HDInsight eco-systemTransition:Transition statement(s) to setup the slideSpeaking Points:Biggest buzzword in Big Data right now is HadoopIt can mean many things, but always includes HDFS and MapReduceHDInsightRed = in product nowBlue = planned for productGreen = ecosystem can connect nowPurple = Samples availableOrange = ecosystem plannedFlume, HBase are not available in the first release of HDInsight ServiceAs of 3/15, we don’t have an on-premise solution, thus AD integration is not yet available. System center integration will come later as well.The Green boxes are packages in the ecosystem that have not been included in the service, but should work out of the box by downloading them.Notes:Any notes go here
  • Slide Objectives:Provides 1 layer to access both attached/local storage on each node and the remote Windows Azure Blog storage which is the default.Transition:Transition statement(s) to setup the slideSpeaking Points:One interface to rule both DFS and Azure blob storageBlob storage:Front End: Security/Auth and scaled out request handlerPartition Layer: Object Layer, Mapping of objects such as Tables, Blobs, Queues to streams (cached in Front End), CCStream Layer: 3-Node HA, Scale-out stream storePlease see details from windows azure storage paper. IN some ways ASV changes things again, we are now moving data to the compute, since data is now remote. Blob storage allows you to persist your data even when you tear down your cluster.Notes:Any notes go here
  • Slide Objectives:Understand the details of ASVTransition:Transition statement(s) to setup the slideSpeaking Points:You will need to create an Azure storage account, you will need your acct name and key.You should create a cluster close to where your data is. (storage in west should create a cluster in the west data center).Notes:Any notes go here
  • Slide Objectives:Talk from the bottom layer up to discuss the Microsoft big data solution.Transition:Transition statement(s) to setup the slideSpeaking Points:BI Platform: Sql server analysis service and reporting service.Self service BI: powerview, powerpivot, predictive analysis and embedded BI.Taking in unstructured data and strutted data sources through Hadoop, or PDWNotes:Any notes go here
  • Slide Objectives:Vision slideTransition:Transition statement(s) to setup the slideSpeaking Points:Broaden access to Hadoop on the windows platformEnterprise ready through AD, System center (to come).BI integration and Self service BINotes:Any notes go here
  • Slide Objectives:Objective #1Transition:Transition statement(s) to setup the slideSpeaking Points:Speaking Point #1Speaking Point #2Notes:Any notes go here
  • Transcript of "Big Data: El qué y el cómo"

    1. 1. Big Data: El qué y el cómo Alan Koo Senior Consultant | Nagnoi, Inc. www.alankoo.com | @alan_koo
    2. 2. Acerca de mi • • • • • • • • • • • Senior Consultant en Nagnoi, Inc. 13+ años en SQL Servidor 8+ años en BI & OLAP Certificaciones Microsoft en SQL Servidor, Business Intelligence y .NET MCT Regional Lead – Puerto Rico MCT desde 2004 para Business Intelligence / SQL Server / .NET Miembro del Microsoft BI Advisors group Miembro del SSRS Insiders Group Microsoft MVP (2008 – 2011) Co-fundador de Puerto Rico PASS Blogger: www.alankoo.com Alan Koo | www.alankoo.com
    3. 3. Agenda Qué es Big Data? Fuentes comunes de Big Data Escenarios comunes de Big Data HDInisght: Windows Azure + Hadoop Alan Koo | www.alankoo.com
    4. 4. ¿Qué es Big Data?
    5. 5. Orden en el viejo mundo Hans Lipperhey Alan Koo | www.alankoo.com
    6. 6. El nuevo mundo Alan Koo | www.alankoo.com
    7. 7. Orden del viejo mundo Alan Koo | www.alankoo.com
    8. 8. El nuevo mundo Alan Koo | www.alankoo.com
    9. 9. Tendencias claves Alan Koo | www.alankoo.com
    10. 10. La transición a Big Data small fk/pk Volumen big pull PDW SQL Server Velocidad k/v push HDInsight
    11. 11. ¿Qué es Big Data? Social Sentiment Click Stream Petabytes (10E15) Terabytes (10E12) Volumen Exabytes (10E18) Gigabytes (10E9) Móvil Internet de cosas / Blogs Wikis Sensores / RFID / Dispositivos Audio / Video Archivos de Log WEB 2.0 Publicidad eCommerce Colaboración ERP / CRM Marketing Digital Search Marketing Pagos Planilla s Inventari o Contacto s Seguimiento de Ordenes Gestión de Ventas Coordenadas Espaciales & GPS Data Market Feeds eGov Feeds Web Logs Clima Recomendacione s Text/Imágenes Velocidad - Variedad - variabilidad ERP / CRM Almacenaje/GB 1980 190,000$ 1990 9,000$ WEB 2.0 Internet de cosas 2000 15$ 2010 0.07$ Alan Koo | www.alankoo.com
    12. 12. Big Data “Big data es un término que describe el almacenaje y el análisis de grandes y/o complejos conjuntos de datos usando una serie de técnicas incluyendo, pero no limitado a: NoSQL, MapReduce and machine learning.” “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define-it/ Alan Koo | www.alankoo.com
    13. 13. Big Data es…. No el Tamaño de los Datos No las herramientas chéveres como Hadoop y R Un Nuevo Paradigma en Cómo Recolectar y Usar Datos de manera Diferente. Alan Koo | www.alankoo.com
    14. 14. Escenarios para Big Data Alan Koo | www.alankoo.com
    15. 15. Entorno Tradicional para DW/BI ETL Data Warehouse OLAP Reporting
    16. 16. Cosas que los clientes pueden estar diciendo • Necesitamos paralelizar las operaciones de datos pero es muy costoso y complicado… • El negocio no puede acceder a toda la data relevante, necesitamos data externa… • No podemos coincidir la data maestra del cliente durante interacciones en vivo… • No podemos forzar a que todo sea un modelo estrella (star-schema) • Nuestro reportes y gráficas de BI no nos dicen nada que no sepamos • Estamos perdiendo la ventana del ETL, la data que necesitamos no llega a tiempo… • No podemos predecir con confidencia si no podemos explorar los datos y desarrollar nuestros propios modelos
    17. 17. Tipos de data generada por sector
    18. 18. Fundamentos
    19. 19. Si Chris Paul pasa la bola a un compañero a 1.5 metros o menos del 89 porciento de chance de que canasto, existe un termine en anotación Chris Paul passes the ball to a teammate within five feet of the basket, there’s an 89 percent chance it will result in a score http://www.adweek.com/news/technology/nba-making-big-data-play-153264
    20. 20. Fuentes de datos comunes Progressive: http://articles.chicagotribune.com/2013-09-15/classified/ct-biz-0915--telematics-insure-20130915_1_insurance-companies-insurance-telematics-progressive-snapshot Alan Koo | www.alankoo.com
    21. 21. Algoritmos comunes en Big Data
    22. 22. Haciéndolo real
    23. 23. Hadoop • Colección de proyectos “open source” en Apache para almacenar/procesar big data (grandes datos no/semiestructurados) • Ha evolucionado sobre los últimos 7+ años para soportar alguno de los websites/productos más grandes en términos de datos • La base/”kernel” de HDInsight Alan Koo | www.alankoo.com
    24. 24. Hadoop: Arquitectura distribuida Alan Koo | www.alankoo.com
    25. 25. MapReduce: Moviendo código a los datos Alan Koo | www.alankoo.com
    26. 26. ¿Cómo funciona? Alan Koo | www.alankoo.com
    27. 27. RDBMS Tradicional vs. NoSQL Alan Koo | www.alankoo.com
    28. 28. ¿Qué es HDInsight? • Plataforma de datos de nivel empresarial • Contruído sobre Hadoop en sociedad con Hortonworks • Actualmente disponible en como servicio “preview” en Windows Azure Alan Koo | www.alankoo.com
    29. 29. Windows Azure HDInsight Service Job submission (hive query, etc) Query & Metadata: Data Movement: Workflow: Monitoring: Hadoop Filesystem Interface Data upload/download Alan Koo | www.alankoo.com
    30. 30. Windows Azure HDInsight Service Job submission (hive query, etc) Alan Koo | www.alankoo.com
    31. 31. Distributed Processing (MapReduce) Distributed Storage (HDFS) ODBC Query (Hive) Legend Red = Core Hadoop Gray = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages Alan Koo | www.alankoo.com
    32. 32. Almacenando datos en HDInsight
    33. 33. HDFS en Azure: Historia de dos Sistemas de Archivos HDFS API Name Node Azure Blob Storage de Front end Front end Front end Data Node Data Node Partition Layer … Stream Layer DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV) Alan Koo | www.alankoo.com
    34. 34. Azure Storage (ASV) • Sistema de archivos por defecto para HDInsight • Provee almacenamiento que se puede compartir, persistente, de alta escalabilidad y disponibilidad (Azure Blob Store) • Azure storage por si solo no provee computo • Acceso rápido desde los nodos de cómputo a la data en el mismo data center • Varios sistemas de archivos, se puede llegar vía: asv[s]:<container>@<account>.blob.core.windows.net/<path> • Requiere el storage key en core-site.xml: <property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value> </property> Alan Koo | www.alankoo.com
    35. 35. Consumiendo resultados desde HDInsight Destino Herramienta / Librería Requiere un Cluster de HDInsight Activo SQL Server, Azure SQL DB Sqoop (Hadoop ecosystem project) Sí Excel Codename “Data Explorer” No Another Blob Storage Account Azure Blob Storage REST APIs (Copy Blob, etc) No SQL Server Analysis Services Hive ODBC Driver Sí Existing BI Apps Hive ODBC Driver (assumes app supports ODBC connections to data sources) Sí Alan Koo | www.alankoo.com
    36. 36. Entorno Tradicional para DW/BI ETL Data Warehouse OLAP Reporting
    37. 37. Entorno de DW/BI del mañana ETL Data Warehouse Crítico para el negocio OLAP Reporting
    38. 38. Solución de Big Data de Microsoft Alan Koo | www.alankoo.com
    39. 39. En resumen • HDInsight es una plataforma de nivel empresarial basada en Hadoop para almacenamiento de “big data • Azure Blob Storage + HDInsight == Almacenamiento y procesamiento de “big data” simple y en la nube disponible para probar hoy mismo • Podemos consumir los resultados de HDInsight en herramientas familiares, aplicaciones, etc (Excel, etc) es simple con Power Query, Azure Blob APIs, Sqoop, ODBC, etc. Alan Koo | www.alankoo.com
    40. 40. ¿Preguntas? Alan Koo | www.alankoo.com
    41. 41. Recursos / Referencias http://brianwmitchell.com/ bit.ly/loKoMN – Do You Have Big Data? (Most Likely!) bit.ly/1awKcqE – Introduction To Windows Azure HDInsight Service bit.ly/1awL923 – Data Management in Microsoft HDInsight: How to Move and Store Your Data bit.ly/16jqv9M – Make Your Apps Smarter with Azure HDInsight bit.ly/1b1mtQN http://nuget.org/packages?q=hadoop http://hadoopsdk.codeplex.com Alan Koo | www.alankoo.com
    42. 42. ¡Gracias! Alan Koo Labrín Senior Consultant | Nagnoi, Inc. Blog: www.alankoo.com Twitter: @alan_koo Alan Koo | www.alankoo.com

    ×