Abdulrahman AlQallaf

Decluttering my mind into the web ...









Chapter 7

Big Data Concepts and Tools.





1. Introduction



Big Data is data that gives you trouble in:

  • capture
  • processing
  • storage
  • management

The “V”s that define big data:

  • volume
  • variety
  • velocity – at rest analytics vs. stream analytics
  • veracity – conformity to facts
  • variability – highly inconsistent and with periodic peaks
  • value proposition



Big data by itself is useless unless someone does something with it that delivers value.

The critical success factors for big data analytics are:

figure 7.4



High performance computing techniques:

  • in-memory analytics
  • in-database analytics
  • grid computing
  • appliances

Things to be aware of when considering big data projects:

  • data volume
  • data integration
  • processing capabilities
  • data governance
  • skills availability
  • solution cost



2. Big Data Technologies



  • they take advantage of comodity hardware to enable scale and parallel processing techniques.
  • they employ non relational data storage capabilities to process unstructured and semistructured data.
  • they apply advanced analytics and data visualization technology to big data to convey insights to end users.



The three big data technologies:

  • MapReduce
  • Hadoop
  • NoSQL

Note: it is best to think of them as an ecosystem (mostly of open source software), not a single product.



MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

figure 7.6



Hadoop

Its fundamental concept is that rather than banging away at one huge block of data with a single machine, hadoop breaks up big data into multiple parts so each part can be processed and analyzed at the same time.

  • Hadoop sits at both ends of the large-scale data life cycle:
    • First when raw data is born…
    • and finally when data is retiring, but is still occasionally needed.
  • Therefore, Hadoop can be used as an active / always on archive.

Hadoop techincal components:

  • HDFS – a file system, not a DBMS
  • name node
  • secondary node
  • job tracker
  • slave nodes

Hadoop ecosystem / complementary subprojects:

  • Hive, an open source data warehouse – resembles SQL, but it is not standard SQL.
  • Mahout, for data science.
  • Hcatalog, for meta data management.

Important remarks:

  • Hadoop is about data diversity, not just data volume.
  • Hadoop compliments a DW; it is rarely a replacement.



Hadoop vs. a DW

table 7.1



Hadoop coexistence with a DW

figure 7.8


The coexistence of Hadoop and DW:

  • use hadoop for storing and archiving multistructured data.
  • use hadoop for filtering, transforming, and/or consolidating multistructured data.
  • use hadoop to analyze large volumes of multistructured data and publish the analytics results.
  • use a relational DBMS that provides mapreduce capabilities as an investigative computing platform.
  • use a front-end query tool to access and anlyze data.



NoSQL (Not only SQL)

  • Offers schema on read (not like the traditional of schema on write).
  • The downside of most NoSQL databases today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability.
  • Many also lack mature management and monitoring tools.



New / Emerging

Stream analytics == data in motion analytics == real time data analytics.