BigData Analytics Platform
BigData Analytics Platform
System Components :
- Data Sources : Externally available data sources from which data which will be collected such as DMP/DSP, CRM, Email marketing, First data, Google analytics, Legacy Systems, etc
- Data Model : The data model and schema that will be used to store raw, in-process, analysed and aggregated data in data stores such as Hadoop, Event Stores and Data Warehouse.
- Data Ingestion : The data ingestion system will consist of data collectors to import data from data sources and inject into the data staging servers. There will be set of micro services developed to run the data collectors for different data sources. Data collectors can be batch, or continuous feeds processors, or they can even be push-pull systems based on the data sources and the interaction protocols such as http, etc.
- Data Staging : Data staging are intermediary servers to temporarily host the incoming ingested data ( from data collectors ) and stream into the HDFS cluster. This de-couples ingestion from storage in Hadoop cluster. Usage of Kafka will require separate services to stream data into HDFS, though Flume has inbuilt support for such tasks. Apache Sqoop will be used to ingest relational SQL data.
- Data Lake : Hadoop cluster to store the incoming raw data.
- Data Processing : Spark batch processing for data management such as transform, integrate, correlate, and enrich data. This massaged data will be stored back to HDFS for further processing.
- Data Analysis : Data analytics system will analyze the integrated data on the Apache Spark platform.
- Data Warehouse : Data warehouse is the storage resulting from the analyzed data . This can be aSQL server storage or Amazon Redshift warehouse. The resultant analyzed data ( apart from getting stored on Sql server/ Redshift ) will also be stored on the HDFS cluster in Apache Hive.
- Workflow Orchestration System : Workflow Orchestration will be used as centralized coordination orchestration service to create workflows for data ingestion, storage, processing and integration tasks. Ex : Apache Oozie
- Tableau Data Exporters : Tableau data exporters are data exporters, exporting data both from Apache Hive storage as well as the Data warehouse ( SQL / Amazon Redshift ). Tableau can connect live with different data sources like Amazon Redshift, Hive, Sql Servers, File Servers, etc. Exporters will be required only when direct connection to the data-warehousing system is not be allowed.
- Data Visualization : Tableau/Power BI server platform for visualization and creating graphs ( charts, etc ) from analysed and aggregated data.
- Application Services : Application services for application configuration, integrated entity descriptions, data models, monitoring and administration.
- Application UI : UI interface for exposing the data from application services. It could be both Web and Mobile.
- Application and Infrastructure Monitoring Platform : Centralized platform to collect, store and analyze monitoring events from different applications and infrastructure components. Define policies and rules to raise alerts and notification. Ex. Log aggregation from different services for centralized storage, search and visualization in ElasticSearch; integration of Amazon cloud watch.
- Security Model : Security Model will be responsible for complete security of all in-flight and stored data. It will consist of elements to secure hadoop and cloud components, access to external data sources, ingested data channels and enterprise warehouse.
Author : Mayank Garg, Technology Enthusiast and Georgia Tech Alumni(https://in.linkedin.com/in/mayankgarg12)