Large data ââb> is a very thick and complicated data set so traditional data processing application software is inadequate to deal with it. Great data challenges include data capture, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data sources. There are a number of concepts related to large data: there were initially 3 concepts of volume , variations , speed . Other concepts that are then associated with large data are honesty (ie, how much noise in the data) and values.
Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or other specific advanced data analysis methods that extract values ââfrom data, and rarely to a specific size of the data set. "There is little doubt that the amount of data available today is indeed large, but that is not the most relevant characteristic of this new data ecosystem." The data set analysis can find a new correlation to "looking at business trends, preventing illness, fighting crime, and so on." Scientists, business executives, medical practitioners, advertisers and governments regularly encounter difficulties with large data sets in various fields including Internet search, fintech, urban informatics, and business informatics. The scientists found limitations in the work of e-Science, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.
The datasets are growing rapidly - in part because they are increasingly being collected by cheap and information-sensitive Internet devices such as mobile devices, air (remote sensing), software logs, cameras, microphones, radio frequency readers (RFID) readers and wireless sensor networks. The capacity of the world's technology per capita to store information has doubled every 40 months since the 1980s; in 2012, every day 2.5 exabytes (2.5ÃÆ' â ⬠"10 18 ) data is generated. Based on IDC report predictions, global data volumes will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question for big companies is determining who should have big-data initiatives that affect the whole organization.
Relational database management systems and desktop statistics and software packages to visualize data often have difficulty handling large data. Jobs may require "massive parallel software running on dozens, hundreds, or even thousands of servers". What counts as "big data" varies depending on the capabilities of users and their tools, and the ever-expanding ability to make large data become a moving target. "For some organizations, encountering hundreds of gigabytes of data for the first time can trigger the need to reconsider data management options.For others, it may take tens or hundreds of terabytes before the size of the data becomes a significant consideration."
Video Big data
Definisi
The term has been used since the 1990s, with some giving John Mashey credit for coining or at least making it popular. Large data usually includes data sets with sizes outside of commonly used software capabilities to capture, organize, manage, and process data in a tolerable time. Large data philosophies include unstructured, semi-structured and structured data, but the main focus is on unstructured data. Big data "size" is a target that continues to move, in 2012 ranging from a few tens of terabytes to many exabytes of data. Large data requires a range of techniques and technologies with new forms of integration to reveal insights from a diverse, complex, and large-scale data set.
The 2016 definition states that "Large data represent information assets that are characterized by high volumes, speeds, and variations that require certain analytical technologies and methods for transformation into value". In addition, the new V, honesty , added by some organizations to describe it, revisionism is challenged by some industrial authorities. Three V (volume, variation, and speed) has been further developed into other complementary characteristics of large data:
- Machine learning: big data often does not ask why and only detects patterns
- Digital footprint: large data is often a byproduct free of digital interactions
The 2018 definition states "Big data is where parallel computing devices are needed to handle data", and notes, "This represents a clear and clear change in computer science used, through parallel programming theory, and the loss of some of the guarantees and capabilities made by the model relational Codd. "
Maturity of the concept that increasingly clearly illustrates the difference between "big data" and "Business Intelligence":
- Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc. Big data use inductive statistics and concepts of nonlinear system identification to conclude laws (regression, nonlinear relationships, and causal effects) of large data sets with low information density to reveal relationships and dependencies, or to predict outcomes and behaviors.
Maps Big data
Characteristics
Large data can be explained by the following characteristics:
- Volume
- Quantity of data generated and stored. The data size determines the value and potential insight, and whether it can be regarded as big data or not.
- Variety
- The type and nature of the data. It helps the person who analyzes it to effectively use the insights generated. Big data is taken from text, images, audio, video; plus it completes the missing pieces through a mix of data.
- Speed ââ
- In this context, the speed at which data is generated and processed to meet the demands and challenges that exist in the path of growth and development. Big data is often available in real-time.
- Veracity
- The data quality of the captured data can vary greatly, affecting accurate analysis.
Factory work and Cyber-physical systems may have a 6C system:
- Connection (sensor and network)
- Cloud (computing and data on demand)
- Cyber âââ ⬠<â ⬠<(model and memory)
- Content/context (meaning and correlation)
- Communities (sharing and collaboration)
- Customization (personalization and value)
Data must be processed with powerful tools (analytics and algorithms) to reveal meaningful information. For example, to manage a factory, one should consider the problems that are visible and invisible to the various components. Information generation algorithms must detect and address invisible problems such as engine failure, component wear, etc. On the factory floor.
Architecture
Large data repositories exist in various forms, often built by companies with special needs. Commercial vendors have historically offered a parallel database management system for large data beginning in the 1990s. Over the years, WinterCorp publishes its largest database report.
Teradata Corporation in 1984 marketed parallel processing of DBC 1012 systems. The teradata system was the first to store and analyze 1 terabyte of data in 1992. The hard disk drive was 2.5 GB in 1991 so that the definition of large data continues to evolve in accordance with Kryder's Law. Teradata installed the first petabyte class-based RDBMS system in 2007. Since 2017, there have been several dozen petabyte classes, Teradata's relational database installed, the largest exceeding 50 PB. System until 2008 is 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro.
In 2000, Seisint Inc. (now LexisNexis Group) developed a C-based distributed file sharing framework for data storage and retrieval. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can create queries in a C dialect called ECL. ECL uses the "apply schema on read" method to infer data structures that are stored when asked, not when stored. In 2004, LexisNexis acquired Seisint Inc. and in 2008 acquired ChoicePoint, Inc. and their high speed parallel processing platform. Both platforms are incorporated into the HPCC System (or High-Performance Computing Cluster) and in 2011, HPCC is an open source under the Apache v2.0 License. The Quantcast File System is available around the same time.
CERN and other physics experiments have been collecting large data sets for decades, typically analyzed through high-performance computations (supercomputers) rather than commodity-reduction map architectures typically meant by the current big data movement.
In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. MapReduce concept provides a parallel processing model, and related implementations released to process large amounts of data. With MapReduce, queries are shared and distributed across parallel nodes and processed in parallel (Map step). The results are then collected and sent (Reduce steps). The framework works very successfully, so others want to replicate the algorithm. Therefore, the implementation of the MapReduce framework was adopted by the open-source Apache project named Hadoop. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to manage multiple operations (not just a map followed by subtract).
MIKE2.0 is an open approach to information management that recognizes the need for revisions due to the large data implications identified in an article titled "Big Data Solution Offering". The methodology addresses the handling of large data in terms of permutations of useful data sources, complexity in mutual relationships, and difficulty in removing (or modifying) individual records.
The 2012 study shows that multi-layer architecture is one option to address the problems presented by large data. Distributed parallel architectures distribute data across multiple servers; this parallel execution environment can dramatically increase the speed of data processing. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of template looks to make transparent processing power to end users by using a front-end application server.
Big data analytics for manufacturing applications are marketed as 5C architectures (connection, conversion, cyber, cognition, and configuration).
Lake data allows organizations to shift their focus from centralized control to a shared model to respond to the dynamics of change in information management. This enables fast segregation of data into the data lake, thus reducing overhead time.
Technology
The McKinsey Global Institute 2011 report features the major components and major data ecosystems as follows:
- Techniques for analyzing data, such as A/B testing, machine learning, and natural language processing
- Big data technologies, such as business intelligence, cloud computing, and databases â â¬
- Visualizations, like charts, graphs, and other views of data âââ â¬
Large multidimensional data can also be represented as tensors, which can be handled more efficiently by tensor-based computing, such as multilinear subspace learning. Additional technologies applied to large data include massive parallel processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud and HPC-based infrastructure (applications, storage and computing resources) and the Internet. Although, many approaches and technologies have been developed, it is still difficult to do machine learning with large data.
Some relational databases MPP has the ability to store and manage petabytes of data. Implied is the ability to load, monitor, back up, and optimize the use of large data tables in RDBMS.
The DARPA Topology Data Analysis program looks for the fundamental structure of massive data sets and in 2008 technology became public with the launch of a company called Ayasdi.
Practitioners from large data analytic processes are generally hostile to slower shared storage, preferring DAS in various forms from a solid state drive (SSD) to a high capacity SATA disk buried inside a parallel processing node. Perceptions of shared storage architecture - Storage area network (SAN) and Network-attached storage (NAS) - is that they are relatively slow, complex, and expensive. This quality is inconsistent with large data analytic systems that thrive on system performance, commodity infrastructure, and low cost.
Delivery of real time or near-real information is one of the main characteristics of large data analytics. Therefore, latency is avoided whenever and wherever possible. Data in memory is good - the data on the disc is spinning on the other end of the FC SAN connection is not. The cost of SAN on the scale required for analytical applications is much higher than other storage techniques.
There are advantages as well as disadvantages for shared storage in large data analytics, but large data analytics practitioners in 2011 did not like it.
Big Data virtualization
Big Data virtualization is a way of collecting data from multiple sources in a single layer. The data layer collected is virtual. Unlike other methods, most of the data remains in place and is taken on demand directly from the source system.
Apps
Big data has increased the demand for information management specialists so much that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $ 15 billion on software companies specializing in data and analytics management. In 2010, the industry was worth more than $ 100 billion and grew nearly 10 percent per year: about twice as fast as the overall software business.
Developed countries increasingly use data intensive technologies. There are 4.6 billion mobile phone subscriptions worldwide, and between 1 billion and 2 billion people access the Internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people are becoming more educated, which in turn led to the growth of information. The effective capacity of the world to exchange information via the telecommunications network is 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the number of Internet traffic at 667 exabytes per year by 2014. According to one estimate, one third of globally stored information is in the form of alphanumeric text data and still images, which is the most useful format for the majority of big data applications. It also shows the potential of the data that has not been used (ie in the form of video and audio content).
While many vendors offer off-the-shelf solutions for large data, experts recommend developing in-house solutions designed specifically to solve enterprise issues at hand if the company has sufficient technical capabilities.
Government
The use and adoption of large data in government processes enables efficiency in terms of cost, productivity, and innovation, but does not come without flaws. Data analysis often requires many parts of government (central and local) to work in collaboration and create new and innovative processes to deliver desired results.
CRVS (Civil Registration and Vital Statistics) collects all certificate status from birth to death. CRVS is a great source of data for governments.
International developments
Research on the use of effective information and communication technology for development (also known as ICT4D) shows that big data technology can make an important contribution but also presents unique challenges for international development. Advancements in large data analysis offer cost-effective opportunities to improve decision making in key development areas such as health care, employment, economic productivity, crime, security, and natural disasters as well as resource management. In addition, the data generated by the user offers new opportunities to cast an inaudible sound. However, longstanding challenges to develop areas such as inadequate technological infrastructure and scarcity of human and economic resources exacerbate existing concerns with large data such as privacy, imperfect methodology, and interoperability issues.
Manufacturing
Based on the TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of large data for manufacturing. Large data provides the infrastructure for transparency in the manufacturing industry, which is the ability to describe uncertainties such as inconsistent performance and availability of components. Predictive manufacturing as a workable approach to near zero downtime and transparency requires large amounts of data and powerful predictors to systematically process data into useful information. A predictive manufacturing conceptual framework begins with data acquisition where various types of sensory data are available for acquiring such as acoustics, vibrations, pressure, currents, voltages and controller data. A large amount of sensory data in addition to historical data builds huge data in manufacturing. The resulting large data acts as input into predictive tools and prevention strategies such as Prognostics and Health Management (PHM).
Health Care
Large data analytics have helped to improve health services by providing prescriptive personal and analytical drugs, clinical risk interventions and predictive analysis, reduced waste and treatment variability, automatic internal and internal patient data reporting, medical term standards and patient registries and point solutions fragmented. Some areas of improvement are more aspirational than actually implemented. The level of data generated in the health care system is not trivial. In addition to the adoption of mHealth, eHealth and wearable technologies, data volumes will continue to increase. This includes electronic medical record data, imaging data, patient-generated data, sensor data, and other forms of data that are difficult to process. Now there is a greater need for such an environment to pay more attention to the quality of data and information. "Big data very often means 'dirty data' and the fraction of data inaccuracies increases with the growth of data volumes." Human inspection on large data scales is impossible and there is a very urgent need in health care for intelligent tools for accuracy and control of trust and handling missed information. Although extensive information on health care is now electronic, it fits under a large data umbrella because it is largely unstructured and difficult to use.
Education
A McKinsey Global Institute study found a shortage of 1.5 million highly trained professional and data managers and a number of universities including the University of Tennessee and UC Berkeley, has created a master's program to meet this demand. Personal bootcamps have also developed programs to meet that demand, including free programs like The Data Incubator or paid programs like the General Assembly. In a particular marketing area, one of the issues that Wedel and Kannan emphasize is that marketing has multiple subdomains (eg, Ads, promotions, product development, branding) all of which use different types of data. Because the one-size-fit-all analytical solution is undesirable, business schools should prepare marketing managers to have a broad knowledge of all the different techniques used in this subdomain to get the big picture and work effectively with analysts.
Media
To understand how media use big data, it is first necessary to provide some context into the mechanisms used for media processes. It has been suggested by Nick Couldry and Joseph Turow that practitioners in Media and Advertising are approaching the big data as a multitude of actionable information points about millions of individuals. The industry seems to move away from traditional approaches using specific media environments such as newspapers, magazines, or television shows and instead leverages consumers with technologies that reach targeted people at optimal times in optimal locations. The main purpose is to serve or deliver, messages or content (statistically) in accordance with the consumer mindset. For example, publishing environments increasingly customize messages (advertisements) and content (articles) to attract consumers who have been exclusively collected through various data mining activities.
- Target consumers (for advertisers by marketers)
- Data retrieval â ⬠<â â¬
- Data journalism: publishers and journalists use large data tools to provide unique and innovative insights and infographics.
Channel 4, the UK public service television broadcaster, is a leader in large data and data analysis.
Internet of Things (IoT)
Large data and IoT work together. Data taken from IoT devices provides device interconnection mapping. Such mapping has been used by the media, corporate and government industries to more accurately target their audience and improve media efficiency. IoT is also increasingly adopted as a sensory data collection tool, and these sensory data have been used in medical and manufacturing contexts.
Kevin Ashton, a digital innovator credited with the term coining, defines the Internet of Things in this quote: "If we have computers that know everything there is to know about things - using the data they collect without our help - we will be able to track and calculate everything, and greatly reduce waste, loss and cost.We will know when things need to be replaced, repaired or pulled back, and whether they are fresh or go through the best. "
Information Technology
Especially since 2015, the big data has become well known in Business Operations as a tool to help employees work more efficiently and streamline the collection and distribution of Information Technology (IT). The use of large data to solve IT problems and data collection within a company is called IT Operations Analytics (ITOA). By applying the principles of large data into the concept of machine intelligence and deep computing, IT departments can predict potential problems and move to provide solutions before problems occur. Today, the ITOA business also begins to play a major role in system management by offering platforms that carry individual data silos together and generate insights from the entire system rather than from isolated data pockets.
Case study
Government
United States
- In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how much data can be used to address the critical issues facing governments. The initiative consists of 84 large data programs spread across six departments.
- Large data analysis played a major role in the successful 2012 Barack Obama election campaign.
- The Federal Government of the United States has four of the ten strongest supercomputers in the world.
- The Utah Data Center has been built by the United States National Security Agency. Once completed, the facility will be able to handle the vast amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but newer sources claim to be in the order of some exabytes. This has raised security concerns related to the anonymity of the data collected.
India
- Large data analysis is on trial for BJP to win India 2014 Elections.
- The Indian government uses many techniques to ascertain how Indian voters respond to government action, as well as ideas for policy improvement.
United Kingdom
Large data can be used to improve training and understand competitors, using sports censorship. It is also possible to predict the winner in a match using large data analytics. Performance of future players can be predicted as well. Thus, the value of players and salaries is determined by data collected throughout the season.
Movies MoneyBall shows how much data can be used to spy on players and also identify less valued players.
In the Formula One race, race cars with hundreds of sensors produce terabytes of data. This sensor collects data points from the tire pressure to the fuel combustion efficiency. Based on the data, engineers and data analysts decide whether adjustments should be made to win the race. In addition, using large data, the racing team tried to predict the time they would finish the previous race, based on simulations using data collected during the season.
Technology
- eBay.com uses two data warehouses at 7.5 petabytes and 40PB and Hadoop 40PB groups for search, consumer recommendations, and merchandising.
- Amazon.com handles millions of back-end operations every day, as well as inquiries from over half a million third-party sellers. The core technology that makes Amazon runs is Linux-based and in 2005 they have three of the world's largest Linux databases, with a capacity of 7.8 TB, 18.5 TB, and 24.7 TB.
- Facebook handles 50 billion photos from its user base.
- Google handles about 100 billion searches per month by August 2012.
Research activity
A scramble search and cluster formation in large data was demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach is involved in dealing with Big Data challenges by MIT Computer Science and the Artificial Intelligence Laboratory and Dr. Amir Esmailpour at UNH Research Group investigates key features of large data as group formation and interconnection. They focus on large data security and term orientation towards the presence of different types of data in encrypted form in the cloud interface by providing standard definitions and real-time examples in technology. In addition, they propose an approach to identify coding techniques to advance toward accelerated search of encrypted text leading to increased security in large data.
In March 2012, the White House announced a nationwide "Large Data Initiative" composed of six Federal departments and agencies that conduct more than $ 200 million for large data research projects.
This initiative includes a $ 10 million "Expedition in Computing" fund of the National Science Foundation for 5 years to AMPLab at the University of California, Berkeley. AMPLab also received funding from DARPA, and more than a dozen industry sponsors and used large data to attack various problems from predicting traffic congestion to fight cancer.
The White House Big Data Initiative also included a commitment by the Department of Energy to provide $ 25 million in funding over 5 years to establish the Institute for Management, Analysis and Visualization (SDAV), which is run by the Lawrence Berkeley National Department Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on Department supercomputers.
The State of Massachusetts announced the Massachusetts Mass Data Initiative in May 2012, which provides funding from state governments and private companies to research institutes. Massachusetts Institute of Technology hosted the Intel Science and Technology Center for Big Data at MIT Computer Science and the Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.
The European Commission is funding the Public Private Forum for Large 2-year Data through their Seventh Framework Program to engage companies, academics, and other stakeholders in discussing big data issues. The project aims to define strategies in terms of research and innovation to guide the support actions of the European Commission in the successful implementation of a large data economy. The results of this project will be used as input for Horizon 2020, the next framework program.
The British government announced in March 2014 the establishment of the Alan Turing Institute, named after computer pioneers and code breakers, who will focus on new ways to collect and analyze large data sets.
At the University of Waterloo Stratford Campus Canada Open Data Experience (CODE), Inspiration Day, participants demonstrated how using data visualization can enhance the understanding and appeal of large data sets and communicate their story to the world.
To make manufacturing more competitive in the United States (and the world), there is a need to integrate more American ingenuity and innovation into manufacturing; Therefore, the National Science Foundation has provided University of Korea's industrial cooperative research center for Intelligent Maintenance Systems (IMS) at the University of Cincinnati to focus on developing advanced prediction tools and techniques for application in large data environments. In May 2013, the IMS Center held an industry advisory board meeting focusing on large data in which presenters from various industry firms discussed their concerns, problems and future goals in a large data environment.
Computational social science - Anyone can use Application Programming Interfaces (APIs) provided by large data holders, such as Google and Twitter, to conduct research in social science and behavior. Often these APIs are provided for free. Tobias Preis et al. uses Google Trends data to show that Internet users from countries with higher per capita gross domestic product (GDP) are more likely to seek information about the future than information about the past. The findings suggest there may be a link between online behavior and real-world economic indicators. The study authors examined Google queries made by search volume ratios for the coming year ('2011') to search volume for the previous year ('2009'), which they called 'future orientation index'. They compare the future orientation index with GDP per capita in each country, and find a strong trend for countries where Google users are asking more about the future to have a higher GDP. The results suggest that there may be a link between a country's economic success and the information seeking behavior of its citizens captured in large data.
Tobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced methods to identify online precursors for stock market movements, using trading strategies based on search volume data provided by Google Trends. Their analysis of Google's search volume for 98 terms of various financial relevance, published in Scientific Report , shows that an increase in search volume for financially relevant search terms tends to precede major losses in financial markets.
Large data sets come with algorithmic challenges that did not exist previously. Therefore, there is a need to fundamentally change the way of processing.
The Algorithm Workshop for the Modern Massive Data Collection (MMDS) gathers computer scientists, statisticians, mathematicians, and data analysis practitioners to discuss major algorithmic data challenges.
Retrieving large data samples
An important research question that can be asked about large data sets is whether you need to look at the full data to draw any particular conclusions about a data property or a good enough sample. The big data name itself contains terms related to size and this is an important characteristic of big data. But Sampling (statistics) allows the selection of appropriate data points from within larger data sets to estimate the characteristics of the entire population. For example, there are about 600 million tweets produced every day. Is it necessary to see everything to determine the topics covered during the day? Is it necessary to view all tweets to determine the sentiments on each topic? In the manufacture of various types of sensory data such as acoustics, vibrations, pressure, currents, voltages and controller data are available at short intervals of time. To predict stop time, it may not be necessary to view all data, but samples may be sufficient. Big Data can be broken down by different categories of data points such as demographics, psychographics, behavior, and transactional data. With a large number of data points, marketers can create and leverage more customized consumer segments for more strategic targeting.
There is some work done in the Sampling algorithm for large data. A theoretical formulation for Twitter data sampling has been developed.
Critique
Critics of the big data paradigm come in two flavors, those who question the implications of the approach itself, and those who question how it is currently done. One approach to this criticism is the field of Critical data studies.
Criticism of the big data paradigm â ⬠<â â¬
"The crucial problem is that we do not know much about the underlying empirical micro processes that lead to the emergence of typical network characteristics of the Big Data". In their criticism, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not reflect what actually happens at the micro-process level. Mark Graham has leveled widespread criticism on Chris Anderson's assertion that large data will spell the end of the theory: focusing specifically on the idea that large data should always be contextualized in their social, economic, and political contexts. Even when companies invest eight and nine points to gain insight from the flow of information from suppliers and customers, fewer than 40% of employees have enough mature processes and skills to do so. To address this insightful deficit, the big data, no matter how comprehensive or well analyzed, should be supplemented by "a great assessment," according to an article in the Harvard Business Review.
Many on the same line, have shown that decisions based on large data analysis must be "informed by the world as it was in the past, or, at best, as it is today". Closed by a large amount of data about past experiences, the algorithm can predict future developments if the future is similar to the past. If the dynamics of the system changes the future (if it is not a stationary process), the past can say little about the future. To make predictions in a changing environment, it is necessary to have a thorough understanding of system dynamics, which requires theory. In response to these criticisms Alemany Oliver and Vayre suggest using "kidnapping reasoning as the first step in the research process to bring the context to the digital footprint of consumers and make new theories emerge". In addition, it is recommended to combine large data approach with computer simulation, such as agent-based model and Complex System. The agent-based model is getting better at predicting the results of the social complexity of unknown future scenarios through computer simulations based on a collection of interdependent algorithms. Finally, the use of multivariate methods that investigate latent data structures, such as factor analysis and cluster analysis, have proved useful as an analytic approach that goes beyond the usual bi-variat approach (cross-tab) used with smaller data sets..
In the field of health and biology, the conventional scientific approach is based on experimentation. For this approach, the limiting factor is the relevant data that can confirm or disprove the initial hypothesis. A new postulate is accepted now in biosciences: information provided by large volumes of data (omics) without previous hypotheses is complementary and sometimes necessary for conventional approaches based on experiments. In this great approach it is the relevant hypothesis formulation to explain the data which is the limiting factor. The reverse search logic and induction limit ("Glory of Science and Philosophy scandal", C. D. Broad, 1926) should be considered.
Privacy advocates are concerned about threats to privacy represented by increased storage and integration of personally identifiable information; the expert panel has released various policy recommendations to adapt the practice to the expectations of privacy. The misuse of Big Data in some cases by media, companies and even governments has allowed the abolition of trust in virtually every fundamental institution that holds the community.
Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual freedom in the context of Big Data and a giant company that has a large amount of information. Use of Large Data should be monitored and managed better at national and international levels. Barocas and Nissenbaum argue that one way to protect individual users is to get information about the types of information collected, with whom it is shared, under what boundaries and for what purpose.
Criticism of 'V' Model
The 'V' Big Data model is united as it centers on computational scalability and has no disadvantages around the perception and understanding of information. This led to the Big Data Cognitive framework, which characterizes Big Data applications by:
- Completeness of data: understanding of data that is not clear from the data;
- Data correlation, causes and predictability: causality as a requirement that is not essential for achieving predictability;
- Clear and interpretable: humans want to understand and accept what they understand, where the algorithm does not solve it;
- Automatic decision-making levels: algorithms that support automated decision making and algorithmic independent learning;
Critique of novelty
Large data sets have been analyzed by computing machines for more than a century, including a US census analysis of 1890 conducted by IBM's punch card machines that compute statistics including means and population variance across the continent. In the last few decades, science experiments like CERN have produced data on a scale similar to today's "commercial big data". But science experiments tend to analyze their data by using clusters and special grids of high-performance custom computing (supercomputers), rather than cheap commodity computer clouds as in the current commercial wave, implying differences in culture and technology piles.
Critics of large data execution
Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that large data has become "fad" in scientific research. Researcher Danah Boyd has raised concerns about the use of large data in science that ignores principles such as selecting representative samples with too much worry about handling large amounts of data. This approach can cause the bias to result in one way or another. Integration across heterogeneous data sources - some that might be considered large data and others are not - presents robust logistics and analytical challenges, but many researchers argue that such integration is likely to represent the most promising new boundary in science. In the provocative article "Critical Questions for Large Data", the authors mention large data as part of mythology: "large data sets offer a higher form of intelligence and knowledge [...], with the auras of truth, objectivity, and accuracy." Large data users are often "lost in thin numbers", and "working with Big Data is still subjective, and what is quantified does not necessarily have a closer claim to objective truth". Recent developments in the BI domain, such as pro-active reporting primarily target improvements in large data usability, through automated filtering of data and useless correlations.
Large data analyzes are often superficial compared to smaller data set analyzes. In many large data projects, no large data analysis takes place, but the challenge is to extract, modify, load part of the preprocessing data.
Source of the article : Wikipedia